【Python】YouTube Liveのアーカイブからチャット（コメント）を取得する

この記事は最新更新日から、5年以上経過しています。

概要

概要

推しぶいっちゅっばが痛風で引退する前にライブ配信のアーカイブのチャット欄を保存しておきたいと思ったのでPythonでスクレイピングしました。
windows環境で行ったせいかちょっとエラーが出たのでうまくいったコードをメモします。

環境

Windows8.1
Python3.8

参考サイト

推しVのYoutubeアーカイブの各種情報を保存するためにどうすればいいのかを考えた - Qiita

推しVがいなくなった。理由もわからずに。 Vtuberを推していると、いつかはVがいなくなるという事態に直面することになるだろう。引退、卒業、契約解除―呼び方は色々とあるだろうが、実態としてはVの死と表現するのが相応しい。そして残念ながら今...

requests-HTML v0.3.4 documentation

PythonでYouTube Liveのアーカイブからチャット（コメント）を取得する（改訂版） - 雑記帳(@watagasi_)

※この記事の内容はかなり古くなっており、2020年11月頃のYouTube Liveのチャットの仕様が変わって今(2021/12/19現在)はpytchatというライブラリを使うのが良いと思われます。 github.com watagass...

やったこと

参考サイトを参考にライブ配信を取得しようとおもいます。
入れたライブラリは下記の通り、pycharmを使っていれたのでライブラリ名だけ書いておきます。
普通は pip install ライブラリ名とかでインストールすると思います。

requests
lxml
requests-html
google-api-python-client

このコードでは27分の動画のチャットコメントを取得していますが、コメント数6000ぐらいで、6分30秒かかっています。
sleepの時間減らしたり削除したりすればもう少し早くなると思います。

環境が悪いのかbeautifulsoup4だけだとレンダリングが遅くてチャットのURLを取得できなかったので、ブラウザ上でレンダリングするrequests-htmlを入れてそちらでhtmlを取得しました。
sleep=1は私の環境では入れないとチャットのURLが取得できなかったのでいれてあります。なのでちょっと取得に時間がかかります。

あとwindowsでtxtデータをopenにする時はエンコードを指定しないとエラーになるそうなので指定しています。
ほかはほとんど同じはず…。

書いたコードは以下の通り。

import os, os.path
# from datetime import datetime

from requests_html import HTMLSession

LIVECOMMENTPATH = "livecomment_data"


def GetLiveComment(target_id, folder=LIVECOMMENTPATH):
    target_url = "https://www.youtube.com/watch?v=" + target_id

    # セッション開始
    session = HTMLSession()
    r = session.get(target_url)

    # ブラウザエンジンでHTMLを生成させる、レンダリングが遅いのかURLが取れなかったのでsleepが入っている
    r.html.render(sleep=1)

    iframe_rows = r.html.find("#chatframe")
    comment_data = []

    for iframe in iframe_rows:
        if iframe.attrs["src"].find("live_chat_replay") != -1:
            # チャットコメントのリプレイURLを取得
            next_url = "https://www.youtube.com" + iframe.attrs["src"]
            # print(next_url)
            break

    while True:
        try:

            html_data = session.get(next_url)
            html_data.html.render()

            for scrp in html_data.html.find("script"):
                if "window[\"ytInitialData\"]" in scrp.text:
                    dict_str = scrp.text.split(" = ")[1]
                    break

            dict_str = dict_str.replace("false", "False")
            dict_str = dict_str.replace("true", "True")

            dict_str = dict_str.rstrip("; \n")
            dics = eval(dict_str)

            for comment in dics["continuationContents"]["liveChatContinuation"]["actions"][1:]:
                comment_data.append(str(comment) + "\n")
                # 次のURLを検索
                continuation = dics["continuationContents"]["liveChatContinuation"]["continuations"][0][
                    "liveChatReplayContinuationData"]["continuation"]

            temp = "https://www.youtube.com/live_chat_replay?continuation=" + continuation
            # print(temp)
            # print(next_url)
            if temp == next_url:
                # 時々何回取得してもnext_urlが更新されなくなるときがある。そうなると打ち切り。
                break
            next_url = temp
        # next_urlがなくなる等で例外になるのでそこでbreak
        except:
            break

    os.makedirs(folder, exist_ok=True)
    path = os.path.join(folder, target_id + "_livecomment.txt")
    with open(path, "w", encoding="utf-8") as fp:
        fp.writelines(comment_data)


if __name__ == "__main__":
    # print("start " + str(datetime.now()))
    VIDEOID = "yDn9UwxxDF8"  # 動画ID
    GetLiveComment(VIDEOID)
    # print("done " + str(datetime.now()))

おわり。