Huya live broadcast data collection, reserve for data analysis, the 24th case of the 120 cases of Python crawler

"Offer arrives, dig friends to pick up! I am participating in the 2022 Spring Recruitment Check-in Event, click to view the details of the event ."

Today I want to grab the live page of Huya Channel. The focus of this blog is still multi-threaded crawler.

Huya live broadcast data collection, reserve for data analysis, the 24th case of the 120 cases of Python crawler

target data analysis

The list of data to be collected this time is shown below. The data comes from the server interface when switching, so this case is an interface-oriented multi-threaded crawler.

The interface API is as follows:

https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=getLiveListJsonpCallback&page=2
https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=getLiveListJsonpCallback&page=3
复制代码

The interface request method is: GET The server data return format is: JSON The parameter description is as follows:

  • m: guess the meaning of the channel;
  • do: interface name;
  • tagAll: tag name;
  • callback: callback function;
  • page: page number.

Test the interface and find that all parameters must be passed except the callbackparameters that can not be passed.

When the page number is pageexceeded , the data returned is as follows:

{
  "status": 200,
  "message": "",
  "data": {
    "page": 230,
    "pageSize": 120,
    "totalPage": 228,
    "totalCount": 0,
    "datas": [],
    "time": 1630141551
  }
}
复制代码

Based on the above code, you can access the interface for the first time, get it totalPage, and then generate all the links to be crawled.

encoding time

The code of this case is moderately difficult to write, and the core is the data returned by the server. Since the data is loaded asynchronously, the returned data is as shown in the figure below. When the callbackparameter has a value, the returned data is also getLiveListJsonpCallbackwrapped by the parameter value, and removing the parameter value is as shown in Figure 2.

Figure 1 Huya live broadcast data collection, reserve for data analysis, the 24th case of the 120 cases of Python crawler Figure 2 Huya live broadcast data collection, reserve for data analysis, the 24th case of the 120 cases of Python crawler If you continue to carry the callbackparameter , you can use the following code to modify the returned data, that is, delete the redundant data in the corresponding content header, and delete the last bracket data.

res.encoding = 'utf-8'
text = res.text
text = text.replace('getLiveListJsonpCallback(', '')
text = text[:-1]
复制代码

Complete Huya live JSON data crawling

The implementation logic of this case is basically the same as that of the previous case, and only subtle differences are reflected in the data request and analysis. You can compare and learn when you are learning.

The data obtained at the end is directly stored as JSON format data, and other formats can be changed by yourself.

import threading
import requests
import random

class Common:
    def __init__(self):
        pass

    def get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
            "其余内容"
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua,
            "referer": "https://www.baidu.com"
        }
        return headers


def run(index, url, semaphore, headers):
    semaphore.acquire()  # 加锁
    res = requests.get(url, headers=headers, timeout=5)
    res.encoding = 'utf-8'
    text = res.text
    text = text.replace('getLiveListJsonpCallback(', '')
    text = text[:-1]
    # print(text)
    # json_data = json.loads(text)
    # print(json_data)
    save(index,text)
    semaphore.release()  # 释放


def save(index, text):
    with open(f"./虎牙/{index}.json", "w", encoding="utf-8") as f:
        f.write(f"{text}")
    print("该URL地址数据写入完毕")


if __name__ == '__main__':
    # 获取总页码
    first_url = 'https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=&page=1'
    c = Common()
    res = requests.get(url=first_url, headers=c.get_headers())
    data = res.json()
    if data['status'] == 200:
        total_page = data['data']['totalPage']

    url_format = 'https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=&page={}'
    # 拼接URL,全局共享变量
    urls = [url_format.format(i) for i in range(1, total_page)]
    # 最多允许5个线程同时运行
    semaphore = threading.BoundedSemaphore(5)
    for i, url in enumerate(urls):
        t = threading.Thread(target=run, args=(i, url, semaphore, c.get_headers()))
        t.start()
    while threading.active_count() != 1:
        pass
    else:
        print('所有线程运行完毕')
复制代码

== Come here, don't leave a comment, give a like, and bookmark it? ==

Today is day 205/365 of continuous writing. You can follow me, like me, comment me, bookmark me.

Guess you like

Origin juejin.im/post/7080329933859323912