Python crawler beginners introductory teaching (7): crawling Tencent video barrage

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

 

Preamble content

 

Python crawler beginners introductory teaching (1): crawling Douban movie ranking information

Python crawler novice introductory teaching (2): crawling novels

Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data

Python crawler novice introductory teaching (4): crawling 51job.com recruitment information

Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B

 

Python crawler novice introductory teaching (6): making word cloud diagrams

Basic development environment

  • Python 3.6
  • Pycharm

Use of related modules

  • jieba
  • wordcloud

Install Python and add it to the environment variables, pip installs the required related modules.

One, clear needs

Select  <Happy Comedian Season 7> to  crawl the barrage information sent by netizens

 

2. Analyze web data

Copy the barrage in the webpage and search in the developer tool.

 


There is corresponding barrage data in it. This url address has a small feature, the link contains  danmu  so boldly try it, filter and search for   the keyword danmu , and see if there is similar content

 


Through the link parameter comparison, you can see the change of each URL address parameter

 


Loop traversal can realize the barrage of crawling the entire video.

Three, parse the data

 

Here I would like to ask, what kind of data do you think the data returned to you by requesting this url address is? Give everyone three seconds to think about time.

1 …2…3…

Okay, now the answer is announced, it is a  string,  you heard it right. If you get response.json() directly   , you will get an error

 


So how can we make it program json data, after all, json data is better to extract data.

the first method

 

  • Regular matching extracts the data in the middle part of the data
  • Import json module, convert string to json data
import requests
import re
import json
import pprint
url = 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19108312825154929784_1611577043265&target_id=6416481842%26vid%3Dt0035rsjty9&session_key=30475%2C0%2C1611577043×tamp=105&_=1611577043296'

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
result = re.findall('jQuery19108312825154929784_1611577043265\((.*?)\)', response.text)[0]
json_data = json.loads(result)
pprint.pprint(json_data)

 

The second method

Delete the callback=jQuery19108312825154929784_1611577043265 in the link to   use response.json() directly 

import requests
import pprint
url = 'https://mfm.video.qq.com/danmu?otype=json&target_id=6416481842%26vid%3Dt0035rsjty9&session_key=30475%2C0%2C1611577043×tamp=105&_=1611577043296'

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
# result = re.findall('jQuery19108312825154929784_1611577043265\((.*?)\)', response.text)[0]
json_data = response.json()
pprint.pprint(json_data)

This is also possible, and can make the code simpler.

Tips:

pprint  is a formatted output module, which makes the effect of similar json data output more beautiful

Complete implementation code

import requests
for page in range(15, 150, 15):
    url = 'https://mfm.video.qq.com/danmu'
    params = {
        'otype': 'json',
        'target_id': '6416481842&vid=t0035rsjty9',
        'session_key': '30475,0,1611577043',
        'timestamp': page,
        '_': '1611577043296',
    }
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=url, params=params, headers=headers)
    json_data = response.json()
    contents = json_data['comments']
    for i in contents:
        content = i['content']
        with open('喜剧人弹幕.txt', mode='a', encoding='utf-8') as f:
            f.write(content)
            f.write('\n')
            print(content)

The code is relatively simple. There is no particular difficulty.

 

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/113250671