Preface
The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.
Python crawler, data analysis, website development and other case tutorial videos are free to watch online
https://space.bilibili.com/523606542
Preamble content
Python crawler beginners introductory teaching (1): crawling Douban movie ranking information
Python crawler novice introductory teaching (2): crawling novels
Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data
Python crawler novice introductory teaching (4): crawling 51job.com recruitment information
Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B
Python crawler novice introductory teaching (6): making word cloud diagrams
Basic development environment
- Python 3.6
- Pycharm
Use of related modules
- jieba
- wordcloud
Install Python and add it to the environment variables, pip installs the required related modules.
One, clear needs
Select <Happy Comedian Season 7> to crawl the barrage information sent by netizens
2. Analyze web data
Copy the barrage in the webpage and search in the developer tool.
There is corresponding barrage data in it. This url address has a small feature, the link contains danmu so boldly try it, filter and search for the keyword danmu , and see if there is similar content
Through the link parameter comparison, you can see the change of each URL address parameter
Loop traversal can realize the barrage of crawling the entire video.
Three, parse the data
Here I would like to ask, what kind of data do you think the data returned to you by requesting this url address is? Give everyone three seconds to think about time.
1 …2…3…
Okay, now the answer is announced, it is a string, you heard it right. If you get response.json() directly , you will get an error
So how can we make it program json data, after all, json data is better to extract data.
the first method
- Regular matching extracts the data in the middle part of the data
- Import json module, convert string to json data
import requests
import re
import json
import pprint
url = 'https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19108312825154929784_1611577043265&target_id=6416481842%26vid%3Dt0035rsjty9&session_key=30475%2C0%2C1611577043×tamp=105&_=1611577043296'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
result = re.findall('jQuery19108312825154929784_1611577043265\((.*?)\)', response.text)[0]
json_data = json.loads(result)
pprint.pprint(json_data)
The second method
Delete the callback=jQuery19108312825154929784_1611577043265 in the link to use response.json() directly
import requests
import pprint
url = 'https://mfm.video.qq.com/danmu?otype=json&target_id=6416481842%26vid%3Dt0035rsjty9&session_key=30475%2C0%2C1611577043×tamp=105&_=1611577043296'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
# result = re.findall('jQuery19108312825154929784_1611577043265\((.*?)\)', response.text)[0]
json_data = response.json()
pprint.pprint(json_data)
This is also possible, and can make the code simpler.
Tips:
pprint is a formatted output module, which makes the effect of similar json data output more beautiful
Complete implementation code
import requests
for page in range(15, 150, 15):
url = 'https://mfm.video.qq.com/danmu'
params = {
'otype': 'json',
'target_id': '6416481842&vid=t0035rsjty9',
'session_key': '30475,0,1611577043',
'timestamp': page,
'_': '1611577043296',
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, params=params, headers=headers)
json_data = response.json()
contents = json_data['comments']
for i in contents:
content = i['content']
with open('喜剧人弹幕.txt', mode='a', encoding='utf-8') as f:
f.write(content)
f.write('\n')
print(content)
The code is relatively simple. There is no particular difficulty.