An article teaches you to use Python to crawl Weibo comments

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it yourself

Python free learning materials and group communication answers Click to join


Part1——Theory

Imagine a question, if we want to grab the comment data of a Weibo big V Weibo, how should we achieve it? The easiest way is to find the Weibo comment data interface, and then change the parameters to get the latest data and save it. First look for the interface to grab comments from the Weibo api, as shown in the figure below.

 

Unfortunately, the frequency of this interface was limited, and it was banned after being caught a few times. It was cold before taking off.

Next, the editor chooses the mobile website of Weibo, login first, then find the Weibo we want to grab comments, open the browser’s built-in traffic analysis tool, keep pulling down the comments, and find the comment data interface, as shown in the figure below .

Then click the "Parameters" tab, you can see that the parameters are as shown in the figure below:

It can be seen that there are 4 parameters in total, among which the first and second parameters are the id of the Weibo, just like the ID number of a person, this is equivalent to the "ID number" of the Weibo, and max_id is the transformation The page number parameter needs to be changed every time, and the next max_id parameter value is in the return data of this request.

Part2-actual combat

With the above foundation, let's start to code and implement it in Python.

 

1. Distinguish url first, max_id is not needed for the first time, and max_id returned for the second time is used.

2. You need to bring cookie data when requesting. The validity period of Weibo cookies is relatively long, enough to capture a piece of Weibo comment data. The cookie data can be found in the browser analysis tool.

3. Then convert the returned data into json format, take out the comment content, commenter's nickname and comment time and other data, and the output result is shown in the figure below.

4. In order to save the content of the comment, we need to remove the expression in the comment and use regular expressions for processing, as shown in the figure below.

5. After that, save the content to a txt file and use a simple open function to realize it, as shown in the figure below.

6. Here comes the point. You can only return 16 pages of data (20 per page) through this interface. There are also reports on the Internet that return 50 pages, but the interface is different and the number of returned data is different, so I added a for Loop, one step in place, traversal is still very powerful, as shown in the figure below.

7. The function is named job here. In order to be able to always fetch the latest data, we can use schedule to add a timing function to the program, which is captured every 10 minutes or half an hour, as shown in the figure below.

8. De-duplicate the acquired data, as shown in the figure below. If the comment is already in it, just pass it, if not, just continue to add it.

This work is basically completed.

Part3-Summary

Although this method cannot capture all the data, it is also a relatively effective method under the constraints of such microblogs.

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/112841211