Python dry goods: teach you how to use python to grab Weibo comments, use python to know more Weibo big V interesting and funny comments!

【Part1——Theory】

Imagine a question. If we want to grab the comment data of a Weibo big V, how should we do it? The easiest way is to find the Weibo comment data interface, and then change the parameters to get the latest data and save it. First look for the interface to fetch comments from Weibo, as shown in the figure below.

But unfortunately, the frequency of this interface is limited, and it will be banned after catching it too many times. As soon as it is ready to fly, it gets cold.

Next, the editor started to select the mobile Weibo website, log in first, and then find the Weibo we want to grab comments, open the browser’s built-in traffic analysis tool, and keep pulling down the comments to find the comment data interface, as shown below Shown.

Then click on the "Parameters" tab, you can see that there are parameters as shown in the figure below:

You can see that there are a total of 4 parameters, of which the first and second parameters are the id of the Weibo, just like everyone’s ID number, this is equivalent to the "ID number" of the Weibo, max_id It is the parameter for changing the page number, and it will change every time. The next max_id parameter value is in the return data of this request.

[Part2-Actual Combat]

With the above foundation, let's start to code the code and implement it through Python.

1. Distinguish url first, max_id is not needed for the first time, and max_id returned for the second time is used.

2. You need to bring cookie data when requesting. The validity period of Weibo cookies is relatively long, enough to grab a Weibo comment data. The cookie data can be found in the browser analysis tool.

3. Then convert the returned data into json format, take out the comment content, commenter's nickname and comment time and other data, and the output result is shown in the figure below.

4. In order to save the content of the comment, we need to remove the expression in the comment and process it with regular expressions, as shown in the figure below.

5. Then save the content to a txt file and use a simple open function to implement it, as shown in the figure below.

6. Here comes the point. Through this interface, only 16 pages of data (20 per page) can be returned. There are also reports on the Internet that can return 50 pages, but the number of returned data is different for different interfaces, so I added A for loop, one step in place, traversal is still very powerful, as shown in the figure below.

7. The function is named job here. In order to always be able to retrieve the latest data, we can use schedule to add a timing function to the program, grabbing once every 10 minutes or half an hour, as shown in the figure below.

8. De-duplicate the acquired data, as shown in the figure below. If the comment is already in it, just pass it directly, if not, just continue to add it.

This work is basically completed here.

Guess you like

Origin blog.csdn.net/Python_xiaobang/article/details/112274536