Crawling Sina Weibo fans with crawlers

preparation tool

cloud mining crawler

mission details

Given hundreds of personal homepage addresses, such as this:

https://weibo.com/u/1688541667?refer_flag=1005050008_&is_hot=1

Grab the follower data of these bloggers

As shown in the figure, we first obtain the address of the fan list
insert image description here
and then turn the page to get the fans on the first 5 pages. With 20 pieces of data per page, each blogger can capture 100 fans.

700 bloggers, we need to capture about 70,000 pieces of data
insert image description here

Here are the fields we need to grab:

insert image description here

Build the login module

Simulate the browser login account to grab, so we need to create a separate login module

The login interface of Sina Weibo:
insert image description here
flow chart
insert image description here
This login module is mainly to enter the account number and password, and then click to log in. It is very simple to use the flow chart to create.

We can also judge the login status, including success and failure. If the login is successful, we will start the collection task. If the login fails, we will report an error.

Collection process

The whole is as follows:

insert image description here

At the beginning, our idea was to crawl each user's personal homepage to obtain user information, including gender, profile, region, etc. Later, we calculated that for 70,000 fans, we need to grab 70,000 addresses, which is too much Big, then we changed our minds and got data directly from the list

insert image description here
Including user name, gender, and region are all listed, so our crawler does not have a details page.

So in the flowchart, we use a [table data page]

insert image description here

As for gender, the page does not directly display men and women. We get the icon class of gender and replace it directly

insert image description here

Let's do a replacement:

insert image description here

The fetched results are as follows:

insert image description here

Guess you like

Origin blog.csdn.net/milu2003516/article/details/107096298