Crawler crawls Sina Weibo data

Tool: cloud mining crawler

Goal: Grab all Weibo of a blogger

Analyze web page structure:

The idea of ​​our crawling is to simulate the browser's automatic access to page crawling.

Let's take a look at the page structure. First, each microblog list must be loaded three or four times. If there is a page turning button at the bottom, it is judged that the page is loaded.

 

 

login problem

Crawling requires a login account, how to log in?

The login does not require a verification code. If you make a mistake, you will be asked to enter the verification code, so there is no technical difficulty in logging in.

We can create a [login module], first log in with a browser, and then all pages will be crawled based on the cookie shared by this browser.

 

 

Flow chart design:

 

 

 

We don't need a detail page for Weibo. Therefore, the entire crawler process does not have a details page, and the data is extracted from the list.

Crawling results:

It took a total of 5 minutes to crawl 10 pages and a total of 400 microblogs. Because I don't post very frequently on Weibo.

Data are as follows:

 

 

Make a simple word cloud:

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/milu2003516/article/details/106208880