Sina Weibo content scraping

Recently crawling related content of Sina Weibo, I encountered many problems and solved many problems. At first, I used httpclient crawler to crawl web pages, but later found that many of the content of Weibo web pages are embedded in js; so I used htmlunit instead. Here are some highlights from my experience!


The advantages and disadvantages of httpclient, htmlunit, and Sina Weibo API:

httpclient:

Advantages: The crawler is relatively stable, and the related usage methods are very detailed. You can refer to the book "Write Your Own Web Crawler".

Disadvantage: Can't parse js.

htmlunit :

Advantages: JS can be parsed, form submission can be simulated, and the usage method is relatively simple.

Disadvantages: The crawler is unstable, especially when parsing JS, it may throw exceptions.

Sina Weibo API:

Advantages: Provides crawling of various content.

Cons: Too many restrictions.


Since I use htmlunit to crawl Sina Weibo content, let's talk about some problems encountered by crawlers.

1. Simulated login

If you want to simulate login, you must understand the process of Sina Weibo login, because there are many changes in Sina Weibo login, this place requires you to capture packets and analyze his login process; it is recommended that you install a packet capture tool, such as httpfox.

Login process:

1. Simulate the user name input box focus lost. Make a request to the server.

2. Obtain the returned data requested in the first step and extract the parameters.

3. Encrypted password. (Sina encryption algorithm may have changed)

4. Encapsulate the relevant parameters of the second and third steps into the post request, and initiate a post request

5. The request is successful, get the cookie; save the cookie.

2. JS parsing

Method 1: Call the save method of htmlpage to save the obtained web page locally. When reading the file, you will find that the js in it has been parsed.

Method 2: Extract specific js, because the content in js is regular; we only need to extract html in js; then parse the extracted html through htmlcleaner.

Method 3: Use the getByXpath method provided by htmlpage directly, but this method is unstable. The element object may not be obtained.

3. Verification code

method one:

1. Initiate a request to get the verification code picture

2. Save locally.

3. Open the picture and enter the verification code manually.

4. Obtain the verification code value and request authentication.

Method Two:

Crack the verification code. This method is so hard! ! !

Method three:

Bypass the verification code and grab the Weibo content on the mobile phone; of course! The content on the mobile side is less detailed than the content on the web side, and some things may not be there.


It feels almost like writing this, and there may be other problems in the future. These are foreshadowings! for reference only! ! The code will not be posted, and if it is written too casually, it will not be ugly.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325405155&siteId=291194637