Learning Python - crawling before watercress love movie information classified 20 exercises

Before the reptiles are used in java, python want to use this time to crawl pages of information.

First, enter the watercress: https: //movie.douban.com/

Right-click check, select network in xhr Properties, click on the movie Love classification

 

 

 

See here corresponding request url page response, copy https://movie.douban.com/j/chart/top_list?type=13&interval_id=100%3A90&action=&start=0&limit=20

View correspondence request header:

 

After copying the information corresponding to the header json format transmission request to python

then python code into corresponding, html is their definition of a module

 

 

 Enter the json result after execution

Find an error

 

 

 Checking the titles will be returning to the use of compression format

 

 

 After the corresponding header in the parameter removes the execution again

Displays information corresponding to a crawling json

 

You can see the information has been successfully display.

We put information corresponding standardized online parse json site to see is not correct

 

 

 

Film can be seen crawling out of the same data in love IMDb

 

 

 

 Next, a link corresponding to the extracted webpage

Url URL can see the corresponding page of the movie in the url in json

 

 

 Url information in the traversal, see if you can output the information is correct

 

 

 Information output follows

 

 

 Information correct traversal

Then enters Farewell My page on which the message is analyzed critics.

Farewell My Concubine profile can be found in the area range inside

 

 

 

Here in regular expression information inside crawling

则用(?<=<span property="v:summary" class>)[\s\S]*?(?=</span>)

The same steps above

 

Set string matching the regular expression:

(?<=<span class="short">)[\s\S]*?(?=</span>)

 The comprehensive reptile after the code is:

 

 

 Then we see the results directly on the pages crawled

Crawling results as follows:

 

 

Guess you like

Origin www.cnblogs.com/halone/p/12452803.html