Before the reptiles are used in java, python want to use this time to crawl pages of information.
First, enter the watercress: https: //movie.douban.com/
Right-click check, select network in xhr Properties, click on the movie Love classification
See here corresponding request url page response, copy https://movie.douban.com/j/chart/top_list?type=13&interval_id=100%3A90&action=&start=0&limit=20
View correspondence request header:
After copying the information corresponding to the header json format transmission request to python
then python code into corresponding, html is their definition of a module
Enter the json result after execution
Find an error
Checking the titles will be returning to the use of compression format
After the corresponding header in the parameter removes the execution again
Displays information corresponding to a crawling json
You can see the information has been successfully display.
We put information corresponding standardized online parse json site to see is not correct
Film can be seen crawling out of the same data in love IMDb
Next, a link corresponding to the extracted webpage
Url URL can see the corresponding page of the movie in the url in json
Url information in the traversal, see if you can output the information is correct
Information output follows
Information correct traversal
Then enters Farewell My page on which the message is analyzed critics.
Farewell My Concubine profile can be found in the area range inside
Here in regular expression information inside crawling
则用(?<=<span property="v:summary" class>)[\s\S]*?(?=</span>)
The same steps above
Set string matching the regular expression:
(?<=<span class="short">)[\s\S]*?(?=</span>)
The comprehensive reptile after the code is:
Then we see the results directly on the pages crawled
Crawling results as follows: