Python crawler crawling Douban movie top250

Python environment: python3.5

Let ’s first look at the web page.
Insert picture description here
Douban movie website link
We will extract the movie ’s name, link, rating, number of reviewers, and one sentence description.
1. Check and copy the movie ’s xPath information
Insert picture description here
movie. The information is as follows:
// * [@ id = ”content”] / div / div [1] / ol / li [1] / div / div [2] / div [1] / a / span [1]
according to the crawler ’s A wave of code routines

from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
title = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()’)[0]
print(title)

Output result: The
Insert picture description here
sixth line of code adds [0] at the end, because if it is not added, the returned list will be unsightly.
2. Extract the names of different movies on the same page.
Compare the xPath information of "Farewell My Concubine", "This Killer is not too cold", and "Forrest Gump" according to the same method of "The Redemption of Shawshank": the
Insert picture description here
film name The xPath information is only different from the serial number after li, and is the same as the serial number of the movie name, so after removing the serial number, you can get the general xPath information

* // [@ the above mentioned id = "Content"] / div / div [1] / OL / li / div / div [2] / div [1] / A / span [1]
1
then we put the page Movie name climbed down

from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
title = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()’)
for movies in title:
print(movies)

Output results
Insert picture description here
The following uses a similar method to extract movie ratings

from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
score = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()’)
for i in score:
print(i)

The output is: The
Insert picture description here
next thing to do is to output the movie and the corresponding rating

from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
file = s.xpath(’//[@id=“content”]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()’)
score = s.xpath(’//
[@id=“content”]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()’)
for i in range(25):
print("{} {}".format(file[i],score[i]))

The output is:
Insert picture description here
here our default movie name and rating are complete and correct information, this default is generally no problem. But it is actually flawed. If we crawl less or more information, a matching error will occur, so how to avoid this error?
After careful thinking, we found that if we use the movie name as the unit to obtain the corresponding information, then the match must be completely correct.
The label of the movie name must be within the frame of this movie, so we looked up the label of the movie name and found the label covering the entire movie, copying the xPath information
Insert picture description here
/// [[id = "content"] / div / div [1] / ol / li [1]

Then we compare the xPath information of the whole movie with other information

// [@ id = "content"] / div / div [1] / ol / li [1]
//
[@ id = "content"] / div / div [1] / ol / li [1] / div / div [2] / div [1] / a / span [1]
// * [@ id = "content"] / div / div [1] / ol / li [2] / div / div [2] / div [2] / div / span [2]

It is not difficult to find that the first half of the movie title and rating is the same as the first half of the entire movie. Then we can write xPath to locate the information like this:

file = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li[1]’)
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)

Try it in the actual code

from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
file = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li[1]’)
for div in file:
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)[0]
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)[0]
print("{} {}".format(movies_name,movies_score))

The output result is the
Insert picture description here
above we crawled the information of a movie, so how to crawl this page? Simply remove [1] behind li. Let's take a look at the new code

from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
file = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li’)
for div in file:
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)[0]
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)[0]
print("{} {}".format(movies_name,movies_score))

The result is:
Insert picture description here
the extraction of other information is similar, so I wo n’t go into details, the code runs again

from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
file = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li’)
for div in file:
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)[0]
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)[0]
movies_href = div.xpath(’./div/div[2]/div[1]/a/@href’)[0]
movies_number = div.xpath(’./div/div[2]/div[2]/div/span[4]/text()’)[0].strip("(").strip( ).strip(")")
movie_scrible = div.xpath(’./div/div[2]/div[2]/p[2]/span/text()’)
print("{} {} {} {} {}".format(movies_name,movies_href,movies_score,movies_number,movie_scrible[0]))

The result is: In
Insert picture description here
this way, we have extracted the information on the first page, so how can we extract all the pages? Compare URLs of different pages

First page: https://movie.douban.com/top250?start=0
Second page: https://movie.douban.com/top250?start=25
Third page: https: //movie.douban. ? com / top250 = 50 start
fourth page: HTTPS: //movie.douban.com/top250 Start = 75?
...

The change rule of the URL is very simple, but the number of start = () is not the same, it is incremented by 25, so it is enough to write a loop. Let's run the entire code below and extract all 25 pages of information.

from lxml import etree
import requests
import time
for a in range(10):
url = ‘https://movie.douban.com/top250?start={}’.format(a25)
data = requests.get(url).text
# print(data)
s = etree.HTML(data)
file = s.xpath(’//
[@id=“content”]/div/div[1]/ol/li’)
for div in file:
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)[0]
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)[0]
movies_href = div.xpath(’./div/div[2]/div[1]/a/@href’)[0]
movies_number = div.xpath(’./div/div[2]/div[2]/div/span[4]/text()’)[0].strip("(").strip( ).strip(")")
movie_scrible = div.xpath(’./div/div[2]/div[2]/p[2]/span/text()’)
# time.sleep(1)
if len(movie_scrible)>0:
print("{} {} {} {} {}".format(movies_name,movies_href,movies_score,movies_number,movie_scrible[0]))
else:
print("{} {} {} {}".format(movies_name,movies_href,movies_score,movies_number))

The result is
Insert picture description here
that this is only a part of the screenshot, and the whole contains 250 movies.

uvv
Published 1 original article · Liked 0 · Visits 10

Guess you like

Origin blog.csdn.net/uukuvv/article/details/105471009