Python environment: python3.5
Let ’s first look at the web page.
Douban movie website link
We will extract the movie ’s name, link, rating, number of reviewers, and one sentence description.
1. Check and copy the movie ’s xPath information
movie. The information is as follows:
// * [@ id = ”content”] / div / div [1] / ol / li [1] / div / div [2] / div [1] / a / span [1]
according to the crawler ’s A wave of code routines
from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
title = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()’)[0]
print(title)
Output result: The
sixth line of code adds [0] at the end, because if it is not added, the returned list will be unsightly.
2. Extract the names of different movies on the same page.
Compare the xPath information of "Farewell My Concubine", "This Killer is not too cold", and "Forrest Gump" according to the same method of "The Redemption of Shawshank": the
film name The xPath information is only different from the serial number after li, and is the same as the serial number of the movie name, so after removing the serial number, you can get the general xPath information
* // [@ the above mentioned id = "Content"] / div / div [1] / OL / li / div / div [2] / div [1] / A / span [1]
1
then we put the page Movie name climbed down
from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
title = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()’)
for movies in title:
print(movies)
Output results
The following uses a similar method to extract movie ratings
from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
score = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()’)
for i in score:
print(i)
The output is: The
next thing to do is to output the movie and the corresponding rating
from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
file = s.xpath(’//[@id=“content”]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()’)
score = s.xpath(’//[@id=“content”]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()’)
for i in range(25):
print("{} {}".format(file[i],score[i]))
The output is:
here our default movie name and rating are complete and correct information, this default is generally no problem. But it is actually flawed. If we crawl less or more information, a matching error will occur, so how to avoid this error?
After careful thinking, we found that if we use the movie name as the unit to obtain the corresponding information, then the match must be completely correct.
The label of the movie name must be within the frame of this movie, so we looked up the label of the movie name and found the label covering the entire movie, copying the xPath information
/// [[id = "content"] / div / div [1] / ol / li [1]
Then we compare the xPath information of the whole movie with other information
// [@ id = "content"] / div / div [1] / ol / li [1]
// [@ id = "content"] / div / div [1] / ol / li [1] / div / div [2] / div [1] / a / span [1]
// * [@ id = "content"] / div / div [1] / ol / li [2] / div / div [2] / div [2] / div / span [2]
It is not difficult to find that the first half of the movie title and rating is the same as the first half of the entire movie. Then we can write xPath to locate the information like this:
file = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li[1]’)
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)
Try it in the actual code
from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
file = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li[1]’)
for div in file:
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)[0]
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)[0]
print("{} {}".format(movies_name,movies_score))
The output result is the
above we crawled the information of a movie, so how to crawl this page? Simply remove [1] behind li. Let's take a look at the new code
from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
file = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li’)
for div in file:
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)[0]
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)[0]
print("{} {}".format(movies_name,movies_score))
The result is:
the extraction of other information is similar, so I wo n’t go into details, the code runs again
from lxml import etree
import requests
url = ‘https://movie.douban.com/top250’
data = requests.get(url).text
s = etree.HTML(data)
file = s.xpath(’//*[@id=“content”]/div/div[1]/ol/li’)
for div in file:
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)[0]
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)[0]
movies_href = div.xpath(’./div/div[2]/div[1]/a/@href’)[0]
movies_number = div.xpath(’./div/div[2]/div[2]/div/span[4]/text()’)[0].strip("(").strip( ).strip(")")
movie_scrible = div.xpath(’./div/div[2]/div[2]/p[2]/span/text()’)
print("{} {} {} {} {}".format(movies_name,movies_href,movies_score,movies_number,movie_scrible[0]))
The result is: In
this way, we have extracted the information on the first page, so how can we extract all the pages? Compare URLs of different pages
First page: https://movie.douban.com/top250?start=0
Second page: https://movie.douban.com/top250?start=25
Third page: https: //movie.douban. ? com / top250 = 50 start
fourth page: HTTPS: //movie.douban.com/top250 Start = 75?
...
The change rule of the URL is very simple, but the number of start = () is not the same, it is incremented by 25, so it is enough to write a loop. Let's run the entire code below and extract all 25 pages of information.
from lxml import etree
import requests
import time
for a in range(10):
url = ‘https://movie.douban.com/top250?start={}’.format(a25)
data = requests.get(url).text
# print(data)
s = etree.HTML(data)
file = s.xpath(’//[@id=“content”]/div/div[1]/ol/li’)
for div in file:
movies_name = div.xpath(’./div/div[2]/div[1]/a/span[1]/text()’)[0]
movies_score = div.xpath(’./div/div[2]/div[2]/div/span[2]/text()’)[0]
movies_href = div.xpath(’./div/div[2]/div[1]/a/@href’)[0]
movies_number = div.xpath(’./div/div[2]/div[2]/div/span[4]/text()’)[0].strip("(").strip( ).strip(")")
movie_scrible = div.xpath(’./div/div[2]/div[2]/p[2]/span/text()’)
# time.sleep(1)
if len(movie_scrible)>0:
print("{} {} {} {} {}".format(movies_name,movies_href,movies_score,movies_number,movie_scrible[0]))
else:
print("{} {} {} {}".format(movies_name,movies_href,movies_score,movies_number))
The result is
that this is only a part of the screenshot, and the whole contains 250 movies.