After the requests familiar Python library that is re, you can try to build a simple crawler system. We use website structure is relatively stable and will not cause a large watercress site server load, the film details the name of watercress score top250, cover and other crawling.
table of Contents
A web analytics
1. Page overview
First, enter the following URL in your browser to open the target site crawling IMDb TOP250 : HTTPS:? //Movie.douban.com/top250 Start = 225 & filter =, get the following screen.
By looking at IMDb official website of the robots protocol, this site is not found in the Disallow years, indicating that the site does not restrict crawling .
2. Matching Analysis
Then press the F12 key to view the Google browser Devtools tool , found the first film (ie, The Shawshank Redemption) of the complete contents in a class belonging to the "Item " in the <div> tag, and thereafter every movie all in the same structure.
Then we move to match the information by looking at the source code, you can see in the figure below by class as "pic" of the div to store information and pictures of the movie ranking url tag.
So we can use regular expressions in non-greedy match in order to match the movie and pictures url ranking information for each entry in the movie, that is non-greedy match (. *?) Reason contains a number of films in this page's source code . The following code, the first (. *?) Is the ranking , the second (. *?) As a picture url .
<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?
And so on, and alias names sequentially class of "title" of the <span> tag and the class is "other" the <span> tag, the integrated upper Regular Expression code as follows:
<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?
div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)
Next is the director, starring , the year the country and type of label position and regular expressions:
<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?
div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)
</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?
After a score and the number of evaluators label positions, and consolidated the positive expression:
<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?
div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)
</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?
class="star.*?<span class="(.*?)"></span>.*?span class="rating_num".*?average">(.*?)</span>.*?<span>(.*?)</span>.*?
Finally, the extraction is stored in class as "inq" of <span> Classic evaluate the content label:
<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?
div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)
</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?
class="star.*?<span class="(.*?)"></span>.*?span class="rating_num".*?average">(.*?)</span>.*?<span>(.*?)</span>.*
span class="inq"?>(.*?)</span>
Second, write reptiles
1. Wide Web for
After making the above analysis, we got matching the core of this article regular expression matching , then we started trying to write code pages to fetch.
First import the relevant library, after the above IMDb top250 URL stored url variable, define the browser header after the call requests the library GE t method to get the page source.
import requests
import re
import json
url = "https://movie.douban.com/top250?start=0&filter="
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"
}
response = requests.get(url, headers=headers)
text = response.text
2. Information Extraction
The positive then the above expression is stored as a string, call re library findall function matched to meet the conditions of all substrings .
regix = '<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?' \
'div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)'\
'</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?' \
'class="star.*?<span class="(.*?)"></span>.*?span class="rating_num".*?average">(.*?)</span>.*?<span>(.*?)</span>.*?' \
'span class="inq"?>(.*?)</span>'
res = re.findall(regix, text, re.S)
print(res)
The movie shows the output by ranking , cover url , name , director and actor , score , evaluate people number and evaluation of content are among the list of the plurality of tuples.
[('1',
'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg',
'肖申克的救赎',
' / 月黑高飞(港) / 刺激1995(台)',
'\n 导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...',
'\n 1994 / 美国 / 犯罪 剧情\n ',
'rating5-t',
'9.7',
'1893209人评价',
'希望让人自由。'),
Since image files need to send a separate request, we are here to define an image to download function, call the python built-in function open the content written response to jpg format file.
# 定义下载图片函数
def down_image(url,name,headers):
r = requests.get(url,headers = headers)
filename = re.search('/public/(.*?)$',url,re.S).group(1)
with open("film_pic/"+name.split('/')[0]+".jpg",'wb') as f:
f.write(r.content)
On this basis, we will integrate the code above is a web analytic function , this function obtains a complete web page, extract information, information processing and output of information, where the yield generator is able to call in the course of time multiple return values, return a significant advantage over.
# 定义解析网页函数
def parse_html(url):
response = requests.get(url, headers=headers)
text = response.text
# 正则表达式头部([1:排名 2:图片] [3:名称 4:别名] [5:导演 6:年份/国家/类型] [7:评星 8:评分 9:评价人数] [10:评价])
regix = '<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?' \
'div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)'\
'</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?' \
'class="star.*?<span class="(.*?)"></span>.*?span class="rating_num".*?average">(.*?)</span>.*?<span>(.*?)</span>.*?' \
'span class="inq"?>(.*?)</span>'
# 匹配出所有结果
res = re.findall(regix, text, re.S)
for item in res:
rank = item[0]
down_image(item[1],item[2],headers = headers)
name = item[2] + ' ' + re.sub(' ','',item[3])
actor = re.sub(' ','',item[4].strip())
year = item[5].split('/')[0].strip(' ').strip()
country = item[5].split('/')[1].strip(' ').strip()
tp = item[5].split('/')[2].strip(' ').strip()
tmp = [i for i in item[6] if i.isnumeric()]
if len(tmp) == 1:
score = tmp[0] + '星/' + item[7] + '分'
else:
score = tmp[0] + '星半/' + item[7] + '分'
rev_num = item[8][:-3]
inq = item[9]
# 生成字典
yield {
'电影名称': name,
'导演和演员': actor,
'类型': tp,
'年份': year,
'国家': country,
'评分': score,
'排名': rank,
'评价人数': rev_num,
'评价': inq
}
3. Save data
Format returns above a dictionary, so we call the json library dumps method Dictionaries are encoded as json format name written into the top250_douban_film.txt text file.
# 定义输出函数
def write_movies_file(str):
with open('top250_douban_film.txt','a',encoding='utf-8') as f:
f.write(json.dumps(str,ensure_ascii=False) + '\n')
4. cyclic structure
Above the crawl only a total of 25 pieces of data, page by clicking the next comparison found that the url of each page only start = different parameters of the back, and are both multiples of 25.
In view of this, we use loop structures and string concatenation can be achieved multiple pages crawled:
# 定义主函数
def main():
for offset in range(0, 250, 25):
url = 'https://movie.douban.com/top250?start=' + str(offset) +'&filter='
for item in parse_html(url):
print(item)
write_movies_file(item)
The final climb to take cover image and movie information as follows:
So far IMDb top250 crawling combat end - Reptile complete code can respond to the public No. " top250 " get. Of course, we can not let a reptile far and skilled people, in order to judge the whole experience repeated cases need to go to the analysis of ideas and concrete solutions approach .
So what we did above reptiles Summary: First reptiles in the page structure and robots protocol analysis, regular expression matching , re-use requests library on the landing page to initiate the request , then re library and regular expression matching landing page source code information extraction , after which json library and open information and the extracted image function is stored up in the minimum reuse cyclic structure and strings together to turn pages crawled.
However, some web pages such as Taobao , or Jingdong we will find the source code can not be extracted to the information you want, because these sites are all dynamic loading site, and watercress movie site belongs to static pages , further later in will these technologies explain. Basics previously involved may refer to the following links:
Python web crawler data collection combat: the basics
Python web crawler data collection combat: Requests and Re library