Python web crawler data collection combat: IMDb top250 crawling

After the requests familiar Python library that is re, you can try to build a simple crawler system. We use website structure is relatively stable and will not cause a large watercress site server load, the film details the name of watercress score top250, cover and other crawling.

table of Contents

A web analytics

    1. Page overview

    2. Matching Analysis

Second, write reptiles

    1. Wide Web for

    2. Information Extraction

    3. Save data

    4. cyclic structure


 

A web analytics

    1. Page overview

    First, enter the following URL in your browser to open the target site crawling IMDb TOP250 : HTTPS:? //Movie.douban.com/top250 Start = 225 & filter =, get the following screen.

    By looking at IMDb official website of the robots protocol, this site is not found in the Disallow years, indicating that the site does not restrict crawling .

    2. Matching Analysis

    Then press the F12 key to view the Google browser Devtools tool , found the first film (ie, The Shawshank Redemption) of the complete contents in a class belonging to the "Item " in the <div> tag, and thereafter every movie all in the same structure.

    Then we move to match the information by looking at the source code, you can see in the figure below by class as "pic" of the div to store information and pictures of the movie ranking url tag.

    So we can use regular expressions in non-greedy match in order to match the movie and pictures url ranking information for each entry in the movie, that is non-greedy match (. *?) Reason contains a number of films in this page's source code . The following code, the first (. *?) Is the ranking , the second (. *?) As a picture url .

<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?

    And so on, and alias names sequentially class of "title" of the <span> tag and the class is "other" the <span> tag, the integrated upper Regular Expression code as follows:

<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)

    Next is the director, starring , the year the country and type of label position and regular expressions:

<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?

    After a score and the number of evaluators label positions, and consolidated the positive expression:

<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?class="star.*?<span class="(.*?)"></span>.*?span class="rating_num".*?average">(.*?)</span>.*?<span>(.*?)</span>.*?

    Finally, the extraction is stored in class as "inq" of <span> Classic evaluate the content label:

<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?class="star.*?<span class="(.*?)"></span>.*?span class="rating_num".*?average">(.*?)</span>.*?<span>(.*?)</span>.*span class="inq"?>(.*?)</span>

Second, write reptiles

    1. Wide Web for

    After making the above analysis, we got matching the core of this article regular expression matching , then we started trying to write code pages to fetch.

    First import the relevant library, after the above IMDb top250 URL stored url variable, define the browser header after the call requests the library GE t method to get the page source.

import requestsimport reimport jsonurl = "https://movie.douban.com/top250?start=0&filter="headers = {    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"    }response = requests.get(url, headers=headers)text = response.text

    2. Information Extraction

    The positive then the above expression is stored as a string, call re library findall function matched to meet the conditions of all substrings .

regix = '<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?' \        'div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)'\        '</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?' \        'class="star.*?<span class="(.*?)"></span>.*?span class="rating_num".*?average">(.*?)</span>.*?<span>(.*?)</span>.*?' \        'span class="inq"?>(.*?)</span>'res = re.findall(regix, text, re.S)print(res)

    The movie shows the output by ranking , cover url , name , director and actor , score , evaluate people number and evaluation of content are among the list of the plurality of tuples.

[('1',  'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg',  '肖申克的救赎',  '&nbsp;/&nbsp;月黑高飞(港)  /  刺激1995(台)',  '\n                            导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...',  '\n                            1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情\n                        ',  'rating5-t',  '9.7',  '1893209人评价',  '希望让人自由。'),

    Since image files need to send a separate request, we are here to define an image to download function, call the python built-in function open the content written response to jpg format file.

# 定义下载图片函数def down_image(url,name,headers):    r = requests.get(url,headers = headers)    filename = re.search('/public/(.*?)$',url,re.S).group(1)    with open("film_pic/"+name.split('/')[0]+".jpg",'wb') as f:        f.write(r.content)

    On this basis, we will integrate the code above is a web analytic function , this function obtains a complete web page, extract information, information processing and output of information, where the yield generator is able to call in the course of time multiple return values, return a significant advantage over.

# 定义解析网页函数def parse_html(url):    response = requests.get(url, headers=headers)    text = response.text    # 正则表达式头部([1:排名 2:图片] [3:名称 4:别名] [5:导演 6:年份/国家/类型] [7:评星 8:评分 9:评价人数] [10:评价])    regix = '<div class="pic">.*?<em class="">(.*?)</em>.*?<img.*?src="(.*?)" class="">.*?' \            'div class="info.*?class="hd".*?class="title">(.*?)</span>.*?class="other">(.*?)'\            '</span>.*?<div class="bd">.*?<p class="">(.*?)<br>(.*?)</p>.*?' \            'class="star.*?<span class="(.*?)"></span>.*?span class="rating_num".*?average">(.*?)</span>.*?<span>(.*?)</span>.*?' \            'span class="inq"?>(.*?)</span>'    # 匹配出所有结果    res = re.findall(regix, text, re.S)    for item in res:        rank = item[0]        down_image(item[1],item[2],headers = headers)        name = item[2] + ' ' + re.sub('&nbsp;','',item[3])        actor =  re.sub('&nbsp;','',item[4].strip())        year = item[5].split('/')[0].strip('&nbsp;').strip()        country = item[5].split('/')[1].strip('&nbsp;').strip()        tp = item[5].split('/')[2].strip('&nbsp;').strip()        tmp = [i for i in item[6] if i.isnumeric()]        if len(tmp) == 1:            score = tmp[0] + '星/' + item[7] + '分'        else:            score = tmp[0] + '星半/' + item[7] + '分'        rev_num = item[8][:-3]        inq = item[9]        # 生成字典        yield {            '电影名称': name,            '导演和演员': actor,            '类型': tp,            '年份': year,            '国家': country,            '评分': score,            '排名': rank,            '评价人数': rev_num,            '评价': inq        }

    3. Save data

    Format returns above a dictionary, so we call the json library dumps method Dictionaries are encoded as json format name written into the top250_douban_film.txt text file.

# 定义输出函数def write_movies_file(str):    with open('top250_douban_film.txt','a',encoding='utf-8') as f:        f.write(json.dumps(str,ensure_ascii=False) + '\n')

    4. cyclic structure

    Above the crawl only a total of 25 pieces of data, page by clicking the next comparison found that the url of each page only start = different parameters of the back, and are both multiples of 25.

    In view of this, we use loop structures and string concatenation can be achieved multiple pages crawled:

# 定义主函数def main():    for offset in range(0, 250, 25):        url = 'https://movie.douban.com/top250?start=' + str(offset) +'&filter='        for item in parse_html(url):            print(item)            write_movies_file(item)

    The final climb to take cover image and movie information as follows:

    So far IMDb top250 crawling combat end - Reptile complete code can respond to the public No. " top250 " get. Of course, we can not let a reptile far and skilled people, in order to judge the whole experience repeated cases need to go to the analysis of ideas and concrete solutions approach .

    So what we did above reptiles Summary: First reptiles in the page structure and robots protocol analysis, regular expression matching , re-use requests library on the landing page to initiate the request , then re library and regular expression matching landing page source code information extraction , after which json library and open information and the extracted image function is stored up in the minimum reuse cyclic structure and strings together to turn pages crawled.

    However, some web pages such as Taobao , or Jingdong we will find the source code can not be extracted to the information you want, because these sites are all dynamic loading site, and watercress movie site belongs to static pages , further later in will these technologies explain. Basics previously involved may refer to the following links:

Python web crawler data collection combat: the basics

Python web crawler data collection combat: Requests and Re library

 

 

 

Published 60 original articles · won praise 18 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_36936730/article/details/104668162