Python Reptile one: crawl IMDb Top250

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/weixin_44517301/article/details/100566947

Goal: grab IMDb Top 250, save the movie to cover local, save the basic information about movies (title, director, star, time, score, number of reviews, introduction) to txt.
First, the idea of analysis

Open IMDb Top250 pages with chrome, https://movie.douban.com/top250. The figure below the first movie, The Shawshank Redemption, film name, director, starring, year, rating, number of evaluation information is what we need. When we use the browser or python sends a request to the browser, html code is returned, we usually use a browser web browser to see these regular page illustrations html code is actually the result after the browser rendering. So we need to find the location information to be captured in the html code. This is called html parsing, parsing tools are many. For example: regular expressions, Beautifulsoup, Xpath, css, etc., used here xpath method.

如何找到信息在html中的位置呢,首先鼠标右键检查,打开当前网页的html代码。然后先单击箭头1处的箭头,把鼠标移动到你要查找的信息上,如箭头2处的电影名:肖申克的救赎 ,右边就会显示你点击信息在html代码中的位置(箭头3)。一个网页的html代码全部打开看上去会非常的繁多,其实html代码是一层一层结构化的,非常规整的。每一对尖括号包起来的是一个标签,比如箭头3的<span> ......</span>,这就是一个span标签。span叫标签名,class="title"是标签的属性,“肖申克的救赎”是标签的内容。标签span在html代码中的完整路径应该是:

body / div [@ id = "wrapper"] / div [@ id = "content"] / div / div [1] / ol / li [1] / div / div [2] / div [1] / a / span [1]. . . Find elements with xpath is to find a path down according to the layers, [@ id = "wrapper"] indicates that the property label to find, together with the label attribute, we can be more convenient location, you do not find the end from the beginning. div [1], span [1] represents the first're looking for is a div tag in the tag, first a span tag. Method can be used to find a span span [1] / text () remove the contents therein, pan [1] / @ class attribute values ​​may be removed.

电影封面的下载,只要找到图片的链接地址,就可以调用urllib库中的函数urllib.request.urlretrieve()直接下载。找到图片链接的方法和上面一样,把鼠标移动到封面上,右边就会显示链接的位置。

每一页网址的变化规律,一页可以显示25部电影,就是说这250部电影一共有10页。观察前几页的网址很容易发现规律:就是start后面跟的参数变化,等于(页数-1)*25,而且发现后面的filter去掉也不影响。

The first page: https: //movie.douban.com/top250
second page: https: //movie.douban.com/top250 start = 25 & filter =?
Third page: https: //movie.douban.com/top250 ? start = 50 & filter =
Here Insert Picture Description
Second, code implementation

A first transmission request urllib, the source code of the returned html. The string is returned html format, is converted into an object you need tree.HTML xpath can handle. Observation html code, each

  • , Corresponding to the label just a movie, so let's navigate to each label li, li in each individual tag information parsing get this movie.

    data_title - movie name
    data_info - movie information (director, starring, release time)
    data_quote - Movie Introduction
    data_score - Film Review
    data_num - Movie Reviews Number
    data_picurl- movie cover link

    code show as below

    from urllib import request
    from lxml import etree
    #构造函数,抓取第i页信息
    def crow(i):
        #  构造第i页的网址
        url='https://movie.douban.com/top250?start='+str(25*i)
        #  发送请求,获得返回的html代码并保存在变量html中
        html=request.urlopen(url).read().decode('utf-8')
        #将返回的字符串格式的html代码转换成xpath能处理的对象
        html=etree.HTML(html)
        #先定位到li标签,datas是一个包含25个li标签的list,就是包含25部电影信息的list
        datas = html.xpath('//ol[@class="grid_view"]/li')
        a=0
        for data in datas:
            data_title=data.xpath('div/div[2]/div[@class="hd"]/a/span[1]/text()')
            data_info=data.xpath('div/div[2]/div[@class="bd"]/p[1]/text()')
            data_quote=data.xpath('div/div[2]/div[@class="bd"]/p[2]/span/text()')
            data_score=data.xpath('div/div[2]/div[@class="bd"]/div/span[@class="rating_num"]/text()')
            data_num=data.xpath('div/div[2]/div[@class="bd"]/div/span[4]/text()')
            data_picurl=data.xpath('div/div[1]/a/img/@src')
            print("No: "+str(i*25+a+1))
            print(data_title)
            #保存电影信息到txt文件,下载封面图片
            with open('douban250.txt','a',encoding='utf-8')as f:
                #封面图片保存路径和文件名
                picname='F:/top250/'+str(i*25+a+1)+'.jpg'
                f.write("No: "+str(i*25+a+1)+'\n')
                f.write(data_title[0]+'\n')
                f.write(str(data_info[0]).strip()+'\n')
                f.write(str(data_info[1]).strip()+'\n')
                #因为发现有几部电影没有quote,所以这里加个判断,以免报错
                if data_quote:
                    f.write(data_quote[0]+'\n')
                f.write(data_score[0]+'\n')
                f.write(data_num[0]+'\n')
                f.write('\n'*3)
                #下载封面图片到本地,路径为picname
                request.urlretrieve(data_picurl[0],filename=picname)
            a+=1
    for i in range(10):
        crow(i)
    

    Here Insert Picture Description

    Here Insert Picture Description

    Guess you like

    Origin blog.csdn.net/weixin_44517301/article/details/100566947