Getting Started to combat python reptile - crawled Cat's Eye Film Top100 rankings

Original link: https: //www.cnblogs.com/NFii/p/11576616.html

The main crawling Top100 list of movie film name, starring and release time and save it as a form of excel table, other similar list can also be Yihuhuhuapiao

First, open the URL for crawling https://maoyan.com/board/4, click Next to continue in the process, we can detect changes in the URL is a regular

https://maoyan.com/board/4?offset=0
https://maoyan.com/board/4?offset=10
https://maoyan.com/board/4?offset=20

Different pages, only the offset variation behind the figures, and the increase in multiples of 10

python library use

1. requests -> 请求页面
2. re -> 匹配想要获取的内容
3. pandas -> 使内容看起来更有结构化, 同时帮助我们将内容保存为文件

Start writing crawler

  • Get page source
base_url = 'https://maoyan.com/board/4?offset='
# 伪造一个请求头, 这个网上有很多
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
}


def get_every_page(url): result = requests.get(url, headers=headers) # 响应成功则返回源代码内容 if result.status_code == requests.codes.ok: return result.text return None
  • Source code analysis features, writing regular expressions, get the main content
<dd>
    <i class="board-index board-index-20">20</i> <a href="/films/428" title="指环王3:王者无敌" class="image-link" data-act="boarditem-click" data-val="{movieId:428}"> <img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" /> <img data-src="https://p0.meituan.net/movie/932bdfbef5be3543e6b136246aeb99b8123736.jpg@160w_220h_1e_1c" alt="指环王3:王者无敌" class="board-img" /> </a> <div class="board-item-main"> <div class="board-item-content"> <div class="movie-item-info"> <p class="name"><a href="/films/428" title="指环王3:王者无敌" data-act="boarditem-click" data-val="{movieId:428}">指环王3:王者无敌</a></p> <p class="star"> 主演:伊莱贾·伍德,伊恩·麦克莱恩,丽芙·泰勒 </p> <p class="releasetime">上映时间:2004-03-15</p> </div> <div class="movie-item-number score-num"> <p class="score"> <i class="integer">9.</i> <i class="fraction">2</i> </p> </div> </div> </div> </dd>

Can be found from the source code that is returned, information movie are concentrated within the <dd> tag, according to its own rules, write the following regular expression gets the name of the movie, starring information and showtimes

# filmname = []
# actor = []
# stime = []
html = get_every_page(url)
        if html:
            # 获取电影信息 # 同时这里需要注意的重点是, 一定不要忘记了修饰符re.S, 否则什么也匹配不出来! data = re.findall('<dd.*?title="(.*?)".*?"star">(.*?)<.*?">(.*?)</p>', html, re.S) # data中的每一个都是一个元组

Output information for each tuple (in a page as an example)

('霸王别姬', '\n                主演:张国荣,张丰毅,巩俐\n        ', '上映时间:1993-01-01')
('肖申克的救赎', '\n 主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿\n ', '上映时间:1994-09-10(加拿大)') ('罗马假日', '\n 主演:格利高里·派克,奥黛丽·赫本,埃迪·艾伯特\n ', '上映时间:1953-09-02(美国)') ('这个杀手不太冷', '\n 主演:让·雷诺,加里·奥德曼,娜塔莉·波特曼\n ', '上映时间:1994-09-14(法国)') ('泰坦尼克号', '\n 主演:莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩\n ', '上映时间:1998-04-03') ('唐伯虎点秋香', '\n 主演:周星驰,巩俐,郑佩佩\n ', '上映时间:1993-07-01(中国香港)') ('魂断蓝桥', '\n 主演:费雯·丽,罗伯特·泰勒,露塞尔·沃特森\n ', '上映时间:1940-05-17(美国)') ('乱世佳人', '\n 主演:费雯·丽,克拉克·盖博,奥利维娅·德哈维兰\n ', '上映时间:1939-12-15(美国)') ('天空之城', '\n 主演:寺田农,鹫尾真知子,龟山助清\n ', '上映时间:1992-05-01') ('辛德勒的名单', '\n 主演:连姆·尼森,拉尔夫·费因斯,本·金斯利\n ', '上映时间:1993-12-15(美国)')

Because a large difference in the output format information, let us look at a unified format

# data = re.findall('<dd.*?title="(.*?)".*?"star">(.*?)<.*?">(.*?)</p>', html, re.S)
            # 去除空格和多余的字符, 分别提取出电影名, 主演和上映时间
            for i in data:
                filmname.append(i[0].strip()) actor.append((i[1].strip())[3:]) stime.append(i[2][5:].strip())
  • Save the file as a result of
    the contents into DataFrame the object, and then save the file
    tdict = {'电影名': filmname, '主演': actor, '上映时间': stime}
    tdict = pd.DataFrame(tdict, index=[i for i in range(1, 101)]) tdict.to_excel('Top100电影排行榜.xlsx', encoding='utf-8') print(tdict)
  • The complete code
import requests
import re
import pandas as pd
base_url = 'https://maoyan.com/board/4?offset=' headers = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' } def get_every_page(url): result = requests.get(url, headers=headers) if result.status_code == requests.codes.ok: return result.text return None def main(): filmname = [] actor = [] stime = [] for i in range(0, 110, 10): url = base_url + str(i) html = get_every_page(url) if html: data = re.findall('<dd.*?title="(.*?)".*?"star">(.*?)<.*?">(.*?)</p>', html, re.S) for i in data: filmname.append(i[0].strip()) actor.append((i[1].strip())[3:]) stime.append(i[2][5:].strip()) tdict = {'电影名': filmname, '主演': actor, '上映时间': stime} tdict = pd.DataFrame(tdict, index=[i for i in range(1, 101)]) tdict.to_excel('Top100电影排行榜.xlsx', encoding='utf-8') print(tdict) main() 
  • Test successful

Open our generation's Top100 movie list table, the result of perfect output nice!

(Interception Top10)

Guess you like

Origin www.cnblogs.com/busishum/p/11713616.html