Python爬虫学习-Day2

一、正则表达式

1、简介
   正则表达式是处理字符串的强大工具,它有自己特定的语法结构。又称为正规表示法、正规表示式、常规表示法等。是计算机科学的一个概念。通常被用来检索、替换哪些匹配某个模式的文本。
   利用开源中共提供的正则表达式测试工具 http://tool.oschina.net/regex/,输入待匹配的文本如下:
   Hello, my name is Frank, my qq is 924406573 and email is [email protected], and my blog website is https://blog.csdn.net/weixin_42937385
   点击右边对应要查找的内容,就可以获得文本中相应的信息,并且正则表达式栏目出现一串怪异的字符串,这一串字符就是正则表达式([\w!#$%&'*+/=?^_`{|}~-]+(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*@(?:[\w](?:[\w-]*[\w])?\.)+[\w](?:[\w-]*[\w])?)。

在这里插入图片描述

2、常用的匹配规则
  • \w 匹配字母、数字下划线
  • \W 匹配不是字母、数字及下划线的字符
  • \s 匹配任意空白字符,等价于[\t\n\r\f]
  • \S 匹配任意非空白字符,[\s\S]用来匹配任意的字符(万能符)
  • \d 匹配任意数字,等价于[0-9]
  • \D 匹配任意非数字的字符
  • \A 匹配字符串开头
  • \Z 匹配字符串结尾,如果存在换行,只匹配到换行前到结束字符串
  • \z 匹配字符串结尾,如果存在换行,同时匹配换行符
  • \G 匹配最后匹配完成的位置
  • \n 匹配一个换行符
  • \t 匹配一个制表符(tab键)
  • \r python中字符串前面加上r表示原生字符串,有了原生字符串,很好的避免了\转义时漏写反斜杠的错误,同时也更加直观
  • ^ 匹配一行字符串开头
  • $ 匹配一行字符串结尾
  • . 匹配任意字符,处理换行符
  • […] 用来表示一组字符,单独列出,比如[mk231]匹配m、k、2、3或1
  • [^…] 不在[]中的字符,比如[^12] 不包含1、2的字符
  • * 匹配0个或多个表达式
  • + 匹配1个或多个表达式
  • ? 匹配0个或1个前面的正则表达式定义的片段,非贪婪模式
  • {n} 精确匹配n个前面的表达式
  • {n,m} 匹配n到m次由前面正则表达式定义的片段,贪婪模式
  • a|b 匹配a或者b,匹配左右任一表达式,优先匹配左边
  • () 匹配括弧内的表达式,也表示一个组
  • (?p<name>) 分组起别名
  • (?P=name) 引用别名为name分组匹配到的字符串
    示例:URL的匹配如下图所示,[a-zA-z]+://[^\s]*, [a-zA-z]+表示中括弧中的内容匹配一次或多次,中括弧的内容匹配所有的英文字母,接着是匹配一个:,两个//,[^\s]*表示空格除了空白字符外的任意字符出现一次或多次。

在这里插入图片描述

3、re模块的使用

在python中需要使用正则表达式时,通常需要一个模块,这个模块就是re,下来简单的介绍下re的一些方法。

a、match()方法
	re.match 是用来进行增则匹配检查的方法,若字符串匹配正则表达式,则match方法返回匹配对象,负责返回None,而不是空字符串。re.match()能够匹配以xxxx开头的字符串。
	匹配对象具有group方法,用来返回字符串的匹配部分。
	span方法用来返回匹配的范围。
import re

content = 'Hello, my name is Frank'
result = re.match('^He\w\w\w', content)
print(result)
print(result.group())
print(result.span())

结果如下:

<re.Match object; span=(0, 5), match='Hello'>
Hello
(0, 5)
b、search()方法
match()方法是从字符串的开头开始匹配,一旦开头不匹配,那么整个匹配就失败了。
search()方法在匹配时会扫描整个字符串,返回第一个匹配成功的内容,如果扫描完后还没找到,则返回None。
import re

content = 'Hello, my name is Frank'
result = re.match('F\w\w\w', content)
print(result)
result2 = re.search('F\w\w\w', content)
print(result2)
result3 = re.search('U\w\w\w', content)
print(result3)
结果:

None
<re.Match object; span=(18, 22), match='Fran'>
None
c、findall()方法
	如果想要获取正则表达式匹配的所有内容,显然上述两种方法不适合,这时就需借助findall()方法了,该方法会搜索整个字符串,并返回所有符合匹配规则的内容。
import re

content = 'Hello, my name is Frank'
result1 = re.match('\w+', content)
print(result1)
result2 = re.search('\w+', content)
print(result2)
result3 = re.findall('\w+', content)
print(result3)


结果:

<re.Match object; span=(0, 5), match='Hello'>
<re.Match object; span=(0, 5), match='Hello'>
['Hello', 'my', 'name', 'is', 'Frank']

d、sub()方法

除了使用正则表达式提取信息外,还可以借助其用来修改文本。
sub()方法就可将匹配到的数据进行替换。
import re

content = 'Hello my name is Frank'
result = re.sub('\s','_', content)#用_来替换空格
print(result)

结果:
Hello_my_name_is_Frank

e、compile()方法

	compile()方法可以将正则表达式编译成正则表达式对象,以便在后面的匹配中复用。
	例如,将三个日期中的时间去掉,借助sub方法,由于正则表达式相同,没必要写三个,可以用compile方法将正则表达式编译成一个正则表达式对象,进行复用
import re

content1 = '2019-02-28 20:00'
content2 = '2019-03-01 20:00' 
content3 = '2019-03-02 20:00'

pattern = re.compile('\d{2}:\d{2}')
result1 = re.sub(pattern, '', content1)
result2 = re.sub(pattern, '', content2)
result3 = re.sub(pattern, '', content3)
print(result1)
print(result2)
print(result3)

结果:

2019-02-28 
2019-03-01 
2019-03-02 

二、豆瓣电影TOP250抓取

	利用requests库以及re库对豆瓣电影TOP250(https://movie.douban.com/top250)对名次、电影名称、国家以及导演等信息进行抓取。
1、抓取分析
我们需要抓取的目标站点为https://movie.douban.com/top250,打开后可以看到榜单信息,如下图。

在这里插入图片描述

	将网页滚动到最下方,有分页的列表,点击第二页后,URL的内容发生变化(https://movie.douban.com/top250?start=25&filter=)。
	与第一页URL进行对比多出来“?start=25&filter=”内容,接着点开第三页,URL变为https://movie.douban.com/top250?start=50&filter=,改变的内容为“?start=50&filter=”。由此发现,每多一页,“start=”中的值加25,而每页列出25个电影。
2、用代码抓取首页
	用代码来实现抓取第一页的内容,定义一个get_one_page()函数,并给其传入url参数,然后将抓取的页面结果返回,并通过main()来调用,代码如下:
import requests

def get_one_page(url):
    headers = {'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) 
    					AppleWebKit/534.50 (KHTML, like Gecko) 
    					Version/5.1 Safari/534.50'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

def main():
    url = 'https://movie.douban.com/top250'
    html = get_one_page(url)
    print(html)

main()
3、用re库进行提取
利用谷歌浏览器的检查功能,点击代码窗口左上角小箭头,然后在网页上选取第一个电影,代码框会跳出电影的信息。

在这里插入图片描述

	电影的排名上在‘<em class="">1</em>’中显示的,这里利用非贪婪的匹配来提取该信息,正则表达式为:<em class="">(.*?)</em>。电影名称在后面的class = hd 的a节点中,正则表达式为<span class="title">[^&nbsp](.*?)</span>。导演在class = bd 中的p节点,正则表达式为<p class="">[\s\S](.*?)&nbsp。国家同导演在一个节点中,正则表达式为<br>[\s\S]*?&nbsp;/&nbsp;(.*?)&nbsp;/&nbsp。
	整合表达式后为<em class="">(.*?)</em>[\s\S]*?<span class="title">[^&nbsp](.*?)</span>[\s\S]*?<p class="">[\s\S](.*?)&nbsp[\s\S]*?<br>[\s\S]*?&nbsp;/&nbsp;(.*?)&nbsp;/&nbsp
	这样就得到第一页的内容,代码如下
import requests
import re

def get_one_page(url):
    headers = {'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 14_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

def parse_one_page(html):
    pattern = re.compile(
            '<em class="">(.*?)</em>[\s\S]*?<span class="title">[^&nbsp](.*?)</span>[\s\S]*?<p class="">[\s\S](.*?)&nbsp[\s\S]*?<br>[\s\S]*?&nbsp;/&nbsp;(.*?)&nbsp;/&nbsp')
    items = re.findall(pattern, html)

def main():
    url = 'https://movie.douban.com/top250'
    html = get_one_page(url)
    lis = parse_one_page(html)
    print(lis)
main()


结果:
[('1', '申克的救赎', '                            导演: 弗兰克·德拉邦特 Frank Darabont', '美国'), ('2', '王别姬', '                            导演: 陈凯歌 Kaige Chen', '中国大陆 香港'), ('3', '个杀手不太冷', '                            导演: 吕克·贝松 Luc Besson', '法国'), ('4', '甘正传', '                            导演: 罗伯特·泽米吉斯 Robert Zemeckis', '美国'), ('5', '丽人生', '                            导演: 罗伯托·贝尼尼 Roberto Benigni', '意大利'), ('6', '坦尼克号', '                            导演: 詹姆斯·卡梅隆 James Cameron', '美国'), ('7', '与千寻', '                            导演: 宫崎骏 Hayao Miyazaki', '日本'), ('8', '德勒的名单', '                            导演: 史蒂文·斯皮尔伯格 Steven Spielberg', '美国'), ('9', '梦空间', '                            导演: 克里斯托弗·诺兰 Christopher Nolan', '美国 英国'), ('10', '犬八公的故事', '                            导演: 莱塞·霍尔斯道姆 Lasse Hallström', '美国 英国'), ('11', '器人总动员', '                            导演: 安德鲁·斯坦顿 Andrew Stanton', '美国'), ('12', '傻大闹宝莱坞', '                            导演: 拉库马·希拉尼 Rajkumar Hirani', '印度'), ('13', '上钢琴师', '                            导演: 朱塞佩·托纳多雷 Giuseppe Tornatore', '意大利'), ('14', '牛班的春天', '                            导演: 克里斯托夫·巴拉蒂 Christophe Barratier', '法国 瑞士 德国'), ('15', '门的世界', '                            导演: 彼得·威尔 Peter Weir', '美国'), ('16', '话西游之大圣娶亲', '                            导演: 刘镇伟 Jeffrey Lau', '香港 中国大陆'), ('17', '际穿越', '                            导演: 克里斯托弗·诺兰 Christopher Nolan', '美国 英国 加拿大 冰岛'), ('18', '猫', '                            导演: 宫崎骏 Hayao Miyazaki', '日本'), ('19', '父', '                            导演: 弗朗西斯·福特·科波拉 Francis Ford Coppola', '美国'), ('20', '炉', '                            导演: 黄东赫 Dong-hyuk Hwang', '韩国'), ('21', '间道', '                            导演: 刘伟强 / 麦兆辉', '香港'), ('22', '狂动物城', '                            导演: 拜伦·霍华德 Byron Howard / 瑞奇·摩尔 Rich Moore', '美国'), ('23', '幸福来敲门', '                            导演: 加布里尔·穆奇诺 Gabriele Muccino', '美国'), ('24', '然心动', '                            导演: 罗伯·莱纳 Rob Reiner', '美国'), ('25', '不可及', '                            导演: 奥利维·那卡什 Olivier Nakache / 艾力克·托兰达 Eric Toledano', '法国')]
4、优化代码

上述结果比较凌乱,优化代码如下



import requests
import re
import json
import time
import random
from multiprocessing import Pool

MAXSLEEPTIME = 3
MINSLEEPTIME = 1
STAUS_OK = 200
MAX_PAGE_NUM = 10
SERVER_ERROR_MIN = 500
SERVER_ERROR_MAX = 600
CLIENT_ERROR_MIN = 400
CLIENT_ERROR_MAX = 500

def get_one_page(url):
    headers = {'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 14_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

def parse_one_page(html):
    pattern = re.compile(
            '<em class="">(.*?)</em>[\s\S]*?<span class="title">[^&nbsp](.*?)</span>[\s\S]*?<p class="">[\s\S](.*?)&nbsp[\s\S]*?<br>[\s\S]*?&nbsp;/&nbsp;(.*?)&nbsp;/&nbsp')
    items = re.findall(pattern, html)
    for item in items:
        
        yield {
                'Top': item[0],
                'title': item[1],
                'actor': item[2].strip()[3:],
                'countries': item[3]
              }
        

  
def write_to_file(item):
    with open("豆瓣.txt", 'a', encoding="utf-8") as f:
        f.write(json.dumps(item, ensure_ascii=False)+'\n')


def crawl_one_page(offset):
    url = 'https://movie.douban.com/top250?start='+str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        write_to_file(item)
    time.sleep(random.randint(MINSLEEPTIME,MAXSLEEPTIME))

if __name__ == "__main__":
    pool = Pool(20)
    pool.map(crawl_one_page, [i*25 for i in range(10)])
    pool.close()
    pool.join()

结果:
{"Top": "101", "title": "剧之王", "actor": " 周星驰 Stephen Chow / 李力持 Lik-Chi Lee", "countries": "香港"}
{"Top": "102", "title": "一", "actor": " 杨德昌 Edward Yang", "countries": "台湾 日本"}
{"Top": "103", "title": "失的爱人", "actor": " 大卫·芬奇 David Fincher", "countries": "美国"}
{"Top": "104", "title": "雕英雄传之东成西就", "actor": " 刘镇伟 Jeffrey Lau", "countries": "香港"}
{"Top": "105", "title": "光姐妹淘", "actor": " 姜炯哲 Hyeong-Cheol Kang", "countries": "韩国"}
{"Top": "106", "title": "蜜蜜", "actor": " 陈可辛 Peter Chan", "countries": "香港"}
{"Top": "107", "title": "在黎明破晓前", "actor": " 理查德·林克莱特 Richard Linklater", "countries": "美国 奥地利 瑞士"}
{"Top": "108", "title": "森林 夏秋篇", "actor": " 森淳一 Junichi Mori", "countries": "日本"}
{"Top": "109", "title": "耳倾听", "actor": " 近藤喜文 Yoshifumi Kondo", "countries": "日本"}
{"Top": "110", "title": "辣椒", "actor": " 今敏 Satoshi Kon", "countries": "日本"}
{"Top": "111", "title": "女幽魂", "actor": " 程小东 Siu-Tung Ching", "countries": "香港"}
{"Top": "112", "title": "怖直播", "actor": " 金秉祐 Byeong-woo Kim", "countries": "韩国"}
{"Top": "113", "title": "帝之城", "actor": " Kátia Lund / Fernando Meirelles", "countries": "巴西 法国"}
{"Top": "114", "title": "之谷", "actor": " 宫崎骏 Hayao Miyazaki", "countries": "日本"}
{"Top": "115", "title": "以你的名字呼唤我", "actor": " 卢卡·瓜达尼诺 Luca Guadagnino", "countries": "意大利 法国 巴西 美国 荷兰 德国"}
{"Top": "116", "title": "脱", "actor": " 托尼·凯耶 Tony Kaye", "countries": "美国"}
{"Top": "117", "title": "龙高手", "actor": " 迪恩·德布洛斯 Dean DeBlois / 克里斯·桑德斯 Chris Sanders", "countries": "美国"}
{"Top": "118", "title": "在日落黄昏时", "actor": " 理查德·林克莱特 Richard Linklater", "countries": "美国"}
{"Top": "119", "title": "次郎的夏天", "actor": " 北野武 Takeshi Kitano", "countries": "日本"}
{"Top": "120", "title": "福终点站", "actor": " 史蒂文·斯皮尔伯格 Steven Spielberg", "countries": "美国"}
{"Top": "121", "title": "利·波特与死亡圣器(下)", "actor": " 大卫·叶茨 David Yates", "countries": "美国 英国"}
{"Top": "122", "title": "人回忆", "actor": " 奉俊昊 Joon-ho Bong", "countries": "韩国"}
{"Top": "123", "title": "森林 冬春篇", "actor": " 森淳一 Junichi Mori", "countries": "日本"}
{"Top": "124", "title": "偷奶爸", "actor": " 皮艾尔·柯芬 Pierre Coffin / 克里斯·雷纳德 Chris Renaud", "countries": "美国 法国"}
{"Top": "125", "title": "东西的小人阿莉埃蒂", "actor": " 米林宏昌 Hiromasa Yonebayashi", "countries": "日本"}
{"Top": "26", "title": "世佳人", "actor": " 维克多·弗莱明 Victor Fleming / 乔治·库克 George Cukor", "countries": "美国"}
{"Top": "27", "title": "蝠侠:黑暗骑士", "actor": " 克里斯托弗·诺兰 Christopher Nolan", "countries": "美国 英国"}
{"Top": "28", "title": "着", "actor": " 张艺谋 Yimou Zhang", "countries": "中国大陆 香港"}
{"Top": "29", "title": "年派的奇幻漂流", "actor": " 李安 Ang Lee", "countries": "美国 台湾 英国 加拿大"}
{"Top": "30", "title": "堂电影院", "actor": " 朱塞佩·托纳多雷 Giuseppe Tornatore", "countries": "意大利 法国"}
{"Top": "31", "title": "子来了", "actor": " 姜文 Wen Jiang", "countries": "中国大陆"}
{"Top": "32", "title": "方证人", "actor": " 比利·怀尔德 Billy Wilder", "countries": "美国"}
{"Top": "1", "title": "申克的救赎", "actor": " 弗兰克·德拉邦特 Frank Darabont", "countries": "美国"}
{"Top": "33", "title": "二怒汉", "actor": " Sidney Lumet", "countries": "美国"}
{"Top": "2", "title": "王别姬", "actor": " 陈凯歌 Kaige Chen", "countries": "中国大陆 香港"}
{"Top": "34", "title": "环王3:王者无敌", "actor": " 彼得·杰克逊 Peter Jackson", "countries": "美国 新西兰"}
{"Top": "3", "title": "个杀手不太冷", "actor": " 吕克·贝松 Luc Besson", "countries": "法国"}
{"Top": "35", "title": "空之城", "actor": " 宫崎骏 Hayao Miyazaki", "countries": "日本"}
{"Top": "4", "title": "甘正传", "actor": " 罗伯特·泽米吉斯 Robert Zemeckis", "countries": "美国"}
{"Top": "36", "title": "屋环游记", "actor": " 彼特·道格特 Pete Docter / 鲍勃·彼德森 Bob Peterson", "countries": "美国"}
{"Top": "37", "title": "跤吧!爸爸", "actor": " 涅提·蒂瓦里 Nitesh Tiwari", "countries": "印度"}
{"Top": "5", "title": "丽人生", "actor": " 罗伯托·贝尼尼 Roberto Benigni", "countries": "意大利"}
{"Top": "38", "title": "击俱乐部", "actor": " 大卫·芬奇 David Fincher", "countries": "美国 德国"}
{"Top": "6", "title": "坦尼克号", "actor": " 詹姆斯·卡梅隆 James Cameron", "countries": "美国"}
{"Top": "39", "title": "话西游之月光宝盒", "actor": " 刘镇伟 Jeffrey Lau", "countries": "香港 中国大陆"}
{"Top": "7", "title": "与千寻", "actor": " 宫崎骏 Hayao Miyazaki", "countries": "日本"}
{"Top": "40", "title": "马假日", "actor": " 威廉·惠勒 William Wyler", "countries": "美国"}
{"Top": "8", "title": "德勒的名单", "actor": " 史蒂文·斯皮尔伯格 Steven Spielberg", "countries": "美国"}
{"Top": "41", "title": "尔的移动城堡", "actor": " 宫崎骏 Hayao Miyazaki", "countries": "日本"}
{"Top": "9", "title": "梦空间", "actor": " 克里斯托弗·诺兰 Christopher Nolan", "countries": "美国 英国"}
{"Top": "42", "title": "香识女人", "actor": " 马丁·布莱斯 Martin Brest", "countries": "美国"}
{"Top": "10", "title": "犬八公的故事", "actor": " 莱塞·霍尔斯道姆 Lasse Hallström", "countries": "美国 英国"}
{"Top": "43", "title": "听风暴", "actor": " 弗洛里安·亨克尔·冯·多纳斯马尔克 Florian Henckel von Donnersmarck", "countries": "德国"}
{"Top": "11", "title": "器人总动员", "actor": " 安德鲁·斯坦顿 Andrew Stanton", "countries": "美国"}
{"Top": "44", "title": "护人", "actor": " 杨宇硕 Woo-seok Yang", "countries": "韩国"}
{"Top": "12", "title": "傻大闹宝莱坞", "actor": " 拉库马·希拉尼 Rajkumar Hirani", "countries": "印度"}
{"Top": "45", "title": "杆大烟枪", "actor": " Guy Ritchie", "countries": "英国"}
{"Top": "13", "title": "上钢琴师", "actor": " 朱塞佩·托纳多雷 Giuseppe Tornatore", "countries": "意大利"}
{"Top": "46", "title": "越疯人院", "actor": " 米洛斯·福尔曼 Miloš Forman", "countries": "美国"}
{"Top": "176", "title": "断蓝桥", "actor": " 茂文·勒鲁瓦 Mervyn LeRoy", "countries": "美国"}
{"Top": "14", "title": "牛班的春天", "actor": " 克里斯托夫·巴拉蒂 Christophe Barratier", "countries": "法国 瑞士 德国"}
{"Top": "47", "title": "亡诗社", "actor": " 彼得·威尔 Peter Weir", "countries": "美国"}
{"Top": "15", "title": "门的世界", "actor": " 彼得·威尔 Peter Weir", "countries": "美国"}
{"Top": "177", "title": "火车", "actor": " 丹尼·博伊尔 Danny Boyle", "countries": "英国"}
{"Top": "48", "title": "代皇帝", "actor": " 贝纳尔多·贝托鲁奇 Bernardo Bertolucci", "countries": "英国 意大利 中国大陆 法国 美国"}
{"Top": "16", "title": "话西游之大圣娶亲", "actor": " 刘镇伟 Jeffrey Lau", "countries": "香港 中国大陆"}
{"Top": "178", "title": "麻的部屋", "actor": " 今敏 Satoshi Kon", "countries": "日本"}
{"Top": "49", "title": "字仇杀队", "actor": " 詹姆斯·麦克特格 James McTeigue", "countries": "美国 英国 德国"}
{"Top": "17", "title": "际穿越", "actor": " 克里斯托弗·诺兰 Christopher Nolan", "countries": "美国 英国 加拿大 冰岛"}
{"Top": "179", "title": "仿游戏", "actor": " 莫滕·泰杜姆 Morten Tyldum", "countries": "英国 美国"}
{"Top": "18", "title": "猫", "actor": " 宫崎骏 Hayao Miyazaki", "countries": "日本"}
{"Top": "50", "title": "环王2:双塔奇兵", "actor": " 彼得·杰克逊 Peter Jackson", "countries": "美国 新西兰"}
{"Top": "180", "title": "个叫欧维的男人决定去死", "actor": " 汉内斯·赫尔姆 Hannes Holm", "countries": "瑞典"}
{"Top": "181", "title": "块广告牌", "actor": " 马丁·麦克唐纳 Martin McDonagh", "countries": "美国 英国"}
{"Top": "19", "title": "父", "actor": " 弗朗西斯·福特·科波拉 Francis Ford Coppola", "countries": "美国"}
{"Top": "182", "title": "生门", "actor": " 黑泽明 Akira Kurosawa", "countries": "日本"}
{"Top": "51", "title": "父2", "actor": " 弗朗西斯·福特·科波拉 Francis Ford Coppola", "countries": "美国"}
{"Top": "20", "title": "炉", "actor": " 黄东赫 Dong-hyuk Hwang", "countries": "韩国"}
{"Top": "52", "title": "媛", "actor": " 李濬益 Jun-ik Lee", "countries": "韩国"}
{"Top": "183", "title": "间", "actor": " 伦尼·阿伯拉罕森 Lenny Abrahamson", "countries": "爱尔兰 加拿大 英国 美国"}
{"Top": "21", "title": "间道", "actor": " 刘伟强 / 麦兆辉", "countries": "香港"}
{"Top": "53", "title": "环王1:魔戒再现", "actor": " 彼得·杰克逊 Peter Jackson", "countries": "新西兰 美国"}
{"Top": "184", "title": "美陌生人", "actor": " 保罗·格诺维瑟 Paolo Genovese", "countries": "意大利"}
{"Top": "185", "title": "犬八公物语", "actor": " Seijirô Kôyama", "countries": "日本"}
{"Top": "22", "title": "狂动物城", "actor": " 拜伦·霍华德 Byron Howard / 瑞奇·摩尔 Rich Moore", "countries": "美国"}
{"Top": "54", "title": "豚湾", "actor": " 路易·西霍尤斯 Louie Psihoyos", "countries": "美国"}
{"Top": "186", "title": "怖游轮", "actor": " 克里斯托弗·史密斯 Christopher Smith", "countries": "英国 澳大利亚"}
{"Top": "55", "title": "食男女", "actor": " 李安 Ang Lee", "countries": "台湾 美国"}
{"Top": "187", "title": "飞正传", "actor": " 王家卫 Kar Wai Wong", "countries": "香港"}
{"Top": "23", "title": "幸福来敲门", "actor": " 加布里尔·穆奇诺 Gabriele Muccino", "countries": "美国"}
{"Top": "56", "title": "丽心灵", "actor": " 朗·霍华德 Ron Howard", "countries": "美国"}
{"Top": "188", "title": "女宅急便", "actor": " 宫崎骏 Hayao Miyazaki", "countries": "日本"}
{"Top": "24", "title": "然心动", "actor": " 罗伯·莱纳 Rob Reiner", "countries": "美国"}
{"Top": "57", "title": "子王", "actor": " Roger Allers / 罗伯·明可夫 Rob Minkoff", "countries": "美国"}
{"Top": "25", "title": "不可及", "actor": " 奥利维·那卡什 Olivier Nakache / 艾力克·托兰达 Eric Toledano", "countries": "法国"}
{"Top": "189", "title": "水", "actor": " 汤姆·提克威 Tom Tykwer", "countries": "德国 法国 西班牙 美国"}
{"Top": "58", "title": "书", "actor": " 岩井俊二 Shunji Iwai", "countries": "日本"}
{"Top": "190", "title": "潮", "actor": " 丹尼斯·甘塞尔 Dennis Gansel", "countries": "德国"}
{"Top": "59", "title": "梦环游记", "actor": " 李·昂克里奇 Lee Unkrich / 阿德里安·莫利纳 Adrian Molina", "countries": "美国"}
{"Top": "191", "title": "读者", "actor": " 史蒂芬·戴德利 Stephen Daldry", "countries": "美国 德国"}
{"Top": "60", "title": "琴家", "actor": " 罗曼·波兰斯基 Roman Polanski", "countries": "法国 德国 英国 波兰"}
{"Top": "61", "title": "国往事", "actor": " 赛尔乔·莱翁内 Sergio Leone", "countries": "意大利 美国"}
{"Top": "192", "title": "吒闹海", "actor": " 严定宪 Dingxian Yan / 王树忱 Shuchen Wang", "countries": "中国大陆"}
{"Top": "62", "title": "杰明·巴顿奇事", "actor": " 大卫·芬奇 David Fincher", "countries": "美国"}
{"Top": "193", "title": "可西里", "actor": " 陆川 Chuan Lu", "countries": "中国大陆 香港"}
{"Top": "63", "title": "鞋子", "actor": " 马基德·马基迪 Majid Majidi", "countries": "伊朗"}
{"Top": "194", "title": "客帝国3:矩阵革命", "actor": " Andy Wachowski / Larry Wachowski", "countries": "美国 澳大利亚"}
{"Top": "64", "title": "客帝国", "actor": " 安迪·沃卓斯基 Andy Wachowski / 拉娜·沃卓斯基 Lana Wachowski", "countries": "美国 澳大利亚"}
{"Top": "195", "title": "街日记", "actor": " 是枝裕和 Hirokazu Koreeda", "countries": "日本"}
{"Top": "196", "title": "争之王", "actor": " 安德鲁·尼科尔 Andrew Niccol", "countries": "美国 法国"}
{"Top": "65", "title": "西里的美丽传说", "actor": " 朱塞佩·托纳多雷 Giuseppe Tornatore", "countries": "意大利 美国"}
{"Top": "197", "title": "影重重", "actor": " 道格·里曼 Doug Liman", "countries": "美国 德国 捷克"}
{"Top": "66", "title": "不见的客人", "actor": " 奥里奥尔·保罗 Oriol Paulo", "countries": "西班牙"}
{"Top": "198", "title": "影重重2", "actor": " 保罗·格林格拉斯 Paul Greengrass", "countries": "美国 德国"}
{"Top": "199", "title": "岭街少年杀人事件", "actor": " 杨德昌 Edward Yang", "countries": "台湾"}
{"Top": "67", "title": "子弹飞", "actor": " 姜文 Wen Jiang", "countries": "中国大陆 香港"}
{"Top": "68", "title": "救大兵瑞恩", "actor": " 史蒂文·斯皮尔伯格 Steven Spielberg", "countries": "美国"}
{"Top": "200", "title": "球上的星星", "actor": " 阿米尔·汗 Aamir Khan", "countries": "印度"}
{"Top": "69", "title": "命魔术", "actor": " 克里斯托弗·诺兰 Christopher Nolan", "countries": "美国 英国"}
{"Top": "70", "title": "宗罪", "actor": " 大卫·芬奇 David Fincher", "countries": "美国"}
{"Top": "71", "title": "闹天宫", "actor": " 万籁鸣 Laiming Wan / 唐澄 Cheng  Tang", "countries": "中国大陆"}
{"Top": "72", "title": "嫌弃的松子的一生", "actor": " 中岛哲也 Tetsuya Nakashima", "countries": "日本"}
{"Top": "73", "title": "利·波特与魔法石", "actor": " Chris Columbus", "countries": "美国 英国"}
{"Top": "74", "title": "乐之声", "actor": " 罗伯特·怀斯 Robert Wise", "countries": "美国"}
{"Top": "75", "title": "俗小说", "actor": " 昆汀·塔伦蒂诺 Quentin Tarantino", "countries": "美国"}
{"Top": "126", "title": "号房的礼物", "actor": " 李焕庆 Hwan-kyeong Lee", "countries": "韩国"}
{"Top": "226", "title": "钻", "actor": " 爱德华·兹威克 Edward Zwick", "countries": "美国 德国"}
{"Top": "127", "title": "兽电力公司", "actor": " 彼特·道格特 Pete Docter / 大卫·斯沃曼 David Silverman", "countries": "美国"}
{"Top": "227", "title": "闯夺命岛", "actor": " 迈克尔·贝 Michael Bay", "countries": "美国"}
{"Top": "151", "title": "蛮故事", "actor": " 达米安·斯兹弗隆 Damián Szifron", "countries": "阿根廷 西班牙"}
{"Top": "128", "title": "月神偷", "actor": " 罗启锐 Alex Law", "countries": "香港 中国大陆"}
{"Top": "129", "title": "火之森", "actor": " 大森贵弘 Takahiro Omori", "countries": "日本"}
{"Top": "152", "title": "横四海", "actor": " 吴宇森 John Woo", "countries": "香港"}
{"Top": "228", "title": "脸", "actor": " 吴宇森 John Woo", "countries": "美国"}
{"Top": "130", "title": "伯虎点秋香", "actor": " 李力持 Lik-Chi Lee", "countries": "香港"}
{"Top": "229", "title": "焦", "actor": " 托马斯·麦卡锡 Thomas McCarthy", "countries": "美国"}
{"Top": "131", "title": "武士", "actor": " 黑泽明 Akira Kurosawa", "countries": "日本"}
{"Top": "153", "title": "父3", "actor": " 弗朗西斯·福特·科波拉 Francis Ford Coppola", "countries": "美国"}
{"Top": "230", "title": "速5厘米", "actor": " 新海诚 Makoto Shinkai", "countries": "日本"}
{"Top": "132", "title": "能陆战队", "actor": " 唐·霍尔 Don Hall / 克里斯·威廉姆斯 Chris Williams", "countries": "美国"}
{"Top": "154", "title": "旺达饭店", "actor": " 特瑞·乔治 Terry George", "countries": "英国 南非 意大利 美国"}
{"Top": "231", "title": "条橙", "actor": " Stanley Kubrick", "countries": "英国 美国"}
{"Top": "155", "title": "具总动员3", "actor": " 李·昂克里奇 Lee Unkrich", "countries": "美国"}
{"Top": "133", "title": "蝠侠:黑暗骑士崛起", "actor": " 克里斯托弗·诺兰 Christopher Nolan", "countries": "美国 英国"}
{"Top": "232", "title": "金三镖客", "actor": " Sergio Leone", "countries": "意大利 西班牙 西德"}
{"Top": "134", "title": "锯惊魂", "actor": " 詹姆斯·温 James Wan", "countries": "美国"}
{"Top": "156", "title": "拉斯买家俱乐部", "actor": " 让-马克·瓦雷 Jean-Marc Vallée", "countries": "美国"}
{"Top": "135", "title": "爱至上", "actor": " 理查德·柯蒂斯 Richard Curtis", "countries": "英国 美国 法国"}
{"Top": "233", "title": "001太空漫游", "actor": " 斯坦利·库布里克 Stanley Kubrick", "countries": "英国 美国"}
{"Top": "157", "title": "样年华", "actor": " 王家卫 Kar Wai Wong", "countries": "香港"}
{"Top": "234", "title": "萨布兰卡", "actor": " 迈克尔·柯蒂兹 Michael Curtiz", "countries": "美国"}
{"Top": "136", "title": "影重重3", "actor": " 保罗·格林格拉斯 Paul Greengrass", "countries": "美国 德国"}
{"Top": "158", "title": "美的世界", "actor": " 克林特·伊斯特伍德 Clint Eastwood", "countries": "美国"}
{"Top": "76", "title": "使爱美丽", "actor": " 让-皮埃尔·热内 Jean-Pierre Jeunet", "countries": "法国 德国"}
{"Top": "235", "title": "常嫌疑犯", "actor": " 布莱恩·辛格 Bryan Singer", "countries": "德国 美国"}
{"Top": "159", "title": "边的曼彻斯特", "actor": " 肯尼斯·罗纳根 Kenneth Lonergan", "countries": "美国"}
{"Top": "137", "title": "狂原始人", "actor": " 科克·德·米科 Kirk De Micco / 克里斯·桑德斯 Chris Sanders", "countries": "美国"}
{"Top": "236", "title": "鹰坠落", "actor": " 雷德利·斯科特 Ridley Scott", "countries": "美国"}
{"Top": "77", "title": "默的羔羊", "actor": " 乔纳森·戴米 Jonathan Demme", "countries": "美国"}
{"Top": "160", "title": "洋", "actor": " 雅克·贝汉 Jacques Perrin / 雅克·克鲁奥德 Jacques Cluzaud", "countries": "法国 瑞士 西班牙 美国 阿联酋"}
{"Top": "237", "title": "王的演讲", "actor": " 汤姆·霍珀 Tom Hooper", "countries": "英国 澳大利亚 美国"}
{"Top": "138", "title": "火虫之墓", "actor": " 高畑勋 Isao Takahata", "countries": "日本"}
{"Top": "78", "title": "敢的心", "actor": " 梅尔·吉布森 Mel Gibson", "countries": "美国"}
{"Top": "238", "title": "爱你", "actor": " 秋昌民 Chang-min Choo", "countries": "韩国"}
{"Top": "161", "title": "口脱险", "actor": " 杰拉尔·乌里 Gérard Oury", "countries": "法国 英国"}
{"Top": "79", "title": "刀手爱德华", "actor": " 蒂姆·波顿 Tim Burton", "countries": "美国"}
{"Top": "139", "title": "宴", "actor": " 李安 Ang Lee", "countries": "台湾 美国"}
{"Top": "239", "title": "国丽人", "actor": " 萨姆·门德斯 Sam Mendes", "countries": "美国"}
{"Top": "201", "title": "次别离", "actor": " 阿斯哈·法哈蒂  Asghar Farhadi", "countries": "伊朗 法国"}
{"Top": "140", "title": "邪西毒", "actor": " 王家卫 Kar Wai Wong", "countries": "香港 台湾"}
{"Top": "80", "title": "蝶效应", "actor": " 埃里克·布雷斯 Eric Bress / J·麦基·格鲁伯 J. Mackye Gruber", "countries": "美国 加拿大"}
{"Top": "162", "title": "恋笔记本", "actor": " 尼克·卡索维茨 Nick Cassavetes", "countries": "美国"}
{"Top": "202", "title": "随", "actor": " 克里斯托弗·诺兰 Christopher Nolan", "countries": "英国"}
{"Top": "240", "title": "钧一发", "actor": " 安德鲁·尼科尔 Andrew Niccol", "countries": "美国"}
{"Top": "141", "title": "民窟的百万富翁", "actor": " 丹尼·鲍尔 Danny Boyle / 洛芙琳·坦丹 Loveleen Tandan", "countries": "英国 美国"}
{"Top": "163", "title": "看起来好像很好吃", "actor": " 藤森雅也 Masaya Fujimori", "countries": "日本"}
{"Top": "81", "title": "光乍泄", "actor": " 王家卫 Kar Wai Wong", "countries": "香港 日本 韩国"}
{"Top": "241", "title": "海蓝天", "actor": " Luc Besson", "countries": "法国 美国 意大利"}
{"Top": "203", "title": "蛇", "actor": " 徐克 Hark Tsui", "countries": "香港"}
{"Top": "142", "title": "雄本色", "actor": " 吴宇森 John Woo", "countries": "香港"}
{"Top": "164", "title": "情岁月", "actor": " 爱德华·兹威克 Edward Zwick", "countries": "美国"}
{"Top": "82", "title": "鼠游戏", "actor": " 史蒂文·斯皮尔伯格 Steven Spielberg", "countries": "美国 加拿大"}
{"Top": "143", "title": "忆碎片", "actor": " 克里斯托弗·诺兰 Christopher Nolan", "countries": "美国"}
{"Top": "242", "title": "愿清单", "actor": " 罗伯·莱纳 Rob Reiner", "countries": "美国"}
{"Top": "204", "title": "魂记", "actor": " Alfred Hitchcock", "countries": "美国"}
{"Top": "144", "title": "天鹅", "actor": " 达伦·阿罗诺夫斯基 Darren Aronofsky", "countries": "美国"}
{"Top": "83", "title": "灵捕手", "actor": " 格斯·范·桑特 Gus Van Sant", "countries": "美国"}
{"Top": "165", "title": "十二", "actor": " 郭柯 Ke Guo", "countries": "中国大陆"}
{"Top": "205", "title": "结者2:审判日", "actor": " 詹姆斯·卡梅隆 James Cameron", "countries": "美国 法国"}
{"Top": "243", "title": "狂的麦克斯4:狂暴之路", "actor": " 乔治·米勒 George Miller", "countries": "澳大利亚 美国"}
{"Top": "145", "title": "人知晓", "actor": " 是枝裕和 Hirokazu Koreeda", "countries": "日本"}
{"Top": "84", "title": "达佩斯大饭店", "actor": " 韦斯·安德森 Wes Anderson", "countries": "美国 德国 英国"}
{"Top": "244", "title": "国病人", "actor": " 安东尼·明格拉 Anthony Minghella", "countries": "美国 英国"}
{"Top": "206", "title": "车", "actor": " 保罗·哈吉斯 Paul Haggis", "countries": "美国 德国"}
{"Top": "166", "title": "解救的姜戈", "actor": " 昆汀·塔伦蒂诺 Quentin Tarantino", "countries": "美国"}
{"Top": "146", "title": "迷宫", "actor": " 忻钰坤 Yukun Xin", "countries": "中国大陆"}
{"Top": "207", "title": "狂的石头", "actor": " 宁浩 Hao Ning", "countries": "中国大陆 香港"}
{"Top": "85", "title": "闭岛", "actor": " Martin Scorsese", "countries": "美国"}
{"Top": "245", "title": "野生存", "actor": " 西恩·潘 Sean Penn", "countries": "美国"}
{"Top": "147", "title": "慢与偏见", "actor": " 乔·怀特 Joe Wright", "countries": "法国 英国 美国"}
{"Top": "167", "title": "脑特工队", "actor": " 彼特·道格特 Pete Docter / 罗纳尔多·德尔·卡门 Ronaldo Del Carmen", "countries": "美国"}
{"Top": "86", "title": "殓师", "actor": " 泷田洋二郎 Yôjirô Takita", "countries": "日本"}
{"Top": "208", "title": "代码", "actor": " 邓肯·琼斯 Duncan Jones", "countries": "美国 加拿大"}
{"Top": "148", "title": "战钢锯岭", "actor": " 梅尔·吉布森 Mel Gibson", "countries": "美国 澳大利亚"}
{"Top": "246", "title": "岛余生", "actor": " 罗伯特·泽米吉斯 Robert Zemeckis", "countries": "美国"}
{"Top": "168", "title": "川时代", "actor": " 卡洛斯·沙尔丹哈 Carlos Saldanha / 克里斯·韦奇 Chris Wedge", "countries": "美国"}
{"Top": "87", "title": "条纹睡衣的男孩", "actor": " 马克·赫尔曼 Mark Herman", "countries": "英国 美国"}
{"Top": "247", "title": "盗电台", "actor": " 理查德·柯蒂斯 Richard Curtis", "countries": "英国 德国 法国"}
{"Top": "169", "title": "中曲", "actor": " 斯坦利·多南 Stanley Donen / 吉恩·凯利 Gene Kelly", "countries": "美国"}
{"Top": "149", "title": "人", "actor": " 巴瑞·莱文森 Barry Levinson", "countries": "美国"}
{"Top": "209", "title": "萝莉的猴神大叔", "actor": " 卡比尔·汗 Kabir Khan", "countries": "印度"}
{"Top": "88", "title": "灵公主", "actor": " 宫崎骏 Hayao Miyazaki", "countries": "日本"}
{"Top": "248", "title": "火", "actor": " 杜琪峰 Johnnie To", "countries": "香港"}
{"Top": "150", "title": "空恋旅人", "actor": " 理查德·柯蒂斯 Richard Curtis", "countries": "英国"}
{"Top": "170", "title": "是山姆", "actor": " 杰茜·尼尔森 Jessie Nelson", "countries": "美国"}
{"Top": "210", "title": "次出发之纽约遇见你", "actor": " 约翰·卡尼 John Carney", "countries": "美国"}
{"Top": "89", "title": "凡达", "actor": " 詹姆斯·卡梅隆 James Cameron", "countries": "美国 英国"}
{"Top": "249", "title": "夫", "actor": " 周星驰 Stephen Chow", "countries": "中国大陆 香港"}
{"Top": "171", "title": "敌破坏王", "actor": " 瑞奇·莫尔 Rich Moore", "countries": "美国"}
{"Top": "211", "title": "履不停", "actor": " 是枝裕和 Hirokazu Koreeda", "countries": "日本"}
{"Top": "90", "title": "光灿烂的日子", "actor": " 姜文 Wen Jiang", "countries": "中国大陆 香港"}
{"Top": "172", "title": "工智能", "actor": " 史蒂文·斯皮尔伯格 Steven Spielberg", "countries": "美国"}
{"Top": "212", "title": "龙门客栈", "actor": " 李惠民 Raymond Lee", "countries": "香港 中国大陆"}
{"Top": "250", "title": "士", "actor": " 加文·欧康诺 Gavin O'Connor", "countries": "美国"}
{"Top": "91", "title": "丽和马克思", "actor": " 亚当·艾略特 Adam Elliot", "countries": "澳大利亚"}
{"Top": "173", "title": "的名字。", "actor": " 新海诚 Makoto Shinkai", "countries": "日本"}
{"Top": "213", "title": "恋这件小事", "actor": " 达伦·阿伦诺夫斯基 Darren Aronofsky", "countries": "美国"}
{"Top": "215", "title": "京物语", "actor": " 小津安二郎 Yasujirô Ozu", "countries": "日本"}
{"Top": "92", "title": "六感", "actor": " M·奈特·沙马兰 M. Night Shyamalan", "countries": "美国"}
{"Top": "174", "title": "裂鼓手", "actor": " 达米恩·查泽雷 Damien Chazelle", "countries": "美国"}
{"Top": "175", "title": "越时空的少女", "actor": " 细田守 Mamoru Hosoda", "countries": "日本"}
{"Top": "216", "title": "在午夜降临前", "actor": " 理查德·林克莱特 Richard Linklater", "countries": "美国 希腊"}
{"Top": "93", "title": "命ID", "actor": " James Mangold", "countries": "美国"}
{"Top": "217", "title": "耻混蛋", "actor": " Quentin Tarantino", "countries": "美国 德国"}
{"Top": "94", "title": "猎", "actor": " 托马斯·温特伯格 Thomas Vinterberg", "countries": "丹麦 瑞典"}
{"Top": "218", "title": "市之光", "actor": " Charles Chaplin", "countries": "美国"}
{"Top": "95", "title": "庆森林", "actor": " 王家卫 Kar Wai Wong", "countries": "香港"}
{"Top": "96", "title": "背山", "actor": " 李安 Ang Lee", "countries": "美国 加拿大"}
{"Top": "219", "title": "书奇谭", "actor": " 王树忱 Shuchen Wang / 钱运达 Yunda Qian", "countries": "中国大陆"}
{"Top": "220", "title": "里奇迹", "actor": " Frank Darabont", "countries": "美国"}
{"Top": "97", "title": "勒比海盗", "actor": " 戈尔·维宾斯基 Gore Verbinski", "countries": "美国"}
{"Top": "98", "title": "登时代", "actor": " 查理·卓别林 Charles Chaplin", "countries": "美国"}
{"Top": "221", "title": "星来的那一夜", "actor": " 詹姆斯·沃德·布柯特 James Ward Byrkit", "countries": "美国 英国"}
{"Top": "222", "title": "个男人来自地球", "actor": " 理查德·沙因克曼 Richard Schenkman", "countries": "美国"}
{"Top": "99", "title": "白", "actor": " 中岛哲也 Tetsuya Nakashima", "countries": "日本"}
{"Top": "100", "title": "鱼", "actor": " 蒂姆·波顿 Tim Burton", "countries": "美国"}
{"Top": "223", "title": ".T. 外星人", "actor": " Steven Spielberg", "countries": "美国"}
{"Top": "224", "title": "路狂花", "actor": " 雷德利·斯科特 Ridley Scott", "countries": "美国 法国"}
{"Top": "225", "title": "蒂和爷爷", "actor": " 阿兰·葛斯彭纳 Alain Gsponer", "countries": "德国 瑞士 南非"}

总结:

  • 1、正则表达式的匹配花费时间太长,以后要多加练习;
  • 2、使用多进程爬取,速度快了很多,但是返回结果杂乱,不能按照名次升序排列,还需处理;
  • 3、爬取完成后发现,第214个电影未显示,不知道什么原因,还需研究。

猜你喜欢

转载自blog.csdn.net/weixin_42937385/article/details/88080306
今日推荐