Python Douban Movie Top250 Crawl

Python Douban Movie Top250 crawls and generates csv file

A long-lost blog update, it happened for a reason. In the evening, a friend asked me to help crawl the Top250 data of Douban movies for analysis. However, the articles on the Internet were not good enough, so I rewritten it myself and updated this vacancy on the Internet. .
Although the crawling is easy this time, because this URL is a static webpage, the source code of the webpage can be parsed directly, but it is still worth a look. There are some subtle points that you may not have seen before.

The webpage link is this: https://movie.douban.com/top250
Insert picture description here
We can directly right-click the pop-up menu and click "Check Elements" to directly find the information we need. This time our purpose is to get the information of each movie:
'Movies Ranking','Movie Name','Director','Release Year','Production Country','Movie Type','Film Rating','Number of Evaluators','Movie Short Comment', the
start code explains:

import requests			#每次都是你,最频繁使用的库
from lxml import etree		"""内含xpath,本人极力推荐使用xpath,不要用bs了"""
import re				#正则是为了一些文本的提取,等会你就知道了

If you report an error in lxml, you must install it . Enter pip install lxml in the cmd window .
Sometimes the installation may fail. The most likely cause is a network problem. Just enter it a few more times.

Let’s start with the beginning of the program

if __name__ == '__main__':
    headers={
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }
    """movieInfo将作为csv文件的表头,我们将把数据逐行加入这个列表,否则在保存时就不是我们想要的样子了"""
    movieInfo = [['电影排名','电影名称','导演','上映年份','制片国家','电影类型','电影评分','评价人数','电影短评']]
    for i in range(0,250,25):
        url='https://movie.douban.com/top250?start={}'.format(i)
        try:
        """因为爬取静态网页没有难度就直接一个函数搞定"""
            DoubanSpider(movieInfo,url,headers=headers)
        except:
            break
    """将movieInfo保存为csv格式"""
    with open('movie.csv','w',encoding='utf-8')as f:
        for info in movieInfo:
        	#逗号分隔每个列表元素然后写入,然后回车到下一行
            f.write(','.join(info)+'\n')
    print('爬取结束')

The main event is on the scene, the code is written really hard to compliment, because in a hurry, there is no optimization, please forgive me

def DoubanSpider(movieInfo,url,headers):
    response = requests.get(url,headers=headers)
    html_ele = etree.HTML(response.text)
    """前面几个都是可以直接xpath提取到的,后面就比较麻烦,
    不了解的xpath的话可以学习一下,当真好用"""
    rank = html_ele.xpath('//div[@class="item"]/div/em/text()')
    film_name = html_ele.xpath('//div[@class="item"]/div/a/img/@alt')
    quote =  html_ele.xpath('//p[@class="quote"]/span/text()')
    """分析代码的时候发现评分和评价人数是合在一起的"""
    temp_content =  html_ele.xpath('//div[@class="star"]/span/text()')
    score=[]
    people=[]
    index=1
    for i in temp_content:
        if index%2==1:		#奇数的是评分
            score.append(i)
        else:				#偶数的是人数
            people.append(i)
        index+=1
	
	"""全文最麻烦之处,剩余的信息都糊在一片文本里"""
    film_content = html_ele.xpath('//div[@class="bd"]/p/text()')
    #用re匹配到数字字符,可是发现出来好多不用的零字符
    #上映年份
    film_year_temp = re.findall('\d+',str(film_content))
    film_year=[] 
    for i in film_year_temp:
        if i!='0':#过滤掉'0'
            film_year.append(i)
	#获取导演名字
    film_director = re.findall(r'导演:(.*?)主演',str(film_content))
    director=[]
    for i in film_director:
        director.append(i.split(r' ')[1])#这里只是让信息输出规范化
	#这里就是一连串的文本,然后被分成了好几个小列表
    country=[]
    index=1
    for i in film_content:
        if index%4==2:
            country.append(i.split('\xa0')[2])
        index+=1

    film_type=[]
    index=1
    for i in film_content:
        if index%4==2:
            film_type.append(i.split('\xa0')[4].strip())
        index+=1
"""是时候展现真正的技术啦,hh"""
"""这里的zip是把这些个列表都绑定在一起,为了后面的循环可以每次同时取出每个列表的顺序元素。"""
"""这么做的原因是我们每次函数执行对应数据的列表就被扩充成25,不可能一次性塞给movieInfo,否则存为csv结果就是事与愿违,不信你可以自己试试。"""
"""当然你也可以改代码,每次列表扩充一个,立即加入moiveInfo当中"""   
nvs=zip(rank,film_name,director,film_year,country,film_type,score,people,quote)
    for rank,film_name,film_director,film_year,country,film_type,score,people,quote in nvs:
        movieInfo.append([rank,film_name,film_director,film_year,country,film_type,score,people,quote])
    print(url,'爬取完毕')

You're done, the final step is acceptance!
The movie.csv file will be generated in the same directory as the py file, which can be opened with Notepad, but opening with excel is what we want.
When you open it with excel. . .
Insert picture description here
What is this? It turned out to be a garbled problem when opening csv in excel. Here you can refer to the solution:
garbled solution when opening csv.
My personal test method 2 is useful and the effect is excellent:
Insert picture description here
finally done! ! !
Like it if it helps!

Guess you like

Origin blog.csdn.net/weixin_43594279/article/details/106107186