豆瓣电影爬虫

import requests#抓取数据使用
from bs4 import BeautifulSoup#分析数据使用
import json#存储数据使用

抓取数据

def get_page():
url = ‘https://movie.douban.com/cinema/nowplaying/chengdu/
headers = {
‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36’
}
response = requests.get(url,headers = headers)
return response.text

解析数据

def parse_page(text):
movies = [ ]
soup = BeautifulSoup(text,‘lxml’)#lxml是一个解析库,支持HTML和XML的解析
lilist = soup.find_all(‘li’,attrs={‘data-category’:‘nowplaying’})
for li in lilist:
movie = {}
title = li[‘data-title’]#字典的获取方式
release = li[‘data-release’]
duration = li[‘data-duration’]
director = li[‘data-director’]
actors = li[‘data-actors’]
img = li.find(‘img’)
thumbnail = img[‘src’]
# movie字典
movie[‘title’] = title
movie[‘release’] = release
movie[‘duration’] = duration
movie[‘actors’] = actors
movie[‘thumbnail’] = thumbnail
movies.append(movie)
return movies

存取数据

def save_page(data):
with open(‘douban.json’,‘w’,encoding=‘utf-8’) as fp:
#encoding:编码方式
json.dump(data,fp,ensure_ascii=False)

简单说就是dump需要一个类似于文件指针的参数(并不是真的指针,可称之为类文件对象),可以与文件操作结合,

也就是说可以将dict转成str然后存入文件中;

而dumps直接给的是str,也就是将字典转成str

if name == ‘main’:
text = get_page()
movies = parse_page(text)
save_page(movies)

猜你喜欢

转载自blog.csdn.net/sdsc1314/article/details/88824850