1. Build scrapy project
scrapy startproject ['项目名']
Generate files in spiders
cd spiders scrapy genspider douban_spider ['域名']
2. clear objectives
Clear what needs to capture, in items.py
the definition of the data structure:
import scrapy
class DoubanItem(scrapy.Item):
# 序号
serial_number = scrapy.Field()
# 电影名
movie_name = scrapy.Field()
# 介绍
introduce = scrapy.Field()
# 星级
star = scrapy.Field()
# 评论
evaluate = scrapy.Field()
# 描述
describe = scrapy.Field()
3. spider write file
Open spider.py file, the default will be three parameters:
class DoubanSpiderSpider(scrapy.Spider):
# 爬虫名
name = 'douban_spider'
# 允许的域名,超出该域名的链接不会进行抓取
allowed_domains = ['movie.douban.com']
# 入口url
start_urls = ['https://movie.douban.com/top250']
def parse(self, response):
pass
In the def parse
content analysis method:
def parse(self, response):
print(response.text)
4. Start scrapy project
The command line to start
# douban_spider 即spider.py中的爬虫名 scrapy crawl douban_spider
The reason being given 403: user_agent set up wrong, go to
settings.py
the settings:USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
Started in pycharm
Create a
main.py
file:from scrapy import cmdline if __name__ == '__main__': cmdline.execute('scrapy crawl douban_spider'.split())
5. Preparation of analytical methods
How to parse is written in the def parse(self, response)
middle.
xpath extract content
You need to learn grammar under xpath
movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
The previous objects encapsulated item.py
from douban.items import DoubanItem
Specific code
# 先使用xpath语法来选取,然后后跟text()函数获取内容 movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li") for item in movie_list: douban_item = DoubanItem() douban_item['serial_number'] = item.xpath(".//div[@class='item']//em/text()").extract_first() douban_item['movie_name'] = item.xpath(".//div[@class='info']//a/span/text()").extract_first() content = item.xpath(".//div[@class='bd']/p[1]/text()").extract() content_set = list() for i_content in content: tmp = "" for temp in i_content.split(): tmp += temp content_set.append(tmp) douban_item['introduce'] = content_set douban_item['star'] = item.xpath(".//div[@class='star']/span[2]/text()").extract_first() douban_item['evaluate'] = item.xpath(".//div[@class='star']/span[4]/text()").extract_first() douban_item['describe'] = item.xpath(".//div[@class='bd']/p[2]/span/text()").extract_first() # 重点 yield douban_item
Object parsed completely sure to call
yield
to submityield douban_item
6. scrolling in
The above code can only read information about the current page, you need to crawl links on the next page, and then againyield
# 取下一页链接
next_link = response.xpath("//span[@class='next']/link/@href").extract()
# 如果不为最后一页
if next_link:
next = next_link[0]
yield scrapy.Request("https://movie.douban.com/top250" + next, callback=self.parse)
7. Save Output
Was added after the command -o
parameters can, JSON supports (Unicode encoded save), and other formats CSV
scrapy crawl douban_spider -o test.json