scrapy first experience

1. Build scrapy project

scrapy startproject ['项目名']
  • Generate files in spiders

    cd spiders
    scrapy genspider douban_spider ['域名']


2. clear objectives

Clear what needs to capture, in items.pythe definition of the data structure:

import scrapy
class DoubanItem(scrapy.Item):
    # 序号
    serial_number = scrapy.Field()
    # 电影名
    movie_name = scrapy.Field()
    # 介绍
    introduce = scrapy.Field()
    # 星级
    star = scrapy.Field()
    # 评论
    evaluate = scrapy.Field()
    # 描述
    describe = scrapy.Field()


3. spider write file

Open spider.py file, the default will be three parameters:

class DoubanSpiderSpider(scrapy.Spider):
    # 爬虫名
    name = 'douban_spider'
    # 允许的域名,超出该域名的链接不会进行抓取
    allowed_domains = ['movie.douban.com']
    # 入口url
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        pass

In the def parsecontent analysis method:

def parse(self, response):
    print(response.text)


4. Start scrapy project

  • The command line to start

    # douban_spider 即spider.py中的爬虫名
    scrapy crawl douban_spider

    The reason being given 403: user_agent set up wrong, go to settings.pythe settings:

    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
  • Started in pycharm

    Create a main.pyfile:

    from scrapy import cmdline
    
    if __name__ == '__main__':
        cmdline.execute('scrapy crawl douban_spider'.split())


5. Preparation of analytical methods

How to parse is written in the def parse(self, response)middle.

  • xpath extract content

    You need to learn grammar under xpath

    movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
  • The previous objects encapsulated item.py

    from douban.items import DoubanItem
  • Specific code

    # 先使用xpath语法来选取,然后后跟text()函数获取内容
    movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li")
    for item in movie_list:
        douban_item = DoubanItem()
        douban_item['serial_number'] = item.xpath(".//div[@class='item']//em/text()").extract_first()
        douban_item['movie_name'] = item.xpath(".//div[@class='info']//a/span/text()").extract_first()
        content = item.xpath(".//div[@class='bd']/p[1]/text()").extract()
        content_set = list()
        for i_content in content:
            tmp = ""
                for temp in i_content.split():
                    tmp += temp
                    content_set.append(tmp)
                douban_item['introduce'] = content_set
        douban_item['star'] = item.xpath(".//div[@class='star']/span[2]/text()").extract_first()
        douban_item['evaluate'] = item.xpath(".//div[@class='star']/span[4]/text()").extract_first()
        douban_item['describe'] = item.xpath(".//div[@class='bd']/p[2]/span/text()").extract_first()
    # 重点
    yield douban_item
  • Object parsed completely sure to call yieldto submit

    yield douban_item


6. scrolling in

The above code can only read information about the current page, you need to crawl links on the next page, and then againyield

# 取下一页链接
next_link = response.xpath("//span[@class='next']/link/@href").extract()
# 如果不为最后一页
if next_link:
    next = next_link[0]
    yield scrapy.Request("https://movie.douban.com/top250" + next, callback=self.parse)


7. Save Output

Was added after the command -oparameters can, JSON supports (Unicode encoded save), and other formats CSV

scrapy crawl douban_spider -o test.json

Guess you like

Origin www.cnblogs.com/yisany/p/11227781.html