Advanced crawler Scrapy framework

Record the error-prone areas

1. Introduce files in spider files

from movie.items import MovieItem
#文件夹.文件名 引入 类名

Two, response.follow() method

The response.follow() method is used to generate the next request and data analysis. The first parameter is the address of the next page, and the second is the method used to parse the source code data obtained from the first parameter address.

import scrapy
from ..items import MovieItem
class ShuichanSpider(scrapy.Spider):
    name = 'shuichan'
    allowed_domains = ['bbs.liyang-tech.com']
    start_urls = ['http://bbs.liyang-tech.com/forum.php?mod=forumdisplay&fid=4']

    def parse(self,response):
        urls = ['http://bbs.liyang-tech.com/forum.php?mod=forumdisplay&fid=4&page=%s'%(i) for i in range(1,51)]
        for i in urls:
            yield response.follow(i,self.parse_title)#方法名不要带括号

    def parse_title(self,response):
        item = MovieItem()
        txt = response.xpath('//*/tr/th/a[2]/text()').extract()
        for i in txt:
            item['title'] = i
            yield item

The second parameter is to submit a method, do not put parentheses, otherwise an error will be reported
Insert picture description here

Three, extract () method

We often use scrapy's built-in xpath to parse the page and obtain the desired information, but the return value of xpath parse is a Selector object, which cannot be directly operated on with the str object. You need to call the extract() function to turn it into unicode encoding. , And then you can perform operations with the str object.

Four, yield

After parsing, the data must be returned, otherwise there will
be nothing. First instantiate the class in the items file, and then return the parsed data.

item = MovieItem()

Finally return item

Setting

settings.py is the configuration file of the entire project. In this file, you can set the number of concurrent crawls, waiting time, output format, default header, and so on. This time we can write some configurations such as:

BOT_NAME = 'appinn'

SPIDER_MODULES = ['appinn.spiders']
NEWSPIDER_MODULE = 'appinn.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'appinn (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# 上面都是自动生成的,下面开始是我们自己定义的
# 要使用的 pipeline
ITEM_PIPELINES = {
    
    
    'appinn.pipelines.AppinnPipeline': 300,  # 300 表示顺序,pipeline 有多个时,数字越小越先执行
}
FEED_FORMAT = 'csv'  # 最后输出的文件格式
FEED_URI = 'appin_windows_apps.csv'  # 最后输出的文件名

# 为了避免对被爬网站造成太大的压力,我们启动自动限速,设置最大并发数为 5
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 5

At last

Be sure to pay attention to capitalization and spelling. For
example, the TXT file format is wrong.
Scrapy does not support this format. CSV can

Guess you like

Origin blog.csdn.net/qq_17802895/article/details/108504355