Python distributed crawler frame Scrapy 4-3 scrapy of debugging tips pycharm

First, look at our spider:

class CnblogsSpider(scrapy.Spider):
    name = 'cnblogs'
    allowed_domains = ['news.cnblogs.com']  # 允许的域名
    start_urls = ['http://news.cnblogs.com/']  # 起始url

    def parse(self, response):
        pass

It inherits the scrapy.Spider, there are many default function.

Start_ulrs can see a list, which can be placed in all of us need url crawling. In determining crawling strategy, there is a program url to piece together all the pages, in which case, put them into start_url can be.

Click through inherited Spider, which has a function:

    def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

start_requests is actually start_urls was a traversal or pass make_requests_from_url Request, in fact, make_requests_from_url is actually a direct return of Request:

    def make_requests_from_url(self, url):
        """ This method is deprecated. """
        return Request(url, dont_filter=True)

Request directly to yeild command scrapy by the downloader, downloader will download everything according to request, after the download is complete, you will go cnblogs.py the parse function in the calling.

After repeat that principle, now we just need to know the contents of each download will enter a url function to parse, parse function receives a response, this response is similar to django of HttpResponse.

Hit a breakpoint in the pass at the parse function, debug it and see.

How to debug it? pycharm not provide scrapy template can not be directly debugging. We can write your own main.py a file, invoke the command line main.py by this document, complete the commissioning.

New main.py project root directory:

from scrapy.cmdline import execute

import sys
import os

# os.path.abspath(__file__) 是当前文件路径:F:\Program Files\爬虫项目\new\Spider\main.py
# os.path.dirname(os.path.abspath(__file__)) 是当前文件所在文件夹的路径:F:\Program Files\爬虫项目\new\Spider
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
# 调用execute可以执行scrapy脚本
execute(["scrapy", "crawl", "cnblogs"])

In fact, a command line command to start spider is:

scrapy crawl cnblogs

You can see the relationship between the command line to start the command to start the script passed to execute a function list.

Prior to commissioning, the value of a parameter is changed in settings.py False:

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

The default value is True, if we do not change False, spider reads the robots protocol on every site, the agreement does not meet the url robots filtered out.

debug main.py script we just wrote, found that successful operation to parse function:

You can see the contents of the response:

First, it is HtmlResponse type, in fact, there are several scrapy Response, such as TextResponse.

url is the URL being accessed.

200 represents the return status is normal.

I found DEFAULT_ENCODING is ascii, but scrapy itself is set utf8.

body is the entire contents of the source file, the future operation of the body is actually operating.

Published 101 original articles · won praise 26 · views 10000 +

Guess you like

Origin blog.csdn.net/liujh_990807/article/details/100027866