debugging

There are usually two ways to debug a Scrapy project, one is through scrapy.shell, and the other is through the IDE's Debug function. The second one is introduced here.

Runtime Environment
- Language: python 3.6
- IDE: VS Code
- Browser: Chrome

scrapy shell

Enter scrapy shell in the command line [url of the page you want to visit]
After successful, you will enter the scrapy shell to operate:
response.xpath('...') for testing

IDE Debug

First, in the same directory of items.py, create run.py
run.py
coding as follows:

from scrapy import cmdline

name = 'douban_book_top250'
cmd = 'scrapy crawl {0}'.format(name)
cmdline.execute(cmd.split())

Among them, name is the name attribute of the previous spider, and then set a breakpoint in the spider file (or where you want). Then, select debug in vs code. The program will pause at the breakpoint, and we can view the corresponding content for debugging.

Grab company information items

crawlerspider

Received a task to crawl a website like the Yellow Pages and crawl the basic information of the company (name, business category, phone number, email). Log in to the website and view the relevant url:

ypurl

It is found that the url of the website about the company information is basically in the format of.../en/profile/number+letter. Inspired by Xiaobai Advanced , crawlerspider is used to crawl. crawlerspider mainly crawls regular urls, the code is as follows:

import re
import scrapy
from scrapy.spiders import CrawlSpider, Rule, Request
from scrapy.linkextractors import LinkExtractor
from YellowPagesCrawler.items import YellowPagesCrawlerItem

class YPCrawler(CrawlSpider):

    name = 'YPCrawler'

    allowed_domains = ['yellowpages.co.th']
    start_urls = ['http://www.yellowpages.co.th']

    rules = (
        Rule(LinkExtractor(allow=(),restrict_xpaths=('//a[@href]')),
        callback='parse_item',follow=True),
    )

    def parse_item(self,response):

        if (re.search(r'^http://www.yellowpages.co.th/en/profile/\S+',response.url)):
            print(response.url)
            item = YellowPagesCrawlerItem()
            item['CompanyURL'] = response.url
            item['CompanyName'] = response.xpath('.//h2/text()').extract()[0]
            item['CompanyCategory'] = response.xpath('.//strong/parent::*/following-sibling::*/text()').extract()[0]
            item['CompanyTel'] = ""
            telnumbers = response.xpath('.//div[contains(text(),"Telephone")]/following::*[1]/a/text()').extract()

            if telnumbers == []:
                telnumbers = response.xpath('.//div[contains(text(),"Telephone")]/following::*[1]/text()').extract()

            for tel in telnumbers:
                item['CompanyTel'] = item['CompanyTel'] + tel.strip() + ' ' 

            mail = response.xpath('.//div[contains(text(),"Email")]/following::*[1]/a/text()').extract()

            if mail == []:
                mail = response.xpath('.//div[contains(text(),"Email")]/following::*[1]/text()').extract()

            if mail != []:    
                item['CompanyMail'] = mail[0].strip()
            else:
                item['CompanyMail'] = "no Email"

            return item

        else:
            pass

Among them, the regular expression in parse_item was originally placed in allow, but in this case, crawling from the homepage will not crawl to relevant information, because the homepage cannot be directly connected to the company information page, even if follow=true It is useless, because there is no matching url after allow, so empty the allow, that is, crawl all the urls under allow_domain. Filter all responses by response.url in parse_item. The parameters in Rule: allow and restrict_xpaths both filter the url, callback is the function that specifies the response, and follow indicates whether to follow up on the fetched url. It is important to note here that rules is an iterator object, so it is very important to add a comma after the Rule is defined! ! ! ! Otherwise, an error will be reported.

Save item to Excel via item pipeline

In scrapy, the item is processed and saved in the pipeline. Here, you can choose to save the item content to excel. First, put the code:

from openpyxl import Workbook

class YellowpagescrawlerPipeline(object):
    wb = Workbook()
    ws = wb.active
    ws.append(['公司名称','业务分类','联系电话','电子邮件','链接地址'])

    def process_item(self, item, spider):
        line = [item['CompanyName'],item['CompanyCategory'],item['CompanyTel'],item['CompanyMail'],item['CompanyURL']]
        self.ws.append(line)
        self.wb.save('.\CompanyInfo.xlsx')
        return item

openpyxl is a third-party library, and the code is relatively simple.
For the pipeline to take effect, the ITEM_PIPELINES setting needs to take effect in settings.py, as follows

ITEM_PIPELINES = {
    'YellowPagesCrawler.pipelines.YellowpagescrawlerPipeline': 200,
}

200 is the priority, and the smaller the value, the higher the priority.
Crawl results:

result

Change requirements, grab a certain category of company information

Temporary project requirements, need to fetch specific categories of company information, for example:
http://www.yellowpages.co.th/en/heading/Plastics-Specialties-Wholesales&Manufacturers?page=0

As you can see, each page has some company lists, and each company clicks into it to have detailed company information. Therefore, modify our spider as follows:

class YPSpider(scrapy.Spider):

    name = "YPSpider"
    allow_domains = ['yellowpages.co.th']
    base_url = 'http://www.yellowpages.co.th/en/heading/Plastics-Specialties-Wholesales&Manufacturers?page='

    def start_requests(self):

        for i in range(0,77):
            url = self.base_url + str(i)
            print(url)
            yield Request(url,self.parse)

    def parse(self,response):

        urls = response.xpath('.//h3/a/@href').extract()
        CoNames = response.xpath('.//h3/a/text()').extract()

        for index in range(0,len(urls)):
            print(urls[index])
            yield Request(urls[index],callback=self.getItems,meta={'CoName':CoNames[index]})

    def getItems(self,response):

        item = YPItem()
        item['CompanyName'] = str(response.meta['CoName'])
        item['CompanyURL'] = response.url

        # 分别对两种页面url进行解析
        if (re.search(r'^http://www.yellowpages.co.th/en/profile/\S+',response.url)):

            item['CompanyTel'] = ""
            telnumbers = response.xpath('.//div[contains(text(),"Telephone")]/following::*[1]/a/text()').extract()

            if telnumbers == []:
                telnumbers = response.xpath('.//div[contains(text(),"Telephone")]/following::*[1]/text()').extract()

            for tel in telnumbers:
                item['CompanyTel'] = item['CompanyTel'] + tel.strip() + ' ' 

            if item['CompanyTel'] == "":
                item['CompanyTel'] = "no Telephone"

            mail = response.xpath('.//div[contains(text(),"Email")]/following::*[1]/a/text()').extract()

            if mail == []:
                mail = response.xpath('.//div[contains(text(),"Email")]/following::*[1]/text()').extract()

            if mail != []:    
                item['CompanyMail'] = mail[0].strip()
            else:
                item['CompanyMail'] = "no Email"

        else:

            item['CompanyTel'] = ""
            telnumbers =  response.xpath('.//a[contains(@href,"tel")]/nobr/text()').extract()
            for telNum in telnumbers:
                item['CompanyTel'] = item['CompanyTel'] + telNum + ' '

            if item['CompanyTel'] == "":
                item['CompanyTel'] = "no Telephone"

            item['CompanyMail'] = ""
            mail = response.xpath('.//a[contains(@href,"mailto")]/text()').extract()
            if mail != []:    
                item['CompanyMail'] = mail[0]
            else:
                item['CompanyMail'] = "no Email"

        return item

It should be noted here that for different requests, we define different callback functions; at the same time, when yielding requests, we can pass the variables we want through meta for the callback function to use.

Upload variables:

yield Request(urls[index],callback=self.getItems,meta={'CoName':CoNames[index]})

Use these variables in the callback function:

item['CompanyName'] = str(response.meta['CoName'])

Getting Started with Scrapy (2)