Scrapy入门(2)

调试

通常有两种方法对Scrapy项目进行调试,一种是通过scrapy.shell,另一种是通过IDE的Debug功能。这里介绍第二种。

运行环境
- 语言:python 3.6
- IDE: VS Code
- 浏览器:Chrome

scrapy shell

在命令行中输入 scrapy shell 【想要访问的页面url】
成功后会进入scrapy shell进行操作:
response.xpath(‘……’)进行测试

IDE Debug

首先在items.py同级目录下,创建run.py
run.py
coding如下:

from scrapy import cmdline

name = 'douban_book_top250'
cmd = 'scrapy crawl {0}'.format(name)
cmdline.execute(cmd.split())

其中,name为之前spider的name属性,接着在spider文件中(或者你想要的位置)设置断点。接着,在vs code中选择调试即可。程序会在断点出暂停,我们就可以查看相应内容进行调试。

debug.jpg

抓取公司信息项目

crawlerspider

接到一个任务,爬取类似黄页的网站,抓取公司的基本信息(名称、业务类别、电话、邮件)。登录网站,查看相关的url:

ypurl

发现网站关于公司信息的url基本都是…/en/profile/数字+字母的格式,受小白进阶的启发,采用crawlerspider进行爬取。crawlerspider主要是对有规则的url进行爬取,代码如下:

import re
import scrapy
from scrapy.spiders import CrawlSpider, Rule, Request
from scrapy.linkextractors import LinkExtractor
from YellowPagesCrawler.items import YellowPagesCrawlerItem

class YPCrawler(CrawlSpider):

    name = 'YPCrawler'

    allowed_domains = ['yellowpages.co.th']
    start_urls = ['http://www.yellowpages.co.th']

    rules = (
        Rule(LinkExtractor(allow=(),restrict_xpaths=('//a[@href]')),
        callback='parse_item',follow=True),
    )

    def parse_item(self,response):

        if (re.search(r'^http://www.yellowpages.co.th/en/profile/\S+',response.url)):
            print(response.url)
            item = YellowPagesCrawlerItem()
            item['CompanyURL'] = response.url
            item['CompanyName'] = response.xpath('.//h2/text()').extract()[0]
            item['CompanyCategory'] = response.xpath('.//strong/parent::*/following-sibling::*/text()').extract()[0]
            item['CompanyTel'] = ""
            telnumbers = response.xpath('.//div[contains(text(),"Telephone")]/following::*[1]/a/text()').extract()

            if telnumbers == []:
                telnumbers = response.xpath('.//div[contains(text(),"Telephone")]/following::*[1]/text()').extract()

            for tel in telnumbers:
                item['CompanyTel'] = item['CompanyTel'] + tel.strip() + ' ' 

            mail = response.xpath('.//div[contains(text(),"Email")]/following::*[1]/a/text()').extract()

            if mail == []:
                mail = response.xpath('.//div[contains(text(),"Email")]/following::*[1]/text()').extract()

            if mail != []:    
                item['CompanyMail'] = mail[0].strip()
            else:
                item['CompanyMail'] = "no Email"

            return item

        else:
            pass

其中,在parse_item的正则表达式原先是放在allow里的,但是这样的话,从主页进行爬取,并不会爬到相关信息,由于主页并不能直接连接到公司信息页面,即使follow=true也没有用,因为allow完之后没有符合的url,所以在allow去空,即爬取所有在allow_domain下的url。在parse_item中通过response.url对所有response里进行筛选。Rule里的参数:allow和restrict_xpaths都是对url进行筛选的,callback是指定处理response的函数,follow表示是否对抓取的url进行跟进。这里特别需要注意的是,rules是一个iterator对象,因此需要在Rule定义完之后,加上一个逗号,至关重要!!!!否则会报错。

通过item pipeline将item保存至Excel

scrapy里在pipeline对item进行处理和保存,这里选择将item内容保存至excel里,先上代码:

from openpyxl import Workbook

class YellowpagescrawlerPipeline(object):
    wb = Workbook()
    ws = wb.active
    ws.append(['公司名称','业务分类','联系电话','电子邮件','链接地址'])

    def process_item(self, item, spider):
        line = [item['CompanyName'],item['CompanyCategory'],item['CompanyTel'],item['CompanyMail'],item['CompanyURL']]
        self.ws.append(line)
        self.wb.save('.\CompanyInfo.xlsx')
        return item

openpyxl是第三方库,代码比较简单。
使pipeline生效,需要在settings.py中生效ITEM_PIPELINES设置,具体如下

ITEM_PIPELINES = {
    'YellowPagesCrawler.pipelines.YellowpagescrawlerPipeline': 200,
}

其中200为优先级,数值越小,优先级越高。
爬取结果:

result

更改需求,抓取某一类别公司信息

项目临时需求,需要抓取特定类别公司信息,例如:
http://www.yellowpages.co.th/en/heading/Plastics-Specialties-Wholesales&Manufacturers?page=0

image

可以看到,每一页有一些公司列表,每个公司点击进去会有详细的公司信息,因此,修改我们的spider如下:

class YPSpider(scrapy.Spider):

    name = "YPSpider"
    allow_domains = ['yellowpages.co.th']
    base_url = 'http://www.yellowpages.co.th/en/heading/Plastics-Specialties-Wholesales&Manufacturers?page='

    def start_requests(self):

        for i in range(0,77):
            url = self.base_url + str(i)
            print(url)
            yield Request(url,self.parse)

    def parse(self,response):

        urls = response.xpath('.//h3/a/@href').extract()
        CoNames = response.xpath('.//h3/a/text()').extract()

        for index in range(0,len(urls)):
            print(urls[index])
            yield Request(urls[index],callback=self.getItems,meta={'CoName':CoNames[index]})

    def getItems(self,response):

        item = YPItem()
        item['CompanyName'] = str(response.meta['CoName'])
        item['CompanyURL'] = response.url

        # 分别对两种页面url进行解析
        if (re.search(r'^http://www.yellowpages.co.th/en/profile/\S+',response.url)):

            item['CompanyTel'] = ""
            telnumbers = response.xpath('.//div[contains(text(),"Telephone")]/following::*[1]/a/text()').extract()

            if telnumbers == []:
                telnumbers = response.xpath('.//div[contains(text(),"Telephone")]/following::*[1]/text()').extract()

            for tel in telnumbers:
                item['CompanyTel'] = item['CompanyTel'] + tel.strip() + ' ' 

            if item['CompanyTel'] == "":
                item['CompanyTel'] = "no Telephone"

            mail = response.xpath('.//div[contains(text(),"Email")]/following::*[1]/a/text()').extract()

            if mail == []:
                mail = response.xpath('.//div[contains(text(),"Email")]/following::*[1]/text()').extract()

            if mail != []:    
                item['CompanyMail'] = mail[0].strip()
            else:
                item['CompanyMail'] = "no Email"

        else:

            item['CompanyTel'] = ""
            telnumbers =  response.xpath('.//a[contains(@href,"tel")]/nobr/text()').extract()
            for telNum in telnumbers:
                item['CompanyTel'] = item['CompanyTel'] + telNum + ' '

            if item['CompanyTel'] == "":
                item['CompanyTel'] = "no Telephone"

            item['CompanyMail'] = ""
            mail = response.xpath('.//a[contains(@href,"mailto")]/text()').extract()
            if mail != []:    
                item['CompanyMail'] = mail[0]
            else:
                item['CompanyMail'] = "no Email"

        return item

这里需要注意的是,对于不同的request,我们定义不同的callback函数;同时在yield request的时候,我们可以通过 meta 的方式传递我们想要的变量,以便callback函数使用。

上传变量:

yield Request(urls[index],callback=self.getItems,meta={'CoName':CoNames[index]})

在callback函数中使用这些变量:

item['CompanyName'] = str(response.meta['CoName'])

猜你喜欢

转载自blog.csdn.net/Tomato_Sir/article/details/80068823