Scrapy抓取西语国家黄页PaginasAmarillas 中潜在客户信息

因工作需要开发南美的客户,于是我就想到 https://www.paginasamarillas.com 西语国家的黄页网站,在上面一搜确实有很多内容。

刚好可以来练手Scrapy.

源码:

./spiders/paginasamarillas_spider.py

from scrapy import Request
from scrapy.spiders import Spider
from paginasamarillas.items import PaginasAmarillasItem
import time

class PaginasAmarillasSpider(Spider):
    name = "empaque_flexible"

    def start_requests(self):
        url = 'http://www.paginasamarillas.com.co/servicios/empaque-flexible'
        yield Request(url)

    def parse(self,response):
        empresas = response.xpath('//div[@class="col-sm-10"]')
        for empresa in empresas:
            item = PaginasAmarillasItem()
            item['nombre']=empresa.xpath('.//span[@class="semibold"]/text()').extract()[0]
            item['sitio']=empresa.xpath('.//div[@class="url"]/a/@href').extract_first()
            item['des']=empresa.css('div.col-sm-12.infoBox p::text').extract_first()
            yield item

        for i in range(2,35):
            time.sleep(5)
            next_url = "http://www.paginasamarillas.com.co/servicios/empaques-y-envases-flexibles?page="+str(i)
            yield Request(next_url)

./items.py

import scrapy

class PaginasAmarillasItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    nombre = scrapy.Field()
    sitio = scrapy.Field()
    des = scrapy.Field()
    pass

./settings.py

这个时候!我们发现这只小蜘蛛一般来说爬不到任何东西,为什么呢?是不是代码哪里错了呢?不是!是因为Scrapy有个隐藏坑!就是它默认遵守网站的robots.text规则,网站不让它爬,它就不爬。很搞笑好吗!于是我们就在根目录 setings.py 里,找到这个ROBOTSTXT_OBEY 把它False

另外,USER_AGENT也是要设置的,因为这里会默认告诉服务器“我就是个爬虫,快拒绝我吧!”

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit / 537.36(KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

猜你喜欢

转载自blog.csdn.net/weixin_42616808/article/details/80927368