3.简单爬虫————爬取拉勾网招聘信息(二)

该文章仅供学习,如有错误,欢迎指出

1.开始创建一个项目

mkdir lagou

2.进入到文件夹下创建python3的虚拟环境

pipenv install scrapy

3.进入pipenv 下使用scrapy命令创建爬虫项目

pipenv shell
scrapy startproject lagou
cd lagou
scrapy genspider -o crawl test www.lagou.com

4.分析网页
这里写图片描述
进入职位 url地址为https://www.lagou.com/zhaopin/Java/?labelWords=label
我们可以多复制几个看一看
https://www.lagou.com/zhaopin/Java/?labelWords=label
https://www.lagou.com/zhaopin/chanpinjingli1/?labelWords=label
https://www.lagou.com/zhaopin/xinmeitiyunying/?labelWords=label
可以看出他都带有一个zhaoping/的关键词,因此我们可以在rule中定义一个正则去匹配该页面中所有符合rule规则的url地址

Rule(LinkExtractor(allow=r'zhaopin/'),follow=True),

这里写图片描述
进入招聘内容之后呈现的是这样的画面,我们需要的内容是下面的内容,我们选几个进入看一下
https://www.lagou.com/jobs/4597655.html
https://www.lagou.com/jobs/4182278.html
https://www.lagou.com/jobs/4692422.html
可以看出有jobs/关键字因此我们添加的rule规则是这样的

Rule(LinkExtractor(allow=r'jobs/\d+.*'), callback='parse_item', follow=True),

2.编写提取item的代码

item = zhaoping_xinxi_itemloader()

        item['name'] = response.css('.company::text')[0].extract()
        item['title'] =response.css('.job-name .name::text')[0].extract()
        item['salary'] =response.css('.salary::text')[0].extract()
        item['city'] = response.css('.job_request span::text')[1].extract().split('/')[1]
        item['jingyan'] =  response.css('.job_request span::text')[2].extract().split('/')[0]
        item['xueli'] = response.css('.job_request span::text')[3].extract().split('/')[0]
        item['lists'] = response.css('.position-label li::text').extract()#多
        item['miaosu'] = response.css('.job_bt ')[0].extract()

同样的在Items.py下也要建立Item

3.运行爬虫

scrapy crawl test -o test.json   

发现给我们返回的内容为
https://www.lagou.com/utrack/trackMid.html?f=https%3A%2F%2Fpassport.lagou.com%2Flogin%2Flogin.html%3Fmsg%3Dvalidation%26uStatus%3D2%26clientIp%3D101.66.185.15&t=1528418878&_ti=1
是一个登陆页面,也就是说我们可以爬取的是第一个rule下的内容,当我们要去爬取第二个页面的内容的时候,lagou会设置一个登陆才能访问的方法,那么这里我们可以用到post请求模拟登陆

4.模拟登陆
在spider的源码中,我们可以看到

class Spider(object_ref):
    """Base class for scrapy spiders. All spiders must inherit from this
    class.
    """

    name = None
    custom_settings = None

    def __init__(self, name=None, **kwargs):
        if name is not None:
            self.name = name
        elif not getattr(self, 'name', None):
            raise ValueError("%s must have a name" % type(self).__name__)
        self.__dict__.update(kwargs)
        if not hasattr(self, 'start_urls'):
            self.start_urls = []

有一个变量叫做custom_settings,他就是用来配置访问客户的设置,也就是模拟请求
设置模拟请求

 custom_settings = {
        "COOKIES_ENABLED": False,
        "DOWNLOAD_DELAY": 1,
        'DEFAULT_REQUEST_HEADERS': {
            'Accept': 'application/json, text/javascript, */*; q=0.01',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.8',
            'Connection': 'keep-alive',
            'Cookie': 'JSESSIONID=ABAAABAAAFCAAEGBC99154D1A744BD8AD12BA0DEE80F320; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=0; _ga=GA1.2.1111395267.1516570248; _gid=GA1.2.1409769975.1516570248; user_trace_token=20180122053048-58e2991f-fef2-11e7-b2dc-525400f775ce; PRE_UTM=; LGUID=20180122053048-58e29cd9-fef2-11e7-b2dc-525400f775ce; index_location_city=%E5%85%A8%E5%9B%BD; X_HTTP_TOKEN=7e9c503b9a29e06e6d130f153c562827; _gat=1; LGSID=20180122055709-0762fae6-fef6-11e7-b2e0-525400f775ce; PRE_HOST=github.com; PRE_SITE=https%3A%2F%2Fgithub.com%2Fconghuaicai%2Fscrapy-spider-templetes; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2F4060662.html; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516569758,1516570249,1516570359,1516571830; _putrc=88264D20130653A0; login=true; unick=%E7%94%B0%E5%B2%A9; gate_login_token=3426bce7c3aa91eec701c73101f84e2c7ca7b33483e39ba5; LGRID=20180122060053-8c9fb52e-fef6-11e7-a59f-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516572053; TG-TRACK-CODE=index_navigation; SEARCH_ID=a39c9c98259643d085e917c740303cc7',
            'Host': 'www.lagou.com',
            'Origin': 'https://www.lagou.com',
            'Referer': 'https://www.lagou.com/',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
        }
    }

重新爬取,爬取成功!

6.简化我们的爬虫,使用Itemloader

    def parse_item(self, response):
        item_loader = putongItem(zhaoping_xinxi_itemloader(),response)
        item_loader.add_css('name','.company::text',)
        item_loader.add_css('title','.job-name .name::text')
        item_loader.add_css('salary','.salary::text')
        item_loader.add_css('city','.job_request span:nth-child(1)::text')
        item_loader.add_css('jingyan','.job_request span:nth-child(2)::text')
        item_loader.add_css('xueli','.job_request span:nth-child(3)::text')
        item_loader.add_css('lists','.position-label li::text')
        item_loader.add_css('miaosu','.job_bt')
        yield item_loader.load_item()
class zhaoping_xinxi_itemloader(scrapy.Item):
    name =scrapy.Field(

        input_processor = MapCompose(change.change_string)
    )
    title=scrapy.Field()
    salary=scrapy.Field()
    city=scrapy.Field()
    jingyan=scrapy.Field()
    xueli=scrapy.Field()
    lists=scrapy.Field()
    miaosu=scrapy.Field()


class putongItem(ItemLoader):
    default_output_processor = TakeFirst()

爬取的太慢了,之后会学习redis分布式爬虫

猜你喜欢

转载自blog.csdn.net/llh_e/article/details/80617968
今日推荐