Scrapy-爬取腾讯招聘

Scrapy
	. requests  + selenium  --> 90%的爬虫需求
	. Scrapy   --?10% --> 爬虫更快 更强
	. 什么是Scrapy? 框架
	. 正则bs4 lxml模块  模块 = 手  框架 = 身体
	. Twisted 异步网络框架  加快下载速度 采用大量的闭包
	学习scrapy网址:
	https://docs.scrapy.org/en/latest/intro/tutorial.html
	
Scrapy工作流程(重点)
	首先Spider(爬虫)将需要发送请求的url经过ScrapyEngins(引擎)交给调试器			  Scheduler(调度器)
Scheduler(调度器)排序 入列处理后,再经过ScrapyEngins(引擎)到DownloaderMiddleware(下载中间件)(user-agent cookie porxy)交给downloader(下载者)-->向internet发送请求
Downloader向互联网发起请求,并接收响应(response) 将响应在经过ScrapyEngins(引擎)给了SpiderMiddlewares(爬虫中间件)交给Spiders
Spiders处理response , 提取数据并将数据经过ScrapyEngine交给itemPipeline(管道)保存数据 

创建
D:\2020st\pycharm.professional\xxx
$cd scrapy框架
创建一个scrapy项目
$scrapy startproject mySpider
$cd mySpider
# 生成一个爬虫 
$scrapy genspider db douban.com
创建后运行
scrapy框架\mySpider>scrapy crawl db

Scrapy
以前爬虫数据是如何进行翻页处理的?
	找页数的规律
	调用requests.get(url)
Scrapy翻页处理
	找到下一页的地址(规律 )
	构造一个关于下一页的url地址的request请求传递给调试器
需求:爬取每一个职位的工作职责和工作要求

# 列表页[1 2 3 4]url
第一页:
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1615465802464&countryId=1&cityId=&bgIds=&productId=&categoryId=40001001&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=us
第三页
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1615467778990&countryId=1&cityId=&bgIds=&productId=&categoryId=40001001&parentCategoryId=&attrId=&keyword=&pageIndex=3&pageSize=10&language=zh-cn&area=us

# 详情页url
https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1615466673257&postId=1369980726107185152&language=zh-cn

1.点击控制台Terminal 操作
2. cd (copy paht)路径进入所在的文件
3. 创建一个名:scrapy startproject tencent
4. 进入项目名 cd tencent
5. 创建爬虫文件 scrapy gensipder hr2 tencent.com
6. 在文件中开始以下代码编写

hr2.py

import scrapy
import json
from tencent.items import TencentItem

class HrSpider(scrapy.Spider):
    name = 'hr2'
    allowed_domains = ['tencent.com']
    # 列表页的url
    one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1615465802464&countryId=1&cityId=&bgIds=&productId=&categoryId=40001001&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=us'
    # 详情页的url
    two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1615466673257&postId={}&language=zh-cn'
    start_urls = [one_url.format(1)]

    def parse(self, response):

        for page in range(1, 3):
            url = self.one_url.format(page)

            yield scrapy.Request(
                url=url,
                callback=self.parse_one

            )

    def parse_one(self, response):
        data = json.loads(response.text)
        for job in data['Data']['Posts']:
            item = {
    
    }
            item = TencentItem()
            item['zhi_jishu'] = job['CategoryName']
            item['zhi_type'] = job['RecruitPostName']
            # items.py 设置后 输入错误代码会提示错误
            # KeyError: 'TencentItem does not support field: zhi_type111'
            post_id = job['PostId']
            # 拼接详情页的url地址
            detail_url = self.two_url.format(post_id)

            yield scrapy.Request(
                url=detail_url,
                meta={
    
    'item': item},
                callback=self.parse_two

            )
            # print(job)

    def parse_two(self, response):

        # 第一种方式来接收meta传递过来的值
        # item = response.meta['item']
        # 第二种 方式
        item = response.meta.get('item')
        # print(response.text)
        # print(type(response.text))   # <class 'str'>

        data = json.loads(response.text)
        item['zhi_duty'] = data['Data']['Responsibility']
        item['zhi_require'] = data['Data']['Requirement']
        print(item)
        # print(type(item))  # <class 'dict'>

运行文件
start.py

from scrapy import cmdline

cmdline.execute(['scrapy', 'crawl', 'hr2'])

修配置文件 添加了 LOG_LEVEL = ‘WARNING’
settings.py


BOT_NAME = 'tencent'
LOG_LEVEL = 'WARNING'
SPIDER_MODULES = ['tencent.spiders']
NEWSPIDER_MODULE = 'tencent.spiders'
注释
# ROBOTSTXT_OBEY = True

猜你喜欢

转载自blog.csdn.net/weixin_45905671/article/details/114582356