Scrapy框架学习练手之爬取腾讯招聘技术类岗位

页面地址：

https://careers.tencent.com/search.html?pcid=40001

实现目标：

将爬取到的岗位名称、工作职责、工作要求、发布日期以字典格式输出。

Scrapy目录框架：

思路：

浏览器抓包分析网页请求地址规律（爬虫最重要），找到页面地址规律后，根据请求返回的数据进行提取即可。

图一

由图一页面可知，招聘岗位共有187页，需循环遍历所有页面；

浏览器抓包实际请求页面地址为:

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575882949947&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

规律：pageIndex=1 为页数，其他暂保持不变

（爬完之后发现这里也有个时间戳，哈哈，冒似没啥影响，时间固定一样可以获取到数据）

图二

图三

页面请求返回的PostId为岗位详情页后面的具体地址:

招聘岗位详情页示例：

https://careers.tencent.com/jobdesc.html?postId=1203886892391600128

图四

从浏览器抓包的请求头来看，URL地址规律如下：

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1575882580480&postId=1203886892391600128&language=zh-cn

“https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp= “此为固定部分，后面为当前时间戳，

Python代码如下：

时间戳部分学习参考：https://blog.csdn.net/qq_31603575/article/details/83343791

# -*- coding: utf-8 -*-
import scrapy
import json,datetime,time

#生成 13位当前时间戳
def get_time_stamp13():
    # 生成13时间戳   eg:1540281250399895
    datetime_now = datetime.datetime.now()

    # 10位，时间点相当于从UNIX TIME的纪元时间开始的当年时间编号
    date_stamp = str(int(time.mktime(datetime_now.timetuple())))

    # 3位，微秒
    data_microsecond = str("%06d" % datetime_now.microsecond)[0:3]

    date_stamp = date_stamp + data_microsecond
    return int(date_stamp)

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['tencent.com']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575855782891&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40001&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']

    def pasrse1(self,response):
        html1 = response.text
        #响应数据转换为字典格式
        html1_json = json.loads(html1)
        # print(html1_json['Data'])
        # exit()
        # for j in html1_json['Data']:
        job_dic = {}
        job_dic['岗位名称'] = html1_json['Data']['RecruitPostName']
        job_dic['工作职责'] = html1_json['Data']['Responsibility']
        job_dic['发布日期'] = html1_json['Data']['LastUpdateTime']
        job_dic['工作要求'] = html1_json['Data']['Requirement']
        print(job_dic)
    def parse(self, response):
        html = response.text
        html_json = json.loads(html)
        id_url_list = html_json['Data']['Posts'] #[0]['PostId']
        for lj in id_url_list:
            time_format = str(get_time_stamp13())
            #url地址拼接
            desc_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=' + time_format + '&postId=' + lj['PostId'] + '&language=zh-cn'
            # print(desc_url)
            # print(lj['PostId'])
        # for k in url_list['PostURL']:
            # job_url = []
            # job_url.append(k)
            # yield scrapy.Request(url=job_url, callback=self.pasrse1, dont_filter=True)
            #提交请求，迭代处理
            yield scrapy.Request(url=desc_url, callback=self.pasrse1, dont_filter=True)
        for i in range(2,188):
            #页面地址循环遍历
            url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575855782891&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40001&attrId=&keyword=&pageIndex=' + str(i) + '&pageSize=10&language=zh-cn&area=cn'
            yield scrapy.Request(url=url,callback=self.parse,dont_filter=True)

执行效果如下：

老板，给我来块大一点的砖丶

发布了16 篇原创文章 · 获赞 3 · 访问量 2236

私信关注

Scrapy框架学习练手之爬取腾讯招聘技术类岗位

猜你喜欢