Scrapy框架学习练手之爬取腾讯招聘技术类岗位

页面地址:

https://careers.tencent.com/search.html?pcid=40001

实现目标:

将爬取到的岗位名称、工作职责、工作要求、发布日期以字典格式输出。

Scrapy目录框架:

思路:

        

浏览器抓包分析网页请求地址规律(爬虫最重要),找到页面地址规律后,根据请求返回的数据进行提取即可。

图一

由图一页面可知,招聘岗位共有187页,需循环遍历所有页面;

浏览器抓包实际请求页面地址为:

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575882949947&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

规律:pageIndex=1 为页数,其他暂保持不变

(爬完之后发现这里也有个时间戳,哈哈,冒似没啥影响,时间固定一样可以获取到数据)

图二

图三

页面请求返回的PostId为岗位详情页后面的具体地址:

招聘岗位详情页示例:

https://careers.tencent.com/jobdesc.html?postId=1203886892391600128

图四

从浏览器抓包的请求头来看,URL地址规律如下:

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1575882580480&postId=1203886892391600128&language=zh-cn

“https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp= “此为固定部分,后面为当前时间戳,

 

Python代码如下:

时间戳部分学习参考:https://blog.csdn.net/qq_31603575/article/details/83343791

 

 

# -*- coding: utf-8 -*-
import scrapy
import json,datetime,time

#生成 13位当前时间戳
def get_time_stamp13():
    # 生成13时间戳   eg:1540281250399895
    datetime_now = datetime.datetime.now()

    # 10位,时间点相当于从UNIX TIME的纪元时间开始的当年时间编号
    date_stamp = str(int(time.mktime(datetime_now.timetuple())))

    # 3位,微秒
    data_microsecond = str("%06d" % datetime_now.microsecond)[0:3]

    date_stamp = date_stamp + data_microsecond
    return int(date_stamp)

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['tencent.com']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575855782891&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40001&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']

    def pasrse1(self,response):
        html1 = response.text
        #响应数据转换为字典格式
        html1_json = json.loads(html1)
        # print(html1_json['Data'])
        # exit()
        # for j in html1_json['Data']:
        job_dic = {}
        job_dic['岗位名称'] = html1_json['Data']['RecruitPostName']
        job_dic['工作职责'] = html1_json['Data']['Responsibility']
        job_dic['发布日期'] = html1_json['Data']['LastUpdateTime']
        job_dic['工作要求'] = html1_json['Data']['Requirement']
        print(job_dic)
    def parse(self, response):
        html = response.text
        html_json = json.loads(html)
        id_url_list = html_json['Data']['Posts'] #[0]['PostId']
        for lj in id_url_list:
            time_format = str(get_time_stamp13())
            #url地址拼接
            desc_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=' + time_format + '&postId=' + lj['PostId'] + '&language=zh-cn'
            # print(desc_url)
            # print(lj['PostId'])
        # for k in url_list['PostURL']:
            # job_url = []
            # job_url.append(k)
            # yield scrapy.Request(url=job_url, callback=self.pasrse1, dont_filter=True)
            #提交请求,迭代处理
            yield scrapy.Request(url=desc_url, callback=self.pasrse1, dont_filter=True)
        for i in range(2,188):
            #页面地址循环遍历
            url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575855782891&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40001&attrId=&keyword=&pageIndex=' + str(i) + '&pageSize=10&language=zh-cn&area=cn'
            yield scrapy.Request(url=url,callback=self.parse,dont_filter=True)








 

执行效果如下:

发布了16 篇原创文章 · 获赞 3 · 访问量 2236

猜你喜欢

转载自blog.csdn.net/OthersOnlyMe/article/details/103461921