爬去拉勾网招聘信息

版权声明:未经允许,随意转载,请附上本文链接谢谢(づ ̄3 ̄)づ╭❤~
https://blog.csdn.net/xiaoduan_/article/details/80835231

爬去拉勾网招聘信息

在拉勾网发现他们招聘信息的返回接口是json接口,有这样好的数据接口怎么能不爬那。
平时比较喜欢spark,那就来爬spark的招聘信息然后放到MongoDB里面吧

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# @Author  : Anthony_Duan
# @Time    : 25/06/2018 15:53
# @File    : lagou.py
# @Software: PyCharm

import requests
from fake_useragent import UserAgent
import time
from pymongo import MongoClient

client = MongoClient()
db = client.lagou  # 连接数据库,如果没有该数据库就创建一个
my_set = db.spark_job  # 定义lagou数据库下的job表 没有自动创建

headers = {
    "Cookie": "JSESSIONID=ABAAABAAAIAACBICB3D046BA1BEA314A00EA18BD6391426; SEARCH_ID=f8e30fdbd29e42f5bd02662ab2cef21f; user_trace_token=20180625154048-198ac502-c5d4-4114-907d-7dcca0c7dd47; _ga=GA1.2.810701623.1529912450; _gat=1; LGSID=20180625154049-149d80f2-784b-11e8-b069-525400f775ce; PRE_UTM=; PRE_HOST=static.dcxueyuan.com; PRE_SITE=https%3A%2F%2Fstatic.dcxueyuan.com%2Fcontent%2Fdisk%2Ftrain%2Fother%2F70b2c405-138b-4862-ad49-138656aef0d6.html; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGUID=20180625154049-149d839e-784b-11e8-b069-525400f775ce; X_HTTP_TOKEN=188845654580e592f42f58c18962c06c; LGRID=20180625154312-699841c7-784b-11e8-b06b-525400f775ce",
    "Referer": "https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput="
}


def get_job_info(page, kd):
    for i in range(1, page, 1):
        url = "https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0"

        payload = {
            "first": "true",
            "pn": i,
            "kd": kd
        }
        ua = UserAgent()
        headers['User-Agent'] = ua.random
        # 要发送一些编码为表单形式的数据——非常像一个 HTML 表单,只需简单地传递一个字典给 data 参数。
        # 如果你是手工构建 URL,那么数据会以键/值对的形式置于 URL 中,跟在一个问号的后面。
        #  Requests 允许你使用 params 关键字参数,以一个字符串字典来提供这些参数 例如, http://bin.org/get?key=val。
        response = requests.get(url, data=payload, headers=headers, timeout=20)

        if response.status_code == 200:
            job_json = response.json()['content']['positionResult']['result']
            my_set.insert(job_json)  # 将json数据插入表中
        else:
            print("something wrong\t")
            print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())+"\n")

        print('正在爬去' + str(i) + '页内容\t')
        print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
        time.sleep(3)


if __name__ == '__main__':
    get_job_info(10, "spark")

爬下来的数据大概长这个样子

猜你喜欢

转载自blog.csdn.net/xiaoduan_/article/details/80835231
今日推荐