版权声明:未经允许,随意转载,请附上本文链接谢谢(づ ̄3 ̄)づ╭❤~
https://blog.csdn.net/xiaoduan_/article/details/80835231
爬去拉勾网招聘信息
在拉勾网发现他们招聘信息的返回接口是json接口,有这样好的数据接口怎么能不爬那。
平时比较喜欢spark,那就来爬spark的招聘信息然后放到MongoDB里面吧
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# @Author : Anthony_Duan
# @Time : 25/06/2018 15:53
# @File : lagou.py
# @Software: PyCharm
import requests
from fake_useragent import UserAgent
import time
from pymongo import MongoClient
client = MongoClient()
db = client.lagou # 连接数据库,如果没有该数据库就创建一个
my_set = db.spark_job # 定义lagou数据库下的job表 没有自动创建
headers = {
"Cookie": "JSESSIONID=ABAAABAAAIAACBICB3D046BA1BEA314A00EA18BD6391426; SEARCH_ID=f8e30fdbd29e42f5bd02662ab2cef21f; user_trace_token=20180625154048-198ac502-c5d4-4114-907d-7dcca0c7dd47; _ga=GA1.2.810701623.1529912450; _gat=1; LGSID=20180625154049-149d80f2-784b-11e8-b069-525400f775ce; PRE_UTM=; PRE_HOST=static.dcxueyuan.com; PRE_SITE=https%3A%2F%2Fstatic.dcxueyuan.com%2Fcontent%2Fdisk%2Ftrain%2Fother%2F70b2c405-138b-4862-ad49-138656aef0d6.html; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGUID=20180625154049-149d839e-784b-11e8-b069-525400f775ce; X_HTTP_TOKEN=188845654580e592f42f58c18962c06c; LGRID=20180625154312-699841c7-784b-11e8-b06b-525400f775ce",
"Referer": "https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput="
}
def get_job_info(page, kd):
for i in range(1, page, 1):
url = "https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&isSchoolJob=0"
payload = {
"first": "true",
"pn": i,
"kd": kd
}
ua = UserAgent()
headers['User-Agent'] = ua.random
# 要发送一些编码为表单形式的数据——非常像一个 HTML 表单,只需简单地传递一个字典给 data 参数。
# 如果你是手工构建 URL,那么数据会以键/值对的形式置于 URL 中,跟在一个问号的后面。
# Requests 允许你使用 params 关键字参数,以一个字符串字典来提供这些参数 例如, http://bin.org/get?key=val。
response = requests.get(url, data=payload, headers=headers, timeout=20)
if response.status_code == 200:
job_json = response.json()['content']['positionResult']['result']
my_set.insert(job_json) # 将json数据插入表中
else:
print("something wrong\t")
print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())+"\n")
print('正在爬去' + str(i) + '页内容\t')
print(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
time.sleep(3)
if __name__ == '__main__':
get_job_info(10, "spark")
爬下来的数据大概长这个样子