【Python爬虫实例学习篇】——1、获取拉勾网职位信息

【Python爬虫实例学习篇】——1、获取拉勾网职位信息

毕业季就要到了,打算上拉钩网爬一下有关实习岗位的招聘信息。刚写完几行代码进行调试发现一直提示:
{“status”:false,“msg”== :“您操作太频繁,请稍后再访问”,“clientIp”:“223.155.85.177”,“state”:2402},此时进入网页一看,能够正常进行访问,并没有出现上述提示语,据此判断存在反爬虫机制。经过一番尝试发现是cookie的问题,下面是解决问题的详细过程。

1.问题

一开始想用urllib库来获取招聘信息结果发现返回结果一直是操作频繁,代码如下:

from urllib import request,parse

KeyWord="python"
url="https://www.lagou.com/jobs/list_"+KeyWord+"?&cl=false&fromSearch=true&labelWords=&suginput="
url_GetJob="https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false"
headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36",
    "Referer":url
}
data={
    "first":"true",
    "pn":"1",
    "kd":KeyWord
}

req=request.Request(url_GetJob,headers=headers,data=parse.urlencode(data).encode('utf-8'))
response=request.urlopen(req)
print(response.read().decode('utf-8'))

返回结果为:
错误信息
此时网页直接访问情况:
网页访问状况

2.解决办法

方法1:利用http.cookiejar

from urllib import request, parse
import http.cookiejar

KeyWord = "python"
url = "https://www.lagou.com/jobs/list_" + KeyWord + "?&cl=false&fromSearch=true&labelWords=&suginput="
url_GetJob = "https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36",
    "Referer": url
}
data = {
    "first": "true",
    "pn": "1",
    "kd": KeyWord
}

cookie_jar = http.cookiejar.CookieJar()
handler = request.HTTPCookieProcessor(cookie_jar)
opener = request.build_opener(handler)

req = request.Request(url, headers=headers)
opener.open(req) # 目的是获取Cookie
req2=request.Request(url_GetJob,headers=headers, data=parse.urlencode(data).encode('utf-8'))
res = opener.open(req2)
print(res.read().decode('utf-8'))

方法2:利用requests.session

import requests

KeyWord = "python"
url = "https://www.lagou.com/jobs/list_" + KeyWord + "?&cl=false&fromSearch=true&labelWords=&suginput="
url_GetJob = "https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36",
    "Referer": url
}

# 创建会话
session = requests.session()
res1 = session.get(url, headers=headers, verify=False)
# 保持会话提交表单
data = {
    "first": "true",
    "pn": "1",
    "kd": KeyWord
}
res = session.post(url_GetJob, headers=headers, data=data, verify=False)


3.结果

正确返回结果

微信公众号:

小术快跑

发布了3 篇原创文章 · 获赞 4 · 访问量 5415

猜你喜欢

转载自blog.csdn.net/qq_40528553/article/details/103931286