Article Directory
Special statement
As we all know, pull hook mechanism for updating the network of anti-climb more frequent, this approach is not necessarily shared blog permanent, that is time-sensitive. The site of the anti-climb study the mechanism is for the exchange of learning, never used to do illegal things.
Website introduction
Pull hook Network is a professional internet recruitment platform, there are many companies hiring information on the site, so the site has become the focus of a lot of crawling reptiles.
Ideas analysis
In this paper I analyze ideas in Python example, first of all, Python enter keywords in the search box, and click Search
to jump to the page shown above, we have to crawl data is the following jobs
View page source we can not find the data we want, which can determine the data obtained through other URL request, the following analysis capture
by capture, we found the requested URL, the request method is POST, but when we direct when the request with the POST method, but can not get to the data, you return directly to the "request too often" (this prompt should be false), I try to increase the information in the header of the request, still to no avail. Later, after analysis I know the background of the website may determine cookie when you request only cookie right before you return to the information, so our thinking is very clear, as long as the cookie to get in before the request, and upon request plus cookie on the line.
The core ideas:
In order to be able to get the jobs we want, we take two steps.
First step, get a request with a URL beginning of the way that we access the page URL, the purpose of this step is to be able to get cookie, preparing for the next request.
The first request URL:https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=
The second step, we use the first step to get the cookie to send a POST request, URL request is:
https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false
Implementation
First, we can use the session method requests the library to create a session object to store the cookie, then use this cookie makes a request to the second step.
Test code
# !/usr/bin/env python
# —*— coding: utf-8 —*—
# @Time: 2020/1/19 18:03
# @Author: Martin
# @File: lagou.py
# @Software:PyCharm
"""
拉勾网反爬机制分析:
通过两次请求来获取职位列表,
第一次请求原始页面获取cookie
第二次请求时利用第一次获取到的cookie
"""
import requests
# 第一次请求的URL
first_url = 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput='
# 第二次请求的URL
second_url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
# 伪装请求头
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '25',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'X-Requested-With': 'XMLHttpRequest'
}
# 创建一个session对象
session = requests.session()
# 请求的数据
data = {
'first': 'true',
'pn': '1',
'kd': 'Python'
}
session.get(first_url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
})
result = session.post(second_url, headers=headers, data=data, allow_redirects=False)
print(result.json())
The results are as follows:
can be found, we are able to successfully obtain data by this method.
Precautions
When data is crawling this way, frequent requests will be the site found, and seal your IP, there are two methods for solving this problem.
A method, set the time interval of the request, not transmitting a large number of requests in a short time.
The second method uses a proxy IP, IP when blocked, can be bypassed by switching the proxy IP block, continues to fetch data.
postscript
I have only just begun to study web crawler, there is a small reptile want to learn partners can focus on each other, to share together, learn together, and common progress.
This article was written in 2020, January 21
At home