Pull hook net Anti-climb Mechanism (2020, January 21)

Special statement

As we all know, pull hook mechanism for updating the network of anti-climb more frequent, this approach is not necessarily shared blog permanent, that is time-sensitive. The site of the anti-climb study the mechanism is for the exchange of learning, never used to do illegal things.

Website introduction

Pull hook Network is a professional internet recruitment platform, there are many companies hiring information on the site, so the site has become the focus of a lot of crawling reptiles.
Here Insert Picture Description

Ideas analysis

In this paper I analyze ideas in Python example, first of all, Python enter keywords in the search box, and click Search
Here Insert Picture Description
to jump to the page shown above, we have to crawl data is the following jobs
Here Insert Picture Description
View page source we can not find the data we want, which can determine the data obtained through other URL request, the following analysis capture
Here Insert Picture Description
by capture, we found the requested URL, the request method is POST, but when we direct when the request with the POST method, but can not get to the data, you return directly to the "request too often" (this prompt should be false), I try to increase the information in the header of the request, still to no avail. Later, after analysis I know the background of the website may determine cookie when you request only cookie right before you return to the information, so our thinking is very clear, as long as the cookie to get in before the request, and upon request plus cookie on the line.

The core ideas:

In order to be able to get the jobs we want, we take two steps.
First step, get a request with a URL beginning of the way that we access the page URL, the purpose of this step is to be able to get cookie, preparing for the next request.
The first request URL:https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=
The second step, we use the first step to get the cookie to send a POST request, URL request is:
https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false

Implementation

First, we can use the session method requests the library to create a session object to store the cookie, then use this cookie makes a request to the second step.

Test code

# !/usr/bin/env python
# —*— coding: utf-8 —*—
# @Time:    2020/1/19 18:03
# @Author:  Martin
# @File:    lagou.py
# @Software:PyCharm
"""
拉勾网反爬机制分析:
通过两次请求来获取职位列表,
第一次请求原始页面获取cookie
第二次请求时利用第一次获取到的cookie

"""
import requests
# 第一次请求的URL
first_url = 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput='
# 第二次请求的URL
second_url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
# 伪装请求头
headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'keep-alive',
    'Content-Length': '25',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Host': 'www.lagou.com',
    'Origin': 'https://www.lagou.com',
    'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36',
    'X-Anit-Forge-Code': '0',
    'X-Anit-Forge-Token': 'None',
    'X-Requested-With': 'XMLHttpRequest'
}
# 创建一个session对象
session = requests.session()
# 请求的数据
data = {
    'first': 'true',
    'pn': '1',
    'kd': 'Python'
}
session.get(first_url, headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
})
result = session.post(second_url, headers=headers, data=data, allow_redirects=False)
print(result.json())


The results are as follows:
Here Insert Picture Description
can be found, we are able to successfully obtain data by this method.

Precautions

When data is crawling this way, frequent requests will be the site found, and seal your IP, there are two methods for solving this problem.
A method, set the time interval of the request, not transmitting a large number of requests in a short time.
The second method uses a proxy IP, IP when blocked, can be bypassed by switching the proxy IP block, continues to fetch data.

postscript

Here Insert Picture Description
I have only just begun to study web crawler, there is a small reptile want to learn partners can focus on each other, to share together, learn together, and common progress.
Here Insert Picture Description

This article was written in 2020, January 21

At home

Published 120 original articles · won praise 173 · views 20000 +

Guess you like

Origin blog.csdn.net/Deep___Learning/article/details/104064641