Lagou.com post crawling

Summary of common anti-crawler strategies

Check the request header

User-Agent identification

Solution: Construct a User-Agent list, and randomly inject a User-Agent into the headers each time

agent = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
        'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)']
    agents = random.sample(agent, 1)[0]
    # 注意:random.sample()返回了一个仅含一个元素的列表,因此我们取[0]

Referer identification

In the developer mode, click on the network-find the target request you need to send-message header-the request header has a Referer, this field indicates the source of the request.

[Referer: This content is used to identify the page from which the request was sent. The server can get this information and do corresponding processing, such as source statistics, anti-leech processing, etc. ( Cui Qingcai’s personal website, how to request , Request header, request process are explained in detail https://cuiqingcai.com/5465.html , including the meaning of various requests in the message header)]

Where is the referer?In the crawling of Lagou.com, if the referer is not a search result page when the request is sent, the crawling basically fails.
Note: The post crawled this time is: python data analysis, "data analysis" is Chinese characters, so you need to use the urlencode under the urllib.parse module to transform and construct a referer. Because the URL request link cannot contain non-ASCII encoded characters, non-ASCII encoded characters are considered insecure, so encoding is performed.
Examples are as follows.

from urllib.parse import urlencode
from urllib.parse import quote
url_search = "https://www.lagou.com/jobs/list_" + quote('python数据分析') + "?"
para = {
    
    'xl':'本科','px':'default','yx':'2k-5k',
'gx':'实习','city':'北京','district':'朝阳区','isSchoolJob':'1'}
url_search = url_search + urlencode(para)

The usage and difference between quote and urlencode in python:
https://blog.csdn.net/zjz155/article/details/88060427

urllib.parse.urlencode()
parameter: dict type
return value: string
function: change key: value into the form of key = encoded value

urllib.parse.quote()
parameter: str type, Chinese characters
Return value: encoded value

Cookies recognition

以拉勾网为例,网站会检测发送的请求中的cookies。观察XHR请求(也在Referer边上,叫XMLHttpRequest),发现该请求的cookies是包括搜索结果页面的cookies的,因此模仿浏览器行为时,需要先获取搜索结果页面的全部cookies并添加到headers中。
通常有两种方式可以完成这一点
①将刚刚的目标请求的请求头中的Cookies全部复制粘贴到代码中。优点在于方便,缺点在于之后如果cookies变动不好修改
②运用session方法
s = requests.Session()
s.get(url_search,headers = headers,timeout = 5) 
# timeout的设置是必要的,否则一段时间内服务器没有回应会让咱们一直卡在这里。也可以这么写:timeout = (3,7),意思是连接时间为3s,响应时间为7s
cookie = s.cookies

IP identification

The principle is to record the IP address and device code of the device that frequently sends requests (but I still don't know where the device code looks like), and block these IPs.
The solution is: set up an IP proxy pool by myself, but I don't know it, I haven't learned it yet.

Overall code and process

# -*- coding: utf-8 -*-
"""
Created on Thu Oct 15 15:17:49 2020

method:POST
类型:XHR
@author: djx
"""
from urllib.parse import urlencode
from urllib.parse import quote
import requests
import pymongo


def getpage(url_final:str,page:int):
    j:int = 1
    url_search = "https://www.lagou.com/jobs/list_" + quote('python数据分析') + "?"
    para = {
    
    'xl':'本科','px':'default','yx':'2k-5k','gx':'实习','city':'北京','district':'朝阳区','isSchoolJob':'1'}
    url_search = url_search + urlencode(para)
    agent = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
        'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)']
    agents = random.sample(agent, 1)[0]
    headers ={
    
    'Accept':'application/json, text/javascript, */*; q=0.01',
       'Host': 'www.lagou.com',
       'User-Agent':agents,
       'Referer':url_search,# 应该为搜索结果页
       'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7'}
#    这个header用于获取搜索结果页的cookie
#    headers = str(headers)
#    print(type(headers))
    payload = {
    
    'first':'true','pn':j,'kd':'python数据分析'}#.encode('utf-8')
#    print(headers)
    s = requests.Session()
    s.get(url_search,headers = headers,timeout = 5)
#    response = requests.post(url_final,data = payload,headers=headers,cookies = s.cookies,timeout = 5)
#    return response.text
#    print(s.cookies)
#    new_cookies = requests.cookies.RequestsCookieJar()
#    new_cookies.set('JSESSIONID','ABAAAECABFAACEA2D7960FECF9FBC9ABF231C39F422F8CD')
#    s.cookies.update(new_cookies)
#    print(s.cookies)
    while j <= page:
        print(j)
        try:
            response = requests.post(url_final,data = payload,headers=headers,cookies = s.cookies,timeout = 5)
            if response.status_code == 200:
                j += 1
                return response.json()
            # content.decode('utf-8') 这样会返回一个字符串
        except Exception as e:
            print("error"+str(e))
            return None
        
#def save_info():
    

def main():
    url_final = "https://www.lagou.com/jobs/positionAjax.json?"
    para = {
    
    'xl':'本科','px':'default','yx':'2k-5k','gx':'实习','city':'北京','district':'朝阳区','needAddtionalResult':'false','isSchoolJob':'1'}
    url_final = url_final +urlencode(para) # 最终的返回json请求的url
#    print(url_final)
    content = getpage(url_final,5)
    result = content["content"]["positionResult"]["result"]
    client = pymongo.MongoClient(host = 'localhost',port=27017)
    db = client['JobInfo']
    collection = db.SimpleInfo
    
    for i in result:
        collection.insert_one(i)


main()

Guess you like

Origin blog.csdn.net/weixin_44123346/article/details/109263982