拉勾网职位爬取--初窥反爬

常见的反爬虫策略汇总

对请求头进行检查

User-Agent识别

解决方法:构造一个User-Agent列表,每次向headers中随机注入一个User-Agent

agent = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
        'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)']
    agents = random.sample(agent, 1)[0]
    # 注意:random.sample()返回了一个仅含一个元素的列表,因此我们取[0]

Referer识别

在开发者模式下,点击网络-找到你需要发送的那个目标请求-消息头-请求头有个Referer,这个字段指出了这个请求的来源。

【Referer:此内容用来标识这个请求是从哪个页面发过来的,服务器可以拿到这一信息并做相应的处理,如作来源统计、防盗链处理等(崔庆才老师的个人网站,对请求方法、请求头、请求过程均有详细解释https://cuiqingcai.com/5465.html,包括消息头里各类请求的意义)】

referer在哪儿呢?在拉勾网的爬取中,发送请求时如果referer不是搜索结果页面,基本就爬取失败了。
注意:本次爬取的岗位是:python数据分析,“数据分析”是中文字符,因此需要使用urllib.parse模块下的urlencode进行转化,构造referer。因为url 请求链接不能包含非ASCII编码的字符,非ASCII编码的字符被认为是不安全的,所以进行编码。
实例如下。

from urllib.parse import urlencode
from urllib.parse import quote
url_search = "https://www.lagou.com/jobs/list_" + quote('python数据分析') + "?"
para = {
    
    'xl':'本科','px':'default','yx':'2k-5k',
'gx':'实习','city':'北京','district':'朝阳区','isSchoolJob':'1'}
url_search = url_search + urlencode(para)

python 中 quote 与 urlencode 的用法与区别:
https://blog.csdn.net/zjz155/article/details/88060427

urllib.parse.urlencode()
参数:dict类型
返回值:字符串
功能:将key:value 变成 key=编码后的value的形式

urllib.parse.quote()
参数:str类型,中文字符
返回值:编码后的value

Cookies识别

以拉勾网为例,网站会检测发送的请求中的cookies。观察XHR请求(也在Referer边上,叫XMLHttpRequest),发现该请求的cookies是包括搜索结果页面的cookies的,因此模仿浏览器行为时,需要先获取搜索结果页面的全部cookies并添加到headers中。
通常有两种方式可以完成这一点
①将刚刚的目标请求的请求头中的Cookies全部复制粘贴到代码中。优点在于方便,缺点在于之后如果cookies变动不好修改
②运用session方法
s = requests.Session()
s.get(url_search,headers = headers,timeout = 5) 
# timeout的设置是必要的,否则一段时间内服务器没有回应会让咱们一直卡在这里。也可以这么写:timeout = (3,7),意思是连接时间为3s,响应时间为7s
cookie = s.cookies

IP识别

原理是对频繁发送请求的设备,记录其IP地址和设备码(不过设备码在哪里看长啥样我还是不清楚),对这些ip进行封锁。
解决办法是:自己建立一个IP代理池,但是我不会,还没学到呢。

整体代码及流程

# -*- coding: utf-8 -*-
"""
Created on Thu Oct 15 15:17:49 2020

method:POST
类型:XHR
@author: djx
"""
from urllib.parse import urlencode
from urllib.parse import quote
import requests
import pymongo


def getpage(url_final:str,page:int):
    j:int = 1
    url_search = "https://www.lagou.com/jobs/list_" + quote('python数据分析') + "?"
    para = {
    
    'xl':'本科','px':'default','yx':'2k-5k','gx':'实习','city':'北京','district':'朝阳区','isSchoolJob':'1'}
    url_search = url_search + urlencode(para)
    agent = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
        'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)']
    agents = random.sample(agent, 1)[0]
    headers ={
    
    'Accept':'application/json, text/javascript, */*; q=0.01',
       'Host': 'www.lagou.com',
       'User-Agent':agents,
       'Referer':url_search,# 应该为搜索结果页
       'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7'}
#    这个header用于获取搜索结果页的cookie
#    headers = str(headers)
#    print(type(headers))
    payload = {
    
    'first':'true','pn':j,'kd':'python数据分析'}#.encode('utf-8')
#    print(headers)
    s = requests.Session()
    s.get(url_search,headers = headers,timeout = 5)
#    response = requests.post(url_final,data = payload,headers=headers,cookies = s.cookies,timeout = 5)
#    return response.text
#    print(s.cookies)
#    new_cookies = requests.cookies.RequestsCookieJar()
#    new_cookies.set('JSESSIONID','ABAAAECABFAACEA2D7960FECF9FBC9ABF231C39F422F8CD')
#    s.cookies.update(new_cookies)
#    print(s.cookies)
    while j <= page:
        print(j)
        try:
            response = requests.post(url_final,data = payload,headers=headers,cookies = s.cookies,timeout = 5)
            if response.status_code == 200:
                j += 1
                return response.json()
            # content.decode('utf-8') 这样会返回一个字符串
        except Exception as e:
            print("error"+str(e))
            return None
        
#def save_info():
    

def main():
    url_final = "https://www.lagou.com/jobs/positionAjax.json?"
    para = {
    
    'xl':'本科','px':'default','yx':'2k-5k','gx':'实习','city':'北京','district':'朝阳区','needAddtionalResult':'false','isSchoolJob':'1'}
    url_final = url_final +urlencode(para) # 最终的返回json请求的url
#    print(url_final)
    content = getpage(url_final,5)
    result = content["content"]["positionResult"]["result"]
    client = pymongo.MongoClient(host = 'localhost',port=27017)
    db = client['JobInfo']
    collection = db.SimpleInfo
    
    for i in result:
        collection.insert_one(i)


main()

猜你喜欢

转载自blog.csdn.net/weixin_44123346/article/details/109263982