Python crawling retractors recruitment network data

0 demand

Crawling pull hook net ( https://www.lagou.com/ jobs and "embedded software" related keywords on).

 

 

1 analysis

Search the page source code information (company name, etc.) we want, no match, indicating that the data is obtained dynamically

Open inspection tool, refresh the page to crawl from under the Network package found in the returned data packet. (JS-tag may be used XHR and filtration)

Here is basically half the battle, and the rest depends on the site of anti-climb mechanism of how to force

 

 

We switched to Headers label to see this packet header information:

请求的URL:https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false

Requested manner: POST (compared to the GET, in FormData request data, there may be encrypted)

(Comparative conscience Here, similar data is not encrypted)

Request header: possible need to add, cookies may have trouble / (¨Ò ¨Ò o) / ~~

 

 

With this analysis, you can start writing code, a problem to solve the problem (ง • _ •) ง

 

2 crawling

Beginning certainly have to assume that what anti-climb mechanism it is not set, written URL, Headers and FormData, direct calls requests.post ()

I will not describe the process, encountered problems just say: We will find the above operating data returned is "too often you visit."

But the fact that we simply do not visit very often, which is anti-climb mechanism website settings, this phenomenon is called professional terms - poisoning

 

FormData since data is not encrypted, it can be determined that the problem is basically out on a cookie

Now with fixed cookie can not, by interaction http request and a cookie, we must find a set-cookie process

(Source: https://www.cnblogs.com/fanying/p/11650034.html )

 

 

Pull back to hook the network page, removed the existing cookie (I was chrome, follows)

 

 

 

 

 Then we'll refresh the page, caught analysis package (look for set-cookie package ah (1 • ㅂ •) و✧)

 You have to find this package ah, note that this is not necessarily the name, but it must be a Set-Cookie parameter ah

 

 

Get the cookie, and the rest is simple, you can get a package to perform a function request a cookie, the cookie will be returned to join the post request parameters, you can get to the data you want

 

 

 

Data processing 3

We get the data, and normally we have to filter the data - extraction, where I do not have special needs, just to learn some key information on direct deposit to a csv file it.

May need to visualize with pandas, pycharts other tools.

 

Code 4

 

# -*- encoding: utf-8 -*-
'''
@File    :   lagou.py
@Time    :   2020/03/30 22:12:38
@Author  :   bAdblocks 
@Version :   1.0
@Contact :   [email protected]
'''

# here put the import lib
import json
import requests
import pprint # 格式化打印

url = "https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false"
headers = {
# "Cookie": getCookie(),
"Host": "www.lagou.com",
"Origin": "https://www.lagou.com",
"Referer": "https://www.lagou.com/jobs/list_%E5%B5%8C%E5%85%A5%E5%BC%8F%E8%BD%AF%E4%BB%B6/p-city_215?px=default", # 防盗链
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"X-Anit-Forge-Code": "0",
"X-Anit-Forge-Token": "None",
"X-Requested-With": "XMLHttpRequest"
}

form_data = {
"first": "false",
"pn": "1",# 页码
"kd": "嵌入式软件"
}

def getCookie():
    cookie_url = "https://www.lagou.com/jobs/list_%E5%B5%8C%E5%85%A5%E5%BC%8F%E8%BD%AF%E4%BB%B6/p-city_215?px=default"
    cookie_headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
    }
    cookie_r Requests.get = (cookie_url, headers = cookie_headers) 

    return cookie_r.cookies 

Cookies = the getCookie () 

R & lt = requests.post (URL = URL, Data = form_data, headers = headers, = Cookies Cookies)
 Print (r.text)
 
Data
= r.json () with Open ( ' jobs .csv ' , MODE = " W + " , encoding = " UTF-. 8 " ) AS F: header = [ ' job title ' , ' company name ' , ' company size ' , '薪资'] f.write(','.join(header)) f.write('\n') for item in position_data: d = [item['positionName'], item['companyFullName'],item['companySize'], item['salary']] f.write(','.join(d)) f.write('\n')

 

 

 

Guess you like

Origin www.cnblogs.com/Irvingcode/p/12616467.html