Primary reptile - crawling pull hook net jobs

The main use of the library: requests

1. Original url address, https: //www.lagou.com/jobs/list_python labelWords = & fromSearch = true & suginput =?. We view the page source code, which did not find jobs we want, because the network has a pull hook mechanism anti-reptile, its jobs is dynamically loaded by ajax.

2. we press F12, find network-- the left to find the Name: positionAjax.json needAddtionalResult = false--, found on the right response?.

 

 Json format of our content will be displayed on the http://www.bejson.com/jsonviewernew/ format:

 

 Find jobs which is what we want.

3. Construction of a simple reptile

Import Requests
 # URL to be crawled actual 
URL = ' https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false '

payload = {
    'first': 'true',
    'pn': '1',
    'kd': ‘python’,
}

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
    'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
    'Accept': 'application/json, text/javascript, */*; q=0.01'
}
# Original url 
urls = ' https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput= ' 
# establish the session 
S = requests.Session ()
 # get cookies search page 
s.get (urls, = header headers, timeout = 3 )
 # to obtain the cookies of 
the cookie = s.cookies
 # get the text of 
the Response = s.post (url, the Data = payload, headers = header, cookies = the cookie, timeout = 5 ). text
 Print (the Response)

Section outputs as follows:

 

Guess you like

Origin www.cnblogs.com/xiximayou/p/11703742.html