The main use of the library: requests
1. Original url address, https: //www.lagou.com/jobs/list_python labelWords = & fromSearch = true & suginput =?. We view the page source code, which did not find jobs we want, because the network has a pull hook mechanism anti-reptile, its jobs is dynamically loaded by ajax.
2. we press F12, find network-- the left to find the Name: positionAjax.json needAddtionalResult = false--, found on the right response?.
Json format of our content will be displayed on the http://www.bejson.com/jsonviewernew/ format:
Find jobs which is what we want.
3. Construction of a simple reptile
Import Requests # URL to be crawled actual URL = ' https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false ' payload = { 'first': 'true', 'pn': '1', 'kd': ‘python’, } header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36', 'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=', 'Accept': 'application/json, text/javascript, */*; q=0.01' } # Original url urls = ' https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput= ' # establish the session S = requests.Session () # get cookies search page s.get (urls, = header headers, timeout = 3 ) # to obtain the cookies of the cookie = s.cookies # get the text of the Response = s.post (url, the Data = payload, headers = header, cookies = the cookie, timeout = 5 ). text Print (the Response)
Section outputs as follows: