I was crawling Python-related jobs, first po on the code, (PS: I am white, this is followed by B station instructional video learning, teacher homework, because the site is relatively large changes Tencent recruitment, teacher code has been unable to run, so the po), some of the ideas and processes behind.
1 from lxml import etree 2 import requests 3 4 HEADERS = { 5 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36.36', 6 'Cookie': '__ga=GA1.2.212176558.1568885824; pgv_pvi=2298593280; _gcl_au=1.1.1370638257.1568885828; loading=agree', 7 'Referer': 'https://careers.tencent.com/search.html?keyword=python', 8 'Authority': 'careers.tencent.com', 9 "Dnt": "1" 10 } 11 12 13 #通过传入的indexNum获取Dict 14 def GetJsonByIndexUrl(indexNum): 15 base_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575374831812&countryId=&cityId" \ 16 "= & bgIds = & the productId = & categoryId = & parentCategoryId = & attrID = & keyword = Python & the pageIndex = { " \ . 17 " } & the pageSize = & Language = ZH-CN & Area = CN 10 " 18 is URL = base_url.format (indexNum) # value passed indexNum, the constructed full the indexURL . 19 Response = requests.get (URL, headers = hEADERS) 20 is postDict = response.json () 21 is return postDict 22 is 23 is # made by each job Id Dict acquired 24 DEF GetPostIdByDict (postDict): 25 postIds = [ ] 26 is Data postDict = [ "Data" ] 27 Posts = Data [ " Posts " ] 28 for POST in Posts: 29 postId POST = [ " the PostID " ] 30 postIds.append (postId) 31 is return postIds 32 33 is # After obtaining Id, and then acquires position before the content 34 # post_url = "https://careers.tencent.com/jobdesc.html?postId=" it's details page, but also the data in json inside, so direct access to json content, 35 # is the following detail_url 36 DEF GetDetailByPostId (postIds ): 37 [ detail_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1575389747280&postId={}&language=zh-cn" 38 for id in postIds: 39 detail_url_byId = detail_url.format(id) 40 rsp = requests.get(detail_url_byId) 41 detailData = rsp.json() 42 print(detailData["Data"]) 43 44 45 if __name__ == '__main__': 46 for x in range(1, 11): #Retrieval of the first 10 information 47 myDict = GetJsonByIndexUrl (X) 48 postIds = GetPostIdByDict (myDict) 49 Print ( " first " , X, " page " , " * " * 20 is ) 50 GetDetailByPostId (postIds) 51 is Print ( " * " * 20)
Some ideas and processes:
① start doing, find jobs List is not the current page, so crawling this information can not be obtained, so the view NetWork find a path is a list of information,
I named base_url, this can be achieved by requests.get List of postId.
② point to open a post details page, found that in fact is not in the details of the content of the current page, the content is a new path, I named detail_url, by requests.get,
In fact, you can get the desired information.