Python crawling details 2019/12/4 effective Tencent Jobs

I was crawling Python-related jobs, first po on the code, (PS: I am white, this is followed by B station instructional video learning, teacher homework, because the site is relatively large changes Tencent recruitment, teacher code has been unable to run, so the po), some of the ideas and processes behind.

 1 from lxml import etree
 2 import requests
 3 
 4 HEADERS = {
 5     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36.36',
 6     'Cookie': '__ga=GA1.2.212176558.1568885824; pgv_pvi=2298593280; _gcl_au=1.1.1370638257.1568885828; loading=agree',
 7     'Referer': 'https://careers.tencent.com/search.html?keyword=python',
 8     'Authority': 'careers.tencent.com',
 9     "Dnt": "1"
10 }
11 
12 
13 #通过传入的indexNum获取Dict
14 def GetJsonByIndexUrl(indexNum):
15     base_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575374831812&countryId=&cityId" \
16                "= & bgIds = & the productId = & categoryId = & parentCategoryId = & attrID = & keyword = Python & the pageIndex = { " \
 . 17                 " } & the pageSize = & Language = ZH-CN & Area = CN 10 " 
18 is      URL = base_url.format (indexNum)   # value passed indexNum, the constructed full the indexURL 
. 19      Response = requests.get (URL, headers = hEADERS)
 20 is      postDict = response.json ()
 21 is      return postDict
 22 is  
23 is  # made by each job Id Dict acquired 
24  DEF GetPostIdByDict (postDict):
 25      postIds = [ ]
 26 is      Data postDict = [ "Data" ]
 27      Posts = Data [ " Posts " ]
 28      for POST in Posts:
 29          postId POST = [ " the PostID " ]
 30          postIds.append (postId)
 31 is      return postIds
 32  
33 is  # After obtaining Id, and then acquires position before the content 
34  # post_url = "https://careers.tencent.com/jobdesc.html?postId=" it's details page, but also the data in json inside, so direct access to json content, 
35  # is the following detail_url 
36  DEF GetDetailByPostId (postIds ):
 37 [      detail_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1575389747280&postId={}&language=zh-cn"
38     for id in postIds:
39         detail_url_byId = detail_url.format(id)
40         rsp = requests.get(detail_url_byId)
41         detailData = rsp.json()
42         print(detailData["Data"])
43 
44 
45 if __name__ == '__main__':
46     for x in range(1, 11):  #Retrieval of the first 10 information 
47          myDict = GetJsonByIndexUrl (X)
 48          postIds = GetPostIdByDict (myDict)
 49          Print ( " first " , X, " page " , " * " * 20 is )
 50          GetDetailByPostId (postIds)
 51 is          Print ( " * " * 20)

Some ideas and processes:

① start doing, find jobs List is not the current page, so crawling this information can not be obtained, so the view NetWork find a path is a list of information,

I named base_url, this can be achieved by requests.get List of postId.

② point to open a post details page, found that in fact is not in the details of the content of the current page, the content is a new path, I named detail_url, by requests.get,

In fact, you can get the desired information.

 

Guess you like

Origin www.cnblogs.com/fudanxi/p/11980549.html