Python3 get pull hook net jobs manner as in Example

This article is mainly to introduce the relevant information about Python3 get pull hook net jobs, the paper sample code described in great detail to all of us to learn or use Python3 has certain reference value of learning, need friends below to learn together learn it
Preface

In order to understand the data analysis with python information about the industry, probably to look at salary requirements and conditions of the industry, I decided to get information from the Internet and analyzed. Since it is necessary to have the data you want to analyze, so I chose to pull hook, risking deep inside, to get information from them. Have to say, pull hook anti-climbing technology quite powerful, then explained later. He did not talk much, just start.

A clear purpose of
each reptile must have a clear purpose, except just to find something new to test the water. I want to know is python data analysis requirements and salary conditions, and therefore, salary, education, work experiences and Qualifications is my purpose.

Since the clear aim, we have to look at what their position, so we open the browser, find the target. This website like pull hook their information are generally loaded via ajax, and enter "python Data Analysis" to go after pressing the return page jobs than to start showing up, just by clicking on the page number Job changes and even the network did not change much, we can safely guess that he was requested by post, so let's just focus on post XHR requests and documents, and soon found what we want. Here Insert Picture Description
Click to preview detailed information saved in visible form the json, where 'salary', 'workYear', 'education', 'positionID' ( Job details about the page id) is what we want. And then look at its form data, where kd = keyword, pn = pageNum (page) This is the time to bring our request parameters. In addition, we should pay attention referer request header parameters, and a little later use. Once you know the target, get up!

Second, start reptile

First set the request header headers, the user-agent usually use tape, then formdata also bring with requests directly to the library requests.post(url, headers=headers, data=formdata) ,and started being given: {"status":false,"msg":"您操作太频繁,请稍后再访问","clientIp":"......","state":2402}。
the key to solving this problem is to understand the mechanism of anti-climb pull hook: entering before hiring python data analysis page, we want to at home, you may wish to call it start_url enter keywords jump. In this process, a server will be reflected cookies, cookies if approached with this request, then we can get what you want, so the first request start_url get cookies in the request target url, and if the destination address in the request but also bring referer the request header parameters, referer meaning of something like this: I tell the server which page links from over, this server group can get some information for processing. In addition, long sleep time should be set a little, or very easy to be sealed. Know the mechanism after anti-climb, man of few words said, directly on the code.

'''
@author: Max_Lyu
Create time: 2019/4/1
url: https://github.com/MaxLyu/Lagou_Analyze
'''
 # 请求起始 url 返回 cookies
 def get_start_url(self):
 session = requests.session()
 session.get(self.start_url, headers=self.headers, timeout=3)
 cookies = session.cookies
 return cookies
 
 # 将返回的 cookies 一起 post 给 target_url 并获取数据
 def post_target_url(self):
 cookies = self.get_start_url()
 pn = 1
 for pg in range(30):
  formdata = {
  'first': 'false',
  'pn': pn,
  'kd': 'python数据分析'
  }
  pn += 1
 
  response = requests.post(self.target_url, data=formdata, cookies=cookies, headers=self.headers, timeout=3)
  self.parse(response)
  time.sleep(60) # 拉勾的反扒技术比较强,短睡眠时间会被封
 
 # 解析 response,获取 items
 def parse(self, response):
 print(response)
 items = []
 print(response.text)
 data = json.loads(response.text)['content']['positionResult']['result']
 
 if len(data):
  for i in range(len(data)):
  positionId = data[i]['positionId']
  education = data[i]['education']
  workYear = data[i]['workYear']
  salary = data[i]['salary']
  list = [positionId, education, workYear, salary]
  items.append(list)
 self.save_data(items)
 time.sleep(1.3)

Which save_data (items) is to save the file, I was saved in the csv file. Space is limited, there is not demonstrated.

Third, access to recruiting details

It says the positionID to obtain details page, which has details page Requirements want. This is relatively easy to obtain, but the text does not deal with very simple, I can only get through Qualifications "requires" the word (although some is serving skills getting better, so the trade-offs).

'''
@author: Max_Lyu
Create time: 2019/4/1
url: https://github.com/MaxLyu/Lagou_Analyze
'''
def get_url():
 urls = []
 with open("analyst.csv", 'r', newline='') as file:
 # 读取文件
 reader = csv.reader(file)
 for row in reader:
  # 根据 positionID 补全 url
  if row[0] != "ID":
  url = "https://www.lagou.com/jobs/{}.html".format(row[0])
  urls.append(url)
 
 file.close()
 return urls
 
# 获取详细信息
def get_info():
 urls = get_url()
 length = len(urls)
 for url in urls:
 print(url)
 description = ''
 print(length)
 response = requests.get(url, headers=headers)
 response.encoding = 'utf-8'
 content = etree.HTML(response.text)
 detail = content.xpath('//*[@id="job_detail"]/dd[2]/div/p/text()')
 print(detail)
 
 for i in range(1, len(detail)):
 
  if '要求' in detail[i-1]:
  for j in range(i, len(detail)):
   detail[j] = detail[j].replace('\xa0', '')
   detail[j] = re.sub('[、;;.0-9。]', '', detail[j])
   description = description + detail[j] + '/'
  print(description)
 write_file(description)
 length -= 1
 time.sleep(3)

Fourth, the results of the show Here Insert Picture Description
Here Insert Picture Description
here, crawling task is over. After obtaining the data is little to analyze, and summarize the next time.

We recommend learning Python buckle qun: 913066266, look at how seniors are learning! From basic web development python script to, reptiles, django, data mining, etc. [PDF, actual source code], zero-based projects to combat data are finishing. Given to every little python partner! Every day, Daniel explain the timing Python technology, to share some of the ways to learn and need to pay attention to small details, click on Join us python learner gathering
summary

That's all for this article, I hope the contents of this paper has some reference value of learning for everyone to learn or work

Published 47 original articles · won praise 53 · views 50000 +

Guess you like

Origin blog.csdn.net/haoxun03/article/details/104270530