Previous statistics have been out of the starting salary of the highest top 10:
Next play, the details of the top 10 in all positions crawling down. Details for a position like this:
We need to work experience, qualifications, functions, keywords crawling down.
1 from urllib.request import urlopen 2 from urllib.error import HTTPError 3 from bs4 import BeautifulSoup 4 import csv 5 from itertools import chain 6 import threading 7 8 def load_datas(): 9 ''' 10 从joblist.csv中装载数据 11 :return: 数据集 datas 12 ''' 13 datas = [] 14 with open('high10_url.csv ' , encoding = ' UTF-. 8 ' ) AS FP: 15 R & lt = csv.reader (FP) 16 for Row in R & lt: . 17 datas.append (Row [0]) 18 is return DATAS . 19 20 is DEF get_desc (URL ): 21 '' ' crawling detailed job description, including: experience, education, jobs, skills, keyword ' '' 22 the try : 23 HTML = urlopen (url) 24- the except HTTPError AS E: 25 Print ( ' Page WAS not found ', E.filename) 26 return [] 27 28 the job_desc = [] # post details of 29 the try : 30 exp, EDU, position, Keys = '' , '' , '' , [] # experience, education, jobs, critical skills word 31 is bsObj = the BeautifulSoup (html.read ()) 32 Contents = bsObj.find ( ' P ' , { ' class ' : ' MSG LTYPE ' }) Contents. 33 is exp = Contents [2] .strip () # experience 34 edu = contents[4].strip() # 学历 35 print(edu) 36 a_list = bsObj.findAll('a', {'class': 'el tdn'}) 37 for i, a in enumerate(a_list): 38 if i == 0: 39 position = a.get_text() # 职位 40 else: 41 keys.append(a.get_text()) # 技能关键字 42 job_desc.append ((exp, EDU, position, Keys)) 43 is the except AttributeError AS E: 44 is Print (E) 45 the job_desc = [] 46 is return the job_desc 47 48 DEF crawl (URLs): 49 '' ' 50 : param URLs: details position 51 is ' '' 52 is Print ( ' start crawling data ... " ) 53 is the job_desc = [get_desc (URL) for URL in URLs] 54 is Print ( " crawling ends ' ) 55 return the job_desc 56 is 57 is DEF save_data (all_jobs, f_name): 58 '' ' 59 to save information to a target file 60 : param all_jobs: two-dimensional list, each element of a position information 61 is '' ' 62 is Print ( ' is save data ... ' ) 63 is with Open (f_name, ' W ' , encoding = ' UTF-. 8 ' , NEWLINE = ' ' ) aS FP: 64 W = csv.writer (FP) 65 # will be converted into a two-dimensional list Victoria 66 t = List (catena alberghiera (* all_jobs)) 67 w.writerows (T) 68 Print ( ' STORED of {} of data ' .format (len (T))) 69 70 URLs = load_datas () 71 is the job_desc = crawl (URLs) 72 Print (the job_desc) 73 is save_data ( the job_desc, ' job_desc.csv ' )
H igh10_url.csv has been previously stored in all of 64 url top 10. job_desc.csv results are as follows:
Education column there is a problem, the fifth line shows "1 recruit people", in fact, no academic requirements for this position, all the "x recruit people," the records are changed to "No."
Then you can experience respectively in accordance with statistics, education, functions:
Import CSV Import PANDAS AS PD Import numpy AS NP DEF load_datas (): '' ' load data from the joblist.csv : return: dataset DATAS ' '' DATAS = [] with Open ( ' job_desc.csv ' , encoding = ' . 8-UTF ' ) AS FP: R & lt = csv.reader (FP) for Row in R & lt: datas.append (Row) return DATAS DEF analysis (DATAS): ' '' data analysis ' '' df Pd.DataFrame = ({ ' exp ' : DATAS [:, 0], ' EDU ' : DATAS [:,. 1 ], ' position ' : DATAS [:, 2 ], ' Keys ' : DATAS [:,. 3 ]} ) COUNT (df, ' exp ' , ' experience ' ) # empirically statistics COUNT (df, ' EDU ' , ' education ' ) # according to education statistics COUNT (df, 'position' , ' Post ' ) # according to the statistics office DEF COUNT (df, IDX, name): ' '' group statistics ' '' Print (( ' press ' + name + ' grouping ' ) .center (60, ' - ' ) ) C = DF [IDX] .value_counts (Sort = True) Print (C) IF the __name__ == ' __main__ ' : # read data and washed DATAS = np.array (load_datas ()) Analysis (DATAS)
5 to 7 years experience really is the easiest to find high-paying jobs, but most employers undergraduate degree requirements.
Statistical functions more messy, more senior software engineer and architect jobs, project manager jobs such salaries are generally lower than engineers, this is also the same as expected:
Skills keyword does not look friendly:
The first record good reaction skills requirements, second to no use, because key information is HR add their own, most HR do not quite understand technology, so it appeared as Article nothing like keyword analysis of the effect of this.
It seems to have to resort to some of the segmentation technique to extract some keywords from the job information.
Next to continue to see what skills are sought-after.
Author: I am an 8-bit
Source: http://www.cnblogs.com/bigmonkey
In this paper, learn, and share research-based, For reprint, please contact me, indicating the author and the source, non-commercial use!
Two-dimensional code scanning of all the public attention No. "I was 8 of"