Payroll status Suzhou java posts (2)

  Previous statistics have been out of the starting salary of the highest top 10:

  Next play, the details of the top 10 in all positions crawling down. Details for a position like this:

  We need to work experience, qualifications, functions, keywords crawling down.

 1 from urllib.request import urlopen
 2 from urllib.error import HTTPError
 3 from bs4 import BeautifulSoup
 4 import csv
 5 from itertools import chain
 6 import threading
 7
 8 def load_datas():
 9     '''
10     从joblist.csv中装载数据
11     :return: 数据集 datas
12     '''
13     datas = []
14     with open('high10_url.csv ' , encoding = ' UTF-. 8 ' ) AS FP:
 15          R & lt = csv.reader (FP)
 16          for Row in R & lt:
 . 17              datas.append (Row [0])
 18 is      return DATAS
 . 19 
20 is  DEF get_desc (URL ):
 21      '' ' crawling detailed job description, including: experience, education, jobs, skills, keyword ' '' 
22      the try :
 23          HTML = urlopen (url)
 24-      the except HTTPError AS E:
 25          Print ( ' Page WAS not found ', E.filename)
 26          return []
 27 
28      the job_desc = []   # post details of 
29      the try :
 30          exp, EDU, position, Keys = '' , '' , '' , []   # experience, education, jobs, critical skills word 
31 is          bsObj = the BeautifulSoup (html.read ())
 32          Contents = bsObj.find ( ' P ' , { ' class ' : ' MSG LTYPE ' }) Contents.
 33 is          exp = Contents [2] .strip ()   # experience
34         edu = contents[4].strip()  # 学历
35         print(edu)
36         a_list = bsObj.findAll('a', {'class': 'el tdn'})
37         for i, a in enumerate(a_list):
38             if i == 0:
39                 position = a.get_text()  # 职位
40             else:
41                 keys.append(a.get_text())  # 技能关键字
42         job_desc.append ((exp, EDU, position, Keys))
 43 is      the except AttributeError AS E:
 44 is          Print (E)
 45          the job_desc = []
 46 is      return the job_desc
 47 
48  DEF crawl (URLs):
 49      '' ' 
50      : param URLs: details position
 51 is      ' '' 
52 is      Print ( ' start crawling data ... " )
 53 is      the job_desc = [get_desc (URL) for URL in URLs]
 54 is      Print ( " crawling ends ' )
 55     return the job_desc
 56 is 
57 is  DEF save_data (all_jobs, f_name):
 58      '' ' 
59      to save information to a target file
 60      : param all_jobs: two-dimensional list, each element of a position information
 61 is      '' ' 
62 is      Print ( ' is save data ... ' )
 63 is      with Open (f_name, ' W ' , encoding = ' UTF-. 8 ' , NEWLINE = ' ' ) aS FP:
 64          W = csv.writer (FP)
 65          # will be converted into a two-dimensional list Victoria 
66          t = List (catena alberghiera (* all_jobs))
 67         w.writerows (T)
 68          Print ( ' STORED of {} of data ' .format (len (T)))
 69 
70 URLs = load_datas ()
 71 is the job_desc = crawl (URLs)
 72  Print (the job_desc)
 73 is save_data ( the job_desc, ' job_desc.csv ' )

  H igh10_url.csv has been previously stored in all of 64 url top 10. job_desc.csv results are as follows:

  Education column there is a problem, the fifth line shows "1 recruit people", in fact, no academic requirements for this position, all the "x recruit people," the records are changed to "No."

  Then you can experience respectively in accordance with statistics, education, functions:

Import CSV
 Import PANDAS AS PD
 Import numpy AS NP 

DEF load_datas ():
     '' ' 
    load data from the joblist.csv 
    : return: dataset DATAS 
    ' '' 
    DATAS = [] 
    with Open ( ' job_desc.csv ' , encoding = ' . 8-UTF ' ) AS FP: 
        R & lt = csv.reader (FP)
         for Row in R & lt: 
            datas.append (Row) 
    return DATAS 

DEF analysis (DATAS):
     ' '' data analysis ' ''
    df Pd.DataFrame = ({ ' exp ' : DATAS [:, 0],
                        ' EDU ' : DATAS [:,. 1 ],
                        ' position ' : DATAS [:, 2 ],
                        ' Keys ' : DATAS [:,. 3 ]} ) 
    COUNT (df, ' exp ' , ' experience ' ) # empirically statistics 
    COUNT (df, ' EDU ' , ' education ' ) # according to education statistics 
    COUNT (df, 'position' , ' Post ' ) # according to the statistics office 

DEF COUNT (df, IDX, name):
     ' '' group statistics ' '' 
    Print (( ' press ' + name + ' grouping ' ) .center (60, ' - ' ) ) 
    C = DF [IDX] .value_counts (Sort = True)
     Print (C) 

IF  the __name__ == ' __main__ ' :
     # read data and washed 
    DATAS = np.array (load_datas ()) 
    Analysis (DATAS)

  

  5 to 7 years experience really is the easiest to find high-paying jobs, but most employers undergraduate degree requirements.

  Statistical functions more messy, more senior software engineer and architect jobs, project manager jobs such salaries are generally lower than engineers, this is also the same as expected:

  Skills keyword does not look friendly:

  The first record good reaction skills requirements, second to no use, because key information is HR add their own, most HR do not quite understand technology, so it appeared as Article nothing like keyword analysis of the effect of this.

  It seems to have to resort to some of the segmentation technique to extract some keywords from the job information.

  Next to continue to see what skills are sought-after.


  Author: I am an 8-bit

  Source: http://www.cnblogs.com/bigmonkey

  In this paper, learn, and share research-based, For reprint, please contact me, indicating the author and the source, non-commercial use! 

  Two-dimensional code scanning of all the public attention No. "I was 8 of"

Guess you like

Origin www.cnblogs.com/bigmonkey/p/11775400.html