Python programmers crawling nearly one hundred thousand jobs data, tell you what kind of talent and skills in the most favorable! | Force plans ...

Author | Huang supreme, Zebian | Guo Rui

Produced | CSDN blog

Cover photo | CSDN download the visual China

With the rapid development of technology, the data show explosive growth, no one can not escape dealing with data, for society "data" aspects of talent demand is also increasing. Therefore, understanding the current business what kind of talent recruitment? What kind of skills needed? Whether for students or for job seekers, it seemed necessary.

Based on this, for 51job recruitment website crawling a large nationwide data, data analysis, data mining, machine learning, artificial intelligence and other related jobs jobs. Analysis and comparison of salaries, academic requirements of different positions; analysis and comparison of the different regions, industry demand for talent-related cases; analysis and comparison of the different positions of the knowledge, skill requirements.

The finished project results in the future as follows:

Dynamic effects are as follows:

Information crawling (based on data 51job recruitment website crawling)

  • Crawling Post: large data, data analysis, machine learning and artificial intelligence related positions;

  • Crawling fields: company name, job name, work address, Payroll, Time, job descriptions, company type, number of employees, industry;

  • Description: Based on 51job recruitment site, we demand nationwide search for the "Data" positions, about 2000. We crawling field, both an information page, there are some two pages of the information;

  • Crawling ideas: first do a page for a certain page of analytical data, and then the two pages to make a resolution, then finally turn the page;

  • Use tools: Python + requests + lxml + pandas + time

  • Website parsing mode: Xpath

1) Import-related library

import requests
import pandas as pd
from pprint import pprint
from lxml import etree
import time
import warnings
warnings.filterwarnings("ignore")

2) description of the page

# 第一页的特点
https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,1.html?
# 第二页的特点
https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,2.html?
# 第三页的特点
https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,3.html?

Note: For observation through the page, can be seen on a number of changes in place, so just do string concatenation, then crawled to the cycle.

3) the complete code for crawl

import requests
import pandas as pd
from pprint import pprint
from lxml import etree
import time
import warnings
warnings.filterwarnings("ignore")


for i in range(1,1501):
    print("正在爬取第" + str(i) + "页的数据")
    url_pre = "https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,"
    url_end = ".html?"
    url = url_pre + str(i) + url_end
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
    }
    web = requests.get(url, headers=headers)
    web.encoding = "gbk"
    dom = etree.HTML(web.text)
    # 1、岗位名称
    job_name = dom.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@title')
    # 2、公司名称
    company_name = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t2"]/a[@target="_blank"]/@title')
    # 3、工作地点
    address = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t3"]/text()')
    # 4、工资
    salary_mid = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t4"]')
    salary = [i.text for i in salary_mid]
    # 5、发布日期
    release_time = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t5"]/text()')
    # 6、获取二级网址url
    deep_url = dom.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@href')
    RandomAll = []
    JobDescribe = []
    CompanyType = []
    CompanySize = []
    Industry = []
    for i in range(len(deep_url)):
        web_test = requests.get(deep_url[i], headers=headers)
        web_test.encoding = "gbk"
        dom_test = etree.HTML(web_test.text)
        # 7、爬取经验、学历信息,先合在一个字段里面,以后再做数据清洗。命名为random_all
        random_all = dom_test.xpath('//div[@class="tHeader tHjob"]//div[@class="cn"]/p[@class="msg ltype"]/text()')
        # 8、岗位描述性息
        job_describe = dom_test.xpath('//div[@class="tBorderTop_box"]//div[@class="bmsg job_msg inbox"]/p/text()')
        # 9、公司类型
        company_type = dom_test.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[1]/@title')
        # 10、公司规模(人数)
        company_size = dom_test.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[2]/@title')
        # 11、所属行业(公司)
        industry = dom_test.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[3]/@title')
        # 将上述信息保存到各自的列表中
        RandomAll.append(random_all)
        JobDescribe.append(job_describe)
        CompanyType.append(company_type)
        CompanySize.append(company_size)
        Industry.append(industry)
        # 为了反爬,设置睡眠时间
        time.sleep(1)
    # 由于我们需要爬取很多页,为了防止最后一次性保存所有数据出现的错误,因此,我们每获取一夜的数据,就进行一次数据存取。
    df = pd.DataFrame()
    df["岗位名称"] = job_name
    df["公司名称"] = company_name
    df["工作地点"] = address
    df["工资"] = salary
    df["发布日期"] = release_time
    df["经验、学历"] = RandomAll
    df["公司类型"] = CompanyType
    df["公司规模"] = CompanySize
    df["所属行业"] = Industry
    df["岗位描述"] = JobDescribe
    # 这里在写出过程中,有可能会写入失败,为了解决这个问题,我们使用异常处理。
    try:
        df.to_csv("job_info.csv", mode="a+", header=None, index=None, encoding="gbk")
    except:
        print("当页数据写入失败")
    time.sleep(1)
print("数据爬取完毕,是不是很开心!!!")

Here you can see, we are crawling the more than 1,000 pages of data for final analysis. Therefore, each takes a climb of data, do a data store, a one-time storage to avoid eventually lead to failure. At the same time according to their tests, there are some pages for data storage, will lead to failure, in order not to affect the implementation of the code behind, we use the "try-except" exception handling.

In a page, we crawled the "job name", "company name", "place of work", "wages", "Date", "secondary URL url" these fields.

In two pages, we crawled up "experience, education information," "job description", "company types", "company size", "Industry" these fields.

Data preprocessing

Crawling taken from the data section to make a display, it can be seen from the data mess. Messy data is not conducive to our analysis, it is necessary to do a study of the target data preprocessing, we finally get the data can be used for visual display.

1) and reading data related to import database

df = pd.read_csv(r"G:\8泰迪\python_project\51_job\job_info1.csv",engine="python",header=None)
# 为数据框指定行索引
df.index = range(len(df))
# 为数据框指定列索引
df.columns = ["岗位名","公司名","工作地点","工资","发布日期","经验与学历","公司类型","公司规模","行业","工作描述"]

2) data deduplication

We believe that a company of the same name and the company name and job release, it is regarded as duplicates. Therefore, drop_duplicates (subset = []) function, based on the "job name" and "company name" do reject a duplicate values.

# 去重之前的记录数
print("去重之前的记录数",df.shape)
# 记录去重
df.drop_duplicates(subset=["公司名","岗位名"],inplace=True)
# 去重之后的记录数
print("去重之后的记录数",df.shape)

Treatment 3) job name field

① explore job name field

df["岗位名"].value_counts()
df["岗位名"] = df["岗位名"].apply(lambda x:x.lower())

Description: First, we do a statistical frequency of occurrence of each post, you can see that "job name field" too messy, not for us to do statistical analysis. Then we will post the name of the unified capital letters to lowercase letters, which means "AI" and "Ai" belong to the same thing.

Goal posts ② structure you want to analyze, to make a data filtering

job_info.shape
target_job = ['算法', '开发', '分析', '工程师', '数据', '运营', '运维']
index = [df["岗位名"].str.count(i) for i in target_job]
index = np.array(index).sum(axis=0) > 0
job_info = df[index]
job_info.shape

Description: First, we construct the key word above seven target positions. Then using the count () function every statistical record, contains seven key words, if you leave this field includes, but is not included to delete this field. After viewing the final screening how many records still remaining.

③ goal posts normalized (because the target is too messy job, we need to unify it)

job_list = ['数据分析', "数据统计","数据专员",'数据挖掘', '算法', 
            '大数据','开发工程师', '运营', '软件工程', '前端开发',
            '深度学习', 'ai', '数据库', '数据库', '数据产品',
            '客服', 'java', '.net', 'andrio', '人工智能', 'c++',
            '数据管理',"测试","运维"]
job_list = np.array(job_list)
def rename(x=None,job_list=job_list):
    index = [i in x for i in job_list]
    if sum(index) > 0:
        return job_list[index][0]
    else:
        return x
job_info["岗位名"] = job_info["岗位名"].apply(rename)
job_info["岗位名"].value_counts()
# 数据统计、数据专员、数据分析统一归为数据分析
job_info["岗位名"] = job_info["岗位名"].apply(lambda x:re.sub("数据专员","数据分析",x))
job_info["岗位名"] = job_info["岗位名"].apply(lambda x:re.sub("数据统计","数据分析",x))

Description: First, we define a goal post job_list want to replace, convert it to ndarray array. Then define a function, if a record contains a keyword job_list array, then this record will be replaced for this keyword, if a record contains a plurality of keywords job_list array, we only take a first keywords replace this record. Then use value_counts () function statistics about the frequency of the posts after replacement. Finally, we will "Data Commissioner", "statistics" are grouped together as "data analysis."

4) processing wages field

Data fields wages similar to "200,000-300,000 / year", "2.5-3 Wan / month" and the form "3.5-4.5 thousand / month." We need to make a single change, converts the data format is "yuan / month", and then remove the two figures, find an average.

job_info["工资"].str[-1].value_counts()
job_info["工资"].str[-3].value_counts()


index1 = job_info["工资"].str[-1].isin(["年","月"])
index2 = job_info["工资"].str[-3].isin(["万","千"])
job_info = job_info[index1 & index2]


def get_money_max_min(x):
    try:
        if x[-3] == "万":
            z = [float(i)*10000 for i in re.findall("[0-9]+\.?[0-9]*",x)]
        elif x[-3] == "千":
            z = [float(i) * 1000 for i in re.findall("[0-9]+\.?[0-9]*", x)]
        if x[-1] == "年":
            z = [i/12 for i in z]
        return z
    except:
        return x


salary = job_info["工资"].apply(get_money_max_min)
job_info["最低工资"] = salary.str[0]
job_info["最高工资"] = salary.str[1]
job_info["工资水平"] = job_info[["最低工资","最高工资"]].mean(axis=1)

Description: First, we did a screening data, specific to each record, if the last word in the "year" and "month" while the third word in the "million" and "thousands", then keep this article record, or to delete. Then we define a function that will be converted to a unified format "Yuan / month." Finally, the minimum wage and maximum wage averaged to give the final "wage level" field.

5) Work Place field

Since the entire data is data about the country, the city is also involved in particularly high. We need a common custom field goal place of work, to make a unified data processing.

#job_info["工作地点"].value_counts()
address_list = ['北京', '上海', '广州', '深圳', '杭州', '苏州', '长沙',
                '武汉', '天津', '成都', '西安', '东莞', '合肥', '佛山',
                '宁波', '南京', '重庆', '长春', '郑州', '常州', '福州',
                '沈阳', '济南', '宁波', '厦门', '贵州', '珠海', '青岛',
                '中山', '大连','昆山',"惠州","哈尔滨","昆明","南昌","无锡"]
address_list = np.array(address_list)


def rename(x=None,address_list=address_list):
    index = [i in x for i in address_list]
    if sum(index) > 0:
        return address_list[index][0]
    else:
        return x
job_info["工作地点"] = job_info["工作地点"].apply(rename)

Description: First, we defined a target list of places of work, to convert it to ndarray array. Then we define a function that will record the original place of work, place of work is replaced with the target in the city.

Processing 6) company type field

This is very easy, not described in detail.

job_info.loc[job_info["公司类型"].apply(lambda x:len(x)<6),"公司类型"] = np.nan
job_info["公司类型"] = job_info["公司类型"].str[2:-2]

7) treatment industry field

Each company may have multiple fields of industry, trade labels, but we default to the first of the company's industry as a label.

# job_info["行业"].value_counts()
job_info["行业"] = job_info["行业"].apply(lambda x:re.sub(",","/",x))
job_info.loc[job_info["行业"].apply(lambda x:len(x)<6),"行业"] = np.nan
job_info["行业"] = job_info["行业"].str[2:-2].str.split("/").str[0]

8) experience in dealing with the field of education

About the data processing field, I was thinking for a moment, is not a good description, put your own code to go experience.

job_info["学历"] = job_info["经验与学历"].apply(lambda x:re.findall("本科|大专|应届生|在校生|硕士",x))
def func(x):
    if len(x) == 0:
        return np.nan
    elif len(x) == 1 or len(x) == 2:
        return x[0]
    else:
        return x[2]
job_info["学历"] = job_info["学历"].apply(func)

Processing 9) job description field

For each row, after we remove stop words, do a jieba word.

with open(r"G:\8泰迪\python_project\51_job\stopword.txt","r") as f:
    stopword = f.read()
stopword = stopword.split()
stopword = stopword + ["任职","职位"," "]


job_info["工作描述"] = job_info["工作描述"].str[2:-2].apply(lambda x:x.lower()).apply(lambda x:"".join(x))\
    .apply(jieba.lcut).apply(lambda x:[i for i in x if i not in stopword])
job_info.loc[job_info["工作描述"].apply(lambda x:len(x) < 6),"工作描述"] = np.nan

For 10) company size field

#job_info["公司规模"].value_counts()
def func(x):
    if x == "['少于50人']":
        return "<50"
    elif x == "['50-150人']":
        return "50-150"
    elif x == "['150-500人']":
        return '150-500'
    elif x == "['500-1000人']":
        return '500-1000'
    elif x == "['1000-5000人']":
        return '1000-5000'
    elif x == "['5000-10000人']":
        return '5000-10000'
    elif x == "['10000人以上']":
        return ">10000"
    else:
        return np.nan
job_info["公司规模"] = job_info["公司规模"].apply(func)

11) the new data structure

We clean the data for the final, select the field to be analyzed, do a data storage.

feature = ["公司名","岗位名","工作地点","工资水平","发布日期","学历","公司类型","公司规模","行业","工作描述"]
final_df = job_info[feature]
final_df.to_excel(r"G:\8泰迪\python_project\51_job\词云图.xlsx",encoding="gbk",index=None)

Special treatment on the "job description" field

Since then we need to do for different positions were different words cloud processing and visual display is done in the tableau, so we need to be classified in accordance with the job name, obtained at different positions each keyword word frequency statistics.

import numpy as np
import pandas as pd
import re
import jieba
import warnings
warnings.filterwarnings("ignore")


df = pd.read_excel(r"G:\8泰迪\python_project\51_job\new_job_info1.xlsx",encoding="gbk")
df


def get_word_cloud(data=None, job_name=None):
    words = []
    describe = data['工作描述'][data['岗位名'] == job_name].str[1:-1]
    describe.dropna(inplace=True)
    [words.extend(i.split(',')) for i in describe]
    words = pd.Series(words)
    word_fre = words.value_counts()
    return word_fre


zz = ['数据分析', '算法', '大数据','开发工程师', '运营', '软件工程','运维', '数据库','java',"测试"]
for i in zz:
    word_fre = get_word_cloud(data=df, job_name='{}'.format(i))
    word_fre = word_fre[1:].reset_index()[:100]
    word_fre["岗位名"] = pd.Series("{}".format(i),index=range(len(word_fre)))
    word_fre.to_csv(r"G:\8泰迪\python_project\51_job\词云图\bb.csv", mode='a',index=False, header=None,encoding="gbk")

tableau visual display

Employment needs of 1) Popular cities TOP10

The number of TOP10 2) Popular Cities post

Number 3) different positions Bubble FIG workplace

4) Hot job salary

5) hot industry salary

6) large-screen visualization of the final show

"Dynamic" 7) visualization of large-screen display

Description: This is the final conclusion of the analysis is not done, because the conclusion by the figure, it can be clearly seen.

Disclaimer: This article is CSDN blogger "Huang supreme" in the original article, CSDN official release authorized.

Original link: https: //blog.csdn.net/weixin_41261833/article/details/104924038

【End】

Recommended Reading 

GitHub open source project after another blocked provoke outrage, CEO personally apologize!

360 in response to security cloud disk appears unusual transactions; Apple's official website after another restriction iPhone; GitHub open source project shielded Microsoft engineers | Geeks headlines

2020 years, the five kinds of programming language will die

withstood million people live, recommended the United Nations since the end of the book fly migration path technology!

do not know what AWS that? This 11 key with you know AWS!

written contract Solidity of intelligent design patterns

You look at every point, I seriously as a favorite

Released 1864 original articles · won praise 40000 + · Views 16,920,000 +

Guess you like

Origin blog.csdn.net/csdnnews/article/details/105039890