The main focus of data analysis job recruitment case, a simple analysis of the data
surroundings
win8, python3.7, pycharm, jupyter notebook
''' 遇到python不懂的问题,可以加Python学习交流群:1004391443一起学习交流,群文件还有零基础入门的学习资料 '''
text
1. Identify the purpose of analysis
For the latest job recruitment data analysis, including regional distribution, education requirements, experience requirements, salary levels and so on.
2. Data Collection
Here With reptile, crawling recruitment site recruitment information, and then analyze the relevant salary and recruitment requirements.
2.1 Analysis of the target site
Through the analysis of the target site, we need to determine the manner requested destination site, and page structure.
2.2 New scrapy project
1. Perform the following code at any path cmd command line window, for example, in: New zhaopin items under "D pythonTests" directory.
d:
cd D:pythonTests
scrapy startproject zhaopin
2. Upon completion of the project zhaopin created, the next step is in zhaopin project file folder in the new spider crawling the main program
cd zhaopin
scrapy genspider zhaopinSpider zhaopin.com
This completes the creation of the project zhaopin, start writing our program now.
2.3 Definitions items
Definition requires crawling jobs in items.py file.
import scrapy
from scrapy.item import Item, Field
class zhaopinItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
JobTitle = Field() #职位名称
CompanyName = Field() #公司名称
CompanyNature = Field() #公司性质
CompanySize = Field() #公司规模
IndustryField = Field() #所属行业
Salary = Field() #薪水
Workplace = Field() #工作地点
Workyear = Field() #要求工作经验
Education = Field() #要求学历
RecruitNumbers = Field() #招聘人数
ReleaseTime = Field() #发布时间
Language = Field() #要求语言
Specialty = Field() #要求专业
PositionAdvantage = Field() #职位福利
2.4 The main program written reptiles
Write reptiles in the main program file zhaopinSpider.py
import scrapy
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider
from scrapy.http import Request
from zhaopin.items import zhaopinItem
class ZhaoPinSpider(scrapy.Spider):
name = "ZhaoPinSpider"
allowed_domains = ['zhaopin.com']
start_urls = ['https://xxxx.com/list/2,{0}.html?'.format(str(page)) for page in range(1, 217)]
def parse(self, response):
'''
开始第一页
:param response:
:return:
'''
yield Request(
url = response.url,
callback = self.parse_job_url,
meta={},
dont_filter= True
)
def parse_job_url(self, response):
'''
获取每页的职位详情页url
:param response:
:return:
'''
selector = Selector(response)
urls = selector.xpath('//div[@class="el"]/p/span')
for url in urls:
url = url.xpath('a/@href').extract()[0]
yield Request(
url = url,
callback = self.parse_job_info,
meta = {},
dont_filter = True
)
def parse_job_info(self, response):
'''
解析工作详情页
:param response:
:return:
'''
item = Job51Item()
selector = Selector(response)
JobTitle = selector.xpath('//div[@class="cn"]/h1/text()').extract()[0].strip().replace(' ','').replace(',',';')
CompanyName = selector.xpath('//div[@class="cn"]/p[1]/a[1]/text()').extract()[0].strip().replace(',',';')
CompanyNature = selector.xpath('//div[@class="tCompany_sidebar"]/div/div[2]/p[1]/text()').extract()[0].strip().replace(',',';')
CompanySize = selector.xpath('//div[@class="tCompany_sidebar"]/div/div[2]/p[2]/text()').extract()[0].strip().replace(',',';')
IndustryField = selector.xpath('//div[@class="tCompany_sidebar"]/div/div[2]/p[3]/text()').extract()[0].strip().replace(',',';')
Salary = selector.xpath('//div[@class="cn"]/strong/text()').extract()[0].strip().replace(',',';')
infos = selector.xpath('//div[@class="cn"]/p[2]/text()').extract()
Workplace = infos[0].strip().replace(' ','').replace(',',';')
Workyear = infos[1].strip().replace(' ','').replace(',',';')
if len(infos) == 4:
Education = ''
RecruitNumbers = infos[2].strip().replace(' ', '').replace(',',';')
ReleaseTime = infos[3].strip().replace(' ', '').replace(',',';')
else:
Education = infos[2].strip().replace(' ', '').replace(',',';')
RecruitNumbers = infos[3].strip().replace(' ', '').replace(',',';')
ReleaseTime = infos[4].strip().replace(' ', '').replace(',',';')
if len(infos) == 7:
Language, Specialty = infos[5].strip().replace(' ',''), infos[6].strip().replace(' ','').replace(',',';')
elif len(infos) == 6:
if (('英语' in infos[5]) or ('话' in infos[5])):
Language, Specialty = infos[5].strip().replace(' ','').replace(',',';'), ''
else:
Language, Specialty = '', infos[5].strip().replace(' ','').replace(',',';')
else:
Language, Specialty = '', ''
Welfare = selector.xpath('//div[@class="t1"]/span/text()').extract()
PositionAdvantage = ';'.join(Welfare).replace(',', ';')
item['JobTitle'] =JobTitle
item['CompanyName'] =CompanyName
item['CompanyNature'] =CompanyNature
item['CompanySize'] = CompanySize
item['IndustryField'] = IndustryField
item['Salary'] =Salary
item['Workplace'] = Workplace
item['Workyear'] =Workyear
item['Education'] =Education
item['RecruitNumbers'] = RecruitNumbers
item['ReleaseTime'] =ReleaseTime
item['Language'] = Language
item['Specialty'] = Specialty
item['PositionAdvantage'] = PositionAdvantage
yield item
Csv file to save 2.5
Save to csv file via pipelines pipeline project
class Job51Pipeline(object):
def process_item(self, item, spider):
with open(r'D:DataZhaoPin.csv','a', encoding = 'gb18030') as f:
job_info = [item['JobTitle'], item['CompanyName'], item['CompanyNature'], item['CompanySize'], item['IndustryField'], item['Salary'], item['Workplace'], item['Workyear'], item['Education'], item['RecruitNumbers'], item['ReleaseTime'],item['Language'],item['Specialty'],item['PositionAdvantage'],'
']
f.write(",".join(job_info))
return item
2.6 configuration setting
Set the user agent, download delay 0.5s, close cookie tracking, call pipelines
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
DOWNLOAD_DELAY = 0.5
COOKIES_ENABLED = False
ITEM_PIPELINES = {
'job51.pipelines.Job51Pipeline': 300,
}
2.7 to run the program
New main.py file and execute the following code
from scrapy import cmdline
cmdline.execute('scrapy crawl zhaopin'.split())
Thus began crawling data, eventually crawling to more than 9,000 pieces of data, before analyzing the data, take a look at what kind of data is to enter the data overview link.
3. Data Overview
3.1 to read data
import pandas as pd
df = pd.read_csv(r'D:aPythonDataDataVisualizationshujufenxishiJob51.csv')
#由于原始数据中没有字段, 需要为其添加字段
df.columns = ['JobTitle','CompanyName','CompanyNature','CompanySize','IndustryField','Salary','Workplace','Workyear','Education','RecruitNumbers', 'ReleaseTime','Language','Specialty','PositionAdvantage']
df.info()
Throws an exception: UnicodeDecodeError: 'utf-8' codec can not decode byte 0xbd in position 0: invalid start byte
Solutions; use Notepad ++ coding will be converted to utf-8 bom format
After the conversion, perform again
抛出异常: ValueError: Length mismatch: Expected axis has 15 elements, new values have 14 elements
解决办法: 在列表['JobTitle..... PositionAdvantage ']后面追加' NNN ', 从而补齐15个元素.
追加之后, 再次执行, 执行结果为:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9948 entries, 0 to 9947
Data columns (total 15 columns):
JobTitle 9948 non-null object
CompanyName 9948 non-null object
CompanyNature 9948 non-null object
CompanySize 9948 non-null object
IndustryField 9948 non-null object
Salary 9948 non-null object
Workplace 9948 non-null object
Workyear 9948 non-null object
Education 7533 non-null object
RecruitNumbers 9948 non-null object
ReleaseTime 9948 non-null object
Language 901 non-null object
Specialty 814 non-null object
PositionAdvantage 8288 non-null object
NNN 0 non-null float64
dtypes: float64(1), object(14)
memory usage: 1.1+ MB
可以了解到的信息: 目前的数据维度9948行X15列, Education, Language, Specialty, PositionAdvantage有不同程度的缺失(NNN是最后添加, 仅仅是用来补齐15元素), 14个python对象(1个浮点型)
3.2 描述性统计
由于我们所需信息的数据类型都是python对象, 故使用以下代码
#注意是大写的字母o
df.describe(include=['O'])
从以下信息(公司名称部分我没有截图)中可以得到:
职位名称中'数据分析师'最多, 多为民营公司, 公司规模150-500人最多, 行业领域金融/投资/证券最多, 薪资中6-8千/月最多, 大多对工作经验没有要求, 学历要求多为本科, 多数均招1人等信息.
职位名称的种类就有4758种, 他们都是我们本次分析的数据分析师岗位吗, 先来确认下:
zhaopin.JobTitle.unique()
array(['零基础免费培训金融外汇数据分析师', '数据分析师(周末双休+上班舒适)', '数据分析师', ...,
'数据分析实习(J10635)', '数据分析实习(J10691)', '数据分析实习(J10713)'], dtype=object)
这仅仅显示了职位名称中的一部分,而且还都符合要求, 换种思路先看20个
JobTitle = zhaopin.groupby('JobTitle', as_index=False).count()
JobTitle.JobTitle.head(20)
0 (AI)机器学习开发工程师讲师
1 (ID67391)美资公司数据分析
2 (ID67465)美资公司数据分析
3 (ID67674)500强法资汽车制造商数据分析专员(6个月)
4 (ID67897)知名500强法资公司招聘数据分析专员
5 (Senior)DataAnalyst
6 (免费培训)数据分析师+双休+底薪
7 (实习职位)BusinessDataAnalyst/业务数据分析
8 (急)人力销售经理
9 (提供食宿)银行客服+双休
10 (日语)股票数据分析员/EquityDataAnalyst-Japanese/
11 (越南语)股票数据分析员/EquityDataAnalyst-Vietnam
12 (跨境电商)产品专员/数据分析师
13 (韩语)股票数据分析员/EquityDataAnalyst-Korean
14 ***数据分析
15 -数据分析师助理/实习生
16 -数据分析师助理/统计专员+双休五险+住宿
17 -无销售不加班金融数据分析师月入10k
18 -金融数据分析师助理6k-1.5w
19 -金融数据分析师双休岗位分红
Name: JobTitle, dtype: object
可以看到还有机器学习开发讲师, 人力销售经理, 银行客服等其他无效数据.
现在我们对数据有了大致的认识, 下来我们开始数据预处理.
4. 数据预处理
4.1 数据清洗
数据清洗的目的是不让有错误或有问题的数据进入加工过程, 其主要内容包括: 重复值, 缺失值以及空值的处理
4.1.1 删除重复值
如果数据中存在重复记录, 而且重复数量较多时, 势必会对结果造成影响, 因此我们应当首先处理重复值.
#删除数据表中的重复记录, 并将删除后的数据表赋值给zhaopin
zhaopin = df.drop_duplicates(inplace = False)
zhaopin.shape
(8927, 15)
对比之前的数据, 重复记录1021条.
4.1.2 过滤无效数据
我们了解到职位名称中存在无效数据, 我们对其的处理方式是过滤掉.
#筛选名称中包含'数据'或'分析'或'Data'的职位
zhaopin = zhaopin[zhaopin.JobTitle.str.contains('.*?数据.*?|.*?分析.*?|.*?Data.*?')]
zhaopin.shape
(7959, 15)
4.1.3 缺失值处理
在pandas中缺失值为NaN或者NaT, 其处理方式有多种:
1. 利用均值等集中趋势度量填充
2. 利用统计模型计算出的值填充
3. 保留缺失值
4. 删除缺失值
#计算每个特征中缺失值个数
zhaopin.isnull().sum()
JobTitle 0
CompanyName 0
CompanyNature 0
CompanySize 0
IndustryField 0
Salary 0
Workplace 0
Workyear 0
Education 1740
RecruitNumbers 0
ReleaseTime 0
Language 7227
Specialty 7244
PositionAdvantage 1364
NNN 7959
dtype: int64
-- Education: 缺失值占比1740/7959 = 21.86%, 缺失很有可能是"不限学历", 我们就用"不限学历"填充
zhaopin.Education.fillna('不限学历', inplace=True)
-- Language: 缺失值占比7227/7959 = 90.80%, 缺失太多, 删除特征
-- Specialty: 缺失值占比7244/7959 = 91.02%, 同样缺失很多, 删除
zhaopin.drop(['Specialty','Language'], axis=1, inplace = True)
-- PositionAdvantage: 缺失占比1364/7959 = 17.14%, 选用众数中的第一个'五险一金'填充
zhaopin.PositionAdvantage.fillna(zhaopin.PositionAdvantage.mode()[0], inplace = True)
-- NNN: 没有任何意义, 直接删除
zhaopin.drop(["NNN"], axis=1, inplace = True)
最后, 检查缺失值是否处理完毕
zhaopin.isnull().sum()
JobTitle 0
CompanyName 0
CompanyNature 0
CompanySize 0
IndustryField 0
Salary 0
Workplace 0
Workyear 0
Education 0
RecruitNumbers 0
ReleaseTime 0
PositionAdvantage 0
dtype: int64
4.2 数据加工
由于现有的数据不能满足我们的分析需求, 因此需要对现有数据表进行分列, 计算等等操作.
需要处理的特征有: Salary, Workplace
1. Salary
将薪资分为最高薪资和最低薪资, 另外了解到薪资中单位有元/小时, 元/天, 万/月, 万/年, 千/月, 统一将其转化为千/月
import re
#将5种单元进行编号
zhaopin['Standard'] = np.where(zhaopin.Salary.str.contains('元.*?小时'), 0,
np.where(zhaopin.Salary.str.contains('元.*?天'), 1,
np.where(zhaopin.Salary.str.contains('千.*?月'), 2,
np.where(zhaopin.Salary.str.contains('万.*?月'), 3,
4))))
#用'-'将Salary分割为LowSalary和HighSalary
SalarySplit = zhaopin.Salary.str.split('-', expand = True)
zhaopin['LowSalary'], zhaopin['HighSalary'] = SalarySplit[0], SalarySplit[1]
#Salary中包含'以上', '以下'或者两者都不包含的进行编号
zhaopin['HighOrLow'] = np.where(zhaopin.LowSalary.str.contains('以.*?下'), 0,
np.where(zhaopin.LowSalary.str.contains('以.*?上'), 2,
1))
#匹配LowSalary中的数字, 并转为浮点型
Lower = zhaopin.LowSalary.apply(lambda x: re.search('(d+.?d*)', x).group(1)).astype(float)
#对LowSalary中HighOrLow为1的部分进行单位换算, 全部转为'千/月'
zhaopin.LowSalary = np.where(((zhaopin.Standard==0)&(zhaopin.HighOrLow==1)), Lower*8*21/1000,
np.where(((zhaopin.Standard==1)&(zhaopin.HighOrLow==1)), Lower*21/1000,
np.where(((zhaopin.Standard==2)&(zhaopin.HighOrLow==1)), Lower,
np.where(((zhaopin.Standard==3)&(zhaopin.HighOrLow==1)), Lower*10,
np.where(((zhaopin.Standard==4)&(zhaopin.HighOrLow==1)), Lower/12*10,
Lower)))))
#对HighSalary中的缺失值进行填充, 可以有效避免匹配出错.
zhaopin.HighSalary.fillna('0千/月', inplace =True)
#匹配HighSalary中的数字, 并转为浮点型
Higher = zhaopin.HighSalary.apply(lambda x: re.search('(d+.?d*).*?', str(x)).group(1)).astype(float)
#对HighSalary中HighOrLow为1的部分完成单位换算, 全部转为'千/月'
zhaopin.HighSalary = np.where(((zhaopin.Standard==0)&(zhaopin.HighOrLow==1)),zhaopin.LowSalary/21*26,
np.where(((zhaopin.Standard==1)&(zhaopin.HighOrLow==1)),zhaopin.LowSalary/21*26,
np.where(((zhaopin.Standard==2)&(zhaopin.HighOrLow==1)), Higher,
np.where(((zhaopin.Standard==3)&(zhaopin.HighOrLow==1)), Higher*10,
np.where(((zhaopin.Standard==4)&(zhaopin.HighOrLow==1)), Higher/12*10,
np.where(zhaopin.HighOrLow==0, zhaopin.LowSalary,
zhaopin.LowSalary))))))
#查看当HighOrLow为0时, Standard都有哪些, 输出为2, 4
zhaopin[zhaopin.HighOrLow==0].Standard.unique()
#完成HighOrLow为0时的单位换算
zhaopin.loc[(zhaopin.HighOrLow==0)&(zhaopin.Standard==2), 'LowSalary'] = zhaopin[(zhaopin.HighOrLow==0)&(zhaopin.Standard==2)].HighSalary.apply(lambda x: 0.8*x)
zhaopin.loc[(zhaopin.HighOrLow==0)&(zhaopin.Standard==4), 'HighSalary'] = zhaopin[(zhaopin.HighOrLow==0)&(zhaopin.Standard==4)].HighSalary.apply(lambda x: x/12*10)
zhaopin.loc[(zhaopin.HighOrLow==0)&(zhaopin.Standard==4), 'LowSalary'] = zhaopin[(zhaopin.HighOrLow==0)&(zhaopin.Standard==4)].HighSalary.apply(lambda x: 0.8*x)
#查看当HighOrLow为2时, Srandard有哪些, 输出为4
zhaopin[zhaopin.HighOrLow==2].Standard.unique()
#完成HighOrLow为2时的单位换算
zhaopin.loc[zhaopin.HighOrLow==2, 'LowSalary'] = zhaopin[zhaopin.HighOrLow==2].HighSalary.apply(lambda x: x/12*10)
zhaopin.loc[zhaopin.HighOrLow==2, 'HighSalary'] = zhaopin[zhaopin.HighOrLow==2].LowSalary.apply(lambda x: 1.2*x)
zhaopin.LowSalary , zhaopin.HighSalary = zhaopin.LowSalary.apply(lambda x: '%.1f'%x), zhaopin.HighSalary.apply(lambda x: '%.1f'%x)
2. Workplace
对工作地区进行统一
#查看工作地有哪些
zhaopin.Workplace.unique()
#查看工作地点名字中包括省的有哪些, 结果显示全部为xx省, 且其中不会出现市级地区名
zhaopin[zhaopin.Workplace.str.contains('省')].Workplace.unique()
#将地区统一到市级
zhaopin['Workplace'] = zhaopin.Workplace.str.split('-', expand=True)[0]
3. 删除重复多余信息
zhaopin.drop(['Salary','Standard', 'HighOrLow'], axis = 1, inplace = True)
到目前为止, 我们对数据处理完成了, 接下来就是分析了.
5. 可视化分析
5.1 企业类型
import matplotlib
import matplotlib.pyplot as plt
CompanyNature_Count = zhaopin.CompanyNature.value_counts()
#设置中文字体
font = {'family': 'SimHei'}
matplotlib.rc('font', **font)
fig = plt.figure(figsize = (8, 8))
#绘制饼图, 参数pctdistance表示饼图内部字体离中心距离, labeldistance则是label的距离, radius指饼图的半径
patches, l_text, p_text = plt.pie(CompanyNature_Count, autopct = '%.2f%%', pctdistance = 0.6, labels = CompanyNature_Count.index, labeldistance=1.1, radius = 1)
m , n= 0.02, 0.028
for t in l_text[7: 11]:
t.set_y(m)
m += 0.1
for p in p_text[7: 11]:
p.set_y(n)
n += 0.1
plt.title('数据分析岗位中各类型企业所占比例', fontsize=24)
可以看出招聘中主要以民营企业, 合资企业和上市公司为主.
5.2 企业规模
CompanySize_Count = zhaopin.CompanySize.value_counts()
index, bar_width= np.arange(len(CompanySize_Count)), 0.6
fig = plt.figure(figsize = (8, 6))
plt.barh(index*(-1)+bar_width, CompanySize_Count, tick_label = CompanySize_Count.index, height = bar_width)
#添加数据标签
for x,y in enumerate(CompanySize_Count):
plt.text(y+0.1, x*(-1)+bar_width, '%s'%y, va = 'center')
plt.title('数据分析岗位各公司规模总数分布条形图', fontsize = 24)
招聘数据分析岗位的公司规模主要以50-500人为主
5.3 地区
from pyecharts import Geo
from collections import Counter
#统计各地区出现次数, 并转换为元组的形式
data = Counter(place).most_common()
#生成地理坐标图
geo =Geo("数据分析岗位各地区需求量", title_color="#fff", title_pos="center", width=1200, height=600, background_color='#404a59')
attr, value =geo.cast(data)
#添加数据点
geo.add('', attr, value, visual_range=[0, 100],visual_text_color='#fff', symbol_size=5, is_visualmap=True, is_piecewise=True)
geo.show_config()
geo.render()
可以看出北上广深等经济相对发达的地区, 对于数据分析岗位的需求量大.
参考自: https://blog.csdn.net/qq_41841569/article/details/82811153?utm_source=blogxgwz1
5.4 学历和工作经验
fig, ax = plt.subplots(1, 2, figsize = (18, 8))
Education_Count = zhaopin.Education.value_counts()
Workyear_Count = zhaopin.Workyear.value_counts()
patches, l_text, p_text = ax[0].pie(Education_Count, autopct = '%.2f%%', labels = Education_Count.index )
m = -0.01
for t in l_text[6:]:
t.set_y(m)
m += 0.1
print(t)
for p in p_text[6:]:
p.set_y(m)
m += 0.1
ax[0].set_title('数据分析岗位各学历要求所占比例', fontsize = 24)
index, bar_width = np.arange(len(Workyear_Count)), 0.6
ax[1].barh(index*(-1) + bar_width, Workyear_Count, tick_label = Workyear_Count.index, height = bar_width)
ax[1].set_title('数据分析岗位工作经验要求', fontsize= 24)
Academic requirements mostly undergraduate, college-based work experience required no work experience requirements based, visible mainly for recruitment of fresh graduates.
5.4 wages
1. The relationship between wages and job requirements
fig = plt.figure(figsize = (9,7))
#转换类型为浮点型
zhaopin.LowSalary, zhaopin.HighSalary = zhaopin.LowSalary.astype(float), zhaopin.HighSalary.astype(float)
#分别求各地区平均最高薪资, 平均最低薪资
Salary = zhaopin.groupby('Workplace', as_index = False)['LowSalary', 'HighSalary'].mean()#分别求各地区的数据分析岗位数量,并降序排列
Workplace = zhaopin.groupby('Workplace', as_index= False)['JobTitle'].count().sort_values('JobTitle', ascending = False)#合并数据表
Workplace = pd.merge(Workplace, Salary, how = 'left', on = 'Workplace')#用前20名进行绘图
Workplace = Workplace.head(20)
plt.bar(Workplace.Workplace, Workplace.JobTitle, width = 0.8, alpha = 0.8)
plt.plot(Workplace.Workplace, Workplace.HighSalary*1000, '--',color = 'g', alpha = 0.9, label='平均最高薪资')
plt.plot(Workplace.Workplace, Workplace.LowSalary*1000, '-.',color = 'r', alpha = 0.9, label='平均最低薪资')
#添加数据标签
for x, y in enumerate(Workplace.HighSalary*1000):
plt.text(x, y, '%.0f'%y, ha = 'left', va='bottom')
for x, y in enumerate(Workplace.LowSalary*1000):
plt.text(x, y, '%.0f'%y, ha = 'right', va='bottom')
for x, y in enumerate(Workplace.JobTitle):
plt.text(x, y, '%s'%y, ha = 'center', va='bottom')
plt.legend()
plt.title('数据分析岗位需求量排名前20地区的薪资水平状况', fontsize = 20)
As can be seen, with the reduction in demand, wages also decreased.
2. Compensation and empirical relationship
#求出各工作经验对应的平均最高与平均最低薪资
Salary_Year = zhaopin.groupby('Workyear', as_index = False)['LowSalary', 'HighSalary'].mean()
#求平均薪资
Salary_Year['Salary'] = (Salary_Year.LowSalary.add(Salary_Year.HighSalary)).div(2)
#转换列, 得到想要的顺序
Salary_Year.loc[0], Salary_Year.loc[6] = Salary_Year.loc[6], Salary_Year.loc[0]
#绘制条形图
plt.barh(Salary_Year.Workyear, Salary_Year.Salary, height = 0.6)
for x, y in enumerate(Salary_Year.Salary):
plt.text(y+0.1,x, '%.2f'%y, va = 'center')
plt.title('各工作经验对应的平均薪资水平(单位:千/月)', fontsize = 20)
The more work experience, the higher the salary.
3. Salary and academic relations
#计算平均薪资
Salary_Education = zhaopin.groupby('Education', as_index = False)['LowSalary', 'HighSalary'].mean()
Salary_Education['Salary'] = Salary_Education.LowSalary.add(Salary_Education.HighSalary).div(2)
Salary_Education = Salary_Education.sort_values('Salary', ascending = True)
#绘制柱形图
plt.bar(Salary_Education.Education, Salary_Education.Salary, width = 0.6)
for x,y in enumerate(Salary_Education.Salary):
plt.text(x, y, '%.2f'%y, ha = 'center', va='bottom')
plt.title('各学历对应的平均工资水平(单位:千/月)', fontsize = 20)
The higher the education, the higher the corresponding salary levels
to sum up
1. The type of business data analysis jobs in private enterprises, joint ventures and listed companies based, enterprise-scale multi-50-500 people.
2. Data analysis of job qualifications required to undergraduate, college-based, experience no work experience in the majority, it is visible mainly for graduates.
3. The economy is relatively developed areas north of Guangzhou-Shenzhen, Hangzhou and other large demand for data analysis jobs, and higher wages in other regions; the higher the education, the more rich experience corresponding salary levels will be higher.