Analysis of employee turnover in the kaggel competition

Problem background

This article is a case in kaggle. Employee turnover analysis. The data set includes various factors that may affect employee turnover (including nine variables such as performance level, position, salary level, etc.) and the label of whether the employee has resigned.
The data comes from kaggle /hr-analytics , the official website of kaggle . (Students who do not download can leave a message)
Data mining goals: Find out the factors that affect employee turnover, and establish a predictive model for employee turnover

Variable description

The following is the variable description table, which further explains all the variables in the data set
Insert picture description here

Read data and data display

1 Read data

# 员工离职分析
import pandas as pd
import matplotlib.pyplot as plt 
# 1读取数据
data=pd.read_csv('HR_comma_sep.csv')  

2 Preliminary data exploration
Use data.info() to view the overall information, there are 10 variables, the sample size is 14,999, and the data quality is very high, there is no null value,
Insert picture description here
data.shape View the shape of the data frame
Insert picture description here
data.columns View the name of each column
Insert picture description here
data.head () View the first ten rows of data
Insert picture description here
data.describe() The summary of the data characteristics of each variable.
Insert picture description here

Descriptive statistics

Distribution of Numerical Variables-Box Plot

In order to distinguish, change the original data'sales' column name to'jobs', and put the dependent variable in the first column.

# 为了区别, 将原数据‘sales'列名改为’jobs' 
data.columns=['satisfaction_level', 'last_evaluation', 'number_project',  'average_montly_hours', 'time_spend_company', 'Work_accident', 'left','promotion_last_5years', 'jobs', 'salary']
# 为了方便处理,将相应变量'left',放在第一列
data = data.reindex(columns = [ 'left','satisfaction_level', 'last_evaluation','number_project','average_montly_hours', 'time_spend_company', 'Work_accident','promotion_last_5years', 'salary','jobs'])
# 3.1 数值型变量分布
plt.rcParams['font.sans-serif'] = ['simsun'] # 指定默认字体
plt.rcParams['axes.unicode_minus'] = True # 解决保存图像是负号'-'显示为方块的问题
plt.rcParams['boxplot.medianprops.color'] = 'red'
columns_number = ['satisfaction_level', 'last_evaluation', 'number_project',
                                     'average_montly_hours', 'time_spend_company'] # 数值型标签 

title_Chinse = ['满意程度', '绩效', '项目数', '月均工作时长','工作年限']
fig = plt.figure(figsize=(10,5),num=1) 
subplot_num  = len(columns_number)
for i,col in enumerate(columns_number): 
    plt.subplot(1, 9, 2*i+1) 
    plt.boxplot(data[col], widths=0.5)
    plt.title(title_Chinse[i])
    plt.xlabel(col)
    plt.xticks([])
del i, col, subplot_num

Insert picture description here
In addition to working years, there are no outliers in employee satisfaction, performance, number of projects, and average monthly working hours. The average employee satisfaction is higher than 0.5, half of the employees have more than four projects, and half of the employees have an average monthly working time of 200 More than hours, overtime is more serious.

Distribution of categorical variables-pie chart
# pie图是处理候数据,输出各类别频数   和value_counts()方法搭配
fig = plt.figure(dpi=300,num=2)
data_temp = data.copy()  # 浅复制,基于地址的内存管理。 为了讲1,0改为yes,no的临时数据 
data_temp['Work_accident'] = ['Yes' if i==1 else 'No' for i in data['Work_accident']]
data_temp['promotion_last_5years'] = ['Yes' if i==1 else 'No' for i in data['promotion_last_5years']]
columns_class = ['Work_accident', 'promotion_last_5years', 'jobs', 'salary']
for i,col in enumerate(columns_class): 
    #print(i,col)
    count = data_temp[col].value_counts()
    plt.subplot(2,2,i+1)
    plt.pie(count, labels = count.index,startangle=90)
    plt.title(col)
del data_temp,i,col,count

Insert picture description here
Less employees have made mistakes, and fewer employees have been promoted within five years. The sales department has the most employees, followed by the technical department with the least management. Nearly half of employees have low salaries, and only 8% of employees get high salaries.

Visualization of the relationship between numerical independent variables and sub-type dependent variables-box plot
fig = plt.figure(figsize=(13,6),num=3)
for i,col in enumerate(columns_number): 
    left_num = data.loc[data['left']==1,   col]
    unleft_num = data.loc[data['left']==0, col]
    plt.subplot(1,5,i+1) 
    plt.boxplot([unleft_num, left_num],positions=[0.1,0.5],widths=0.3)
    plt.xticks(ticks=[0.1,0.5],labels=['未离职','离职'])
    plt.title(title_Chinse[i])
plt.show()
del i,col,left_num,unleft_num

Insert picture description here
It can be seen from the figure that the departed employees are less satisfied with the company and have higher performance. It may be that the more capable people have more opportunities to change jobs. The number of people who left their jobs is more widely distributed. Those who leave the job also work longer hours. It seems that overtime is not conducive to retaining employees. Capitalists are expected to be cautious. After all, recruitment also has a cost. Finally, leaving employees have a higher working life.

Visualization of the relationship between independent variables and dependent variables by type (two categories)-percent stacked chart
fig = plt.figure(num=4)
for i,col in enumerate(['Work_accident', 'promotion_last_5years']): 
    plt.subplot(1,2,i+1)
    # 第一组要的是, x轴是自变量, 更
    left_y,left_n,unleft_y,unleft_n =0,0,0,0
    for j in range(len(data)):
        if data.loc[j,'left']==1 and data.loc[j,col]==1: 
            left_y+=1 
        elif data.loc[j,'left']==1 and data.loc[j,col]==0:
            left_n+=1 
        elif data.loc[j,'left']==0 and data.loc[j,col]==1:
            unleft_y+=1 
        else: 
            unleft_n+=1 
    rate = pd.Series([left_y/(left_y+unleft_y),left_n/(left_n+unleft_n)])
    x=[1,1.2]
    plt.bar(x, height=rate, width=0.1,label='离职') 
    plt.bar(x, height=1-rate,width=0.1,bottom=rate,label='未离职')
    plt.xlabel( col+'?')
    plt.xticks(ticks=[1,1.2,1.4,1.6],labels=['是','否','',''])
    plt.legend(loc='upper right' )  
del i,j,col,left_y,left_n,unleft_y,unleft_n,rate,x

Insert picture description here
It can be seen that the proportion of employees who did not make mistakes leave is higher, and the proportion of employees who have not been promoted is far higher than that of employees who are promoted.

Visualization of the relationship between independent variables and taxonomic dependent variables by type (multi-category)-line chart
for col in ['jobs', 'salary']: 
    data[col].value_counts()
    rate = []
    for cla in data[col].value_counts().index:
        print(cla)
        zipper = zip(data['left']==1, data[col]==cla) 
        rate.append(len(data.loc[[all(i) for i in zipper],]))
    rate = pd.Series(rate).values/data[col].value_counts().values
    fig = plt.figure(figsize=(10,5))
    plt.plot(range(0,2*len(rate),2),rate,'o-',label='left rate')
    plt.xticks(range(0,2*len(rate),2),labels =data[col].value_counts().index  )
    plt.grid()
    plt.xlabel(col)
    plt.legend(loc='best')
    plt.title('离职率对比')
    plt.show()
del rate,col,cla,zipper

Insert picture description here
Insert picture description here
The turnover rate of each department is quite different, the human resources department has the highest turnover rate, and the rand department has the lowest turnover rate. The salary level is very real, the higher the salary, the lower the turnover rate. Ma Yun is right. There are two reasons for employees to leave their jobs. The money is less and the heart is wronged.

Data preprocessing

After the descriptive statistical exploration of the data, we have a certain understanding of each variable and the impact of each variable on employee turnover. Now let's start modeling. The data preprocessing before modeling is divided into the following steps.

  • The previous analysis results show that the data does not need to be processed for outliers and missing values.
  • Digitize taxonomic variables and convert multi-category disordered variables to one_hot encoding
  • Normalization processing eliminates the influence of data dimensions.
  • The data is randomly divided into two parts: training set and test set
# 数据转化 (分类型变量数字化)
from sklearn import preprocessing, model_selection,tree,naive_bayes,svm,metrics
# 分类型也分两种,无序和有序。 有序可以直接数字化,无序数字化会产生错误,转为one_hot编码 
# salary 有序
salary  =  list(data['salary'].value_counts().index )
enc = preprocessing.OrdinalEncoder(categories = [salary])
enc.fit([['low']])
salary = enc.transform([[i] for i in data['salary']])
data['salary'] = salary
# jobs无序 
jobs = list(data['jobs'].value_counts().index )
enc = preprocessing.OneHotEncoder(categories=[jobs])
enc.fit([['sales']])
jobs = enc.transform([[i] for i in data['jobs']]).toarray()
data.drop('jobs', inplace=True, axis=1) # 可以原地操作, 删除不需要的行或列 
data = pd.merge(data, pd.DataFrame(jobs), left_index=True, right_index=True )
# 数据归一化处理 
mmscaler = preprocessing.MinMaxScaler() #  实例化一个归一化对象
data.loc[:,['number_project', 'average_montly_hours', 'time_spend_company']] = mmscaler.fit_transform(data.loc[:,['number_project',
       'average_montly_hours', 'time_spend_company']])
# 数据分割 
X = data.iloc[:,1::] ; y= data['left']
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = model_selection.train_test_split(
  X, y, test_size=0.3, random_state=0)

Model building

Here we use several commonly used models for classification problems, decision trees, naive Bayes, and support vector machines.

# 训练模型 
# 决策树
dt = tree.DecisionTreeClassifier()
dt = dt.fit(X_train,y_train)#[n_samples, n_features] 
#dt.predict(data.iloc[1998:2010,1::]) # 预测数据的结构与X相同 ,返回结构与y相同 
# 朴素贝叶斯
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train,y_train)
gnb.predict(data.iloc[:,1::])
# 支持向量机 
parameters=[{
    
    'kernel':['rbf'],'gamma':[1e-3,1e-4], 'C':[1,10,100,1000]}]
clf = model_selection.GridSearchCV(svm.SVC(), parameters)  

Model evaluation

Evaluation of common indicators of classification algorithms
# 模型评估 
# 1 常用分类指标的计算 
data_pre = pd.DataFrame({
    
    'dt':dt_pre, 'gnb':gnb_pre, 'svc':svc_pre})
#  acuuracy 
Scores = np.empty((3,4))
for i in range(3):
    print(i)
    y_pred = data_pre.iloc[:,i]
    #accuracy
    Scores[i,0]=metrics.accuracy_score(y_test, y_pred)
    # precisoin   
    Scores[i,1]=metrics.precision_score(y_test, y_pred)
    # recall 
    Scores[i,2]=metrics.recall_score(y_test, y_pred)
    # f1_score
    Scores[i,3]=metrics.f1_score(y_test, y_pred) 
    #  混淆矩阵 
    print(metrics.confusion_matrix( y_pred,y_test))
model accuracy precision recall f1_score
Decision tree 0.97644444 0.92909761 0.97206166 0.95009416
Bayes 0.74 0.4625 0.78420039 0.58184417
Support vector machine 0.80155556 0.69129288 0.25240848 0.36979534
roc curve
# 2 分类模型roc曲线 
fig = plt.figure(dpi =300)
dtcurve = metrics.plot_roc_curve(dt, X_test, y_test)
gnbcurve = metrics.plot_roc_curve(gnb, X_test, y_test,ax=dtcurve.ax_)
svccurve = metrics.plot_roc_curve(svc, X_test, y_test,ax=dtcurve.ax_)
plt.title('各模型ROC曲线') 
plt.show()

Insert picture description here
In terms of employee turnover analysis, the performance of the decision tree is better.

Analysis of the advantages and disadvantages of the model

1 The variable distribution is not tested and may not satisfy the Gaussian distribution. The effect of using the Naive Bayesian model based on Gaussian is limited.
2 When processing the data, the multi-categorical variables are encoded by one_hot, which increases the dimensionality and sparse data. Support The vector machine model is restricted.

Guess you like

Origin blog.csdn.net/weixin_43705953/article/details/110863273