R for realization of job sites salary predictive analysis

1, first determine the target data analysis - which factors influence salary by

Determining the variables:

The dependent variable: Salary

Qualitative arguments :() - Company category, company size, region, industry sector, academic requirements, software requirements,

    (Quantitative) - Experience (numeric)

Analysis goal: the establishment of independent variables and the dependent variable multiple linear regression model to estimate model coefficients, the coefficient significance test to determine whether the independent variable on the dependent variable affected. And implement new independent variable values ​​into the model predicts to achieve.

2, data preprocessing.

(Organize data so that it can become direct modeling analysis of the data format), first look at the data structure.

1) Read the data packet is not recommended xlsx, slower data amount 
Library (xlsx) 
jobInfo2 = read.xlsx ( 'jobinfo.xlsx', 1, encoding = 'UTF-. 8') 
STR (jobInfo2) # View data structure 
head (jobInfo2) 

2) Library (readxl) 
JobInfo = read_excel ( 'jobinfo.xlsx') 
STR (JobInfo) See data structure # 
# head () function does not seem, the first five lines View

options (scipen = 200) # removed scientific notation 
JobInfo = read_excel ( 'jobinfo.xlsx') 
STR (JobInfo) # View data structure

 1) minimum wage and maximum salary dependent variable is converted to numeric

jobInfo $ minimum wage = as.numeric (jobInfo $ minimum wage) 
jobInfo $ highest salary = as.numeric (jobInfo $ highest salary) 
jobInfo $ Average salary = (jobInfo $ + jobInfo $ highest minimum wage salary) / 2

2) treatment area, non-divided north and north deep deep

loc = which (jobInfo $ Area% in% c ( "Beijing", "Shanghai", "Shenzhen")) 
loc_other = Which (! JobInfo $ regions% in% c ( "Beijing", "Shanghai", "Shenzhen") ) 
JobInfo $ area [LOC] =. 1 
JobInfo $ area [loc_other] = 0 
JobInfo $ area = as.numeric (jobInfo $ area)

3) processing company size, education, factor into a variable. Easy draw 

jobInfo $ company size = factor (jobInfo $ company size, levels = c ( "less than 50", "50-150", "150-500", "500-1000", "1000-5000" "10000" and "5000-10000 people")) 
Levels ($ JobInfo company size) [c (2, 3) ] = c ( "50-500 " and "50-500 people") 
JobInfo $ education = factor (jobInfo $ education, levels = c ( "secondary school", "school", "college", "no", "Bachelor", "Master", "Doctor"))

 4) matching company needs to master the tools

分析工具包含:"R", "SPSS", "Excel", "Python", "MATLAB", "Java", "SQL", "SAS", "Stata", "EViews", "Spark", "Hadoop"

software = as.data.frame(matrix(0,nrow = length(jobInfo$描述),ncol = 12))  # 生成*行*列的数据框
colnames(software) = c("R", "SPSS", "Excel", "Python", "MATLAB", "Java", "SQL", "SAS", "Stata", "EViews", "Spark", "Hadoop")

mixseg = worker()

for (i in 1:length(jobInfo$描述)) {
  subData = as.character(jobInfo$描述[i])
  fenci = mixseg[subData]
  
  R.identify = ("R" %in% fenci) | ("r" %in% fenci)
  SPSS.identify = ("spss" %in% fenci) | ("Spss" %in% fenci) | ("SPSS" %in% fenci)
  Excel.identify = ("excel" %in% fenci) | ("EXCEL" %in% fenci) | ("Excel" %in% fenci)
  Python.identify = ("Python" %in% fenci) | ("python" %in% fenci) | ("PYTHON" %in% fenci)
  MATLAB.identify = ("matlab" %in% fenci) | ("Matlab" %in% fenci) | ("MATLAB" %in% fenci)
  Java.identify = ("java" %in% fenci) | ("JAVA" %in% fenci) | ("Java" %in% fenci)
  SQL.identify = ("SQL" %in% fenci) | ("Sql" %in% fenci) | ("sql" %in% fenci)
  SAS.identify = ("SAS" %in% fenci) | ("Sas" %in% fenci) | ("sas" %in% fenci)
  Stata.identify = ("STATA" %in% fenci) | ("Stata" %in% fenci) | ("stata" %in% fenci)
  EViews.identify = ("EViews" %in% fenci) | ("EVIEWS" %in% fenci) | ("Eviews" %in% fenci) | ("eviews" %in% fenci) 
  Spark.identify = ("Spark" %in% fenci) | ("SPARK" %in% fenci) | ("spark" %in% fenci)
  Hadoop.identify = ("HADOOP" %in% fenci) | ("Hadoop" %in% fenci) | ("hadoop" %in% fenci)
  
  if (R.identify) software$R[i] = 1
  IF (SPSS.identify) the SPSS Software $ [I]. 1 = 
  if (Excel.identify) software$Excel[i] = 1
  IF (Python.identify) the Python Software $ [I]. 1 = 
  IF (MATLAB.identify) the MATLAB Software $ [I]. 1 = 
  IF (Java.identify) Software $ java [I]. 1 = 
  IF (SQL.identify) the SQL Software $ [I]. 1 = 
  IF (SAS.identify) the SAS Software $ [I]. 1 = 
  IF (Stata.identify) Software Stata $ [I]. 1 = 
  IF ( EViews.identify) Software EViews $ [I]. 1 = 
  IF (Spark.identify) the Spark Software $ [I]. 1 = 
  IF (Hadoop.identify) the Hadoop Software $ [I] =. 1 
} 
jobInfo.new cbind = (mean $ JobInfo payroll, Software) 
colnames (jobInfo.new) = c ( "average salary", colnames (Software)) 

# associated with other columns 
jobInfo.new $ area = jobInfo $ area 
jobInfo.new $ class = jobInfo $ company company category 
jobInfo.new $ = jobInfo $ company size company size 
jobInfo.new $ $ = jobInfo education qualifications 
jobInfo.new $ experience = jobInfo $ experience
jobInfo.new $ sectors = jobInfo $ sectors 

table (jobInfo.new $ company category) # analytics firm category observations less be deleted 
jobInfo.new = jobInfo.new [-which (jobInfo.new $% in the company category % c ( "institution", "NPI")),] 

colnames (jobInfo.new) = C ( 'aveSalary', colnames (jobInfo.new [2:13]), "Area", "compVar" "compScale", "Academic", "exp", "induCate") 


# save the data set 
write.csv (jobInfo.new, file = 'data analysis job recruitment .csv', row.names = F)

3, data visualization

1) descriptive analysis of the dependent variable histogram \ payroll distribution

 

hist (dat0 $ aveSalary, xlab = " average salary (RMB / month)", ylab = "frequency", main = "", COL = 'DodgerBlue', XLIM = C (1500,11000), 
     Breaks SEQ = (0, 500000, by = 1500)) 

the Summary (DAT0 $ aveSalary)

  

 

 

 2) average salary \ --- Experience boxplot

Cut = $ exp_level DAT0 (DAT0 $ exp, Breaks = C (-0.01,3.99,6, max (DAT0 $ exp))) 
DAT0 = $ exp_level factor (DAT0 $ exp_level, Levels = Levels (DAT0 $ exp_level), 
                        Labels = C ( "experience: 0 - 3 years", "experience: 4 - 6 years", "experience:> 6 years")) 

Boxplot (aveSalary ~ exp_level, Data = DAT0, COL = 'DodgerBlue', ylab = "average salary (yuan / month) ", ylim = C (0,45000)) 
Summary (LM (aveSalary ~ exp_level, Data = DAT0)) 

Table (DAT0 $ exp_level)

  

 

 

 

 

 

 

 3) - average salary degree boxplot

dat0 $ academic = factor (dat0 $ academic, levels = c ( " no", "secondary school", "school", "college", "Bachelor", "Master", "doctor")) 
DAT0 $ compVar factor = ( dat0 $ compVar, levels = c ( " private companies", "start-ups", "state-owned" "joint venture", "listed company", "foreign")) 
Boxplot (aveSalary ~ Academic, the Data = DAT0, COL = 'DodgerBlue ', ylab =' average salary (RMB / month) ', ylim = C (0,45000)) 
Summary (LM (Academic aveSalary ~, Data = DAT0)) 
Table (Academic DAT0 $)

  

 

 

 

 

 

 4, regression model

lm1 = lm(aveSalary~.,data = dat0)
summary(lm1)
lm2 = lm(aveSalary~.-induCate-exp_level,data = dat0)
summary(lm2)

par(mfrow = c(2,2))  # 回归诊断
plot(lm2,which = c(1:4))

  

 

 

 

FIG QQ seen the result in non-normality, non-linear 45 °, continue to analyze the dependent variable taking the logarithm.

 

Install.packages # ( 'RMS') 
Library (RMS) 
VIF (LM2) # calculate VIF,> 5 on behalf of a larger collinearity, (R regression of the other independent variables ^ 2> 80%) 

# removed collinearity factors, VIF several large merger and the benchmark portfolio and as an 
# DAT0 $ compVar = as.character (DAT0 $ compVar) 
# DAT0 [Which (DAT0 $ compVar% in% c ( "joint venture", "foreign" and "private company "," start-ups "))," compVar "] =" other "# will be joint ventures, foreign investment, private companies, start-up companies into other 
# dat0 $ compVar = factor (dat0 $ compVar, levels = c (" other " "SOE", "listed companies")) 
# 
# LM3 = LM (aveSalary ~.-induCate-exp_level, Data = DAT0) 
# Summary (LM3) 
# VIF (LM3) 

# remove non-normality impact, log-linear model 
LM4 LM = (.-induCate-exp_level log (aveSalary) ~, Data = DAT0) 
Summary (LM4) 

PAR (mfrow = C (2,2 &)) 
Plot (LM4, Which C = (. 1:4)) 

# # outlier handling Cook 
# cook = cooks.distance (lm4) 
# cook = sort (cook, decreasing = T)
# cook_point = names(cook)[1]
# cook_delete = which(rownames(dat0) %in% cook_point)
# dat0 = dat0[-cook_delete,]
# 
# # 检查
# lmTest = lm(log(aveSalary)~.-induCate-exp_level,data = dat0)
# par(mfrow = c(2,2))
# plot(lmTest,which = c(1,4))

 

 FIG improving effect after:

 

 

The final model predicts:

  •  Assuming that determine the interaction terms of company size and product areas in order to increase the impact factor compScale * area, (the process of derivation step discriminant function, AIC value as small as possible (negative number is the same judge))
  • Stepwise regression analysis is based on statistics for the AIC information criterion, AIC statistics by selecting the minimum amount of information to achieve the purpose of increasing or delete variables.

 

 

= DAT0 DAT1 [1:18] 
LM0 LM = (log (aveSalary) ~. + compScale * area, Data = DAT1) 
Summary (STEP (LM0)) # interactive items have initial results obtained compScale * area will be preferred

 AIC can be seen from the initial value out there compScale * area interaction term model is best.

Start:  AIC=-12289.46
log(aveSalary) ~ R + SPSS + Excel + Python + MATLAB + Java + 
    SQL + SAS + Stata + EViews + Spark + Hadoop + area + compVar + 
    compScale + academic + exp + compScale * area

                 Df Sum of Sq    RSS    AIC
- area:compScale  5     1.338 1227.1 -12292
- EViews          1     0.000 1225.7 -12292
- Spark           1     0.037 1225.8 -12291
- Stata           1     0.165 1225.9 -12290
- SPSS            1     0.237 1226.0 -12290
- MATLAB          1     0.272 1226.0 -12290
<none>                        1225.7 -12290
- Java            1     0.662 1226.4 -12288
- SAS             1     0.762 1226.5 -12287
- R               1     0.872 1226.6 -12286
- Python          1     1.555 1227.3 -12282
- compVar         5     3.479 1229.2 -12280
- Hadoop          1     6.249 1232.0 -12256
- SQL             1     9.494 1235.2 -12237
- Excel           1    22.307 1248.0 -12164
- academic        6   114.286 1340.0 -11672
- exp             1   214.853 1440.6 -11151

Step:  AIC=-12291.76
log(aveSalary) ~ R + SPSS + Excel + Python + MATLAB + Java + 
    SQL + SAS + Stata + EViews + Spark + Hadoop + area + compVar + 
    compScale + academic + exp

            Df Sum of Sq    RSS    AIC
- EViews     1     0.000 1227.1 -12294
- compScale  5     1.416 1228.5 -12294
- Spark      1     0.038 1227.1 -12294
- Stata      1     0.166 1227.2 -12293
- SPSS       1     0.245 1227.3 -12292
- MATLAB     1     0.256 1227.3 -12292
<none>                   1227.1 -12292
- Java       1     0.652 1227.7 -12290
- SAS        1     0.739 1227.8 -12290
- R          1     0.856 1227.9 -12289
- Python     1     1.569 1228.6 -12285
- compVar    5     3.531 1230.6 -12282
- Hadoop     1     6.216 1233.3 -12258
- SQL        1     9.359 1236.4 -12240
- Excel      1    22.587 1249.7 -12165
- academic   6   113.393 1340.5 -11680
- area       1   149.888 1377.0 -11480
- exp        1   215.650 1442.7 -11151

Step:  AIC=-12293.76
log(aveSalary) ~ R + SPSS + Excel + Python + MATLAB + Java + 
    SQL + SAS + Stata + Spark + Hadoop + area + compVar + compScale + 
    academic + exp

            Df Sum of Sq    RSS    AIC
- compScale  5     1.417 1228.5 -12296
- Spark      1     0.037 1227.1 -12296
- Stata      1     0.167 1227.2 -12295
- SPSS       1     0.246 1227.3 -12294
- MATLAB     1     0.257 1227.3 -12294
<none>                   1227.1 -12294
- Java       1     0.653 1227.7 -12292
- SAS        1     0.739 1227.8 -12292
- R          1     0.861 1227.9 -12291
- Python     1     1.569 1228.6 -12287
- compVar    5     3.531 1230.6 -12284
- Hadoop     1     6.216 1233.3 -12260
- SQL        1     9.360 1236.4 -12242
- Excel      1    22.597 1249.7 -12167
- academic   6   113.394 1340.5 -11682
- area       1   149.898 1377.0 -11482
- exp        1   215.652 1442.7 -11153

Step:  AIC=-12295.61
log(aveSalary) ~ R + SPSS + Excel + Python + MATLAB + Java + 
    SQL + SAS + Stata + Spark + Hadoop + area + compVar + academic + 
    exp

           Df Sum of Sq    RSS    AIC
- Spark     1     0.036 1228.5 -12297
- Stata     1     0.170 1228.7 -12297
- SPSS      1     0.261 1228.7 -12296
- MATLAB    1     0.298 1228.8 -12296
<none>                  1228.5 -12296
- Java      1     0.633 1229.1 -12294
- SAS       1     0.752 1229.2 -12293
- R         1     0.878 1229.3 -12293
- Python    1     1.547 1230.0 -12289
- compVar   5     3.779 1232.3 -12284
- Hadoop    1     6.288 1234.8 -12262
- SQL       1     9.517 1238.0 -12243
- Excel     1    22.306 1250.8 -12171
- academic  6   113.717 1342.2 -11683
- area      1   152.798 1381.3 -11470
- exp       1   217.467 1445.9 -11147

Step:  AIC=-12297.41
log(aveSalary) ~ R + SPSS + Excel + Python + MATLAB + Java + 
    SQL + SAS + Stata + Hadoop + area + compVar + academic + 
    exp

           Df Sum of Sq    RSS    AIC
- Stata     1     0.166 1228.7 -12298
- SPSS      1     0.256 1228.8 -12298
- MATLAB    1     0.297 1228.8 -12298
<none>                  1228.5 -12297
- Java      1     0.606 1229.1 -12296
- SAS       1     0.761 1229.3 -12295
- R         1     0.888 1229.4 -12294
- Python    1     1.520 1230.0 -12291
- compVar   5     3.779 1232.3 -12286
- Hadoop    1     8.237 1236.8 -12252
- SQL       1     9.549 1238.1 -12245
- Excel     1    22.302 1250.8 -12172
- academic  6   113.684 1342.2 -11685
- area      1   153.022 1381.5 -11471
- exp       1   217.431 1445.9 -11149

Step:  AIC=-12298.46
log(aveSalary) ~ R + SPSS + Excel + Python + MATLAB + Java + 
    SQL + SAS + Hadoop + area + compVar + academic + exp

           Df Sum of Sq    RSS    AIC
- SPSS      1     0.258 1228.9 -12299
<none>                  1228.7 -12298
- MATLAB    1     0.405 1229.1 -12298
- Java      1     0.615 1229.3 -12297
- SAS       1     0.715 1229.4 -12296
- R         1     0.859 1229.5 -12296
- Python    1     1.504 1230.2 -12292
- compVar   5     3.781 1232.5 -12287
- Hadoop    1     8.212 1236.9 -12253
- SQL       1     9.817 1238.5 -12244
- Excel     1    22.319 1251.0 -12173
- academic  6   113.730 1342.4 -11686
- area      1   152.949 1381.6 -11472
- exp       1   217.584 1446.3 -11149

Step:  AIC=-12298.97
log(aveSalary) ~ R + Excel + Python + MATLAB + Java + SQL + SAS + 
    Hadoop + area + compVar + academic + exp

           Df Sum of Sq    RSS    AIC
<none>                  1228.9 -12299
- MATLAB    1     0.385 1229.3 -12299
- Java      1     0.587 1229.5 -12298
- R         1     1.003 1229.9 -12295
- Python    1     1.495 1230.4 -12292
- SAS       1     1.854 1230.8 -12290
- compVar   5     3.763 1232.7 -12287
- Hadoop    1     8.280 1237.2 -12254
- SQL       1    10.189 1239.1 -12243
- Excel     1    22.096 1251.0 -12175
- academic  6   114.599 1343.5 -11682
- area      1   153.067 1382.0 -11472
- exp       1   217.601 1446.5 -11150

Call:
lm(formula = log(aveSalary) ~ R + Excel + Python + MATLAB + Java + 
    SQL + SAS + Hadoop + area + compVar + academic + exp, data = dat1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.72378 -0.28570 -0.04861  0.25334  2.14908 

Coefficients:
                 Estimate Std. Error t value             Pr(>|t|)    
(Intercept)      8.456762   0.017967 470.678 < 0.0000000000000002 ***
R                0.065522   0.027340   2.397             0.016576 *  
Excel           -0.143574   0.012763 -11.249 < 0.0000000000000002 ***
Python           0.085599   0.029258   2.926             0.003448 ** 
MATLAB          -0.056898   0.038301  -1.486             0.137439    
Java             0.058288   0.031793   1.833             0.066791 .  
SQL              0.145021   0.018985   7.639   0.0000000000000248 ***
3.259 0.078317 .024033 .001125 ** SAS 
Hadoop           0.229420   0.033317   6.886   0.0000000000062357 ***
Area 0.394826 .013335 29.607 <.0000000000000002 *** 
compVar startups 1.849 0.044609 .064501 .082482.   
CompVar state-owned -0.994 .320391 0.027142 -0.026972     
compVar 3.386 0.000712 .016646 *** joint venture .056369 
compVar listed companies 0.058643 0.022498 0.009165 2.607 ** 
compVar foreign 0.336 .736654 0.016148 .005431     
Academic secondary .036399 -6.257 .0000000004143217 *** -0.227767 
Academic High school 0.0000000049575319 *** -5.856 0.042443 -0.248540 
Academic college -0.149227 0.016084 -9.278 <.0000000000000002 *** 
academic undergraduate 6.547 .0000000000627649 *** .016581 0.108561 
academic master 0.269012 0.036317 7.407 .0000000000001438 ***
academic博士     0.807996   0.127023   6.361   0.0000000002129521 ***
exp              0.099921   0.002831  35.301 < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4179 on 7038 degrees of freedom
Multiple R-squared:  0.3778,	Adjusted R-squared:  0.3759 
F-statistic: 203.5 on 21 and 7038 DF,  p-value: < 0.00000000000000022

5, model predictions

There are three characteristics as follows applicants predicted to pay.

1) will use the r and python, graduate, no work experience, the company is located in Shanghai, the scale of 87 listed companies.
2) will use r, java, sas and python, Ph.D., 7 years of work experience, the company is located in Beijing, small and medium sized companies (150-500 scale), start-up companies.
3) no qualifications, work experience weak state-owned enterprises, not any statistical software.

new_data1 = matrix(c(1,0,0,1,0,0,0,0,0,0,0,0,1,"上市公司","50-500人","本科",0),1,17)
new_data2 = matrix(c(1,0,0,1,0,1,0,1,0,0,0,0,1,"创业公司","50-500人","博士",7),1,17)
new_data3 = matrix(c(0,0,0,0,0,0,0,0,0,0,0,0,0,"国企","少于50人","无",0),1,17)

new_data1 = as.data.frame(new_data1)
new_data2 = as.data.frame(new_data2)
new_data3 = as.data.frame(new_data3)

colnames(new_data1) = names(dat0)[2:18]
colnames(new_data2) = names(dat0)[2:18]
colnames(new_data3) = names(dat0)[2:18]

for (j in 1:13) {
  new_data1[,j] = as.numeric(as.character(new_data1[,j]))
  new_data2[,j] = as.numeric(as.character(new_data2[,j]))
  new_data3[,j] = as.numeric(as.character(new_data3[,j]))
}
new_data1$exp = as.numeric(as.character(new_data1$exp))
new_data2$exp = as.numeric(as.character(new_data2$exp))
new_data3$exp = as.numeric(as.character(new_data2$exp))

# 预测
exp(predict(lm0,new_data1))
exp(predict(lm0,new_data2))
exp(predict(lm0,new_data3))

  Salary achieve the predicted value:

 

 

Guess you like

Origin www.cnblogs.com/hqczsh/p/11481757.html