R语言Logist回归

Logist回归

Logistic回归又称Logistic回归分析，是一种广义的线性回归分析模型，常用于数据挖掘，疾病自动诊断，经济预测等领域。当通过一系列连续型和/或类别型预测变量来预测二值结果型变量时，Logistic回归是一个非常有用的工具。

研究数据

该数据集为AER包中的Affairs数据集，通过翻译可以发现该数据为“婚外情数据”，取自于1969年《今日心理》所做的一个非常有代表性的调查，变量名称解释如下：

affairs：一年以来婚外私通的频率
gender：性别
age：年龄
yearsmarried：婚龄
children:是否有小孩
Religious：宗教信仰程度（5分制，1表示完全反对，5表示非常信仰）
education：学历（逆向编码的戈登7种分类）
occupation：职业
rating：对婚姻的自我评分（5分制，1表示非常不幸福，5表示非常幸福）
统计信息如下

summary(Affairs)
    affairs          gender         age         yearsmarried    children  religiousness  
 Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171   Min.   :1.000  
 1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430   1st Qu.:2.000  
 Median : 0.000                Median :32.00   Median : 7.000             Median :3.000  
 Mean   : 1.456                Mean   :32.49   Mean   : 8.178             Mean   :3.116  
 3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000             3rd Qu.:4.000  
 Max.   :12.000                Max.   :57.00   Max.   :15.000             Max.   :5.000  
   education       occupation        rating      ynaffairs
 Min.   : 9.00   Min.   :1.000   Min.   :1.000   No :451  
 1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000   Yes:150  
 Median :16.00   Median :5.000   Median :4.000            
 Mean   :16.17   Mean   :4.195   Mean   :3.932            
 3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000            
 Max.   :20.00   Max.   :7.000   Max.   :5.000

Logist回归相关代码

1. 加载数据

 ###Logistic回归
#install.packages("AER")
> library(AER)
#import data and view data
> data(Affairs,package = "AER")
> head(Affairs)
   affairs gender age yearsmarried children religiousness education occupation rating
4        0   male  37        10.00       no             3        18          7      4
5        0 female  27         4.00       no             4        14          6      4
11       0 female  32        15.00      yes             1        12          1      4
16       0   male  57        15.00      yes             5        18          6      5
23       0   male  22         0.75       no             2        17          6      3
29       0 female  32         1.50       no             2        17          5      5
> summary(Affairs)
    affairs          gender         age         yearsmarried    children  religiousness  
 Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171   Min.   :1.000  
 1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430   1st Qu.:2.000  
 Median : 0.000                Median :32.00   Median : 7.000             Median :3.000  
 Mean   : 1.456                Mean   :32.49   Mean   : 8.178             Mean   :3.116  
 3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000             3rd Qu.:4.000  
 Max.   :12.000                Max.   :57.00   Max.   :15.000             Max.   :5.000  
   education       occupation        rating     
 Min.   : 9.00   Min.   :1.000   Min.   :1.000  
 1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000  
 Median :16.00   Median :5.000   Median :4.000  
 Mean   :16.17   Mean   :4.195   Mean   :3.932  
 3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000  
 Max.   :20.00   Max.   :7.000   Max.   :5.000  
> table(Affairs$affairs)

  0   1   2   3   7  12 
451  34  17  19  42  38

2. 将affairs转化为二值型因子ynaffair

 #translate sffairs(num) to ynaffair(binary type)
> Affairs$ynaffairs[Affairs$affairs >0 ] <- 1
> Affairs$ynaffairs[Affairs$affairs ==0 ] <- 0
> Affairs$ynaffairs <- factor(Affairs$ynaffairs,
 1.                             levels = c(0,1),
 2.                             labels = c("No","Yes"))
> table(Affairs$ynaffairs)

 No Yes 
451 150

3. 建模分析

#create a logist model
> names(Affairs)
 [1] "affairs"       "gender"        "age"           "yearsmarried"  "children"     
 [6] "religiousness" "education"     "occupation"    "rating"        "ynaffairs"    
> fit.full <- glm(ynaffairs ~ gender + age + yearsmarried + children + religiousness +
+                   education + occupation + rating, data = Affairs,family = binomial()) #binomial 二项分布
> summary(fit.full)

Call:
glm(formula = ynaffairs ~ gender + age + yearsmarried + children + 
    religiousness + education + occupation + rating, family = binomial(), 
    data = Affairs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5713  -0.7499  -0.5690  -0.2539   2.5191  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.37726    0.88776   1.551 0.120807    
gendermale     0.28029    0.23909   1.172 0.241083    
age           -0.04426    0.01825  -2.425 0.015301 *  
yearsmarried   0.09477    0.03221   2.942 0.003262 ** 
childrenyes    0.39767    0.29151   1.364 0.172508    
religiousness -0.32472    0.08975  -3.618 0.000297 ***
education      0.02105    0.05051   0.417 0.676851    
occupation     0.03092    0.07178   0.431 0.666630    
rating        -0.46845    0.09091  -5.153 2.56e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.38  on 600  degrees of freedom
Residual deviance: 609.51  on 592  degrees of freedom
AIC: 627.51

Number of Fisher Scoring iterations: 4

4. 删除不显著变量二次建模

# remove insignificant variables and create a new model
> fit.reduced <- glm(ynaffairs ~ age + yearsmarried + religiousness + rating,
+                    data = Affairs, family = binomial()) 
> summary(fit.reduced)

Call:
glm(formula = ynaffairs ~ age + yearsmarried + religiousness + 
    rating, family = binomial(), data = Affairs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6278  -0.7550  -0.5701  -0.2624   2.3998  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.93083    0.61032   3.164 0.001558 ** 
age           -0.03527    0.01736  -2.032 0.042127 *  
yearsmarried   0.10062    0.02921   3.445 0.000571 ***
religiousness -0.32902    0.08945  -3.678 0.000235 ***
rating        -0.46136    0.08884  -5.193 2.06e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 675.38  on 600  degrees of freedom
Residual deviance: 615.36  on 596  degrees of freedom
AIC: 625.36

Number of Fisher Scoring iterations: 4

5. 模型比较
使用anova()函数对它们进行比较，对于广义线性回归回归，可用卡方检验(Chisq)。

 # test model 
> anova(fit.reduced,fit.full,test = "Chisq") #Chisq 卡方检验
Analysis of Deviance Table

Model 1: ynaffairs ~ age + yearsmarried + religiousness + rating
Model 2: ynaffairs ~ gender + age + yearsmarried + children + religiousness + 
    education + occupation + rating
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       596     615.36                     
2       592     609.51  4   5.8474   0.2108

结果的卡方值不显著（p=0.2108），表明四个预测变量的新模型与九个完整预测变量的模型拟合程度一样好。

6. 模型参数解释

## model parameter interpretation
 # view regression coefficient
> coef(fit.reduced)
  (Intercept)           age  yearsmarried religiousness        rating 
   1.93083017   -0.03527112    0.10062274   -0.32902386   -0.46136144 
> # exponentiate for coefficient
> exp(coef(fit.reduced))
  (Intercept)           age  yearsmarried religiousness        rating 
    6.8952321     0.9653437     1.1058594     0.7196258     0.6304248

7. 评价预测变量对结果概率的影响

##create a virtual dataframe
> # variables of  age, yearsmarried, and religiousness are their mean value, the range of rating is from 1 to 5  
> testdata <- data.frame(rating =c(1:5),
+                        age = mean(Affairs$age),
+                        yearsmarried = mean(Affairs$yearsmarried),
+                        religiousness = mean(Affairs$religiousness))
> testdata
  rating      age yearsmarried religiousness
1      1 32.48752     8.177696      3.116473
2      2 32.48752     8.177696      3.116473
3      3 32.48752     8.177696      3.116473
4      4 32.48752     8.177696      3.116473
5      5 32.48752     8.177696      3.116473
> #then, using testdata to predict
> testdata$prob <- predict(fit.reduced,newdata = testdata,type="response")
> testdata
  rating      age yearsmarried religiousness      prob
1      1 32.48752     8.177696      3.116473 0.5302296
2      2 32.48752     8.177696      3.116473 0.4157377
3      3 32.48752     8.177696      3.116473 0.3096712
4      4 32.48752     8.177696      3.116473 0.2204547
5      5 32.48752     8.177696      3.116473 0.1513079
> # the effect of age
> testdata_age <- data.frame(rating = mean(Affairs$rating),
+                            age = seq(17,57,10),
+                            yearsmarried = mean(Affairs$yearsmarried),
+                            religiousness = mean(Affairs$religiousness))
> testdata_age$pro <- predict(fit.reduced,newdata = testdata_age,type = "response")
> testdata_age
   rating age yearsmarried religiousness       pro
1 3.93178  17     8.177696      3.116473 0.3350834
2 3.93178  27     8.177696      3.116473 0.2615373
3 3.93178  37     8.177696      3.116473 0.1992953
4 3.93178  47     8.177696      3.116473 0.1488796
5 3.93178  57     8.177696      3.116473 0.1094738

从结果中可以看到，当婚姻评分从1（很不幸福）变为5（非常幸福）时，婚外情概率从0.53降低到了0.51（假定年龄、婚龄和宗教信仰不变）；当其他变量不变时，年龄从17岁增加到57岁时，婚外情概率将从0.34降低到0.11.利用该方法可以探究每一个预测变量对结果概率的影响。

Logist回归

研究数据

Logist回归相关代码

猜你喜欢