Logist回归
Logistic回归又称Logistic回归分析,是一种广义的线性回归分析模型,常用于数据挖掘,疾病自动诊断,经济预测等领域。当通过一系列连续型和/或类别型预测变量来预测二值结果型变量时,Logistic回归是一个非常有用的工具。
研究数据
该数据集为AER包中的Affairs数据集,通过翻译可以发现该数据为“婚外情数据”,取自于1969年《今日心理》所做的一个非常有代表性的调查,变量名称解释如下:
- affairs:一年以来婚外私通的频率
- gender: 性别
- age:年龄
- yearsmarried:婚龄
- children:是否有小孩
- Religious: 宗教信仰程度(5分制,1表示完全反对,5表示非常信仰)
- education: 学历(逆向编码的戈登7种分类)
- occupation:职业
- rating:对婚姻的自我评分(5分制,1表示非常不幸福,5表示非常幸福)
统计信息如下
summary(Affairs)
affairs gender age yearsmarried children religiousness
Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171 Min. :1.000
1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430 1st Qu.:2.000
Median : 0.000 Median :32.00 Median : 7.000 Median :3.000
Mean : 1.456 Mean :32.49 Mean : 8.178 Mean :3.116
3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000 3rd Qu.:4.000
Max. :12.000 Max. :57.00 Max. :15.000 Max. :5.000
education occupation rating ynaffairs
Min. : 9.00 Min. :1.000 Min. :1.000 No :451
1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000 Yes:150
Median :16.00 Median :5.000 Median :4.000
Mean :16.17 Mean :4.195 Mean :3.932
3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000
Max. :20.00 Max. :7.000 Max. :5.000
Logist回归相关代码
1. 加载数据
###Logistic回归
#install.packages("AER")
> library(AER)
#import data and view data
> data(Affairs,package = "AER")
> head(Affairs)
affairs gender age yearsmarried children religiousness education occupation rating
4 0 male 37 10.00 no 3 18 7 4
5 0 female 27 4.00 no 4 14 6 4
11 0 female 32 15.00 yes 1 12 1 4
16 0 male 57 15.00 yes 5 18 6 5
23 0 male 22 0.75 no 2 17 6 3
29 0 female 32 1.50 no 2 17 5 5
> summary(Affairs)
affairs gender age yearsmarried children religiousness
Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171 Min. :1.000
1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430 1st Qu.:2.000
Median : 0.000 Median :32.00 Median : 7.000 Median :3.000
Mean : 1.456 Mean :32.49 Mean : 8.178 Mean :3.116
3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000 3rd Qu.:4.000
Max. :12.000 Max. :57.00 Max. :15.000 Max. :5.000
education occupation rating
Min. : 9.00 Min. :1.000 Min. :1.000
1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000
Median :16.00 Median :5.000 Median :4.000
Mean :16.17 Mean :4.195 Mean :3.932
3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000
Max. :20.00 Max. :7.000 Max. :5.000
> table(Affairs$affairs)
0 1 2 3 7 12
451 34 17 19 42 38
2. 将affairs转化为二值型因子ynaffair
#translate sffairs(num) to ynaffair(binary type)
> Affairs$ynaffairs[Affairs$affairs >0 ] <- 1
> Affairs$ynaffairs[Affairs$affairs ==0 ] <- 0
> Affairs$ynaffairs <- factor(Affairs$ynaffairs,
1. levels = c(0,1),
2. labels = c("No","Yes"))
> table(Affairs$ynaffairs)
No Yes
451 150
3. 建模分析
#create a logist model
> names(Affairs)
[1] "affairs" "gender" "age" "yearsmarried" "children"
[6] "religiousness" "education" "occupation" "rating" "ynaffairs"
> fit.full <- glm(ynaffairs ~ gender + age + yearsmarried + children + religiousness +
+ education + occupation + rating, data = Affairs,family = binomial()) #binomial 二项分布
> summary(fit.full)
Call:
glm(formula = ynaffairs ~ gender + age + yearsmarried + children +
religiousness + education + occupation + rating, family = binomial(),
data = Affairs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5713 -0.7499 -0.5690 -0.2539 2.5191
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.37726 0.88776 1.551 0.120807
gendermale 0.28029 0.23909 1.172 0.241083
age -0.04426 0.01825 -2.425 0.015301 *
yearsmarried 0.09477 0.03221 2.942 0.003262 **
childrenyes 0.39767 0.29151 1.364 0.172508
religiousness -0.32472 0.08975 -3.618 0.000297 ***
education 0.02105 0.05051 0.417 0.676851
occupation 0.03092 0.07178 0.431 0.666630
rating -0.46845 0.09091 -5.153 2.56e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 609.51 on 592 degrees of freedom
AIC: 627.51
Number of Fisher Scoring iterations: 4
4. 删除不显著变量二次建模
# remove insignificant variables and create a new model
> fit.reduced <- glm(ynaffairs ~ age + yearsmarried + religiousness + rating,
+ data = Affairs, family = binomial())
> summary(fit.reduced)
Call:
glm(formula = ynaffairs ~ age + yearsmarried + religiousness +
rating, family = binomial(), data = Affairs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6278 -0.7550 -0.5701 -0.2624 2.3998
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.93083 0.61032 3.164 0.001558 **
age -0.03527 0.01736 -2.032 0.042127 *
yearsmarried 0.10062 0.02921 3.445 0.000571 ***
religiousness -0.32902 0.08945 -3.678 0.000235 ***
rating -0.46136 0.08884 -5.193 2.06e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 615.36 on 596 degrees of freedom
AIC: 625.36
Number of Fisher Scoring iterations: 4
5. 模型比较
使用anova()函数对它们进行比较,对于广义线性回归回归,可用卡方检验(Chisq)。
# test model
> anova(fit.reduced,fit.full,test = "Chisq") #Chisq 卡方检验
Analysis of Deviance Table
Model 1: ynaffairs ~ age + yearsmarried + religiousness + rating
Model 2: ynaffairs ~ gender + age + yearsmarried + children + religiousness +
education + occupation + rating
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 596 615.36
2 592 609.51 4 5.8474 0.2108
结果的卡方值不显著(p=0.2108),表明四个预测变量的新模型与九个完整预测变量的模型拟合程度一样好。
6. 模型参数解释
## model parameter interpretation
# view regression coefficient
> coef(fit.reduced)
(Intercept) age yearsmarried religiousness rating
1.93083017 -0.03527112 0.10062274 -0.32902386 -0.46136144
> # exponentiate for coefficient
> exp(coef(fit.reduced))
(Intercept) age yearsmarried religiousness rating
6.8952321 0.9653437 1.1058594 0.7196258 0.6304248
7. 评价预测变量对结果概率的影响
##create a virtual dataframe
> # variables of age, yearsmarried, and religiousness are their mean value, the range of rating is from 1 to 5
> testdata <- data.frame(rating =c(1:5),
+ age = mean(Affairs$age),
+ yearsmarried = mean(Affairs$yearsmarried),
+ religiousness = mean(Affairs$religiousness))
> testdata
rating age yearsmarried religiousness
1 1 32.48752 8.177696 3.116473
2 2 32.48752 8.177696 3.116473
3 3 32.48752 8.177696 3.116473
4 4 32.48752 8.177696 3.116473
5 5 32.48752 8.177696 3.116473
> #then, using testdata to predict
> testdata$prob <- predict(fit.reduced,newdata = testdata,type="response")
> testdata
rating age yearsmarried religiousness prob
1 1 32.48752 8.177696 3.116473 0.5302296
2 2 32.48752 8.177696 3.116473 0.4157377
3 3 32.48752 8.177696 3.116473 0.3096712
4 4 32.48752 8.177696 3.116473 0.2204547
5 5 32.48752 8.177696 3.116473 0.1513079
> # the effect of age
> testdata_age <- data.frame(rating = mean(Affairs$rating),
+ age = seq(17,57,10),
+ yearsmarried = mean(Affairs$yearsmarried),
+ religiousness = mean(Affairs$religiousness))
> testdata_age$pro <- predict(fit.reduced,newdata = testdata_age,type = "response")
> testdata_age
rating age yearsmarried religiousness pro
1 3.93178 17 8.177696 3.116473 0.3350834
2 3.93178 27 8.177696 3.116473 0.2615373
3 3.93178 37 8.177696 3.116473 0.1992953
4 3.93178 47 8.177696 3.116473 0.1488796
5 3.93178 57 8.177696 3.116473 0.1094738
从结果中可以看到,当婚姻评分从1(很不幸福)变为5(非常幸福)时,婚外情概率从0.53降低到了0.51(假定年龄、婚龄和宗教信仰不变);当其他变量不变时,年龄从17岁增加到57岁时,婚外情概率将从0.34降低到0.11.利用该方法可以探究每一个预测变量对结果概率的影响。