Understanding of dummy variables (dummy variables) in logistic regression with categorical variables

  When using the R language to do logistic regression, when there are categorical variables (more than two) in the independent variables, I am a little confused about the results of the regression model. Searching for relevant knowledge, I found that many people have the same questions. out your own understanding.

  First look at an example (data downloaded from: http://freakonometrics.free.fr/db.txt)

> db <- read.table("db.txt",header=TRUE,sep=";")
> head(db)
  Y X1 X2 X3
1 1 3.297569 16.25411  B
2 1 6.418031 18.45130  D
3 1 5.279068 16.61806  B
4 1 5.539834 19.72158  C
5 1 4.123464 18.38634  C
6 1 7.778443 19.58338  C
> summary(db)
       Y X1 X2 X3     
 My. : 0.000 Min. : -1.229 Min. : 10.93 A: 197  
 1st Qu.: 1.000 1st Qu.: 4.545 1st Qu.: 17.98 B: 206  
 Median :1.000   Median : 5.982   Median :20.00   C:196  
 Mean   :0.921   Mean   : 5.958   Mean   :19.94   D:197  
 3rd Qu.:1.000   3rd Qu.: 7.358   3rd Qu.:21.89   E:204  
 Max.   :1.000   Max.   :11.966   Max.   :28.71          

> reg <- glm(Y~X1+X2+X3,family=binomial,data=db)
> summary(reg)

Call: glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db) Deviance Residuals: Min 1Q Median 3Q Max -2.98017 0.09327 0.19106 0.37000 1.50646 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.45885 1.04646 -4.261 2.04e-05 *** X1 0.51664 0.11178 4.622 3.80e-06 *** X2 0.21008 0.07247 2.899 0.003745 ** X3B 1.74496 0.49952 3.493 0.000477 *** X3C -0.03470 0.35691 -0.097 0.922543 X3D 0.08004 0.34916 0.229 0.818672 X3E 2.21966 0.56475 3.930 8.48e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 552.64 on 999 degrees of freedom Residual deviance: 397.69 on 993 degrees of freedom AIC: 411.69 Number of Fisher Scoring iterations: 7

  Among the three independent variables in this dataset, X1, X2 are continuous variables, and X3 are categorical variables (A, B, C, D, E). When the logistic regression results were obtained, it was found that the representation of the X3 variable was different from that of X1 and X2, and four new variables, X3B, X3C, X3D, and X3E were generated respectively, but there was no X3A variable. Later, after consulting relevant information, I realized that the processing of categorical variables and continuous variables in logistic regression is not the same.

  When there are more than two categories of categorical independent variables, a set of dummy variables (dummy variables) need to be established to represent the attribution properties of the variables. Generally, the number of dummy variables is one less than the number of categorical variables, and the missing one is used as the reference category. For example, in this example, A is the reference class, and X3B, X3C, X3D, and X3E are four dummy variables. The selection of the reference class is arbitrary, and the R language logistic regression sets the first factor of the categorical variable as a dummy variable by default. The regression model at this time is as follows:

The value of the four dummy variables is 1 or 0, that is, when the categorical variable in the observation belongs to a certain group, the value of the dummy variable in this group is 1, and the value of the remaining dummy variables is 0.

For example, when the value of X3 in a set of observations (X1, X2, X3, Y) is B, the dummy variable X3B = 1, and X3C, X3D, and X3E are all 0. At this time:

And when the value of X3 in a group of observations (X1, X2, X3, Y) is A, because A is the reference class, X3B, X3C, X3D, X3E are all 0 at this time, at this time:

Therefore, under the condition of control variables, that is, assuming that in the two groups of observation values, X1 and X2 are the same, and X3 are A and B respectively, the subtraction of the above two formulas can be obtained:

            

Here odds(B/A) is the occurrence ratio of variable B to variable A, that is, the ratio of the occurrence ratio of variable B to the occurrence ratio of variable A. An occurrence ratio greater than 1 indicates that the probability of the event will increase, or that the independent variable has a positive effect on the probability of the event occurring. For example, if the value of odds(B/A) is greater than 1, it means that under the condition that X1 and X2 remain unchanged, the value B of X3 has a greater probability to make the value of Y 1 than the value of X3. (Wang Jichuan, Guo Zhigang. Logistic Regression Model - Method and Application [M]. Beijing: Higher Education Press)

 

  Going back to the example at the beginning, according to the results, we can draw the conclusion that the value of A, C, and D of the variable X3 has the same effect on Y, and the value of B and E of the variable X3 will make the probability of the value of Y to be 1. A, C, D increased significantly. Just take a look:

> db_a <- db[db$X3 == "A",]
> db_b <- db[db$X3 == "B",]
> db_c <- db[db$X3 == "C",]
> db_d <- db[db$X3 == "D",]
> db_e <- db[db$X3 == "E",]

> table(db_a$Y)

  0   1 
 25 172 
> table(db_b$Y)

  0   1 
  6 200 
> table(db_c$Y)

  0   1 
 21 175 
> table(db_d$Y)

  0   1 
 22 175 
> table(db_e$Y)

  0   1 
  5 199 

It can be seen from the results that the ratio of the Y value of 1 in groups B and E is indeed higher than that in groups A, C and D.

 

We can also define dummy variables ourselves:

> levels(db$X3)
[1] "A" "B" "C" "D" "E"
> db$X3 <- relevel(db$X3, "B")
> levels(db$X3)
[1] "B" "A" "C" "D" "E"

Same as the regression model above:

> reg <- glm(Y~X1+X2+X3,family=binomial,data=db)
> summary(reg)

Call:
glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db)

Deviance Residuals:
     Min        1Q    Median        3Q       Max  
-2.98017   0.09327   0.19106   0.37000   1.50646  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.71389    1.07274  -2.530 0.011410 *  
X1           0.51664    0.11178   4.622  3.8e-06 ***
X2           0.21008    0.07247   2.899 0.003745 ** 
X3A         -1.74496    0.49952  -3.493 0.000477 ***
X3C         -1.77966    0.51002  -3.489 0.000484 ***
X3D         -1.66492    0.50365  -3.306 0.000947 ***
X3E          0.47470    0.66354   0.715 0.474364    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 552.64  on 999  degrees of freedom
Residual deviance: 397.69  on 993  degrees of freedom
AIC: 411.69

Number of Fisher Scoring iterations: 7

 

The main content is so much. If you want a more detailed understanding, you can refer to: Wang Jichuan, Guo Zhigang. Logistic Regression Model - Method and Application [M]. Beijing: Higher Education Press

And the link: https://www.r-bloggers.com/logistic-regression-and-categorical-covariates/

 

Copyright statement: This article is an original article by the blogger, blog address: http://www.cnblogs.com/Demo1589/p/8973731.html, please indicate the source for reprinting. 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325162890&siteId=291194637