Teach you how to use R language to do LASSO regression

LASSO regression is also called lasso regression. It uses a penalty function to compress the coefficients of the variables in the regression model to prevent overfitting and solve the problem of serious collinearity. LASSO regression was first proposed by British Robert Tibshirani. The prediction model is widely used. In the New Glen literature, Daniel proposed that for model fitting with too many variables and a small number of variables, the LASSO penalty function must first be considered. Today we will talk about how to use R language to construct a predictive model through LASSO regression.
First of all, we have to download the glmnet package of R, led by Trevor Hastie, the inventor of LASSO regression and Stanford statistician Trevor Hastie.
Load the required package, import the data (or our previous SPSS breast cancer data), delete missing values

library(glmnet)
library(foreign)
bc <- read.spss("E:/r/Breast cancer survival agec.sav",
                use.value.labels=F, to.data.frame=T)
bc <- na.omit(bc)

Insert picture description here
Currently, the glmnet package can only accept data in matrix form, and the data in the data frame will report errors, so we must first convert the data into matrix form. This step is very important.

y<-as.matrix(bc[,8])
x<-as.matrix(bc[,c(2:7,9:11)])

Insert picture description here
After conversion, we get two data matrices, Y is the result, X is the variable of the data, and
start to build the model

f1 = glmnet(x, y, family="binomial", nlambda=100, alpha=1) #这里alpha=1为LASSO回归,如果等于0就是岭回归
#参数 family 规定了回归模型的类型:
family="gaussian" 适用于一维连续因变量(univariate)
family="mgaussian" 适用于多维连续因变量(multivariate)
family="poisson" 适用于非负次数因变量(count)
family="binomial" 适用于二元离散因变量(binary)
family="multinomial" 适用于多元离散因变量(category)
我们这里结局指标是2分类变量,所以使用binomial
print(f1)#把f1结果输出

Insert picture description here
Insert picture description here
It can be seen that as lambdas increase, degrees of freedom and residuals decrease, and the minimum lambda is 0.000233
.

plot(f1, xvar="lambda", label=TRUE)

Insert picture description here
The abscissa is the logarithm of lambdas, and the ordinate is the variable coefficient. It can be seen that as lambdas increase, the variable coefficients continue to decrease, and some of the variable coefficients become 0 (equal to no such variable)

Let's cross-validate below.
We can take a part of the data set for verification (this step can also be done)

predict(f1, newx=x[2:5,], type = "response")

Insert picture description here
Then cross-check through glmnet's own function and output the graph

cvfit=cv.glmnet(x,y)
plot(cvfit)

Insert picture description here
There are two dashed lines in our graph, one is the λ value when the mean square error is the smallest, and the other is the λ value when the distance mean square error is the smallest one standard error.

cvfit$lambda.min#求出最小值
cvfit$lambda.1se#求出最小值一个标准误的λ值

Insert picture description here
Insert picture description here
OK, we get these two values ​​and bring them into the model to take a look

l.coef2<-coef(cvfit$glmnet.fit,s=0.004174369,exact = F)
l.coef1<-coef(cvfit$glmnet.fit,s=0.04272596,exact = F)
l.coef1
l.coef2

Insert picture description here
Insert picture description here
We see that there are no variables in the first model, and there are 5 variables in the second model, so we can only choose the second one.
We take these coefficients out to form a generalized linear equation, and the time variable time is too lazy to take (it’s just for demonstration, it’s okay to take it)

mod<-glm(status~age+pathsize+lnpos+pr,family="binomial",data = bc)
summary(mod)

Insert picture description here
There are 3 indicators selected, we can also calculate OR and 95% CI
Insert picture description here
OK. If you do this, all models have been made. Have you learned?
Insert picture description here

Guess you like

Origin blog.csdn.net/dege857/article/details/111693504