Actual lasso feature screening to get 5 genes Cox single factor analysis to get a lot of meaningful genes

2. Why do we need to use the Lasso + Cox survival analysis mode ? Generally, when we screen variables that affect the prognosis of patients, we usually first perform univariate Cox analysis to screen out associated variables, and then construct a multivariate model to further confirm whether the association between variables and survival is independent.
However, this approach does not take into account the influence of multicollinearity between variables . Sometimes we even find that the hazard ratios obtained by single factor and multifactor Cox regression are contradictory, which is the result of model distortion caused by multicollinearity between variables. . Moreover, when the number of variables is greater than the sample size (such as screening genes or mutation sites that affect prognosis, the number of candidate variables may far exceed the number of samples), at this time, traditional Cox regression stepwise regression, forward method, backward method, etc. Variable screening methods are no longer applicable.
Therefore, when there is multicollinearity between variables or the number of variables is greater than the sample size, it is necessary to use Lasso
(Least absolute shrinkage and selection operator) regression to screen the variables first, and then construct a Cox regression model to analyze the prognostic impact, which is Lasso + Cox survival analysis model.
3. What is Lasso + Cox survival analysis mode Lasso can realize the selection of variables while estimating model parameters, can better solve the problem of multicollinearity in regression analysis, and can explain the results well. The Lasso regression algorithm uses the L1 norm for shrinkage penalties, and performs penalty correction for some variable coefficients that do not contribute much to the dependent variable, compresses the coefficients of some less important variables to 0, and retains the coefficients of important variables greater than 0 to reduce The number of covariates in the Cox regression .

5.7+ Life letter article recurrence (5): Single factor cox+lasso screening of prognostic related DEGs

load("G:/r/duqiang_IPF/GSE70866_metainformation_4_platforms/3_ipf_combined_cox_univariate_Adjuste_for_age_sex.RData")


head(cox_results)
rownames(cox_results)
cox_results2=cox_results %>% as.data.frame() %>% filter(p<0.05)

identical(colnames(exprSet),rownames(phe))

x=exprSet[rownames(exprSet) %in% rownames(cox_results2),]
x=t(x)
dim(x)
y=phe %>%select('time','event')
head(y)[1:4,1:2]
head(x)[1:4,1:4]

Model building input x y

#构建模型
y=data.matrix(Surv(time=y$time,
                   event= y$event))
head(y)
head(x)[1:4,1:5]

1 Model Construction

fit <- glmnet(x, y, family = 'cox', type.measure = "deviance", nfolds = 10)
plot(fit,xvar = 'lambda',label = T) #候选DEHGs的lasso系数

2 Ten-fold cross-validation to filter the best lambda: 

#十折交叉检验筛选最佳lambda:
set.seed(007)
lasso_fit <- cv.glmnet(x, y, family = 'cox', type.measure = 'deviance', nfolds = 10)

3 Extract the best lambda value (here choose 1se corresponding to lambda): 

#提取最佳λ值(这里选择1se对应lambda):
lambda.1se <- lasso_fit$lambda.1se
lasso_fit$lambda.min
lambda.1se  #[1] 0.2617315

4. Remodel with lambda of 1se:

model_lasso_1se <- glmnet(x, y, family = 'cox',
                          type.measure = 'deviance', nfolds = 10,
                          lambda = lambda.1se)

5 to bring out the modeling using genes:

#拎出建模使用基因:
gene_1se <- rownames(model_lasso_1se$beta)[as.numeric(model_lasso_1se$beta)!=0]#as.numeric后"."会转化为0
gene_1se #筛选出5个  #"HS3ST1"  "MRVI1"   "TPST1"   "SOD3"    "S100A14"




library(dplyr)

#save(phe,phe_final_3,exprSet,cox_results,file = "G:/r/duqiang_IPF/GSE70866_metainformation_4_platforms/3_ipf_combined_cox_univariate_Adjuste_for_age_sex.RData")
load("G:/r/duqiang_IPF/GSE70866_metainformation_4_platforms/3_ipf_combined_cox_univariate_Adjuste_for_age_sex.RData")


head(cox_results)
rownames(cox_results)
cox_results2=cox_results %>% as.data.frame() %>% filter(p<0.05)

getElement(cox_results,"p")
cox_results['p']
head(cox_results)
dim(cox_results)

head(exprSet)
dim(exprSet)

dim(phe)
head(phe)

identical(colnames(exprSet),rownames(phe))

x=exprSet[rownames(exprSet) %in% rownames(cox_results2),]
x=t(x)
dim(x)
y=phe %>%select('time','event')
head(y)[1:4,1:2]
head(x)[1:4,1:4]



table(y$time==0)


#OS单位从天转换为年: 是否转换成年不影响结果
if(1==1){
  y$time <- round(y$time/365,5) #单位年,保留5位小数  time不可以有0
  head(y)
}



#构建模型
y=data.matrix(Surv(time=y$time,
                   event= y$event))
head(y)
head(x)[1:4,1:5]
fit <- glmnet(x, y, family = 'cox', type.measure = "deviance", nfolds = 10)
plot(fit,xvar = 'lambda',label = T) #候选DEHGs的lasso系数
head(coef(fit))



#十折交叉检验筛选最佳lambda:
set.seed(007)
lasso_fit <- cv.glmnet(x, y, family = 'cox', type.measure = 'deviance', nfolds = 10)

plot(lasso_fit)
lasso_fit
head(coef(lasso_fit))
rownames(lasso_fit$beta)[as.numeric(lasso_fit$beta)>0]


#提取最佳λ值(这里选择1se对应lambda):
lambda.1se <- lasso_fit$lambda.1se
lasso_fit$lambda.min
lambda.1se  #[1] 0.2617315


#使用1se的lambda重新建模:
model_lasso_1se <- glmnet(x, y, family = 'cox',
                          type.measure = 'deviance', nfolds = 10,
                          lambda = lambda.1se)
head(model_lasso_1se)
head(coef(model_lasso_1se))


#拎出建模使用基因:
gene_1se <- rownames(model_lasso_1se$beta)[as.numeric(model_lasso_1se$beta)!=0]#as.numeric后"."会转化为0
gene_1se #筛选出5个  #"HS3ST1"  "MRVI1"   "TPST1"   "SOD3"    "S100A14"

The difference between cox regression and logistic regression

1. Brief introduction of LASSO

       As technology has advanced, so have the techniques for collecting data. Therefore, how to effectively mine useful information from data has attracted more and more attention. Statistical modeling is undoubtedly one of the most effective means to deal with this problem at present. At the beginning of the model establishment, in order to minimize the model bias due to the lack of important independent variables, people usually choose as many independent variables as possible. However, in the actual modeling process, it is usually necessary to find the most explanatory subset of independent variables for the response variable—that is, model selection (or variable selection, feature selection), in order to improve the interpretation and prediction accuracy of the model. Therefore, model selection is an extremely important issue in the process of statistical modeling.

       The Lasso (Least absolute shrinkage and selection operator, Tibshirani (1996)) method is a compression estimate. It obtains a more refined model by constructing a penalty function, which makes it compress some coefficients and set some coefficients to zero. Therefore, the advantage of subset shrinkage is preserved, and it is a biased estimate for dealing with data with multicollinearity.

      The basic idea of ​​Lasso is to minimize the residual sum of squares under the constraint that the sum of the absolute values ​​of the regression coefficients is less than a constant, so that some regression coefficients that are strictly equal to 0 can be generated and an interpretable model can be obtained. The Lars algorithm package of R provides Lasso programming. According to the needs of model improvement, we can give Lasso algorithm, and use the AIC criterion and BIC criterion to cut off the variables of the statistical model, and then achieve the purpose of dimensionality reduction. Therefore, we can better apply it to variable selection by studying Lasso.

     To put it simply : In regression analysis, factor screening mainly uses methods such as stepwise regression, forward, backward, etc. These methods are relatively traditional, but for data with serious collinearity problems, or the number of variables is greater than the number of observations For example, in gene sequencing data, the number of genes is much larger than the number of observed values ​​(the number of patients), the above-mentioned traditional method is not suitable, and the Lasso method was born to solve the above problem, it provides a new variable screening algorithm, which can be very good To solve the problem of collinearity, for the regression analysis we usually do, if you feel that the variables screened by ordinary methods are not ideal, and the variables you want are not screened, you can try this method. The specific process is first in In the R software, variables are screened out by this method, and then COX regression or other regression analysis is performed on the screened out variables.

Guess you like

Origin blog.csdn.net/qq_52813185/article/details/127363232