R language | Logistic regression implementation of binary classification and multi-classification

Binary Logistic Regression

First of all, let me show the overall code of my logistic regression. If you have basic needs, you can directly modify the data and parameters and use it:

library(lattice)
library(ggplot2)
library(caret)
library(e1071)
library(foreign)
library(survival)
library(MASS)
library(nnet)
library(epiDisplay)
library(pROC)

# 数据导入
data<-read.csv('E:/TestData/Number10.2.csv',header = T)

# 共线性诊断
XX<-cor(data[-1])
kappa(XX,exact = TRUE)  # 也可以计算条件数kappa(X),k<100,说明共线性程度小;如果100<k<1000,则存在较多的多重共线性;若k>1000,存在严重的多重共线性。

# 划分训练集与测试集
train_sub = sample(nrow(data),7.5/10*nrow(data))
train_data = data[train_sub,]
test_data =data[-train_sub,]

# 模型构建
model6<-glm(mort~.,data=train_data,family = binomial)
summary(model6) #展示模型的

# 模型预测
pre_logistic<-as.numeric(predict(model6,newdata = test_data,type = "response")>0.5)

# 模型检验
conMat4<-confusionMatrix(factor(pre_logistic),factor(test_data$mort),positive="1")
logistic.display(model6) #输出OR值

# 绘制ORC曲线
roc1<-roc(test_data$mort,pre_logistic,plot=TRUE, print.thres=TRUE, print.auc=TRUE,levels = c(0,1),direction = "<")

# 检验线性假设是否成立
pre<-predict(model6,newdata = test_data,type = "response")
par(mfrow=c(2,2))
scatter.smooth(test_data[,2],log(pre/(1-pre)),cex=0.5)
scatter.smooth(test_data[,3],log(pre/(1-pre)),cex=0.5)
scatter.smooth(test_data[,4],log(pre/(1-pre)),cex=0.5)
scatter.smooth(test_data[,5],log(pre/(1-pre)),cex=0.5)

data preparation

Clean the data according to your needs before importing it into R. In this example, the dependent variable is binary, representing the discharge and death of the patient, and the purpose of the experiment is to predict the factors related to the death of the patient. Therefore, we set death to 1 and hospitalization to 0.

The import of R language data supports many formats. In this example, the import is in csv format. You can use the read.csv() function or the read.table() function; if you want to import .xlsx files, you can call library(openxlsx ) package, use the read.xlsx() function, comrades who need it can check it by themselves.

In this example, first we import the data

data<-read.csv('E:/TestData/Number10.2.csv',header = T)

#header = T indicates that the header line in the file is imported directly

The effect of data import is shown as follows (because it may involve sensitive information, I will erase the title line here, and there is a title line in the actual import) Before data analysis,
Because sensitive information may be involved, I will erase the title line here, and the actual import has a title line
we need to check whether there is multicollinearity among the independent variables :

data<-read.csv('E:/TestData/Number10.2.csv',header = T)
XX<-cor(data[-1])
kappa(XX,exact = TRUE)  # 也可以计算条件数kappa(X),k<100,说明共线性程度小;如果100<k<1000,则存在较多的多重共线性;若k>1000,存在严重的多重共线性。

In this example, the calculated k=2.857281 indicates that the degree of collinearity between the independent variables is small and can be ignored.

Then we divide the data into test set and training set according to the ratio of 7.5 : 2.5

train_sub = sample(nrow(data),7.5/10*nrow(data))
train_data = data[train_sub,]  # 训练集
test_data =data[-train_sub,]  # 测试集

model building

In the R language, the binary classification logistic regression can call the glm() function, and the specific implementation is as follows:

model6<-glm(mort~.,data=train_data,family = binomial)

#When the actual model is established, the form of mort~colname1+colname2+colname3 can be used, and mort ~ means putting all the remaining columns as independent variables into the model

summary(model6) #模型展示

The results are as follows:
insert image description here
In the figure above, we can see that in the model, all variable P values ​​are significant (if there are insignificant independent variables in this step, you can consider removing them and rebuilding the model)

After the model is successfully built, we predict the test set according to the model, and classify the probability > 0.5 into the death category, and the probability <0.5 into the discharge category:

pre_logistic<-as.numeric(predict(model6,newdata = test_data,type = "response")>0.5)

model checking

In order to test the prediction effect of the model, we use the confusionMatrix() function under the library(caret) package to output the confusion matrix of the predicted value and the real value, as well as the accuracy, sensitivity and specificity of the model:

conMat4<-confusionMatrix(factor(pre_logistic),factor(test_data$mort),positive="1") 
logistic.display(model6) #输出OR值

#It should be noted that the confusionMatrix() function requires the input variable to be of the factor type. Here, we can use the factor() function under the library(e1071) package, or the as.factor() function to force the data into a factor type. As shown in the above code.

The output results of the confusion matrix are shown as follows:
insert image description here
As can be seen from the figure above, the accuracy of the model is 0.9681, the sensitivity is 0.80882, and the specificity is 0.98861. Overall, the sensitivity of the model is low and the specificity is high. The possible reason is that the data is unbalanced, which makes the model not sensitive enough to the classification with less data. It is suggested that oversampling, SMOTE balance and other methods can be used to process the data. For the specific implementation method, comrades who need it can Baidu by themselves.

In addition to outputting the confusion matrix, the sensitivity and specificity of the model can also be obtained by drawing the ROC curve:

roc1<-roc(test_data$mort,pre_logistic,plot=TRUE, print.thres=TRUE, print.auc=TRUE,levels = c(0,1),direction = "<")

The ROC curve is shown as follows:
insert image description here
In fact, from the above figure, we can see that the AUC value in my ROC curve and the AUC output through the confusion matrix are biased (after the comments, AUC and accuracy are not the same thing, I started I didn't understand it, so it is understandable that the AUC and the Accuracy in the confusion matrix are inconsistent~~).

So far, the logistic regression model of the two classifications has been established. I did a lot of work in the process of building the model, and there are often situations where the model fits poorly. I originally wanted to record in detail, but the logic of the writing is limited. I found that some feelings during the process were not easy to write, and finally I only wrote a rough one. process. In fact, in the process of model construction, various unspeakable errors often occur due to data problems or improper operation. For example, it is best to set unordered multi-category independent variables with more than 2 levels as dummy variables (in R language logistic regression , set the original data to the data represented by letters such as ABC, and when it is substituted into the glm() model, it will be automatically set as a dummy variable with the first classification as a reference, comrades in need can Baidu by themselves); moreover, if There is a nonlinear relationship, and the relationship is not clear, you may need generalized additive models, corresponding to the gam() function. If the nonlinear relationship is clear, convert it into a linear one, using the glm() function: For example, if the functional relationship of an independent variable is found to be quadratic through linear diagnosis, the independent variable can be converted into a square, and then the new variable can be substituted into Model reconstruction (the linear hypothesis test is given in my overall code, and interested comrades can try it themselves).

The results of the linearity test are roughly as follows:
insert image description here
As shown in the figure above, Figure 3 is obviously a quadratic relationship, so the independent variable in Figure 3 can be transformed into its own quadratic form, and then a new model can be constructed.

Multiclass Logistic Regression

After introducing the binary regression, let's take a look at the multi-class logistic regression. The R language implementation process is similar to the binary regression. Similarly, I will show the overall code first:

library(lattice)
library(ggplot2)
library(caret)
library(e1071)
library(nnet)
library(pROC)

# 数据准备
data<-read.csv('D:/多分类逻辑回归/iris.csv',header = T)
train_sub = sample(nrow(data),7.5/10*nrow(data))
train_data = data[train_sub,]
test_data =data[-train_sub,]

# 多元分类模型构建
train_data$class2<-relevel(as.factor(train_data$class),ref = "Iris-setosa") # 选择参考分类
mult.model<-multinom(class~A+B+C+D,data=train_data)
summary(mult.model)

# 系数显著性检验
z <- summary(mult.model)$coefficients/summary(mult.model)$standard.errors
p <- (1 - pnorm(abs(z), 0, 1))*2
p

# 相对危险度(相对危险风险比,或者也叫odds),与OR等价
exp(coef(mult.model))

# 利用head()函数得到模型的拟合值
# head(pp<-fitted(mult.model))

# 测试集结果预测
pre_logistic<-predict(mult.model,newdata = test_data) 

# 预测正确百分比
# table(test_data$class,pre_logistic)
# 多分类混淆矩阵
conMat4<-confusionMatrix(factor(pre_logistic),factor(test_data$class))

In this example, multi-category logistic regression uses the multinom() function in the library(nnet) package. This function has two characteristics: one is that the reference classification needs to be selected; the other is that the significance of the coefficient cannot be calculated (it needs to be calculated by itself). In multiple logistic regression, assuming that there are 3 categories, the reference category will be used as a reference to construct two classification models. The model calculates the probability that the piece of data belongs to three categories, and the category with the highest probability is taken as the final category. (For detailed tutorials, please refer to: Multivariate Classification Detailed Tutorial )

The code for multiple regression will not be explained in detail here. The data used in the sample code is the classic iris case data, you can download it yourself if you need it: iris data set

Summary: The feeling is that things like algorithms and data analysis still need to be written by yourself, only watching or listening, and it is still a mess when you do it yourself. So, if you get the chance, practice it!

Guess you like

Origin blog.csdn.net/icefountain/article/details/106105110