Table of contents
logistic regression logistic regression
Linear Probability Model (LPM)
Spss solves logistic regression
Independent variables have categorical variables
Poor prediction: add square term/interaction term
How to determine the right model: cross-validation
Fisher linear discriminant analysis
logistic regression logistic regression
01 Logistic regression can be used for regression, that is, when the dependent variable is a categorical variable , logistic regression can be used:
treat y as the probability of an event occurring, y>=0.5 means it occurs; y<0.5 means it does not happen.
Linear Probability Model ( LPM )
If you directly use the previous linear model for regression:
Problems with the above model :
- Since y can only take 0/1, there is an endogeneity problem:
- y_hat, that is, the value of the predicted value of y is probability (it will be explained later why y_hat can be understood as "the probability of y=1 occurring" . Logically speaking, it should be between 0 and 1, but the predicted value y_hat < 0 or > will occur. 1 unrealistic situation
Therefore, a connection function is needed to limit the value between 0 and 1 :
Why can y_hat be understood as "the probability of y=1 occurring" : Based on the two-point distribution model,
How to choose the connection function (two types of regression) :
① Probit regression of standard normal distribution cumulative density function ② Logistic regression of Sigmoid function
(The content in the red box is included in the paper)
f1=@(x) normcdf(x); % 标准正态分布的累积分布函数
fplot(f1, [-4,4]); % 在-4到
4上画出匿名函数的图形
hold on;
grid on;
f2=@(x) exp(x)/(1+exp(x));
fplot(f2, [-4,4]);
legend('标准正态分布的cdf','sigmoid函数','location','SouthEast')
How to solve the model after adding the connection function (that is, to find the coefficient βi ) : Not written in full
After finding the unknown number β in the model, how to predict which class the sample to be classified belongs to :
Substitute the known data of xi of the sample to be tested into the model to find y_hat. If y_hat >= 0.5, it is considered that y=1 , and the sample belongs to the classification corresponding to y=1 (Principle: understand y_hat as the probability of y = 1 occurring)
(The content in the red box is included in the paper)
Spss solves logistic regression
step
First convert the dependent variable (category of classification) into a categorical variable (dummy variable) :
Create dummy variables in SPSS: Transformation of toolbar - Create dummy variables - Enter root name
In the "Variable View" at the bottom, you can view variables, modify variable names, etc.
SPSS logistic regression: Analysis - Regression - Binary logistic ; add covariates and independent variables; if there are categorical variables , click Classification - Categorical Covariates to create dummy variables; in save , check the probability and group in the predicted value group Members ; you can choose to perform stepwise regression in the options ; if the number of samples is too small, you can click self-service sampling
View the result table: Classification table and logistic regression coefficient table are placed in the paper
Stepwise regression settings
Independent variables have categorical variables
Add to categorical covariates
Poor prediction: add square term/interaction term
Add a new squared term column (all arguments in this example have a squared term added):
Overfitting phenomenon
How to determine the right model: cross-validation
If we now have model 1 and model 2 with the square term added, which one is more accurate to measure:
Divide the data into a training group and a test group , use the data of the training group to estimate the model, and use the data of the test group to verify the prediction results of the model. Randomly select a few
from the original data as test groups to see which model predicts more accurately in order to eliminate chance. To influence the impact, you can randomly select more test groups and conduct several more training and tests, and finally calculate the average accuracy of each model . This step is cross-validation.
Fisher linear discriminant analysis
Idea: Find a hyperplane to separate different types of points. The projection points of similar samples should be as close and dense as possible, and the projection points of heterogeneous samples should be as far away as possible.
SPSS operations
result:
Multi-classification problem
Generate categorical numbers for dependent variables: in excel - Replacement
Fisher discriminant analysis
https://blog.csdn.net/z962013489/article/details/79918758
Just adjust the range of the dependent variable:
Logistic regression
The difference between factors and covariates in SpssFactor: refers to categorical variables, such as gender, education, etc.Covariates: refers to continuous variables, such as area, weight, etc.