Mathematical Modeling: 12 Classification Models

Table of contents

logistic regression logistic regression

Linear Probability Model (LPM) 

Spss solves logistic regression

step

Stepwise regression settings

Independent variables have categorical variables

Poor prediction: add square term/interaction term

Overfitting phenomenon

How to determine the right model: cross-validation

Fisher linear discriminant analysis

SPSS operations

Multi-classification problem

Fisher discriminant analysis

Logistic regression


logistic regression logistic regression

01 Logistic regression can be used for regression, that is, when the dependent variable is a categorical variable , logistic regression can be used:
       treat
as the probability of an event occurring, y>=0.5 means it occurs; y<0.5 means it does not happen.

Linear Probability Model ( LPM ) 

If you directly use the previous linear model for regression:

Problems with the above model :

  1. Since y can only take 0/1, there is an endogeneity problem:

  2. y_hat, that is, the value of the predicted value of y is probability (it will be explained later why y_hat can be understood as "the probability of y=1 occurring" . Logically speaking, it should be between 0 and 1, but the predicted value y_hat < 0 or > will occur. 1 unrealistic situation

Therefore, a connection function is needed to limit the value between 0 and 1 :

 Why can y_hat be understood as "the probability of y=1 occurring" : Based on the two-point distribution model,

How to choose the connection function (two types of regression) :

① Probit regression of standard normal distribution cumulative density function ② Logistic regression of Sigmoid function

Since the latter has analytical expressions (while the cdf of the standard normal distribution does not), it is more convenient to calculate the logistic model than the probit model.

(The content in the red box is included in the paper)

f1=@(x) normcdf(x); % 标准正态分布的累积分布函数
fplot(f1, [-4,4]); % 在-4到
4上画出匿名函数的图形
hold on;
grid on;
f2=@(x) exp(x)/(1+exp(x));
fplot(f2, [-4,4]);
legend('标准正态分布的cdf','sigmoid函数','location','SouthEast')

How to solve the model after adding the connection function (that is, to find the coefficient βi ) : Not written in full

After finding the unknown number β in the model, how to predict which class the sample to be classified belongs to :

Substitute the known data of xi of the sample to be tested into the model to find y_hat. If y_hat >= 0.5, it is considered that y=1 , and the sample belongs to the classification corresponding to y=1 (Principle: understand y_hat as the probability of y = 1 occurring)

(The content in the red box is included in the paper)

Spss solves logistic regression

step

First convert the dependent variable (category of classification) into a categorical variable (dummy variable) :

Create dummy variables in SPSS: Transformation of toolbar - Create dummy variables - Enter root name

In the "Variable View" at the bottom, you can view variables, modify variable names, etc. 

SPSS logistic regression: Analysis - Regression - Binary logistic ; add covariates and independent variables; if there are categorical variables , click Classification - Categorical Covariates to create dummy variables; in save , check the probability and group in the predicted value group Members ; you can choose to perform stepwise regression in the options ; if the number of samples is too small, you can click self-service sampling

View the result table: Classification table and logistic regression coefficient table are placed in the paper

Stepwise regression settings

Independent variables have categorical variables

Add to categorical covariates

Poor prediction: add square term/interaction term

Add a new squared term column (all arguments in this example have a squared term added):

Overfitting phenomenon

How to determine the right model: cross-validation

If we now have model 1 and model 2 with the square term added, which one is more accurate to measure:

       Divide the data into a training group and a test group , use the data of the training group to estimate the model, and use the data of the test group to verify the prediction results of the model. Randomly select a few
       from the original data as test groups        to see which model predicts more accurately        in order to eliminate chance. To influence the impact, you can randomly select more test groups and conduct several more training and tests, and finally calculate the average accuracy of each model . This step is cross-validation.

Fisher linear discriminant analysis

Idea: Find a hyperplane to separate different types of points. The projection points of similar samples should be as close and dense as possible, and the projection points of heterogeneous samples should be as far away as possible.

Principle: Machine Learning-Linear Classification 3-Linear Discriminant Analysis (Fisher Discriminant Analysis)-Model Definition_bilibili_bilibili

Core problem: find the linear coefficient vector ω

SPSS operations

result:

Multi-classification problem

Generate categorical numbers for dependent variables: in excel - Replacement

Fisher discriminant analysis

https://blog.csdn.net/z962013489/article/details/79918758

Just adjust the range of the dependent variable:

Logistic regression

https://www.cnblogs.com/bonelee/p/8127411.html
https://blog.csdn.net/bitcarmanlee/article/details/82440853

 

The difference between factors and covariates in Spss
Factor: refers to categorical variables, such as gender, education, etc.
Covariates: refers to continuous variables, such as area, weight, etc.

Guess you like

Origin blog.csdn.net/m0_54625820/article/details/128699150