Mathematical Modeling--(6) Classification Model

Table of contents

1. Fruit classification problem

2. Logistic regression

3. Linear Probabilistic Model (LPM)

4. Spss for logistic regression

5.Fisher linear discriminant analysis

6. Logistic regression multi-classification

7. Homework


1. Fruit classification problem

According to the attributes of the fruit, determine the type of the fruit.
mass: fruit weight

width: the width of the fruit

height: the height of the fruit

color_score: the color value of the fruit, range 0‐1

fruit_name: Fruit category.
The first 19 samples are apples and the last 19 samples are oranges. Use these 38 samples to predict the fruit types corresponding to the last four samples.

To apply the operation of logistic regression, data preprocessing is performed first to generate dummy variables. 


2. Logistic regression

 The essence of logistic regression is still a kind of regression analysis. For the case where the dependent variable is a categorical variable, we can regard y as the probability of event occurrence, y ≥ 0.5 means occurrence; y < 0.5 means no occurrence

3. Linear Probabilistic Model (LPM)

For the above ideas, the linear probability regression model can be used for regression.

There are certain problems with linear probability models:

Question 1:
  Applying the linear probability model will inevitably discuss whether the disturbance item ui is related to other independent variables. If there is, then there will be an endogenous problem, that is, the regression coefficient estimates are inconsistent and biased. Because the yi estimated by the binary classification can only be 0 or 1. So the disturbance term can be transformed into the following form:


Question 2:

The predicted value may appear y i>1 or y i<0, because yi represents probability, so this kind of predicted value is unrealistic.

The two-point distribution is modified for linear models:

event 1 0
probability p 1-p
Given x, consider the two-point distribution probability of y.

 F ( x , β ) becomes the link function, which connects the explanatory variable x xx with the explained variable y yy . Then it needs to be guaranteed that F (x , β ) is defined in the interval [0, 1], which can guarantee: 0 ≤ y ^ ≤ 1.


There are two ways to take the connection function:

 Since the latter has an analytical expression (while the cdf of the standard normal distribution does not), it is more convenient to calculate the logistic model than the probit model.

f1=@(x) normcdf(x);  % 标准正态分布的累积密度函数 
fplot(f1, [-4,4]);  % 在-4到4上画出函数f1的图形
hold on;  % 不关闭作图窗口
grid on;   % 显示网格线
f2=@(x) exp(x)/(1+exp(x));  % Sigmoid函数
fplot(f2, [-4,4]);  % 在-4到4上画出函数f2的图形
legend('标准正态分布的cdf','sigmoid函数','location','SouthEast')


Solve:

Because Sigmoid is a nonlinear model, it is estimated using maximum likelihood estimation (MLE).

 Written in a more compact form:

Finally, this non-linear maximization problem can be solved using numerical methods (gradient descent).

If y ≥ 0.5, its predicted y = 1; otherwise, its predicted y = 0
 


4. Spss for logistic regression

Example: The corresponding attributes and results of some fruits are given in the title, and unknown fruits are classified according to the attribute characteristics of known fruits. (Part of the data is intercepted in the title, the actual data is 19 apples and 19 oranges) 

  • mass: fruit weight
  • width: the width of the fruit
  • height: the height of the fruit
  • color_score: the color value of the fruit, range 0‐1
  • fruit_name: fruit category

Data preprocessing: converting qualitative variables into quantitative variables

  Qualitative variable means that the value is not a value, but a specified string. Such as: sick, not sick. Then to analyze the data, it is necessary to convert qualitative variables into quantitative variables. The conversion method is to generate a dummy variable, and this dummy value represents a state of the sample attribute. For example: 1 if sick, 0 if not sick.

There are two ways to generate dummy variables:

 The first: Spss generates dummy variables

 Of course, this will generate the corresponding number of columns according to the number of qualitative variable attribute values. For example, if this question judges whether the fruit is an apple or an orange, then set the dummy variable apple to 1 and orange to 0, and vice versa. Therefore, Spss will generate two columns of data , only one column is required.

  If it is not available in Spss, you can go to the extension center to expand it. If it cannot be expanded, you can use the second method to manually generate dummy variables.

The second: Excel generates dummy variables


Solve logistic regression:

Analysis => Regression => Binary Logistic => Select dependent variable and independent variable (here if it is a qualitative variable, then you need to select the corresponding dummy variable) => Select Save, and check the probability and group members

 Result analysis:

  • Among the 19 apple samples, 14 were predicted to be apples, and the correct rate of prediction was 73.7%;
  • Among the 19 orange samples, 15 were predicted to be oranges, and the correct rate of prediction was 78.9%;
  • For the entire sample, logistic regression had a predictive success rate of 76.3%.

  • B represents the estimated correlation coefficient, and the significance actually corresponds to the P value.
  • At the 95% confidence level, an attribute with a P value less than 0.05 means that the attribute is significant.
  • Interpretation of the significance level: Hypothesis testing is required to determine whether a regression result is good or bad. Set joint significance test H 0 : β 1 = β 2 = . . . = β k = 0 
  • Tests whether the coefficients of k independent variables are 0. If H 0 is not rejected, that is, the calculated value of the P value is >0.05, that is to say, H 0 cannot be rejected (the probability of existence of this hypothesis exceeds 5% at the 95% confidence level), so it is finally concluded that this joint test cannot reject the null hypothesis , it is meaningless to return at this time. Of course, this is the standard for judging the results of the overall regression, and a 90% confidence level can also be set. If you want to check the significance of a single independent variable, you only need to check that the significance is less than 0.05 (95% confidence level).
  • It can be seen from the table that width and height are significant at the 95% confidence level. If it is a significant variable at the 90% confidence level, add color_score.

  • The first column represents the predicted value of y^, indicating how likely it is to be an apple.
  • The second column represents the results of the regression, with 1 representing apples and 0 representing oranges. Of course, the probability corresponding to y^ here is greater than 0.5 and less than 0.5.

5. Fisher linear discriminant analysis

LDA (Linear Discriminant Analysis) is a classic linear discriminant method, also known as Fisher discriminant analysis. The idea of ​​this method is relatively simple: given the training set samples, try to project the samples onto a one-dimensional straight line , so that the projection points of similar samples are as close and dense as possible, and the projection points of heterogeneous samples are as far away as possible.

 

 The core problem : find the linear coefficient vector w.

SPss operation 

 

Multi-category problem:

Now there are four types of fruits, and the average values ​​of the indicators are as follows:

 Question: How to classify 60-67 fruits.

Fisher linear discriminant analysis in SPSS:

  • Steps: Analysis -> Classification -> Discriminant -> Add Grouping Variable (y) -> Define Range (Type) -> Add Independent Variable -> Statistics (Fischer, Unstandardized) -> Classification (Summary Table) -> Save (prediction of group members, probability of group members)
  • Focus: Unnormalized coefficients (linear coefficient ω), classification results

6. Logistic regression multi-classification

Extend the connection function: Sigmoid function to  Softmax function

 

 

 Result description:

 Returning to our data list, we can see that the probability of belonging to each category is output, and the one with the highest probability is our prediction result.

7. Homework

 Reference answer:

Quantitative processing first:

 

 Import data in Spss:

Building a Multiple Regression Model

 

 

 View forecast results:

 

Guess you like

Origin blog.csdn.net/qq_58602552/article/details/130180313