This lecture will introduce classification models. For the classification model, we will introduce two classification algorithms, logistic regression (logistic regression) and Fisher linear discriminant analysis; for the multi-classification model, we will briefly introduce the operation steps of multi-class linear discriminant analysis and multi-class logistic regression in Spss. .
An example of classifying fruits in this question
Idea: Logistic regression to the original phenomenon
- set dummy variable y
- Regression is performed, and the estimated y-hat is closer to the dummy variable, then it is classified as that.
Eg: Let 1 apple, 2 oranges, if y is close to 1, it is an apple, and if it is close to 0, it is an orange
Data preprocessing to generate dummy variables
Independent variable mass weight, width fruit width, height fruit height, color_score color (0-1)
Dependent variable: fruit_name fruit name
Generate dummy variable operation: Transform -> Create dummy variable
3. Logistic regression:
4. Build a model:
It is not difficult to see that there is a correlation between u and x, so there is endogenousness, which leads to inaccurate data, so it needs to be improved.
A solution to endogeneity: the two-point distribution
How to get the connection function
These two formulas are obtained from the figure. Both models are consistent with x belonging to (-∞, +∞) and y belonging to (0, 1).
How to solve it?
Substitute the independent variable into the formula to get y and compare it with 0.5 (the comparison of 0.5 in this question is a fruit case)
Maximum likelihood estimation can estimate the rough B_hat and then deduce y_hat for the final prediction.
How is it used for classification?
Here we choose the second equation e^X/1+e^x
SPSS solves binary logistic regression:
Logistic regression coefficient table:
What if the independent variable has a categorical variable?
What about poor forecasts?
Negative impact:
Increasing the squared independent variable too much makes the fitted line fit the sample data exactly, causing the predicted data to not fit.
How to determine the appropriate model? (It not only makes the sample data consistent, but also makes the forecast data more reliable)
Here we remove three apples and oranges for comparison
Fisher's Linear Judgment Analysis
The core problem: find the coefficient vector w
SPSS operation:
Multi-category problem:
Fisher Judgment Multi-Classification
1. Set the number of categories
2. Summary table
3. Saving: predicting group members + group member probability
Fisher multi-classification discriminant results:
Logistic multi-class discrimination:
Spss operation:
Analysis -> Regression -> Multivariate Logistic
Statistics: Select the category and the rest to see if you need to choose
Save options: estimate response probabilities, predict classes.
result:
after class homework:
answer:
In order to facilitate multi-class classification, we need to customize the name of the category, such as 1 for Iris versicolor, 2 for Iris Iris, and 3 for Iris Virginia.
The blogger chose Logistic multivariate classification:
But in order to prevent the inaccuracy of sample data or prediction data, we divide the data into training group and test group, and finally get the classification results.
forecast result: