Regression Analysis of "Who Says a Rookie Can't Data Analysis"

Regression, originally a term in genetics, was first proposed by the British biologist and statistician Galton. When he studied human height, he found that tall people regressed to the average height of the population, and short people regressed to the average human height from the other direction.

regression analysis

It is a method to study the quantitative relationship between the independent variable and the dependent variable. It is mainly to predict the development trend of Y by establishing a regression model between the dependent variable Y and the independent variable X to measure the influence of X on Y.

The connection between correlation analysis and regression analysis : both are methods to study the relationship between two or more variables. In actual work, regression analysis is based on correlation analysis, and only when there is correlation between two variables can the regression analysis continue. The difference
between correlation analysis and regression analysis : * Correlation analysis studies random variables, and does not distinguish between dependent variables and independent variables; the variables studied by regression analysis must define independent variables and dependent variables, and independent variables are determined ordinary variables, dependent variables is a random variable. * Correlation analysis is mainly to describe the closeness of the correlation between two variables. Regression analysis can not only reveal the degree of influence of X on Y, but also predict according to the regression model.


Classification

linear regression

  • simple linear regression
  • multiple linear regression

Note: Multiple Linear Regression is a linear regression model that includes two or more independent variables;
Multivariate Linear Regression is a linear regression model that includes two or more dependent variables.

nonlinear regression

  • Logistic regression

linear regression

Linear Regression Analysis Steps

According to the prediction target, determine the independent variable and dependent variable

Determine the analysis objectives and ideas, select the independent variable and dependent variable

Draw a scatter plot to determine the regression model type

By drawing a scatter diagram, it is preliminarily judged whether there is a linear correlation between the variables, and at the same time the correlation coefficient is calculated, and then the degree and direction of the correlation between the independent variable and the dependent variable are judged, so as to determine the type of regression model.

Estimating model parameters and building a regression model

The least squares method was used to evaluate the model parameters and establish a regression model

Validate the regression model

Through the statistical significance test of the whole model and each parameter, the regression model is gradually optimized and finally established.

Using Regression Models to Make Predictions

Apply to new data for data prediction.

Interpretation of SPSS linear regression analysis results

The SPSS software will output 4 result tables after the regression analysis is completed. They are explained in turn below.

Linear regression model input/removal variables table


This table shows the independent and dependent variables of the regression model.
Removed variables are those independent variables that are not statistically significant.
There are 5 methods:

Input: Force the selected independent variable to be included in the regression model (default)
Step: Introduce the independent variable Zhuge into the model and perform a statistical significance test until no non-significant independent variable is proposed from the regression model
Remove Set conditions and directly eliminate some independent variables
Back : According to the set conditions, one independent variable is proposed each time until it cannot be eliminated
Forward: According to the set conditions, one independent variable is included each time until it cannot be further included

In multiple linear regression, it is recommended to use the [step] method, also known as the stepwise regression method, which is a combination of the two methods of [backward] and [forward]. Stepwise regression will screen the dependent variable once based on the contribution of each independent variable to the model, and gradually eliminate those independent variables that are not statistically significant until no insignificant independent variables are eliminated from the regression model. This is a process of automatic model optimization.

Linear Regression Model Summary Table


R is the correlation coefficient r, indicating the degree and direction of the correlation between variables.

The R square is the square of R, called the coefficient of determination, also known as the goodness of fit or the coefficient of determination, which is used to indicate the percentage of the variation of the dependent variable that the fitted model can explain. The closer the R square is to 1, the better the regression model fit.
If the R square is equal to 0.732, it means that the independent variables in the regression model can explain 73.2% of the model variation.

The simple linear regression model mainly uses the R square to measure the fitting effect of the model; the adjusted R square is mostly used in the multiple linear regression model to correct the situation where the model fitting effect is too high due to the increase in the number of independent variables. It is used to measure the change of model fit after adding other independent variables in the process of building a multiple linear regression model.

The last column is the error of the standard estimate. The size reflects the accuracy of the model when predicting the dependent variable. The smaller the value, the better the fitting effect. This metric is often compared when comparing multiple regression models.

Linear regression ANOVA table


The main function of the variance analysis table is to judge the regression effect of the regression model through the F test, that is, to test whether the linear relationship between the dependent variable and all independent variables is significant, and whether the relationship between them can be described by a linear model.

In this table, we mainly focus on the F-statistic and significance (P-value). But because the calculation of the F value still needs to look up the statistical table (F distribution critical value table) and compare with it to get the result, so we directly use the significant P value and the significance level α (0.01, 0.05) for comparison.

Linear regression model regression coefficient table


This table is mainly used for the description of the regression model and the significance test (t test) of the regression coefficient.

The second column is the regression coefficient of the regression model equation, which is used to construct the regression equation.

The standardized coefficient is used to measure the importance of the independent variable to the dependent variable. In this example, foot traffic contributes more to sales.
The last column shows the significance of different independent variables. The advertising cost has a significant statistical significance, and the passenger flow has an extremely significant statistical significance, that is, the overall at least has a significant linear relationship.

predict

Add the independent variable value to be predicted in the data view, [Linear Regression] dialog box - [Save] - check the [Unstandardized] check box in [Predicted Value]. After the calculation, there will be an additional column of predictor variables named "PRE_1" in the data view.

Automatic Linear Modeling

SPSS can automatically establish a regression model based on the data. This method is an improvement of the general linear model and can help users to establish a linear model with less input data.
[Analysis] - [Regression] - [Automatic Linear Modeling]
Features:
1. Continuous variables and categorical variables can be used as independent variables to participate in automatic modeling
2. The independent variable that is most important to the dependent variable can be automatically selected according to the characteristics of the data , Abandon unimportant or less important variables
3. Can automatically handle outliers and missing values, and output a series of charts to show the effect of the regression model, that is, relevant information

[Target] Item

Create a standard model: Create a traditional model that uses independent variables to predict the target. Generally speaking, the standard model is easy to understand and predicts the score faster
. Enhance model accuracy: Use Boosting to build the overall model, which can generate a model sequence to obtain More accurate predictions
Enhanced model stability: Build ensemble models using Bagging, which can generate multiple models for more reliable predictions
Create models for large datasets, building ensembles by diffing datasets into separate chunks The method of the model is mainly used for large data sets and needs to be connected with IBM SPSSSS Statistic Server.
The latter three take longer than standard models to build models and predict scores.

Model building method

1. Include all predictor variables, that is, consider all variables in the process of model construction, and do not screen independent variables 2.
Step forward, that is, introduce independent variables into the model one by one and perform statistical significance test until there is no more insignificant until the independent variable is removed from the regression model.
3. The best subset, using the variable selection model algorithm in statistics to automatically screen the best variables, the calculation steps are more than stepping forward, because the selection process considers all variable combinations, more than 10 variables, need long time

Generally, after the model is established, it is necessary to judge the effect of the model from the perspective of statistical methodology. If there are multiple sets of variable combinations, it is possible to establish multiple sets of models. Then the criterion for determining the effect of the model is the information condition, also known as the information criterion .
Common SPSS information criteria are:
* AIC (Akaike Information Criterion)
* BIC (Bayesian Information Criterion)

Among them, the AICC criterion is to adapt to small sample data. It is adjusted and corrected on the basis of the AIC criterion formula, which is suitable for any sample size, and the AIC criterion is suitable for large sample data, so the AICC criterion is more general.
The smaller the value of the information criterion, the better the model.

In linear regression analysis, the data types of the independent variable and the dependent variable are both continuous variables, and the regression equation can be constructed through the linear relationship between the independent variable and the dependent variable. But when the dependent variable is a categorical variable, linear regression analysis is no longer applicable.

Logistic regression analysis

When the independent variable is a continuous variable and the dependent variable is a categorical variable, and there is no linear relationship between the two and cannot be analyzed linearly, it is necessary to perform a logarithmic transformation on the dependent variable to convert the nonlinear problem into a linear problem, thereby using linear regression related theory and method to solve nonlinear problems.
Logistic regression is a statistical method for regression analysis where the dependent variable is a categorical variable, and it belongs to the probabilistic nonlinear regression.

Classification of Categorical Variables

1. Binary classification, that is, the variable has only two classification situations, such as yes and no, occurrence and non-occurrence. Therefore, the corresponding dependent variable has only two categorical values ​​of 0 and 1.
2. Multi-classification, that is, the classification of variables with multiple categories, such as high, medium and low

When the dependent variable is binary, the corresponding Logistic is binary Logistic analysis; when the dependent variable is multi-category, the corresponding Logistic analysis is multivariate Logistic analysis.

In model predictions, the dependent variable zeroing computes values ​​not to get 0s and 1s, but to measure the magnitude of the occurrence. If the probability is greater than or equal to 0.5 and less than or equal to 1, then the classification value corresponding to the dependent variable zero is 1, that is, it is or occurs, and vice versa.

Logistic regression equation

The difference between Logistic regression analysis and linear regression analysis

linear regression Logistic regression
The dependent variable is a continuous variable The dependent variable is a categorical variable
The independent variable has a linear relationship with the dependent variable The independent variable has a nonlinear relationship with the dependent variable
The dependent variable is normally distributed The dependent variable has a 0/1 distribution
Prediction results are continuous values The predicted result is a probability value between 0 and 1

Logistic regression prediction

Save the model in xml format, and then calculate the predicted value of the model in [Utilities] - [Scoring Wizard].

Guess you like

Origin blog.csdn.net/weixin_40575956/article/details/80103201