[AI underlying logic] - "Mathematical Waltz" linear regression

Univariate linear regression modelEveryone must be familiar with it, so I won’t go into details here. But when using the machine learning package in python, you must have seen the output of similar modelsevaluation parameters. In this chapter, we will talk about some models in regression analysis. Evaluation concept!

1. Analysis of Variance ANOVA

Analysis of variance is a statistical technique used to determine how well different variables in a linear regression model explain the target variable. Variance analysis
analysisby comparing the average variance of different variables in the model, To determine which variables explain the target variable moreto a greater extent. The following is a standard ANOVA table:

Source are the three sources for calculating variance. regressor regression, residuals residuals, total total distance.
df represents degrees of freedom; the degree of freedom means that it can be changed at will when calculating statistics The number of independent data points
. Total dispersion degrees of freedom DFT, regression degrees of freedom DFR, and residual degrees of freedom DFE.
SS represents the sum of squares (Sum of Squares); the sum of squares is usually used to describe the degree of variation of data, that is How far they deviate from the mean
. The three sums of squares SSR, SSE, and SST are as follows.
MS represents the mean sum of squares (Mean Sum of Squares); in statistics, the mean sum of squares is a A measure of the mean calculated
by dividing the sum of squares by the degrees of freedom.
F represents the F-test statistic. The F test is a statistical test method based on variance comparison, used to determine whether there are significant differences between two or more samples.

Significance represents significance: the p value of F-test.

1. Sum of squares: SST, SSE, SSR

Sum of Squares for Total, SST ), also called TSS (total sum of squares). SST describes the sum of squares of the differences between allactual values ​​and the overall mean, and is used to evaluate the entire The degree of dispersion of the data set.

\text{SST}=\sum_{i=1}^n\Bigl(y^{(i)}-\overline{y}\Bigr)^2
Sum of Squares of Residual Errors (Sum of Squares for Error, SSE< a i=4>), also called RSS (residual sum of squares). SSE reflects the part of the dependent variable thatcannot be predicted by the independent variables, also known as the error term(The actual value is different from The sum of squares of the differences between the predicted values ​​) can be used to check the fitting degree of the regression model and judge whether There are outliers. In regression analysis, the best regression coefficient is often determined by minimizing the sum of squares of the residuals.

\mathrm{SSE}=\sum_{i=1}^n\Bigl(y^{(i)}-\hat{y}^{(i)}\Bigr)^2
Regression sum of squares (Sum of Squares for Regression, SSR), also called ESS (explained sum of squares). SSR reflects the amount of data variation explained by the regression model(predicted value and population mean sum of squared differences), used to evaluate the fitness of the regression model and The degree of influenceof the independent variable on the dependent variable.

\mathrm{SSR}=\sum_{i=1}^n\left(\hat{y}^{(i)}-\overline{y}\right)^2

The relationship between the threeSST=SSR+SSE,The essence of linear regression variance analysis is to decompose SST into SSE and SSR ! It can be found that this relationship can be regarded asTriangle Pythagorean Theorem:

\begin{aligned} &\text{SST}=\sum_{i=1}^{n}\left(y^{(i)}-\overline{y}\right)^{2}=\left\|y-\overline{y}\right\|_{2}^{2} \\ &\text{SSR}=\sum_{i=1}^{n}\left({\hat{y}}^{(i)}-{\overline{y}}\right)^{2}=\left\|{\hat{y}}-{\overline{y}}\right\|_{2}^{2} \\ &\text{SSE}=\sum_{i=1}^{n}\left(y^{(i)}-\hat{y}^{(i)}\right)^{2}=\left\|y-\hat{y}\right\|_{2}^{2} \\ &\underbrace{\left\|\boldsymbol{y}-\overline{\boldsymbol{y}}\boldsymbol{I}\right\|_{2}^{2}}_{\mathrm{ssr}}=\underbrace{\left\|\hat{\boldsymbol{y}}-\overline{\boldsymbol{y}}\boldsymbol{I}\right\|_{2}^{2}}_{\mathrm{ssR}}+\underbrace{\left\|\boldsymbol{y}-\hat{\boldsymbol{y}}\right\|_{2}^{2}}_{\mathrm{ssE}} \end{aligned}

That is:\left(\sqrt{\mathrm{SST}}\right)^2=\left(\sqrt{\mathrm{SSR}}\right)^2+\left(\sqrt{\mathrm{SSE}}\right )^2 First understand the picture below and then analyze it further!

2. Degree of freedom DF

Degree of freedom (degree of freedom) Column 2 in the ANOVA table above:Total dispersion degree of freedom DFT(degree of freedom total), regression degree of freedom DFR(degree of freedom regression), residual degree of freedom DFE (degree of freedom error)The relationship between the three is:

\mathrm{DFT}=n-1=\mathrm{DFR}+\mathrm{DFE}=\underbrace{\left(k-1\right)}_{\mathrm{DFR}}+\underbrace{\left( n-k\right)}_{\mathrm{DFE}}=\left(D\right)+\underbrace{\left(n-D-1\right)}_{\mathrm{DFR}}

n represents the number of non-NaN samples participating in the regression. k represents the number of regression model parameters, including the intercept term. D represents the number of self variables (the number of explanatory variables), so k = D + 1 (+1 represents the constant parameter). For example, for linear regression, D = 1, k = 2. If the sample data involved in modeling is n = 252, the several degrees of freedom are::

\left.\left\{\begin{aligned}\mathrm{DFT}&=252-1=251\\k&=D+1=2\\\mathrm{DFR}&=k-1=D =1\\\mathrm{DFE}&=n-k=n-D-1=252-2=250\end{aligned}\right.\right.

3、MST、MSR、MSE、RMSE

①Mean square total (MST) is defined as follows, which is actually the variance of the sample dependent variable y a>!

\mathrm{MST}=\mathrm{var}\big(Y\big)=\frac{\sum_{i=1}^{n}\big(y_{i}-\overline{y}\big)^ {2}}{n-1}=\frac{\mathrm{SST}}{\mathrm{DFT}}

②Mean square regression (MSR) is:

\mathrm{MSR}={\frac{SSR}{DFR}}={\frac{SSR}{k-1}}={\frac{SSR}{D}}

③The mean squared error (MSE) is:

\mathrm{MSE}={\frac{SSE}{DFE}}={\frac{SSE}{n-k}}={\frac{SSE}{n-D-1}}

Root mean square residual (Root mean square error, RMSE) is the square root of MSE:

\mathrm{RMSE}=\sqrt{\mathrm{MSE}}=\sqrt{\frac{SSE}{DFE}}=\sqrt{\frac{SSE}{n-p}}=\sqrt{\frac{SSE}{n-D-1}}

2. Goodness of Fit

After the regression model is created, it is natural to consider whether the modelcan explain the data well, that is, to examine the degree of fit of this regression line to the observed value
, which is the so-called goodness of fit (goodness of fit). Visually , the above mentioned tetrahedron when the relationship between the sum of three squares is similar to the Pythagorean theorem, The smaller θ is, the smaller its opposite side (error) is, and the better the goodness of fit is..

1. Coefficient of determination

The coefficient of determination (R2) is a statistic that quantitatively reflects the goodness of fit. The closer R2 is to 1, the better the goodness of fit; the closer R2 is to 0, the worse the goodness of fit is.

From a geometric point of view, R2 is the square of the cosine of θ in Figure 12:

R^{2}=\cos(\theta)^{2}

Using the Pythagorean theorem triangle above we can get:

R^{2}=\frac{\mathrm{SSR}}{\mathrm{SST}}=1-\frac{\mathrm{SSE}}{\mathrm{SST}}

Specifically,for linear regression, the coefficient of determination isThe square of the correlation coefficient between the dependent variable and the independent variable is also directly related to the model slope term coefficient b1.

R^{2}=\rho_{X,Y}^{2}=\left(b_{1}\frac{\sigma_{X}}{\sigma_{Y}}\right)^{2}

inb_{1}=\rho_{x,x}\frac{\sigma_{y}}{\sigma_{x}}

Therefore,the linear correlation coefficient ρ and the coefficient of determination R2 are both important indicators for measuring the strength of the linear relationship between variables , can help us understand the ability of the independent variables to explain the dependent variables, evaluate the goodness of fit of the model, and select the best regression model.

2. Modify the coefficient of determination

But just using the coefficient of determination R2 is not enough. For the multivariate linear model, when the number of explanatory variables D is continuously increased, R2 will continue to increase. We can use the adjusted coefficient of determination (adjusted R squared).

\begin{aligned}R_{\mathrm{adj}}^{2}& =1-\frac{\maths{MSE}}{\maths{MST}}\\&=1-\frac{\maths{SSE}/(n-k)}{\maths{SST}/(n-1 )} \\ &=1-\biggl(\frac{n-1}{n-k}\biggr)\frac{\mathrm{SSE}}{\mathrm{SST}} \\ &=1-\biggl (\frac{n-1}{n-k}\biggr)\biggl(1-R^{2}\biggr)\\ &=1-\biggl(\frac{n-1}{n-D-1}\ biggr)\frac{\mathrm{SSE}}{\mathrm{SST}}\end{aligned}

When the number D of independent variables in the model increases, it canpenalize overfitting, avoiding the need to determine when the number of independent variables increases. Artificial increase in coefficient. Overfitting usually occurs when the model complexity is too high or the training data is too small. In order to avoid overfitting, the following methods can be adopted:Increase the amount of training data, reduce the model complexity, Adopt regularization (regularization) technology, etc..

3. F test: Not all model parameters are 0

In linear regression, the F test is used totest whether the linear regressionmodel parameters are significant< / a> . comparing the regression sum of squares and the residual sum of squares, it determines whether the model has significant explanatory power by

1. Statistics

The statistics for the F test are as follows:

\begin{aligned}F&=\frac{\mathrm{MSR}}{\mathrm{MSE}}=\frac{\frac{\mathrm{SSR}}{k-1}}{\frac{\mathrm{ SSE}}{n-k}}=\frac{\mathrm{SSR}\left(n-k\right)}{\mathrm{SSE}\left(k-1\right)}\\&=\frac{\frac {\mathrm{SSR}}{D}}{\frac{\mathrm{SSE}}{n-D-1}}=\frac{\mathrm{SSR}\cdot\left(n-D-1\right)}{\ mathrm{SSE}\cdot\left(D\right)}\rightarrow F\left(k-1,n-k\right)\end{aligned}

2. Hypothesis testing

假设检验 (hypothesis testing) 是统计学中常用的一种方法,用于根据样本数据推断总体参数是否符合某种假设。
假设检验通常包括两个假设:原假设和备择假设。
原假设 (null hypothesis) 是指在实验或调查中假设成立的一个假设,通常认为其成立。
备择假设 (alternative hypothesis) 是指当原假设不成立时,我们希望成立的另一个假设。

By collecting sample data and calculating the probability distribution of the sample statistic based on statistical principles, we can calculate the probability of rejecting the null hypothesis
. If this probability is less than the preset significance level (such as 0.05) , you can reject the null hypothesis and consider the alternative hypothesis to be true. On the other hand, if this probability is greater than the preset significance level, the null hypothesis cannot be rejected.

F test isone-tailed test, the null hypothesis H0 and the alternative hypothesis H1 are:

\begin{aligned}H_0:&b_1=b_2=\cdots=b_D=0\\H_1:&b_j\neq0\text{ for at least one}j\end{aligned}

Specifically, F testThe null hypothesis (null hypothesis) is that all regression coefficients of the model are equal to zero, that is, the independent variable has no significant impact on the dependent variable.
If the p-value of the F test is less than the set significance level, the null hypothesis can be rejected and the model is considered significant, that is, the independent variable has a significant impact on the dependent variable.

3. Critical value

The critical value Fα can be obtained by looking up the table according to the two degrees of freedom (k − 1 and n − k) and the significance level α < /span>. This indicates that when a decision is made to accept the null hypothesis, there is a 95% or 99% chance that it will be correct. . 1 − α is the confidence degree or confidence level, usually α = 0.05 or α = 0.01

Compare the F value calculated according to the statistical formula with the critical value Fα,If the following formula is true:

F>F_{1-\alpha}\left(k-1,n-k\right)

Then at this confidence level reject the null hypothesis H0, and do not consider the independent variable coefficients to be non-significant at the same time, that is, all coefficients are not are likely to be zero at the same time. On the contrary, if the null hypothesis H0 is accepted, the coefficients of the independent variables are non-significant at the same time, that is, all coefficients are likely to be zero at the same time.

for example

If the given condition α = 0.01, F1–α(1, 250) = 6.7373. The calculated statistic F = 549.7 > 6.7373 shows that H0 can be significantly rejected. The p value can also be used. If the p value is less than α, the null hypothesis H0 can be rejected.

p\text{-value}=\mathsf{P}\Big(F<F_{\alpha}\left(k-1,n-k\right)\Big)

4. t test: whether a certain regression coefficient is 0

In linear regression, the t test is mainlyused to test the coefficient of a specific independent variable in the linear regression modelIs it significant, but it cannot be judged whether the overall model is significant.

1. Statistics

t-test statistic of b1:

t_{b1}=\frac{\hat{b}_{1}-b_{1,0}}{\mathrm{SE}\Big(\hat{b}_{1}\Big)}

Among them, b1ˆ is the coefficient estimated by the least squares method OLS linear regression, SE is its standard error:

\mathrm{SE}\left(\hat{b_1}\right)=\sqrt{\frac{\mathrm{MSE}}{\sum_{i=1}^n\left(x^{(i)}-\overline{x}\right)^2}}=\sqrt{\frac{\sum_{i=1}^n\left(e^{(i)}\right)^2}{\sum_{i=1}^n\left(x^{(i)}-\overline{x}\right)^2}}

Among them, MSE is the mean squared error of the previous residual, and n is the number of sample data (except NaN). The larger the standard error, the less reliable the estimate of the regression coefficient is.

2. Hypothesis testing

Forunivariate linear regression, t tests the null and alternative hypotheses< a i=4> are:

\begin{cases}H_0:b_1=b_{1,0}\\H_1:b_1\neq b_{1,0}\end{cases}

The null hypothesis is that the specific regression coefficient is equal to zero, that is, the independent variable has no significant impact on the dependent variable. If the p-value of the t-test is less than the set significance level, the null hypothesis can be rejected and the coefficient of the independent variable is considered to be significantly non-zero, that is, the independent variable has a significant impact on the dependent variable.

3. Critical value

If the following formula is true, accept the null hypothesis H0, otherwise, reject the null hypothesis H0. (The following T is the b1 statistic t)

-t_{1-\alpha/2,n-2}<T<t_{1-\alpha/2,n-2}

Specifically, if the null and alternative hypotheses are:

\begin{cases}H_0:b_1=0\\H_1:b_1\neq0\end{cases}

If the critical inequality holds, accept the null hypothesis H0, that is, the regression coefficient is not statistically significant; in plain language, that is, b1 = 0, which means that there is no linear relationship between the independent variable and the dependent variable. Otherwise, the null hypothesis H0 is rejected, that is, the regression coefficient is statistically significant.

4. Intercept term coefficient

For linear regression of one variable, the hypothesis testing procedure for the intercept term coefficient b0 is similar to the above-mentioned slope term coefficient b1. t-test statistic for b0:

t_{b0}=\frac{\hat{b}_{0}-b_{0,0}}{\mathrm{SE}\Big(\hat{b}_{0}\Big)}

Similar to the definition above, where:

\mathrm{SE}\left(\hat{b_0}\right)=\sqrt{\frac{\sum_{i=1}^n\left(\varepsilon^{(i)}\right)^2}{n-2}\left[\frac1n+\frac{\overline{x}^2}{\sum_{i=1}^n\left(x^{(i)}-\overline{x}\right)^2}\right]}

for example:

t test statistic value T obeys the t distribution with n – 2 degrees of freedom. The t-test used in this section is two-tailed test.

在统计学中,双尾假设检验是指在假设检验过程中,假设被拒绝的区域位于一个统计量分布的两个尾端,
即研究者对于一个参数或者统计量是否等于某一特定值,不确定其比该值大或小,
而是存在两种可能性,因此需要在两个尾端进行检验。

For exampleGiven the significance level α = 0.05 and the degrees of freedom n – 2 = 252 - 2 = 250, you can and get the t value as follows. In Python, you can use stats.t.ppf(1 - alpha/2, DFE) to calculate the two valuesLook up the table

t_{1-\alpha/2,n-2}=t_{0.975,250}=1.969498

Sincet-distribution is symmetrical, we can get:

t_{\alpha/2,n-2}=t_{0.025,250}=-1.969498

If the calculated statistic is tb1 = 23.446, which is greater than 1.969498, then indicates that the t test of parameter b1 is significant at the α = 0.05 level, that is, it can a >Significantly rejects H0: b1 = 0, thereby accepting H1:b1 ≠ 0. The larger the standard error of the regression coefficient, the less reliable the estimate of the regression coefficient is.

Therefore,The 1 – α confidence interval of the slope term coefficient b1 is as follows, which means that the probability of the true b1 in the above interval is 1 – α :

\hat{b}_1\pm t_{1-\alpha/2,n-2}\cdot\mathrm{SE}\!\left(\hat{b}_1\right)

Similarly,The 1 – α confidence interval of the intercept term coefficient b0 is, which means that the probability of the true b0 in the above interval is 1 – α:

\hat{b_0}\pm t_{1-\alpha/2,n-2}\cdot\mathrm{SE}\!\left(\hat{b_0}\right)

5. Confidence interval, prediction interval

Everyone has probably seen an image similar to the one below. The bandwidth of the left image represents the confidence interval of the predicted value of the linear regression, and the right image is the prediction interval of the predicted value a>.

Confidence interval (interval for the mean of the dependent variable)

In regression analysis, confidence intervals are used toevaluateregression models< a i=4>Predictive ability (accuracy). Generally, the narrower the confidence interval of the predicted value, the higher the accuracy of the model prediction.

Predicted value\hat{y}^{(i)}’s 1 – αConfidence interval:

\hat{y}^{(i)}\pm t_{1-\alpha/2,n-2}\cdot\sqrt{\mathrm{MSE}}\cdot\sqrt{\frac1n+\frac{\left(x^{(i)}-\overline{x}\right)^2}{\sum_{k=1}^n\left(x^{(k)}-\overline{x}\right)^2}}

Width of confidence interval:

2\times\left\{t_{1-\alpha/2,n-2}\cdot\sqrt{\mathrm{MSE}}\cdot\sqrt{\frac1n+\frac{\left(x^{(i)}-\overline{x}\right)^2}{\sum_{k=1}^n\left(x^{(k)}-\overline{x}\right)^2}}\right\}

As \left|x^{(i)}-\overline{x}\right| continues to increase, the width of the confidence interval continues to increase. Whenx^{(i)}=\overline{x}, the width of the confidence interval is the narrowest. As the MSE
(mean square error) decreases, the width of the confidence interval decreases.

Prediction interval (an interval for a specific value of the dependent variable)

Refers to when estimating the regression model, for a given value xp of the independent variable, the estimation interval of the individual value of the dependent variable yp is found:

\hat{y}_p\pm t_{1-\alpha/2,n-2}\cdot\sqrt{\mathrm{MSE}}\cdot\sqrt{1+\frac1n+\frac{\left(x_p-\overline{x}\right)^2}{\sum_{k=1}^n\left(x^{(k)}-\overline{x}\right)^2}}

Different from the confidence interval of the predicted value, the prediction interval takes into account both the prediction error and < /span>. wider, including errors in two aspects: estimation errors in the regression equation and future observations The random error in the value is also The randomness of future observations

6. Likelihood function and information criterion

1. Log-likelihood function: used in MLE

The likelihood function isafunction about the parameters in a statistical model that is used to maximize Likelihood estimation MLE, represents the likelihood in the model parameters. In OLS (Ordinary Least Squares) linear regression, it is assumed thatresiduals\varepsilon^{(i)}=y^{(i)}-\hat{y}^{(i)} follownormal distribution N(0, σ2), therefore:

\Pr\Big(\varepsilon^{(i)}\Big)=\frac{1}{\sigma\sqrt{2\pi}}\exp\Bigg(-\frac{\Big(y^{(i)}-\hat{y}^{(i)}\Big)^{2}}{2\sigma^{2}}\Bigg)

similar function为:

L=\prod_{i=1}^n\mathrm{P}\left(\varepsilon^{(i)}\right)=\prod_{i=1}^n\left\{\frac1{\sigma\sqrt{2\pi}}\exp\left(-\frac{\left(y^{(i)}-\hat{y}^{(i)}\right)^2}{2\sigma^2}\right)\right\}

We commonly useLog likelihood function ln(L):

\ln\left(L\right)=\ln\left(\prod_{i=1}^{n}\mathrm{P}\left(\varepsilon^{\left(i\right)}\right)\right)=-\frac{n}{2}\cdot\ln\left(2\pi\sigma^{2}\right)-\frac{\mathrm{SSE}}{2\sigma^{2}}

Among them, in maximum likelihood estimation MLE σ is:

\sigma^2=\frac{\mathrm{SSE}}{n}

Then there are:

\ln\left(L\right)=-\frac n2\cdot\ln\left(2\pi\sigma^2\right)-\frac n2

2. Information criteria: model selection

AIC and BIC are linear regression model selectioncommonly used information criteria for Select the optimal model among the models.

AICAkaike Information Quota (AIC)

\mathrm{AIC}=2k-2\ln\left(L\right)

Among them, k = D + 1; L is the likelihood function. AIC encourages excellent data fitting; however, try to avoid overfitting, of which 2k items are penalty items.

②Bayesian Information Criterion (BIC), also called Schwartz Information Criterion (SIC)

\mathrm{BIC=}k\cdot\ln(n){-}2\ln(L)

Where, n is the number of sample data. The penalty term kln(n) of BIC is larger than AIC .

Note: When using AIC and BIC for model selection, the model with the smallest AIC or BIC value should be selected. This means that smaller AIC or BIC values ​​indicate better model fit and smaller model complexity. However, there is no guarantee that the selected model is the optimal model. In practical applications, AIC and BIC should be used as guidance, and domain knowledge and experience should be combined to select the optimal model. At the same time, the assumptions and limitations of the model also need to be tested.

7. Others

1. Residual analysis: Assumption obeys normality

Residual analysis assumes that the residuals obey the normal distribution with mean 0! Through the information provided by the residuals, the regression model is evaluated and the data is analyzed for possible interference. The basic idea of ​​residual analysis is that if the regression model can fit the data well, then the residuals should be randomly distributed, with no obvious pattern or trend. Residual analysis canprovide informationabout the goodness offit of the model.

step:

1、绘制残差图。残差图是观测值的残差与预测值之间的散点图。
如果残差呈现出随机分布、没有明显的模式或趋势,那么模型可能具有较好的拟合优度。
2、检查残差分布。通过绘制残差直方图或核密度图来检查残差分布是否呈现出正态分布或近似正态分布。
如果残差分布不是正态分布,那么可能需要采取转换或其他措施来改善模型的拟合。
3、检查残差对自变量的函数形式。通过绘制残差与自变量之间的散点图或回归曲线,来检查残差是否
随自变量的变化而呈现出系统性变化。
如果存在这种关系,那么可能需要考虑增加自变量、采取变量转换等方法来改善模型的拟合。

In order to detect the normality of the residuals, the Omnibus normality test can be used. The Omnibus normality test uses the skewness S of the residuals. and kurtosis K, test the null hypothesis that the residual distribution is normal. The statistical value of the Omnibus normality test is the sum of the squared skewness and the squared excess kurtosis. The Omnibus normality test uses the χ2 test (Chi-squared test).

2. Autocorrelation detection: Durbin-Watson

Durbin-Watson is used to test the autocorrelation of the sequence. In linear regression, autocorrelation is used to analyze the correlation between the residuals in a model and their time-delayed versions. When autocorrelation exists in the model, it may indicate that some important variables are missing from the model, or that the time series data in the model has not been processed correctly . Autocorrelation can be diagnosed by examining the residual plot. If the residual plot shows obvious patterns, such as periodic relationships between residual values ​​or clustering in a certain area, Then there may be autocorrelation. In this case, the model can be modified byintroducing more independent variables or using time series analysis methods.

The statistics of Durbin-Watson detection are:

DW=\frac{\sum_{i=2}^n\left(\left(y^{(i)}-\hat{y}^{(i)}\right)-\left(y^{(i-1)}-\hat{y}^{(i-1)}\right)\right)^2}{\sum_{i=1}^n\left(y^{(i)}-\hat{y}^{(i)}\right)^2}

The above formula essentially detects the difference between the residual sequence and the one-period lag sequence of the residual. The value range of DW value is 0 ~ 4. When the DW value is very small (DW < 1), it indicates that the sequence may have positive autocorrelation. When the DW value is large (DW > 3), it indicates that the sequence may have negative autocorrelation. When the DW value is near 2 (1.5 < DW < 2.5), it indicates that the sequence has no autocorrelation. The remaining value intervals indicate that it is impossible to determine whether the series is autocorrelated. More details on this knowledge point

3. Condition number: multicollinearity

In linear regression, the condition number is often usedto test the design matrixX_{k\times k}whether there is multicollinearity, multicollinearity refers tothe situation where there is a high degree of correlation or linear relationship between independent variables in a multiple regression model. Multicollinearity willlead to unstable estimation of regression coefficients, reduce the explanatory power of the model, and even cause the model's prediction accuracy to decrease. When talking about multiple regression analysis, the role of conditional numbers is more obvious.

Perform eigenvalue decomposition onX^\mathrm{T}X to obtain the maximum eigenvalue λmax and the minimum eigenvalue λmin. The definition of the condition number is the square root of the ratio between the two :

\text{condition number}=\sqrt{\frac{\lambda_{\max}}{\lambda_{\min}}}

8. Summary

These concepts are very important indicators in linear regression analysis and can help us evaluate the degree of fit, coefficient significance, predictive ability, and multicollinearity of the model. The formulas involved in these concepts may be more complicated, but you don't need to memorize them completely. Just understand what their purpose is and what quantities are roughly used to calculate them!

Analysis of variance can evaluate the overall goodness of fit of the model. The F test can be used to evaluate the overall significance of linear model parameters, and the t test can evaluate the significance of a single coefficient. sex.

Goodness of fit refers to the proportion of data variation that the model can explain, and is commonly measured by R2.

AIC and BIC are used for model selection, which can select the simplest and most explanatory model when the model fitting degree is similar.

Autocorrelation refers to the correlation between error terms and can be detected using the Durbin-Watson test.

Condition number is an indicator used to evaluate multicollinearity. If the condition number is too large, there may be serious multicollinearity problems.

Guess you like

Origin blog.csdn.net/weixin_51658186/article/details/134930675