Statistics_Jia Junping——Thinking Questions Chapter 11 Univariate Linear Regression

1. Explain the meaning of correlation and explain the characteristics of correlation.

The uncertain quantitative relationship between variables is called correlation.

The characteristics of the correlation relationship: the value of one variable cannot be uniquely determined by another variable, when the variable xxWhen x takes a certain value, the variable yyThere may be several values ​​for y . Variables for which this relationship is uncertain cannot be described by a functional relationship.

2. What problems does the correlation analysis mainly solve?

Correlation analysis is the description and measurement of the linear relationship between two variables, and the problems it needs to solve include:

(1) Whether there is a relationship between the variables;
(2) If there is a relationship, what kind of relationship between them;
(3) How strong is the relationship between the variables;
(4) The relationship between the variables reflected by the sample can be No represents the relationship between population variables.

3. What are the basic assumptions in the correlation analysis?

When conducting correlation analysis, there are two main assumptions about the population:

(1) There is a linear relationship between the two variables;
(2) Both variables are random variables.

4. Briefly describe the nature of the correlation coefficient.

The correlation coefficient is a statistic that measures the strength of the linear relationship between two variables, calculated from sample data. If the correlation coefficient is calculated based on all the overall data, it is called the overall correlation coefficient, denoted as ρ; if it is calculated based on the sample data, it is called the sample correlation coefficient, denoted as rrr

Properties of the correlation coefficient:

(1) r r The value range of r is− 1 ∼ + 1 -1 \sim +11+ 1 , ie− 1 ≤ r ≤ 1 -1≤r≤11r1。若 0 < r ≤ 1 0<r≤1 0<r1 , indicatingxxxyyThere is a positive linear correlation between y ; if − 1 ≤ r < 0 -1≤r<01r < 0 , indicating thatxxxyyThere is a negative linear correlation between y ; if r = + 1 r=+1r=+ 1 , indicatingxxxyyThere is a complete positive linear correlation between y ; if r = − 1 r=-1r=1 , indicatingxxxyyThere is a complete negative linear correlation between y . It can be seen that when∣ r ∣ = 1 |r|=1r=1时,yyThe value of y depends entirely onxxx , the relationship between the two is a function; whenr = 0 r=0r=0 , indicatingyyThe value of y andxxx is independent, that is, there is no linear correlation between the two.
(2)rrr has symmetry. xxxyyCorrelation coefficient rxybetween y r_{xy}rxyyyy andxxThe correlation coefficient ryx between x is equal, ie rxy=ryx.
(3)rrr value size andxxxyyThe origin of y is independent of scale. changexxxyyThe data origin and measurement scale of y do not change rrThe value of r .
(4)rrr is justxxxyyA measure of the linear relationship between y , it cannot be used to describe nonlinear relationships. This means,r = 0 r=0r=0 only means that there is no linear correlation between the two variables, it does not mean that there is no relationship between the variables, and there may be a nonlinear correlation between them. When the nonlinear correlation between variables is large, it may lead tor = 0 r=0r=0 . Therefore, whenr = 0 r=0r=When it is 0 or very small, it cannot be easily concluded that there is no correlation between the two variables, but a reasonable explanation should be made in combination with the scatter plot.
(5)rrAlthough r is a measure of the linear relationship between two variables, it does not necessarily mean thatxxxyyy must have a causal relationship.

5. Why do we need to test the significance of the correlation coefficient?

In general, the overall correlation coefficient ρ is unknown, usually based on the sample correlation coefficient rrr as an approximate estimate of ρ. but sincerrr is calculated from sample data and it is affected by sampling fluctuations. Due to the different samples drawn,rrThe value of r is also different, sorrr is a random variable. Whether the degree of overall correlation can be explained based on the sample correlation coefficient needs to examine the reliability of the sample correlation coefficient, that is, to conduct a significance test.

6. Briefly describe the steps of the significance test of the correlation coefficient.

The steps to test the significance of the correlation coefficient are as follows:

(1) Propose a hypothesis, H0: ρ=0; H1: ρ≠0.
(2) Calculate the statistics of the test,
t = ∣ r ∣ n − 2 1 − r 2 t ( n − 2 ) t=|r|\sqrt{\frac{n-2}{1-r^2} } ~t(n-2)t=r1r2n2  t(n2 )
(3) Make decisions. According to a given significance level α and degrees of freedomdf = n − 2 df=n-2df=n2Check the t distribution table and find outt α / 2 ( n − 2 ) t_{\alpha/2}(n-2)ta / 2(n2 ) critical value. If∣ t ∣ > t α / 2 |t|>t_{\alpha/2}t>ta / 2, then the null hypothesis H0 is rejected, indicating that there is a significant linear relationship between the two variables in the population.

7. Explain the meaning of regression model, regression equation, and estimated regression equation.

(1) Regression model: For two variables with a linear relationship, a linear equation can be used to express the relationship between them. Describe the dependent variable yyHow y depends on the independent variablexxx and the error termε \varepsilonThe equation for ε is called a regression model. For a linear regression model involving only one independent variable, it can be expressed as:
y = β 0 + β 1 x + ε y= \beta_0 +\beta_1x+\varepsilony=b0+b1x+e

(2) Regression equation: According to the assumptions in the regression model, ε \varepsilonThe expected value of ε is equal to 0, so the expected value of yE ( y ) = β 0 + β 1 x E(y)= \beta_0 +\beta_1xE ( and )=b0+b1x , that is,yyThe expected value of y isxxA linear function of x . Describe the dependent variableyyThe equation of how the expected value of y depends on the independent variable x is called a regression equation. The unary linear regression equation has the form:
E ( y ) = β 0 + β 1 x E(y)= \beta_0 +\beta_1xE ( and )=b0+b1x

(3) Estimated regression equation: If the parameter in the regression equation β 0 \beta_0b0and β 1 \beta_1b1Known, for a given xxThe value of x , using the formulaE ( y ) = β 0 + β 1 x E(y)= \beta_0 +\beta_1xE ( and )=b0+b1x can calculateyyThe expected value of y . But the overall regression parameterβ 0 {\beta}_0b0and β 1 {\beta}_1b1are unknown and must be estimated using sample data. Using the sample statistic β ^ 0 \hat{\beta}_0b^0and β ^ 1 \hat{\beta}_1b^1Instead of the unknown parameter β 0 {\beta}_0 in the regression equationb0and β 1 {\beta}_1b1, then the estimated regression equation is obtained. It is an estimate of the regression equation found on the sample data.
For unary linear regression, the estimated regression equation is of the form:
y ^ = β ^ 0 + β ^ 1 x \hat{y}= \hat{\beta}_0 +\hat{\beta}_1xy^=b^0+b^1x

where β ^ 0 \hat{\beta}_0b^0is the intercept of the estimated regression line on the y-axis; β ^ 1 \hat{\beta}_1b^1is the slope of the straight line, which means xxWhen x changes by one unit,yyThe mean change in y .

8. What are the basic assumptions in the linear regression model?

The basic assumptions in the unary linear regression model:

(1) dependent variable yyy and the independent variablexxThere is a linear relationship between x .
(2) In repeated sampling, the independent variablexxThe value of x is fixed, that is, it is assumed thatxxx is non-random.
(3) Error termε \varepsilonε is a random variable with an expected value of 0, ieE ( ε ) = 0 E(\varepsilon)=0E ( e )=0 .
(4) For allxxx值,ε \varepsilonThe variance of ε σ 2 \sigma ^2p2 are the same.
(5) Error termε \varepsilonε is a random variable that obeys a normal distribution and is independent, that is,ε ∼ N ( 0 , σ 2 ) \varepsilon \sim N(0,\sigma ^2)eN(0,p2)

9. Briefly describe the basic principle of parameter least squares estimation.

for iii xx_x value, the estimated regression equation can be expressed as:

y ^ = β ^ 0 + β ^ 1 x \hat{y}= \hat{\beta }_0 +\hat{\beta }_1xy^=b^0+b^1x

for xxxyynnof yFor n pairs of observations, there are many straight lines used to describe their relationship. The relationship between the two variables is represented by a straight line closest to each observation point, and it is estimated by minimizing the sum of squared deviations between the observed value and the estimated value. Parameterβ 0 \beta_0b0and β 1 \beta_1b1, according to this method to determine the model parameters β 0 \beta_0b0and β 1 \beta_1b1The method is called the least square method, also known as the least square method, it is to make the observed value of the dependent variable yi y_iyiwith estimated value y ^ i \hat{y}_iy^iThe sum of squared deviations between them reaches the minimum to estimate β 0 \beta_0b0and β 1 \beta_1b1Methods.

10. Explain the meaning of the total sum of squares, regression sum of squares, and residual sum of squares, and explain the relationship between them.

(1) The total sum of squares (SST) is the actual observed value yi y_iyiand its mean y ˉ \bar{y}yˉFor the functional form,
SST = ∑ ( yi − y ˉ ) 2 = ∑ ( yi − y ^ i ) 2 + ∑ ( y ^ i − y ˉ ) 2 SST=\sum(y_i-\bar{y} )^2=\sum(y_i-\hat{y}_i)^2+\sum(\hat{y}_i-\bar{y})^2SST=(yiyˉ)2=(yiy^i)2+(y^iyˉ)2

(2) Regression sum of squares (SSR) is each regression value y ^ i \hat{y}_iy^iand the mean y of the actual observations ˉ \bar{y}yˉThe sum of squared deviations, that is, SSR = ∑ ( y ^ i − y ˉ ) 2 SSR=\sum(\hat{y}_i-\bar{y})^2SSR=(y^iyˉ)2 , which reflects theyyThe total variation in y due to xxxyyyycaused by a linear relationship between yThe variable part of y , which can be explained by the regression lineyi y_iyivariation part.

(3) The residual sum of squares (SSE) is the actual observed value yi y_iyiwith regression value yyThe sum of squared deviations of y , that is , ∑ ( yi − y ^ i ) 2 \sum(y_i-\hat{y}_i)^2(yiy^i)2 , it is exceptxxxyyFactors other than the linear influence of y on yyThe effect of y variation is yi y_ithat cannot be explained by the regression lineyivariation part. It is also known as the error sum of squares.

(4) The relationship between the three: total sum of squares (SST) = regression sum of squares (SSR) + residual sum of squares (SSE).

11. Briefly describe the meaning and function of the coefficient of determination.

(1) The meaning of the coefficient of determination
The ratio of the regression sum of squares to the total sum of squares is called the coefficient of determination, denoted as R 2 R_2R2,其计算公式为
R 2 = S S R S S T = ∑ ( y ^ i − y ˉ ) 2 ∑ ( y i − y ˉ ) 2 = 1 − ∑ ( y i − y ^ i ) 2 ∑ ( y i − y ˉ ) 2 R^2=\frac{SSR}{SST}=\frac{\sum(\hat{y}_i-\bar{y})^2}{\sum(y_i-\bar{y})^2} =1-\frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2} R2=SSTSSR=(yiyˉ)2(y^iyˉ)2=1(yiyˉ)2(yiy^i)2

(2) The role of the coefficient of determination
The coefficient of determination R 2 R_2R2It measures how well the regression line fits the observed data. If all the observation points fall on the straight line, the residual sum of squares SSE = 0 , R 2 = 1 SSE = 0, R_2 = 1SSE0,R2=1 , the fit is complete; ifyyy changes withxxx is irrelevant,xxx doesn't help explainyyThe variation of y , at this time, thenR 2 = 0 R_2=0R2=0 . VisibleR 2 R_2R2The value range of is [0,1]. R 2 R_2R2The closer it is to 1, the larger the proportion of the regression sum of squares to the total sum of squares is, and the closer the regression line is to each observation point, use xxchange in x to account for yyThe more the part of the y value becomes worse, the better the fitting degree of the regression line; otherwise,R 2 R_2R2The closer it is to 0, the worse the fit of the regression line is.

12. In regression analysis, what is the role of F test and t test?

The F-test is to test the independent variable xxx and dependent variableyyIs the linear relationship between y significant, or can they use a linear modely = β 0 + β 1 x + ε y= \beta_0 +\beta_1x+\varepsilony=b0+b1x+ε , which is the test of linear relationship.
The t test is to test whether the independent variable has a significant impact on the dependent variable, that is, the test of the regression coefficient.

13. Briefly explain the role of residual analysis in regression analysis.

The role of residual analysis in regression analysis is as follows:

(1) It is used to judge whether the assumptions about the model are established;
(2) It is used to analyze the outliers in the regression and the observed values ​​that have an impact on the model.

Guess you like

Origin blog.csdn.net/J__aries/article/details/131317574
Recommended