Regression analysis --R language

By lm () return fitting model

As the core of a statistical regression analysis, which is actually a broad concept, through those with one or more predictors (also known as independent or explanatory variables) to predict the response variable (also known as the dependent variable, the variable standard school or outcome variable)
generally regression analysis can be used to pick the response variable associated explanatory variable may be described relationship between the two, an equation may be generated by the explanatory variables to predict the response variable.
Variations thereof regression analysis of
a simple linear - with a quantitative explanatory variables predict a quantitative response variable (one dollar)
2, the polynomial - a prediction relation variable a quantitative response model with a quantized explanatory variables n-th order polynomial (mono times)
3, multi - level data structure prediction has a response variable. Also referred to as hierarchical model, or a mixed model of nested models
4, Multiple Linear - with two or more quantized prediction explanatory variables of a quantitative response variable (a polyol)
5, multivariable, with one or more a plurality of explanatory variables predicted response variable
6, Logistic- with one or more explanatory variables predict a categorical response variables
7, Poisson - interpreted by one or more variables representative of a prediction of the number of variable frequency in response to
8, Cox proportional hazard - with a plurality of explanatory variables or a predicted event time of
9, the time series - modeling time series data related to the error term
10, nonlinear - with one or more quantized prediction explanatory variables a quantitative response variable, but the model is nonlinear
11, nonparametric - a quantized prediction response variable with one or more quantized explanatory variables, the model of the form data from the form without prior setting
12, robust - with one or more quantized prediction explanatory variables a quantitative response variable, can withstand strong interference point of impact

Our focus in this chapter is the ordinary least squares regression, including simple linear regression, polynomial regression and multiple linear regression.
In order to properly interpret the OLS model coefficients, the data must meet the following statistical hypotheses
1, normality - for a fixed value of the independent variable, the dependent variable to the value of a normal distribution
are independent between the two, the value of independence -Yi
3, linear - because between linear and independent variables related to
4, homoscedasticity - the variance of the dependent variable does not change with different levels of independent variables change. Also known as constant variance

8.2.1 using LM () return fitting model
in R, the linear model LM is the basic function (), the format
myfit <- lm (formula, data )
parameter fitting form of the model formula mean it, expression formula the Y the X1 + an X2 + X3 + ... + Xk is, the left side is the response variable, to the right of each predictor
parameter data is a data block that contains the data used to fit the model
results are stored in a list of objects, including the model a lot of information

Expression commonly used symbols R
~, delimiter, response variable is the left and right for the explanatory variables
+, separated predictors
: shows the interaction term predictors. For example, through interaction prediction Y x, z and x and z are, code ~ x + z + Y x: z
*, represents a simple embodiment all possible interaction term. For example, the code Y X * W * Z can be expanded to Y X + W + X + Z: Z + X: Z + W: W + X: Z: W
^, represents an interaction term to achieve frequency Code Y (X + Z + W) ^ 2 may be expanded to Y X + Z + W + X: Z + X: W + Z: W + X
, represents contain in addition to all the variables due to external variables such as a data frame. contains variables x, y, z, w, y codes deployable y. X + Z + W
-, minus sign indicates a variable is removed from the equation. For example Y (X + Z + W) ^ 2-X: W, can be expanded to Y X + Z + W + X: Z + Z: X + W
-1, delete intercept, e.g. y ~ x-1 Quasi together y regression in the x and forces the straight line through the origin
Im (), from the arithmetic angle to explain the elements in parentheses, such as code y x + the I ((Z + W) ^ 2) expands to y x + H , h is a square of z and w and created new variables
function, mathematical functions can be used in the expression. E.g. log (y) ~ x + z + w

Fitting a linear model to other functions useful
summary () - shows the results of a detailed model fit
coefficients () - List of fitted model parameters (intercept and slope)
confint () - providing a model parameter confidence intervals (% default 95)
fitted () - fitting model predicted values listed
residuals- residual values listed in the model fitting
anova () - generates a model of analysis of variance table fitting or comparing two or more intended model analysis of variance table together vcov () - listed in the model parameter covariance matrix
AIC () - output Akaike information statistics
plot () - evaluation fit diagnostic plots generated models
predict () - fitted with the new model variable values predicted response data set

When the regression model includes a dependent variable and one independent variable, we called simple linear regression.
When only one predictor variable, but a variable that contains power, we called polynomial regression.
When there is more than one predictor variable, we called multiple linear regression

Examples of simple linear regression
based installation data set provides 15 women aged 30 to 39 year-old female height and weight
height prediction by weight
function by a simple linear fit LM ()

> fit <- lm(weight ~height, data=women)
> fit

Call:
lm(formula = weight ~ height, data = women)

Coefficients:
(Intercept)       height  
     -87.52         3.45  

More detailed results using the summary () function to show the fitted model

summary(fit)

Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991,	Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14


The results can be seen from the regression coefficient was not significant 3.45 0 (P = 1.1E-14),
R & lt 0.991 square term weight can account represented 99.1% of the variance. It is also the square of the correlation coefficient between the actual and predicted values.
Residual standard error (1.53) can be considered a model to predict the average height and weight of the error
test F statistic of all predictor variables to predict whether the response variables on a probability level. Since only a simple regression predictor variables, where F test is equivalent to the height of the regression coefficient t test

> #使用coefficients()列出拟合模型的参数
> coefficients(fit)
(Intercept)      height 
  -87.51667     3.45000 
> #提供模型参数的置信区间(默认为0.95),从中可以看出参数的稳定性
> confint(fit)
                  2.5 %     97.5 %
(Intercept) -100.342655 -74.690679
height         3.253112   3.646888
> #输出数据框中女性体重的数据
> women$weight
 [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
> #列出拟合模型的预测值(会显示模型中提供的所有自变量通过模型得出的预测量,这样可以比较预测变量和因变量)
> fitted(fit)
       1        2        3        4        5        6        7        8        9 
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833 
      10       11       12       13       14       15 
143.6333 147.0833 150.5333 153.9833 157.4333 160.8833 
> #用拟合模型对新的数据集预测响应变量值
> predict(fit,newdata=data.frame(height=c(90,100)))
       1        2 
222.9833 257.4833 
> #列出拟合模型的残差值
> residuals(fit)
          1           2           3           4           5           6           7 
 2.41666667  0.96666667  0.51666667  0.06666667 -0.38333333 -0.83333333 -1.28333333 
          8           9          10          11          12          13          14 
-1.73333333 -1.18333333 -1.63333333 -1.08333333 -0.53333333  0.01666667  1.56666667 
         15 
 3.11666667 
> plot(women$height,women$weight,xlab="Height",ylab="weight")
> #制作回归线(注意abline()函数只能制作线性的直线的回归线,非直线的回归线不能制作)
> abline(fit)

As can be seen from the figure you can use a bent curve to improve the accuracy of the prediction can try using polynomial regression

Here Insert Picture Description

8.2.3 polynomial regression
from the example of the figure can be seen a two times may be added to improve the prediction accuracy of the regression
examples

fit2 <- lm(weight~height + I(height^2),data=women)
> summary(fit2)

Call:
lm(formula = weight ~ height + I(height^2), data = women)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50941 -0.29611 -0.00941  0.28615  0.59706 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 261.87818   25.19677  10.393 2.36e-07 ***
height       -7.34832    0.77769  -9.449 6.58e-07 ***
I(height^2)   0.08306    0.00598  13.891 9.32e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-squared:  0.9995,	Adjusted R-squared:  0.9994 
F-statistic: 1.139e+04 on 2 and 12 DF,  p-value: < 2.2e-16

> #输出模型参数的置信区间
> confint(fit2)
                   2.5 %       97.5 %
(Intercept) 206.97913605 316.77723111
height       -9.04276525  -5.65387341
I(height^2)   0.07003547   0.09609252
> #输出拟合模型的模型参数
> coefficients(fit2)
 (Intercept)       height  I(height^2) 
261.87818358  -7.34831933   0.08306399 
> #输出拟合模型的预测值
> confint(fit2)
                   2.5 %       97.5 %
(Intercept) 206.97913605 316.77723111
height       -9.04276525  -5.65387341
I(height^2)   0.07003547   0.09609252
> women$weight
 [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
> #输出拟合模型的预测值
> fitted(fit2)
       1        2        3        4        5        6        7        8        9 
115.1029 117.4731 120.0094 122.7118 125.5804 128.6151 131.8159 135.1828 138.7159 
      10       11       12       13       14       15 
142.4151 146.2804 150.3118 154.5094 158.8731 163.4029 
> #输出残差值
> residuals(fit2)
           1            2            3            4            5            6 
-0.102941176 -0.473109244 -0.009405301  0.288170653  0.419618617  0.384938591 
           7            8            9           10           11           12 
 0.184130575 -0.182805430  0.284130575 -0.415061409 -0.280381383 -0.311829347 
          13           14           15 
-0.509405301  0.126890756  0.597058824 
> #作图
> plot(women$height,women$weight,xlab="Height",ylab="Weight")
> #函数lines()其作用是在已有图上加线,命令为lines(x,y),其功能相当于plot(x,y,type="1")
> lines(women$height,fitted(fit2))

Here Insert Picture Description
Note that a polynomial equation is still linear regression model, because the form of the equation is still weighted predictors and
the only way it is not a linear model of the form y = B0 + B1 * e ^ (x / B2)
like this y = log (x1) + sin (x2) still assumed to be linear
Note the use of three times higher than the basic entry is not necessary

Scatterplot car package () function to quickly generate a binary relation FIG
Scatterplot () function may provide a scatter plot, a linear fitting curve fitting and smoothing curves, the response also shows a box plot of the boundary of each variable.
spread = FALSE option deletes the rms residual negative expand, and asymmetry information on a smooth curve,
smoother.args = List (LTY = 2) option loess fit curve as a dashed line

library(car)
scatterplot(weight~height,data=women,spread=FALSE,smoother.args=list(lty=2),pch=19,
            main="Women Age 30-39",
            xlab="Height (inches)",
            ylab="Weight (lbs.)")

Here Insert Picture Description

8.2.4 Multiple linear regression
when more than one predictor variable, the simple linear regression becomes a multiple linear regression,
using the basic package state.x77 data sets to explore the relationship between crime rates and other factors of a state, including population, illiteracy rate, average income and the number of frost days

states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
names(states)
[1] "Murder"     "Population" "Illiteracy" "Income"     "Frost"     

在多元回归分析中,第一步需要检查一下变量间的相关性。
cor()函数提供了二变量之间的相关系数,car包中的scatterplotMatrix()函数则会生成散点图矩阵
示例

states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> names(states)
[1] "Murder"     "Population" "Illiteracy" "Income"     "Frost"     
> cor(states)
               Murder Population Illiteracy     Income      Frost
Murder      1.0000000  0.3436428  0.7029752 -0.2300776 -0.5388834
Population  0.3436428  1.0000000  0.1076224  0.2082276 -0.3321525
Illiteracy  0.7029752  0.1076224  1.0000000 -0.4370752 -0.6719470
Income     -0.2300776  0.2082276 -0.4370752  1.0000000  0.2262822
Frost      -0.5388834 -0.3321525 -0.6719470  0.2262822  1.0000000
> library(car)
> scatterplotMatrix(states,spread=FALSE,smoother.args=list(lty=2),main="Scatter Plot Matrix")

Here Insert Picture Description
scatterplotMatrix()函数默认在非对角区域绘制变量间的散点图,并添加平滑和线性拟合曲线。
对角线区域绘制每个变量的密度图和轴须图
从图中可以看到,谋杀率是双峰的曲线,每个预测变量都一定程度上出现了偏斜,谋杀率随着人口和文盲率的增加而增加,随着收入水平和结霜天数增加而下降。
使用lm()函数拟合多元线性回归模型

states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit <- lm(Murder~.,data=states)
> summary(fit)

Call:
lm(formula = Murder ~ ., data = states)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
Population  2.237e-04  9.052e-05   2.471   0.0173 *  
Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
Income      6.442e-05  6.837e-04   0.094   0.9253    
Frost       5.813e-04  1.005e-02   0.058   0.9541    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567,	Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

> residuals(fit)
       Alabama         Alaska        Arizona       Arkansas     California 
    4.11179210     3.27433977    -1.68700264     0.26668056    -0.57424792 
      Colorado    Connecticut       Delaware        Florida        Georgia 
    1.68594493    -3.81042204     0.73768277     1.91178879     2.97838044 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
   -3.41984294     1.05927673     2.42954793     1.41893921    -2.02545720 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
   -0.09731294     1.68494109    -0.72117551    -2.00277259     2.21479548 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
   -4.15834611     3.72023253    -2.69149081     0.57035176     3.34806321 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
    0.74271628    -1.53684814     7.62104160    -1.39312273    -2.63613735 
    New Mexico       New York North Carolina   North Dakota           Ohio 
   -1.20643853    -0.54123176     0.89516295    -3.72716517     0.08408941 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
   -0.30323290    -0.35693643    -2.29005973    -4.79596633    -0.06492602 
  South Dakota      Tennessee          Texas           Utah        Vermont 
   -2.12674691     1.50235895    -1.17607560     0.17189817     1.32479323 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
    0.99906692    -0.54828949    -1.02808130    -2.53545933     2.70090366 
> confint(fit)
                    2.5 %       97.5 %
(Intercept) -6.552191e+00 9.0213182149
Population   4.136397e-05 0.0004059867
Illiteracy   2.381799e+00 5.9038743192
Income      -1.312611e-03 0.0014414600
Frost       -1.966781e-02 0.0208304170
> fitted(fit)
       Alabama         Alaska        Arizona       Arkansas     California 
     10.988208       8.025660       9.487003       9.833319      10.874248 
      Colorado    Connecticut       Delaware        Florida        Georgia 
      5.114055       6.910422       5.462317       8.788211      10.921620 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
      9.619843       4.240723       7.870452       5.681061       4.325457 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
      4.597313       8.915059      13.921176       4.702773       6.285205 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
      7.458346       7.379767       4.991491      11.929648       5.951937 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
      4.257284       4.436848       3.878958       4.693123       7.836137 
    New Mexico       New York North Carolina   North Dakota           Ohio 
     10.906439      11.441232      10.204837       5.127165       7.315911 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
      6.703233       4.556936       8.390060       7.195966      11.664926 
  South Dakota      Tennessee          Texas           Utah        Vermont 
      3.826747       9.497641      13.376076       4.328102       4.175207 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
      8.500933       4.848289       7.728081       5.535459       4.199096 

当预测变量不止一个时,回归系数的含义为:一个预测变量增加一个单位,其他预测变量保持不变时,因变量将要增加的数量。

8.2.5有交互项的多元线性回归
若两个预测变量的交互项显著,说明响应变量与其中一个预测变量的关系依赖于另一个预测变量的水平

fit <- lm(mpg ~ hp +wt + hp:wt,data=mtcars)
> summary(fit)

Call:
lm(formula = mpg ~ hp + wt + hp:wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0632 -1.6491 -0.7362  1.4211  4.5513 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
hp          -0.12010    0.02470  -4.863 4.04e-05 ***
wt          -8.21662    1.26971  -6.471 5.20e-07 ***
hp:wt        0.02785    0.00742   3.753 0.000811 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared:  0.8848,	Adjusted R-squared:  0.8724 
F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

> 

通过effects包中的effect()函数,可以展示交互项结果
effect函数的格式为
plot(effect(term,mod,xlevels),multiline=TRUE)
根据lm()函数生成的二元一次方程,给定因变量中的一个因变量几个固定值得到不同的一元一次方程,然后作图

library(effects)
plot(effect("hp:wt",fit,,list(wt=c(2.2,3.2,4.2))),multiline=TRUE)

Here Insert Picture Description
从图中可以看到hp和mpg的斜率随着wt的改变而改变,说明对于mpg来说hp和wt存在交互,如果不存在交互,当wt不同时hp和mpg的关系只会改变截距而不会改变斜率。

Published 39 original articles · won praise 11 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_42712867/article/details/99812140