By lm () return fitting model
As the core of a statistical regression analysis, which is actually a broad concept, through those with one or more predictors (also known as independent or explanatory variables) to predict the response variable (also known as the dependent variable, the variable standard school or outcome variable)
generally regression analysis can be used to pick the response variable associated explanatory variable may be described relationship between the two, an equation may be generated by the explanatory variables to predict the response variable.
Variations thereof regression analysis of
a simple linear - with a quantitative explanatory variables predict a quantitative response variable (one dollar)
2, the polynomial - a prediction relation variable a quantitative response model with a quantized explanatory variables n-th order polynomial (mono times)
3, multi - level data structure prediction has a response variable. Also referred to as hierarchical model, or a mixed model of nested models
4, Multiple Linear - with two or more quantized prediction explanatory variables of a quantitative response variable (a polyol)
5, multivariable, with one or more a plurality of explanatory variables predicted response variable
6, Logistic- with one or more explanatory variables predict a categorical response variables
7, Poisson - interpreted by one or more variables representative of a prediction of the number of variable frequency in response to
8, Cox proportional hazard - with a plurality of explanatory variables or a predicted event time of
9, the time series - modeling time series data related to the error term
10, nonlinear - with one or more quantized prediction explanatory variables a quantitative response variable, but the model is nonlinear
11, nonparametric - a quantized prediction response variable with one or more quantized explanatory variables, the model of the form data from the form without prior setting
12, robust - with one or more quantized prediction explanatory variables a quantitative response variable, can withstand strong interference point of impact
Our focus in this chapter is the ordinary least squares regression, including simple linear regression, polynomial regression and multiple linear regression.
In order to properly interpret the OLS model coefficients, the data must meet the following statistical hypotheses
1, normality - for a fixed value of the independent variable, the dependent variable to the value of a normal distribution
are independent between the two, the value of independence -Yi
3, linear - because between linear and independent variables related to
4, homoscedasticity - the variance of the dependent variable does not change with different levels of independent variables change. Also known as constant variance
8.2.1 using LM () return fitting model
in R, the linear model LM is the basic function (), the format
myfit <- lm (formula, data )
parameter fitting form of the model formula mean it, expression formula the Y the X1 + an X2 + X3 + ... + Xk is, the left side is the response variable, to the right of each predictor
parameter data is a data block that contains the data used to fit the model
results are stored in a list of objects, including the model a lot of information
Expression commonly used symbols R
~, delimiter, response variable is the left and right for the explanatory variables
+, separated predictors
: shows the interaction term predictors. For example, through interaction prediction Y x, z and x and z are, code ~ x + z + Y x: z
*, represents a simple embodiment all possible interaction term. For example, the code Y X * W * Z can be expanded to Y X + W + X + Z: Z + X: Z + W: W + X: Z: W
^, represents an interaction term to achieve frequency Code Y (X + Z + W) ^ 2 may be expanded to Y X + Z + W + X: Z + X: W + Z: W + X
, represents contain in addition to all the variables due to external variables such as a data frame. contains variables x, y, z, w, y codes deployable y. X + Z + W
-, minus sign indicates a variable is removed from the equation. For example Y (X + Z + W) ^ 2-X: W, can be expanded to Y X + Z + W + X: Z + Z: X + W
-1, delete intercept, e.g. y ~ x-1 Quasi together y regression in the x and forces the straight line through the origin
Im (), from the arithmetic angle to explain the elements in parentheses, such as code y x + the I ((Z + W) ^ 2) expands to y x + H , h is a square of z and w and created new variables
function, mathematical functions can be used in the expression. E.g. log (y) ~ x + z + w
Fitting a linear model to other functions useful
summary () - shows the results of a detailed model fit
coefficients () - List of fitted model parameters (intercept and slope)
confint () - providing a model parameter confidence intervals (% default 95)
fitted () - fitting model predicted values listed
residuals- residual values listed in the model fitting
anova () - generates a model of analysis of variance table fitting or comparing two or more intended model analysis of variance table together vcov () - listed in the model parameter covariance matrix
AIC () - output Akaike information statistics
plot () - evaluation fit diagnostic plots generated models
predict () - fitted with the new model variable values predicted response data set
When the regression model includes a dependent variable and one independent variable, we called simple linear regression.
When only one predictor variable, but a variable that contains power, we called polynomial regression.
When there is more than one predictor variable, we called multiple linear regression
Examples of simple linear regression
based installation data set provides 15 women aged 30 to 39 year-old female height and weight
height prediction by weight
function by a simple linear fit LM ()
> fit <- lm(weight ~height, data=women)
> fit
Call:
lm(formula = weight ~ height, data = women)
Coefficients:
(Intercept) height
-87.52 3.45
More detailed results using the summary () function to show the fitted model
summary(fit)
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
The results can be seen from the regression coefficient was not significant 3.45 0 (P = 1.1E-14),
R & lt 0.991 square term weight can account represented 99.1% of the variance. It is also the square of the correlation coefficient between the actual and predicted values.
Residual standard error (1.53) can be considered a model to predict the average height and weight of the error
test F statistic of all predictor variables to predict whether the response variables on a probability level. Since only a simple regression predictor variables, where F test is equivalent to the height of the regression coefficient t test
> #使用coefficients()列出拟合模型的参数
> coefficients(fit)
(Intercept) height
-87.51667 3.45000
> #提供模型参数的置信区间(默认为0.95),从中可以看出参数的稳定性
> confint(fit)
2.5 % 97.5 %
(Intercept) -100.342655 -74.690679
height 3.253112 3.646888
> #输出数据框中女性体重的数据
> women$weight
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
> #列出拟合模型的预测值(会显示模型中提供的所有自变量通过模型得出的预测量,这样可以比较预测变量和因变量)
> fitted(fit)
1 2 3 4 5 6 7 8 9
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 140.1833
10 11 12 13 14 15
143.6333 147.0833 150.5333 153.9833 157.4333 160.8833
> #用拟合模型对新的数据集预测响应变量值
> predict(fit,newdata=data.frame(height=c(90,100)))
1 2
222.9833 257.4833
> #列出拟合模型的残差值
> residuals(fit)
1 2 3 4 5 6 7
2.41666667 0.96666667 0.51666667 0.06666667 -0.38333333 -0.83333333 -1.28333333
8 9 10 11 12 13 14
-1.73333333 -1.18333333 -1.63333333 -1.08333333 -0.53333333 0.01666667 1.56666667
15
3.11666667
> plot(women$height,women$weight,xlab="Height",ylab="weight")
> #制作回归线(注意abline()函数只能制作线性的直线的回归线,非直线的回归线不能制作)
> abline(fit)
As can be seen from the figure you can use a bent curve to improve the accuracy of the prediction can try using polynomial regression
8.2.3 polynomial regression
from the example of the figure can be seen a two times may be added to improve the prediction accuracy of the regression
examples
fit2 <- lm(weight~height + I(height^2),data=women)
> summary(fit2)
Call:
lm(formula = weight ~ height + I(height^2), data = women)
Residuals:
Min 1Q Median 3Q Max
-0.50941 -0.29611 -0.00941 0.28615 0.59706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 261.87818 25.19677 10.393 2.36e-07 ***
height -7.34832 0.77769 -9.449 6.58e-07 ***
I(height^2) 0.08306 0.00598 13.891 9.32e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3841 on 12 degrees of freedom
Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
F-statistic: 1.139e+04 on 2 and 12 DF, p-value: < 2.2e-16
> #输出模型参数的置信区间
> confint(fit2)
2.5 % 97.5 %
(Intercept) 206.97913605 316.77723111
height -9.04276525 -5.65387341
I(height^2) 0.07003547 0.09609252
> #输出拟合模型的模型参数
> coefficients(fit2)
(Intercept) height I(height^2)
261.87818358 -7.34831933 0.08306399
> #输出拟合模型的预测值
> confint(fit2)
2.5 % 97.5 %
(Intercept) 206.97913605 316.77723111
height -9.04276525 -5.65387341
I(height^2) 0.07003547 0.09609252
> women$weight
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
> #输出拟合模型的预测值
> fitted(fit2)
1 2 3 4 5 6 7 8 9
115.1029 117.4731 120.0094 122.7118 125.5804 128.6151 131.8159 135.1828 138.7159
10 11 12 13 14 15
142.4151 146.2804 150.3118 154.5094 158.8731 163.4029
> #输出残差值
> residuals(fit2)
1 2 3 4 5 6
-0.102941176 -0.473109244 -0.009405301 0.288170653 0.419618617 0.384938591
7 8 9 10 11 12
0.184130575 -0.182805430 0.284130575 -0.415061409 -0.280381383 -0.311829347
13 14 15
-0.509405301 0.126890756 0.597058824
> #作图
> plot(women$height,women$weight,xlab="Height",ylab="Weight")
> #函数lines()其作用是在已有图上加线,命令为lines(x,y),其功能相当于plot(x,y,type="1")
> lines(women$height,fitted(fit2))
Note that a polynomial equation is still linear regression model, because the form of the equation is still weighted predictors and
the only way it is not a linear model of the form y = B0 + B1 * e ^ (x / B2)
like this y = log (x1) + sin (x2) still assumed to be linear
Note the use of three times higher than the basic entry is not necessary
Scatterplot car package () function to quickly generate a binary relation FIG
Scatterplot () function may provide a scatter plot, a linear fitting curve fitting and smoothing curves, the response also shows a box plot of the boundary of each variable.
spread = FALSE option deletes the rms residual negative expand, and asymmetry information on a smooth curve,
smoother.args = List (LTY = 2) option loess fit curve as a dashed line
library(car)
scatterplot(weight~height,data=women,spread=FALSE,smoother.args=list(lty=2),pch=19,
main="Women Age 30-39",
xlab="Height (inches)",
ylab="Weight (lbs.)")
8.2.4 Multiple linear regression
when more than one predictor variable, the simple linear regression becomes a multiple linear regression,
using the basic package state.x77 data sets to explore the relationship between crime rates and other factors of a state, including population, illiteracy rate, average income and the number of frost days
states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
names(states)
[1] "Murder" "Population" "Illiteracy" "Income" "Frost"
在多元回归分析中,第一步需要检查一下变量间的相关性。
cor()函数提供了二变量之间的相关系数,car包中的scatterplotMatrix()函数则会生成散点图矩阵
示例
states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> names(states)
[1] "Murder" "Population" "Illiteracy" "Income" "Frost"
> cor(states)
Murder Population Illiteracy Income Frost
Murder 1.0000000 0.3436428 0.7029752 -0.2300776 -0.5388834
Population 0.3436428 1.0000000 0.1076224 0.2082276 -0.3321525
Illiteracy 0.7029752 0.1076224 1.0000000 -0.4370752 -0.6719470
Income -0.2300776 0.2082276 -0.4370752 1.0000000 0.2262822
Frost -0.5388834 -0.3321525 -0.6719470 0.2262822 1.0000000
> library(car)
> scatterplotMatrix(states,spread=FALSE,smoother.args=list(lty=2),main="Scatter Plot Matrix")
scatterplotMatrix()函数默认在非对角区域绘制变量间的散点图,并添加平滑和线性拟合曲线。
对角线区域绘制每个变量的密度图和轴须图
从图中可以看到,谋杀率是双峰的曲线,每个预测变量都一定程度上出现了偏斜,谋杀率随着人口和文盲率的增加而增加,随着收入水平和结霜天数增加而下降。
使用lm()函数拟合多元线性回归模型
states <- as.data.frame(state.x77[,c("Murder","Population","Illiteracy","Income","Frost")])
> fit <- lm(Murder~.,data=states)
> summary(fit)
Call:
lm(formula = Murder ~ ., data = states)
Residuals:
Min 1Q Median 3Q Max
-4.7960 -1.6495 -0.0811 1.4815 7.6210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.235e+00 3.866e+00 0.319 0.7510
Population 2.237e-04 9.052e-05 2.471 0.0173 *
Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05 ***
Income 6.442e-05 6.837e-04 0.094 0.9253
Frost 5.813e-04 1.005e-02 0.058 0.9541
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared: 0.567, Adjusted R-squared: 0.5285
F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08
> residuals(fit)
Alabama Alaska Arizona Arkansas California
4.11179210 3.27433977 -1.68700264 0.26668056 -0.57424792
Colorado Connecticut Delaware Florida Georgia
1.68594493 -3.81042204 0.73768277 1.91178879 2.97838044
Hawaii Idaho Illinois Indiana Iowa
-3.41984294 1.05927673 2.42954793 1.41893921 -2.02545720
Kansas Kentucky Louisiana Maine Maryland
-0.09731294 1.68494109 -0.72117551 -2.00277259 2.21479548
Massachusetts Michigan Minnesota Mississippi Missouri
-4.15834611 3.72023253 -2.69149081 0.57035176 3.34806321
Montana Nebraska Nevada New Hampshire New Jersey
0.74271628 -1.53684814 7.62104160 -1.39312273 -2.63613735
New Mexico New York North Carolina North Dakota Ohio
-1.20643853 -0.54123176 0.89516295 -3.72716517 0.08408941
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
-0.30323290 -0.35693643 -2.29005973 -4.79596633 -0.06492602
South Dakota Tennessee Texas Utah Vermont
-2.12674691 1.50235895 -1.17607560 0.17189817 1.32479323
Virginia Washington West Virginia Wisconsin Wyoming
0.99906692 -0.54828949 -1.02808130 -2.53545933 2.70090366
> confint(fit)
2.5 % 97.5 %
(Intercept) -6.552191e+00 9.0213182149
Population 4.136397e-05 0.0004059867
Illiteracy 2.381799e+00 5.9038743192
Income -1.312611e-03 0.0014414600
Frost -1.966781e-02 0.0208304170
> fitted(fit)
Alabama Alaska Arizona Arkansas California
10.988208 8.025660 9.487003 9.833319 10.874248
Colorado Connecticut Delaware Florida Georgia
5.114055 6.910422 5.462317 8.788211 10.921620
Hawaii Idaho Illinois Indiana Iowa
9.619843 4.240723 7.870452 5.681061 4.325457
Kansas Kentucky Louisiana Maine Maryland
4.597313 8.915059 13.921176 4.702773 6.285205
Massachusetts Michigan Minnesota Mississippi Missouri
7.458346 7.379767 4.991491 11.929648 5.951937
Montana Nebraska Nevada New Hampshire New Jersey
4.257284 4.436848 3.878958 4.693123 7.836137
New Mexico New York North Carolina North Dakota Ohio
10.906439 11.441232 10.204837 5.127165 7.315911
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
6.703233 4.556936 8.390060 7.195966 11.664926
South Dakota Tennessee Texas Utah Vermont
3.826747 9.497641 13.376076 4.328102 4.175207
Virginia Washington West Virginia Wisconsin Wyoming
8.500933 4.848289 7.728081 5.535459 4.199096
当预测变量不止一个时,回归系数的含义为:一个预测变量增加一个单位,其他预测变量保持不变时,因变量将要增加的数量。
8.2.5有交互项的多元线性回归
若两个预测变量的交互项显著,说明响应变量与其中一个预测变量的关系依赖于另一个预测变量的水平
fit <- lm(mpg ~ hp +wt + hp:wt,data=mtcars)
> summary(fit)
Call:
lm(formula = mpg ~ hp + wt + hp:wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.0632 -1.6491 -0.7362 1.4211 4.5513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.80842 3.60516 13.816 5.01e-14 ***
hp -0.12010 0.02470 -4.863 4.04e-05 ***
wt -8.21662 1.26971 -6.471 5.20e-07 ***
hp:wt 0.02785 0.00742 3.753 0.000811 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724
F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13
>
通过effects包中的effect()函数,可以展示交互项结果
effect函数的格式为
plot(effect(term,mod,xlevels),multiline=TRUE)
根据lm()函数生成的二元一次方程,给定因变量中的一个因变量几个固定值得到不同的一元一次方程,然后作图
library(effects)
plot(effect("hp:wt",fit,,list(wt=c(2.2,3.2,4.2))),multiline=TRUE)
从图中可以看到hp和mpg的斜率随着wt的改变而改变,说明对于mpg来说hp和wt存在交互,如果不存在交互,当wt不同时hp和mpg的关系只会改变截距而不会改变斜率。