Extending Linear Models: Non-Linearity

Sources:

require(knitr)

## Loading required package: knitr

Overview

Here we relax the linear assumptions of popular linear techniques. This is motivated by the simple truth that sometimes the linear assumption is simply a poor approximation. There are many ways we can approach this problem, some which are addressed by reducing model complexity using regularization methods . However, these techniques still use a linear model, which can only be improved upon so far. This notebooks focuses on extensions of linear models…

Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. Cubic regression uses three variables X, X2, and X3 as predictors. This is a simple way to provide a non-linear fit to the data.
Step functions cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piece wise constant function.
Regression splines are more flexible than polynomials and step functions, and are actually an extension of the two. The divide the range of Xinto K distinct regions. For each region, a polynomial function is fit to the data, however, the polynomials are constrained so that they join smoothly at the region boundaries or knots. These can provide and extremely flexible fit.
Local splines are similar to regression splines, but the regions are allowed to overlap, and do so in a smooth way.
Smoothing splines is similar to regression splines as well, but they result from minimizing a residual sum of squares criterion subject to a smoothness penalty.
Generalized additive models allow extension of the above methods to deal with multiple predictors.

Polynomial Regression

This is the most traditional way to extend the linear model. As we increase the d degree of polynomial, a polynomial regression allows us to produce very non-linear curves while still estimating the coefficients using least squares. Generally, it is unusual to use d greater than 3 or 4 because after this we risk overfitting the model and generating too much variance. Examples follow below. An example of a d = 4 polynomial fit between age and wage is presented in the image below.

Step Functions

Using polynomial functions (above) of the features as predictors imposes a global structure on the non-linear function of X. Instead, we could use step functions to avoid such global structure. Here we break X into bins, and fit a different constant in each bin. Essentially, we create the bins by selecting K cut-points in the range of X, and then construct K + 1 new variables, which behave like dummy variables. Unfortunately, unless there are natural breakpoints in the data, step wise functions chance missing the changes in the data. Despite this, they are used regularly in bio statistics and epidemiology; for example: 5 year age groups are often used as the bins. Using the same dataset as above, here is what a step function might look like.

Regression Splines

Regression splines are one of many basis functions that extend polynomial and stepwise regression techniques. In-fact. Polynomial and stepwise regression functions are just specific cases of basis functions. The term basis function simply means that we apply a known transformation to Xahead of time before fitting a linear model.

For regression splines, instead of fitting a high degree polynomial over the entire range of X, we can fit several low-degree polynomials over different regions of X. This accounts to essentially fitting a polynomial regression model to subsets of the data. Each of these functions can be fit using least squares.

Here is an example of a piecewise cubic fit (top left figure). We can see that it provides a very discontinuous fit for this data.

To solve this issue, a better solution is to employ a constraint such that the fitted curve must be continuous. Unfortunately, even when splines are continuous, they are susceptible to high variance at the outer range of predictors, where there may be fewer observations. Thus, another type of constraint can be applied to restrict these highly variable outer knots to be linear. This is called a natural spline.

Chosing the Location and Number of Knots

One option would be to place more knots in places where we think the function might vary the most rapidly, and fewer knots where it is more stable. This can work well, but in practise it is more common to place knots in a uniform fashion. One way to do this is to specify a desired degrees of freedom, and then have software automatically place the knits at uniform quantiles in the data. For example, if we specified 4 degrees of freedom for the Wage dataset, we would fit 3 knits at 25th, 50th and 75th quantiles.

To be clear, in this case there are actuall 5 knots, including the boundary knots. A cubic spline with 5 knots has 9 degrees of freedom, but the two constraints imposed on the natural cubic splines reduce the model complexity by 4. A final degree of freedom is the constant, which is absorbed by the intercept, so we count it as four degrees of freedom.

So how many knots should we use? Or equivalently, how many degrees of freedom should the spline contain? A simple option is to try many numbers of knots and see which produces the best looking curve. Yet, a more objective approach would be to use cross-validation. To do this we would remove some portion of the data, fit a spline with a certain number of knots, and then use the fitted spline to make predictions on the held out portion. We repeat this step multiple times, and then compute out overall cross-validated RSS. This entire processes can be repeated with different numbers of of knots K, then we would be able to chose the value of K with the smallest RSS.

Compared to polynomial regression, splines can be shown to be much more stable, particularly near the tails of data. This is because of the constraints we are able to apply to splines.

Smoothing Splines

In the last section we discussed regression splines, which were created by specifying a set of knots, producing a sequence of basis functions, and then using least squares to estimate the spline coefficients. Smoothing splines is a different approach to creating a spline. Let’s recall that our goal is to find some function that fits the observed data well, that is to minimize RSS. However, if there are no constraints on our function we can always make RSS zero by choosing a function that interpolates all of the data exactly. The problem is obviously that we would completely over-fit the data. What we actually need to do is find a function that makes RSS reasonable small, but that is also smooth. One way to do this is to utilize a tuning parameter lambda which penalizes variability in the function. Without getting into the math, this tuning parameter measures the variability across the entire range of the data. If lambda = 0 then the penalty term has no effect, and the function will be jumpy and interpolate every value. When lamba = infinity, then the function will be perfectly smooth, a straight line (actually a linear least squares line). This function will end up looking very similar to a natural cubic spline that has a knot at every unique value of xi, yet it is actually a shrunkenversion of the natural cubic spline, where lambda controls the amount of shrinkage.

Choosing the Smoothing Parameter Lambda

We would guess that a spline with a knot at every unique data point will have far too many degrees of freedom. However, lambda controls the roughness of the smoothing spline, and hence the effective degrees of freedom, so instead of n DOF, we have only 2. This is because although a smoothing spline has n parameters (and n degrees freedom), these n parameters are heavily constrained, or shrunken. The exact definition of effective degrees of freedom is rather technical and not explained here. In fitting a smoothing spline, we do not select the number or location of knots, but we do have another problem: we must choose the value of lambda. Again, we resort to cross-validation. It turns out that we can actually very efficiently calculate LOOCV for smoothing splines, regression splines, and other arbitrary basis functions.

Smoothing splines can often be preferable to regression splines because they often create simpler models, with comparable fit.

Local Regression

Local regression involves computing the fit at a target point x0 using only the nearby training observations. Basically we fit many weighted least squares regressions to estimate the behaviour in the underlying data. To perform local regression there are a number of considerations to be made.

* How to define the weighting function _K_?

* Whether to fit a linear, constant, or quadratic regression.

* What is the span _s_, which controls the smoothing and flexibility of the fit. Large _s_ would lead to a global fit, and vice versa. _s_ would be chosen using cross validation.

Local regression can be performed in various ways, particularly apparent in the multivariate scenario that involves fitting p linear regression models, whereby some variables could be fit globally, and some locally. These varying coefficient models are a useful way of adapting a model to the most recently gathered data. Local regression is also effective when we want to fit models that are local in a pair of variables, rather than in simply one dimension. Here we can use two-dimensional neighbourhoods, fit a bi variate linear regression model using observations that are near each target point in two-dimensional space. Theoretically, the same approach can be done in higher dimensions, yet in practice local regression performs poorly above 3 or 4 predictors because there are few observations close to x0 (curse of dimensional).

Generalized Additive Models

The examples above are explained from the perspective of fitting a model to the response Y with a single predictor X.

Here we explore the problem of flexibility predicting Y on the basis of several predictors. This is again an extension of the simple linear model.

GAM models provide a general framework to extend the linear model by allowing non-linear functions of each variable, while maintaining additivity. In-fact, most of these methods discussed above can be easily be applied to a multivariate scenario.

GAM with a smoothing spline is not as simple, as least squares cannot be used. Instead an approach called backfitting is used. This model fit multiple predictors by updating the fit for each predictor in turn, holding the others fixed. This is great because that each time we update a function, we can simply apply the fitting method for that particular variable, as a partial residual.

Pros and Cons of GAMs

Pros

GAMs allow fitting non-linear functions to each predictor so that we can automatically model non-linear relationships that standard linear regression will miss. This means we don’t need to try out many different transformations on each variable individually.
The non-linear fits can potentially make more accurate predictions for the response Y.
Because the model is additive, we can still examine the effect of each predictor on Y while holding other variables fixed.
The smoothness of the function can be summarized as the degrees of freedom.

Cons

The main limitation is that the model is restricted to be additive, thus important interactions can be missed. However, as with regular linear regression, we can manually add interactions by including additional predictors of the for _Xj_ x _Xk_.

For fully general models, we have to look for even more flexible approaches, such as random forests and boosting, described here.

Code Examples

Polynomial Regression and Step Functions

library(ISLR)
attach(Wage)

We can easily fit polynomial functions using poly() and then specify the variable and degree of the polynomial. The function returns a matrix of orthogonal polynomials, which means each column is a linear combination of the variables in the variables age, age^2, age^3, and age^4. If we want to obtain the variables directly, we can specify raw=TRUE, however this does not effect the prediction results. It can be used to examine the coefficient estimates is desired.

fit = lm(wage~poly(age, 4), data=Wage) kable(coef(summary(fit)))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	111.7036	0.7287	153.2830	0.0000
poly(age, 4)1	447.0679	39.9148	11.2006	0.0000
poly(age, 4)2	-478.3158	39.9148	-11.9834	0.0000
poly(age, 4)3	125.5217	39.9148	3.1447	0.0017
poly(age, 4)4	-77.9112	39.9148	-1.9519	0.0510

Now let’s create a vector of ages that we want to predict for and call the predict() function. Lastly, we will want to plot the data and the fitted 4 degree polynomial.

ageLims <- range(age)
age.grid <- seq(from=ageLims[1], to=ageLims[2]) pred <- predict(fit, newdata = list(age = age.grid), se=TRUE) se.bands <- cbind(pred$fit + 2*pred$se.fit, pred$fit - 2*pred$se.fit)

plot(age,wage,xlim=ageLims ,cex=.5,col="darkgrey") title("Degree -4 Polynomial ",outer=T) lines(age.grid,pred$fit,lwd=2,col="blue") matlines(age.grid,se.bands,lwd=2,col="blue",lty=3)

plot of chunk unnamed-chunk-4

In this simple example, we could use an ANOVA test, which performs analysis of variance to test the null hypothesis that a model M is sufficient to explain the data, against the alternative hypothesis that a more complex model M2 is required.

fit.1=lm(wage~age,data=Wage)
fit.2=lm(wage~poly(age,2),data=Wage) fit.3=lm(wage~poly(age,3),data=Wage) fit.4=lm(wage~poly(age,4),data=Wage) fit.5=lm(wage~poly(age,5),data=Wage) anova(fit.1, fit.2, fit.3, fit.4, fit.5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)    
## 1   2998 5022216                               
## 2   2997 4793430  1    228786 143.59 <2e-16 ***
## 3   2996 4777674  1     15756   9.89 0.0017 ** 
## 4   2995 4771604  1      6070   3.81 0.0510 .  
## 5   2994 4770322  1      1283   0.80 0.3697    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that p-value comparing _M_1 to the quadratic model _M_2 is essentially zero, which tells us that a linear fit is not sufficient. Meanwhile the p-value from _M_3 to _M_4 is about 5%, and higher order models are not sufficient. Thus we can conclude that a quadratic or cubic model are likely sufficient to fit this data, with preference to the simpler model.

We could also choose the polynomial degree using cross-validation.

library(boot)

set.seed(17)

cv.errors <- data.frame(degree=seq(1,5,1), error= rep(NA, 5)) for (i in 1:5) { # loop through 1-5 degree polynomials glm.fit <- glm(wage~poly(age, i), data=Wage) cv.errors$error[i] <- cv.glm(Wage, glm.fit, K=10)$delta[1] } kable(cv.errors)

degree	error
1	1677
2	1602
3	1596
4	1594
5	1596

Here we actually see the lowest cross validated error is for a 4 degree polynomial, however we don’t really lose much ground choosing a 3rd or 2nd order model. Next we consider predicting whether an individuals earns more than $250 000 per year. Here we use the expression I(wage>250) to create a new boolean response variable. Note we should specify type='response' in the prediction to extract the probabilities.

fit <- glm(I(wage > 250)~poly(age, 4), data=Wage, family=binomial) pred <- predict(fit, newdata=list(age=age.grid), se=TRUE, type='response')

However, the confidence intervals for the probabilities would not be sensible, since we end up with some negative probabilities. For generating confidence intervals it makes more sense to transform the logit predictions.

pred <- predict(fit, newdata=list(age=age.grid), se=TRUE) pfit <- exp(pred$fit) / (1+exp(pred$fit)) # Convert logit se.bands.logit<- cbind(pred$fit + 2*pred$se.fit, pred$fit - 2*pred$se.fit) se.bands <- exp(se.bands.logit) / (1+exp(se.bands.logit))

Plot it:

plot(age,I(wage>250),xlim=ageLims ,type="n",ylim=c(0,.2)) points(jitter(age), I((wage>250)/5), cex=.5,pch="|", col =" darkgrey ") lines(age.grid,pfit,lwd=2, col="blue") matlines(age.grid,se.bands,lwd=1,col="blue",lty=3)

plot of chunk unnamed-chunk-9

Step Function

Here we need to use the cut() function to split the data. We could also use quantile() or manually specify cutpoints with breaks().

table(cut(age, 4))

## 
## (17.9,33.5]   (33.5,49]   (49,64.5] (64.5,80.1] 
##         750        1399         779          72

fit <- lm(wage~cut(age, 4), data=Wage) coef(summary(fit))

##                        Estimate Std. Error t value  Pr(>|t|)
## (Intercept)              94.158      1.476  63.790 0.000e+00
## cut(age, 4)(33.5,49]     24.053      1.829  13.148 1.982e-38
## cut(age, 4)(49,64.5]     23.665      2.068  11.443 1.041e-29
## cut(age, 4)(64.5,80.1]    7.641      4.987   1.532 1.256e-01

Splines

Here we will make use of the splines library. The bs() function generates the entire matrix of basis functions with the specified set of knits. By default it uses a cubic spline.

library(splines)

fit <- lm(wage~bs(age, knots=c(25,40,60)), data=Wage) pred <- predict(fit, newdata=list(age=age.grid), se=T) plot(age, wage, col='gray') lines(age.grid, pred$fit, lwd=2) lines(age.grid, pred$fit+2*pred$se, lty='dashed') lines(age.grid, pred$fit-2*pred$se, lty='dashed')

plot of chunk unnamed-chunk-11

Since we are using a cubic spline with three knots, this produces a spline with six basis functions. This equates to 7 degrees of freedom (including the intercept). We can also use the df = parameter to specify a spline with quantile knots. Note the function also has a degree = option if we want to specify a different type of spline.

dim(bs(age, knots=c(25,40,60)))

## [1] 3000    6

dim(bs(age, df=6))

## [1] 3000    6

attr(bs(age, df=6), 'knots') # quantiles of `age`

##   25%   50%   75% 
## 33.75 42.00 51.00

To fit a natural spline we use the ns() function.

fit2 <- lm(wage~ns(age, df=4), data=Wage) pred2 <- predict(fit2, newdata=list(age=age.grid), se=T) # re-plot the data plot(age, wage, col='gray') lines(age.grid, pred$fit, lwd=2) lines(age.grid, pred2$fit, col='red', lwd=2)

plot of chunk unnamed-chunk-13

Note how the natural spline doesn’t suffer from the increased variation at the tails.

We can also fit a smoothing spline with none other than smooth.spline(). Here we fit a spline with 16 degrees of freedom, then spline chosen by cross-validation, which yields 6.8 degrees of freedom.

plot(age, wage, xlim=ageLims, col='gray')
title('Smoothing Spline') fit <- smooth.spline(age, wage, df=16) fit2 <- smooth.spline(age, wage, cv=TRUE)

## Warning: cross-validation with non-unique 'x' values seems doubtful

fit2$df

## [1] 6.795

lines(fit, col='red', lwd=2) lines(fit2, col='blue', lwd=1) legend('topright', legend=c('16 DF', '6.8 DF'), col=c('red','blue'), lty=1, lwd=2, cex=0.8)

plot of chunk unnamed-chunk-14

Local Regression

To perform local regression we use the loess() function. This function is also built into ggplot2. Here we fit a loess function with span of 0.2 and 0.5: that is, each neighbourhood consists of 20% or 50% of the observations. The larger the span, the smoother the fit.

library(ggplot2)

ggplot(data=Wage, aes(x=age, y=wage))+geom_point(color='gray') +  geom_smooth(method='loess', span=0.2) +  geom_smooth(method='loess', span=0.5, color='red')+  theme_bw()

plot of chunk unnamed-chunk-15

GAMs

Now we use a GAM to predict wage using a natural spline of year, age, and education. Since this is simply a linear regression model with several basis functions, we simply use the lm() function.

gam1 <- lm(wage ~ ns(age, 5) + education, data=Wage)

To fit more complex splines, or other components that cannot be specified as basis functions, we need to use the gam library in R. the s()function is used to indicate a smoothing spline.

library(gam)

## Loaded gam 1.09.1

gam2 <- gam(wage~s(year, 4) + s(age, 5) + education, data=Wage)

Plot these two models

par(mfrow=c(1, 3)) plot(gam2, se=TRUE, col='blue')

plot of chunk unnamed-chunk-18

plot.gam(gam1, se=TRUE, col='red')

plot of chunk unnamed-chunk-18

It looks like the behaviour if year is rather linear. We can make a new model and then use an ANOVA test to decide between them.

gam3 <- gam(wage~year + s(age, 5) + education, data=Wage) anova(gam1, gam3, gam2, test='F')

## Analysis of Variance Table
## 
## Model 1: wage ~ ns(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
##   Res.Df     RSS Df Sum of Sq    F  Pr(>F)    
## 1   2990 3712881                              
## 2   2989 3693842  1     19040 15.4 8.9e-05 ***
## 3   2986 3689770  3      4071  1.1    0.35    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems like that addition of a linear year component is much better than a GAM without year. Yet, there is no evidence that a non-linear function of year is needed, see p-value of 0.34.

summary(gam2)

## 
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -119.43  -19.70   -3.33   14.17  213.48 
## 
## (Dispersion Parameter for gaussian family taken to be 1236)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29888 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##              Df  Sum Sq Mean Sq F value  Pr(>F)    
## s(year, 4)    1   27162   27162      22 2.9e-06 ***
## s(age, 5)     1  195338  195338     158 < 2e-16 ***
## education     4 1069726  267432     216 < 2e-16 ***
## Residuals  2986 3689770    1236                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F  Pr(F)    
## (Intercept)                          
## s(year, 4)        3    1.1   0.35    
## s(age, 5)         4   32.4 <2e-16 ***
## education                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Within the model with the non-linear relationship for year we can again confirm that this component does not contribute to the model.

Next we fit a local regression as building blocks in a GAM using the lo() function.

gam.lo <- gam(wage ~ s(year, df=4) + lo(age, span=0.7) + education, data=Wage) plot.gam(gam.lo, se=TRUE, col='green')

plot of chunk unnamed-chunk-21

We can also use location regression to create interaction terms before calling the GAM.

gam.lo.i <- gam(wage ~ lo(year, age, span=0.5) + education, data=Wage)

We can plot the resulting surface using the akima package.

library(akima)
plot(gam.lo.i)

plot of chunk unnamed-chunk-23