[R] language learning notes Day2 linear regression and CART regression trees and contrast applications

1. Purpose: According to house information, determine prices, Dr. Dayton area.

 

2. Data Sources: paper "Hedonic housing prices and the demand for clean air", the CPC data contained 506 observations and 16 variables. Wherein each observation represents a census tract.

boston <- read.csv ( "boston.csv") # read the file
str (boston) # View data structure

 

 

 

3. Variable description:

(1) town: the town where each census area

Longitude census tracts center: (2) LON

Latitude census district center: (3) LAT

(4) MEDV: Median each census area corresponding to the value of the house (in units of $ 1000)

(5) CRIM: per capita crime rate

(6) ZN: How much of the land area is a large number of residential property

(7) INDUS: area used for industrial purposes land proportion

(8) CHAS: 1: The census area close to the Charles River; 0: The census area is not close to the Charles River

(9) NOX: nitrogen oxides concentration in the air (air pollution index measured)

(10) RM: The average number of rooms per house

(11) AGE: Built in 1940 the proportion of house

(12) DIS: distance from the census district downtown Boston

(13) RAD: (1 represents the most recent; 24 represents the farthest) from the nearness important highway

(14) TAX: $ 10,000 per house value corresponding to the amount of tax

(15) PTRATIO: The proportion of urban students and teachers

 

View (boston) # View data

 

 

 

 

 

4. Application and analysis

4.1 data exploration

plot (boston $ LON, boston $ LAT) # according to the latitude and longitude of plotting Census Area

 

points (boston $ LON [boston $ CHAS == 1], boston $ LAT [boston $ CHAS == 1], col = 'blue', pch = 19) # blue represents close to the Charles River Census Area

 

summary (boston $ NOX) # NOX average of about 0.55
points (boston $ LON [boston $ NOX> = 0.55], boston $ LAT [boston $ NOX> = 0.55], col = 'gray', pch = 19) # greater than the average gray areas representative of NOX

 

summary (boston $ MEDV) # regional median house price median of 21.2
points (boston $ LON [boston $ MEDV> = 21.2], boston $ LAT [boston $ MEDV> = 21.2], col = 'green', pch = 19) Green # represents the median area above the median rate of

 

 

 

4.2 using the latitude and longitude of the two arguments to build linear regression model

model1 <- lm (MEDV ~ LON + LAT, data = boston) # of latitude and longitude as independent variables, the dependent variable is the median rate
summary(model1) # not a good model

 

 

Although the argument "latitude" very significant, but the adjusted R ^ 2 model only 0.1036, therefore, the linear regression model is not very satisfactory.

plot(boston$LON, boston$LAT)
points (boston $ LON [boston $ MEDV> = 21.2], boston $ LAT [boston $ MEDV> = 21.2], col = 'green', pch = 19) # green represents the actual rate above the median of median area

 

points (boston $ LON [model1 $ fitted.values> = 21.2], boston $ LAT [model1 $ fitted.values> = 21.2], col = 'yellow', pch = 19) # Yellow represents a linear regression model of the predicted area median prices higher than the median
 

 

 上图进一步证实了线性回归模型不是很理想,因为模型完全忽略了图像中右半部分的地区。同时,这也证实了“自变量’纬度'不显著”的说法。

 

 

4.3 运用经度和纬度两个自变量构建CART回归树

library(rpart) # 加载rpart包,用于构建CART模型
library(rpart.plot) # 加载rpart.plot包,用于绘制回归树

tree1 <- rpart(MEDV ~ LAT + LON, data = boston) # 不需要加method="class",因为构建的是回归树而非分类树
prp(tree1, digits = 4) # 绘制回归树

points(boston$LON[boston$MEDV >= 21.2], boston$LAT[boston$MEDV >= 21.2], col = 'green', pch = 19) # 绿色为原数据中房价中位数 高于 中位数的 地区

fitted <- predict(tree1) # 运用回归树tree1预测每个地区的房价中位数
points(boston$LON[fitted >= 21.2], boston$LAT[fitted >= 21.2], col = 'yellow', pch = 19) # 黄色为预测数据高于中位数的地区

 

由此不难看出,与线性回归模型相比,回归树tree1的准确性显著提高。

 

 

4.4 运用全部自变量构建线性回归模型并不断优化

library(caTools) # 加载caTools包,将数据分为70%训练集和30%测试集
set.seed(123) # 设置种子
spl <- sample.split(boston$MEDV, SplitRatio = 0.7)
train <- subset(boston, spl == T) # 训练集
test <- subset(boston, spl == F) # 测试集

  

lm1 <- lm(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data = train) # 运用训练集数据构建线性回归模型
summary(lm1)

 

 

其中,CRIM, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO为显著变量,模型调整后R^2为0.6525,较之前线性回归模型的调整后R^2相比显著提高。

lm1pred <- predict(lm1, newdata = test) # 将模型应用到测试集用于预测集

library(forecast) # 加载forecast包
accuracy(lm1pred, test$MEDV) # 计算模型准确性

 

接着,依次剔除模型中最不显著的自变量LAT, INDUS, LON,直至所有自变量均显著,构建以下线性回归模型:

lm4 <- lm(MEDV ~ CRIM + ZN + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data = train) # 将模型应用到测试集用于预测集
summary(lm4) lm4pred <- predict(lm4, newdata = test)
accuracy(lm4pred, test$MEDV)

 

对比两个模型的准确性:

library(forecast)
accuracy(lm1pred, test$MEDV) # 应用所有自变量的模型准确性 (模型4)
accuracy(lm4pred, test$MEDV) # 剔除LAT, INDUS, LON以后的模型准确性 (模型1)

 

 

对比ME, RMSE, MAE, MPE, MAPE, 不难发现模型4略微比模型1准确。

 

 

 4.5 运用全部自变量构建CART回归树并不断优化

tree3 <- rpart(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data = train)
prp(tree3)

 

 

treepred <- predict(tree3, newdata = test) # 将回归树tree3应用到训练集,进行预测
accuracy(treepred, test$MEDV) # 回归树tree3的准确性

 

 

 

通过对比ME, RMSE, MAE, MPE, MAPE等指标,发现该回归树tree3的准确定略低于线性回归模型 (模型1以及模型4)

 

接下来,通过引入cp, 进行交叉检验来寻找最佳的回归树模型

library(caret)
library(lattice)
library(ggplot2)
library(e1071)
tr.control <- trainControl(method = 'cv', number = 10)
cp.grid <- expand.grid(.cp = (0:10)*0.001)
tr <- train(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + TAX + PTRATIO, data = train, method = 'rpart', trControl = tr.control, tuneGrid = cp.grid)
tr # cp = 0.001

 

 

 

 基于cp=0.001,构建新的回归树

best.tree <- tr$finalModel
prp(best.tree)

 

 

best.tree.pred <- predict(best.tree, newdata = test) # 将新的回归树best.tree应用到训练集数据中进行预测
accuracy(best.tree.pred, test$MEDV) # 回归树best.tree的准确性

 

 

 

 

 虽然引入cp,做了交叉检验以后,回归树best.tree与回归树tree3相比,准确率有所提升,但准确率仍低于线性回归模型(模型4)。因此,回归树模型不一定是一个更好的选择。

Guess you like

Origin www.cnblogs.com/shanshant/p/11888456.html