1. Purpose: According to house information, determine prices, Dr. Dayton area.
2. Data Sources: paper "Hedonic housing prices and the demand for clean air", the CPC data contained 506 observations and 16 variables. Wherein each observation represents a census tract.
boston <- read.csv ( "boston.csv") # read the file
str (boston) # View data structure
3. Variable description:
(1) town: the town where each census area
Longitude census tracts center: (2) LON
Latitude census district center: (3) LAT
(4) MEDV: Median each census area corresponding to the value of the house (in units of $ 1000)
(5) CRIM: per capita crime rate
(6) ZN: How much of the land area is a large number of residential property
(7) INDUS: area used for industrial purposes land proportion
(8) CHAS: 1: The census area close to the Charles River; 0: The census area is not close to the Charles River
(9) NOX: nitrogen oxides concentration in the air (air pollution index measured)
(10) RM: The average number of rooms per house
(11) AGE: Built in 1940 the proportion of house
(12) DIS: distance from the census district downtown Boston
(13) RAD: (1 represents the most recent; 24 represents the farthest) from the nearness important highway
(14) TAX: $ 10,000 per house value corresponding to the amount of tax
(15) PTRATIO: The proportion of urban students and teachers
View (boston) # View data
4. Application and analysis
4.1 data exploration
plot (boston $ LON, boston $ LAT) # according to the latitude and longitude of plotting Census Area
points (boston $ LON [boston $ CHAS == 1], boston $ LAT [boston $ CHAS == 1], col = 'blue', pch = 19) # blue represents close to the Charles River Census Area
summary (boston $ NOX) # NOX average of about 0.55 points (boston $ LON [boston $ NOX> = 0.55], boston $ LAT [boston $ NOX> = 0.55], col = 'gray', pch = 19) # greater than the average gray areas representative of NOX
summary (boston $ MEDV) # regional median house price median of 21.2 points (boston $ LON [boston $ MEDV> = 21.2], boston $ LAT [boston $ MEDV> = 21.2], col = 'green', pch = 19) Green # represents the median area above the median rate of
4.2 using the latitude and longitude of the two arguments to build linear regression model
model1 <- lm (MEDV ~ LON + LAT, data = boston) # of latitude and longitude as independent variables, the dependent variable is the median rate summary(model1) # not a good model
Although the argument "latitude" very significant, but the adjusted R ^ 2 model only 0.1036, therefore, the linear regression model is not very satisfactory.
plot(boston$LON, boston$LAT) points (boston $ LON [boston $ MEDV> = 21.2], boston $ LAT [boston $ MEDV> = 21.2], col = 'green', pch = 19) # green represents the actual rate above the median of median area
points (boston $ LON [model1 $ fitted.values> = 21.2], boston $ LAT [model1 $ fitted.values> = 21.2], col = 'yellow', pch = 19) # Yellow represents a linear regression model of the predicted area median prices higher than the median
上图进一步证实了线性回归模型不是很理想,因为模型完全忽略了图像中右半部分的地区。同时,这也证实了“自变量’纬度'不显著”的说法。
4.3 运用经度和纬度两个自变量构建CART回归树
library(rpart) # 加载rpart包,用于构建CART模型 library(rpart.plot) # 加载rpart.plot包,用于绘制回归树 tree1 <- rpart(MEDV ~ LAT + LON, data = boston) # 不需要加method="class",因为构建的是回归树而非分类树 prp(tree1, digits = 4) # 绘制回归树
points(boston$LON[boston$MEDV >= 21.2], boston$LAT[boston$MEDV >= 21.2], col = 'green', pch = 19) # 绿色为原数据中房价中位数 高于 中位数的 地区
fitted <- predict(tree1) # 运用回归树tree1预测每个地区的房价中位数 points(boston$LON[fitted >= 21.2], boston$LAT[fitted >= 21.2], col = 'yellow', pch = 19) # 黄色为预测数据高于中位数的地区
由此不难看出,与线性回归模型相比,回归树tree1的准确性显著提高。
4.4 运用全部自变量构建线性回归模型并不断优化
library(caTools) # 加载caTools包,将数据分为70%训练集和30%测试集 set.seed(123) # 设置种子 spl <- sample.split(boston$MEDV, SplitRatio = 0.7) train <- subset(boston, spl == T) # 训练集 test <- subset(boston, spl == F) # 测试集
lm1 <- lm(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data = train) # 运用训练集数据构建线性回归模型 summary(lm1)
其中,CRIM, CHAS, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO为显著变量,模型调整后R^2为0.6525,较之前线性回归模型的调整后R^2相比显著提高。
lm1pred <- predict(lm1, newdata = test) # 将模型应用到测试集用于预测集
library(forecast) # 加载forecast包
accuracy(lm1pred, test$MEDV) # 计算模型准确性
接着,依次剔除模型中最不显著的自变量LAT, INDUS, LON,直至所有自变量均显著,构建以下线性回归模型:
lm4 <- lm(MEDV ~ CRIM + ZN + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data = train) # 将模型应用到测试集用于预测集
summary(lm4) lm4pred <- predict(lm4, newdata = test)
accuracy(lm4pred, test$MEDV)
对比两个模型的准确性:
library(forecast) accuracy(lm1pred, test$MEDV) # 应用所有自变量的模型准确性 (模型4) accuracy(lm4pred, test$MEDV) # 剔除LAT, INDUS, LON以后的模型准确性 (模型1)
对比ME, RMSE, MAE, MPE, MAPE, 不难发现模型4略微比模型1准确。
4.5 运用全部自变量构建CART回归树并不断优化
tree3 <- rpart(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data = train) prp(tree3)
treepred <- predict(tree3, newdata = test) # 将回归树tree3应用到训练集,进行预测 accuracy(treepred, test$MEDV) # 回归树tree3的准确性
通过对比ME, RMSE, MAE, MPE, MAPE等指标,发现该回归树tree3的准确定略低于线性回归模型 (模型1以及模型4)
接下来,通过引入cp, 进行交叉检验来寻找最佳的回归树模型
library(caret) library(lattice) library(ggplot2) library(e1071) tr.control <- trainControl(method = 'cv', number = 10) cp.grid <- expand.grid(.cp = (0:10)*0.001) tr <- train(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + TAX + PTRATIO, data = train, method = 'rpart', trControl = tr.control, tuneGrid = cp.grid) tr # cp = 0.001
基于cp=0.001,构建新的回归树
best.tree <- tr$finalModel prp(best.tree)
best.tree.pred <- predict(best.tree, newdata = test) # 将新的回归树best.tree应用到训练集数据中进行预测 accuracy(best.tree.pred, test$MEDV) # 回归树best.tree的准确性
虽然引入cp,做了交叉检验以后,回归树best.tree与回归树tree3相比,准确率有所提升,但准确率仍低于线性回归模型(模型4)。因此,回归树模型不一定是一个更好的选择。