tf.keras入门(3) Predicting House Prices: Regression（boston_housing 数据集）

预测房价（回归）

预测 20 世纪 70 年代中期波士顿郊区房价的中间值。

该数据集包含 13 个不同的特征：

人均犯罪率。
占地面积超过 25000 平方英尺的住宅用地所占的比例。
非零售商业用地所占的比例（英亩/城镇）。
查尔斯河虚拟变量（如果大片土地都临近查尔斯河，则为 1；否则为 0）。
一氧化氮浓度（以千万分之一为单位）。
每栋住宅的平均房间数。
1940 年以前建造的自住房所占比例。
到 5 个波士顿就业中心的加权距离。
辐射式高速公路的可达性系数。
每 10000 美元的全额房产税率。
生师比（按城镇统计）。
1000 * (Bk - 0.63) ** 2，其中 Bk 是黑人所占的比例（按城镇统计）。
较低经济阶层人口所占百分比。

数据预处理与网络结构

接口解释

df = pd.DataFrame(train_data, columns=column_names) #　DataFrame 类型类似于数据库表结构的数据结构，其含有行索引和列索引，

print(df.head()) # 可以将DataFrame 想成是由相同索引的Series组成的Dict类型。

optimizer = tf.train.RMSPropOptimizer(0.001) 一种最优化方法 具体原理之后补上

防止过拟合或者做无用功 patience表示每多少Epoch检测一次需不需要停止

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss',patience=EPOCHS/20)

# Store training status
history = model.fit(train_data, train_labels, epochs=EPOCHS,
                    validation_split=0.2, verbose=0, 
                    callbacks=[early_stop, PrintDot()])   #verbose 表示是否显示详细信息

test_predictions = model.predict(test_data)
test_predictions = test_predictions.flatten(order='C') #将二维矩阵转为一维
# C means to flatten in row-major order   (C-style)  default
# F means to flatten in column-major order   (Fortran- style) 
# ‘A’ means to flatten in column-major order if a is Fortran contiguous
#  in memory, row-major order otherwise. ‘K’ means to flatten a in the 
# order the elements occur in memory.

总结

均方误差 ( $MSE$ ) 是用于回归问题的常见损失函数（与分类问题不同）。
同样，用于回归问题的评估指标也与分类问题( $acc$ )不同。常见回归指标是平均绝对误差 ( $MAE$ )。
如果输入数据特征的值具有不同的范围，则应分别缩放每个特征。
如果训练数据不多，则选择隐藏层较少的小型网络，以避免出现过拟合。
早停法( $keras.callbacks.EarlyStopping()$ )是防止出现过拟合的实用技术。（依据每个epoch）

Code

main.py

import tensorflow as tf 
from tensorflow import keras
import numpy as np
import pandas as pd
from plot import plot_history
from matplotlib import pyplot as plt

'''
数据预处理
'''
boston_housing = keras.datasets.boston_housing

(train_data, train_labels), (test_data, test_labels) = boston_housing.load_data()

# shuffle the training set
# 生成train_labels.shape个0~1间的随机浮点数 
# 然后使用argsort 获得排序后对应原List中的id 那么由于之前是random的 
# 就相当于产生了一个随机排列
order = np.argsort(np.random.random(train_labels.shape))
train_data = train_data[order]
train_labels = train_labels[order]

print("Training Set Size: {}".format(train_data.shape))
print("Testing Set Size: {}".format(test_data.shape))
print("第一个数据：\n",train_data[0])


# 使用 Pandas 库在格式规范的表格中显示数据集的前几行：
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
                'TAX', 'PTRATIO', 'B', 'LSTAT']


#　DataFrame 类型类似于数据库表结构的数据结构，其含有行索引和列索引，
# 可以将DataFrame 想成是由相同索引的Series组成的Dict类型。
df = pd.DataFrame(train_data, columns=column_names)  
print(df.head())
# 下面查看标签（以千美元为单位）
print(train_labels[0:10])


'''
标准化特征 

虽然在未进行特征标准化的情况下，模型可能会收敛，但这样做会增
加训练难度，而且使生成的模型更加依赖于在输入中选择使用的单位。
'''
# 按照列求平均 （很自然）和 标准差
mean = train_data.mean(axis = 0)
std = train_data.std(axis=0)
train_data = (train_data - mean) / std
test_data = (test_data - mean) / std
print(train_data[0]) # First training sample , normalized


'''
构建模型
'''
def build_model():
    model = keras.Sequential()
    model.add(keras.layers.Dense(64, activation=tf.nn.relu,
                     input_shape = (train_data.shape[1],)))
    model.add(keras.layers.Dense(64, activation=tf.nn.relu))
    model.add(keras.layers.Dense(1))
    optimizer = tf.train.RMSPropOptimizer(0.001)

    model.compile(loss='mse',
                  optimizer = optimizer,
                  metrics=['mae'])# mse:Mean squared error   mae:Mean Abs Error
    return model


model = build_model()
model.summary()
# 可见第一层有896: (13+1)*64 个参数   
# 第二层有4160: (64+1)*64个参数   
# 第三层有65: (64+1)*1个参数


'''
训练模型
'''
EPOCHS = 500

# Display training progress by 
# printing a single dot for each completed epoch
class PrintDot(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs):
        if epoch % 100 ==0 : print('')  # 每一百个换行一次
        print('.',end='')

# The patience parameter is the amount of epochs to check for improvement
# 防止过拟合或者做无用功
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss',patience=EPOCHS/20)


# Store training stats
history = model.fit(train_data, train_labels, epochs=EPOCHS,
                    validation_split=0.2, verbose=0, 
                    callbacks=[early_stop, PrintDot()])   #verbose 表示是否显示详细信息


'''
作LOSS图
'''
plot_history(history)


'''
在测试集上评估
'''
[loss, mae] = model.evaluate(test_data, test_labels, verbose=1)
print("\nTesting set Mean Abs Error: ${:7.2f}".format(mae * 1000))


'''
预测
'''
test_predictions = model.predict(test_data)
test_predictions = test_predictions.flatten(order='C') 
# C means to flatten in row-major order   (C-style)  default
# F means to flatten in column-major order   (Fortran- style) 
# ‘A’ means to flatten in column-major order if a is Fortran contiguous
#  in memory, row-major order otherwise. ‘K’ means to flatten a in the 
# order the elements occur in memory. 

plt.figure()
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [1000$]')
plt.ylabel('Predictions [1000$]')
plt.axis('equal')
plt.xlim(plt.xlim())
plt.ylim(plt.ylim())
_ = plt.plot([-100, 100], [-100, 100]) #参考线
plt.savefig('预测结果与真实值对比.png')

plt.figure()
error = test_predictions - test_labels
n,bins,patches = plt.hist(error, bins = 50) # 分成50块 查看每个error区间内对应的数量
plt.xlabel("Prediction Error [1000$]")
_ = plt.ylabel("Count")
plt.savefig('预测误差.png')

print(type(n),type(bins),type(patches))
print(n,bins)

plt.show()

plot.py

import matplotlib.pyplot as plt
import numpy as np

def plot_history(history):
    Dict = history.history
    plt.figure()
    plt.xlabel('Epoch')
    plt.ylabel('Mean Abs Error [1000$]')
    plt.plot(history.epoch, np.array(Dict['mean_absolute_error']),
                label='Train_Loss')
    plt.plot(history.epoch, np.array(Dict['val_mean_absolute_error']),
                label='Val_Loss')
    plt.legend()
    plt.ylim([0, 5])
    plt.savefig('训练过程.png')