Predicting housing prices: regression problems

Introduction

This example is from "Python Deep Learning", I made a simple summary.

For the complete code, please refer to: [https://github.com/fchollet/deep-learning-with-python-notebooks](

Code

Load data set

Note: The data set will be downloaded during the first run, and the speed is slow, please be patient.

from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

Data standardization

# 数据标准化
"""
Q:为什么要对数据进行标准化处理?
A:因为不同指标之间的差值较大,很不利于神经网络进行学习。
   因此我们需要手动对输入的特征值进行处理,将特征值先减去特征值的均值再处于标准差。
   这样就可以将不同的特征值保留在一个差异较小的范围。
   而且由于是线性处理,因此相同特征值之间的差异并没有被改变
"""
mean = train_data.mean(axis=0)  # 特征差
train_data -= mean  # 减去特征差

std = train_data.std(axis=0)  # 标准差
train_data /= std
print(train_data)

# 对测试数据集也做同样操作
test_data -= mean
test_data /= std

Build the network

# 构建网络
from keras import models
from keras import layers

def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1], )))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    """
    Q:为什么这个网络最后一层不使用激活函数?
    A:不使用激活函数的话这就是一个线性层。
       这是标量回归(标量回归是预测单一连续值的回归)的典型设置。
       添加激活函数将会限制输出范围。
       例如,如果向最后一层添加sigmoid激活函数,网络只能学会预测0~1范围内的值。
       这里最后一层是纯线性的,所以网络可以学会预测任意范围内的值。
    """
    return model

Use K-fold validation to train the model

**Tip:** Because we train 500 rounds each time, and the silent mode is turned on during training. If the output result does not change for a long time, please wait patiently. Don't mistakenly think that the program has failed to execute!

# K折验证
"""
Q:为什么我们需要使用K折验证?
A:因为数据量太少。
   如果选择只使用数据集一次,那么训练结果会和数据的分布情况有很大相关性。
   数据集分布不同输出结果会有很大差异,即误差较大,这不符合泛化的理念。
   使用K折验证可以减小这种误差。
"""
import numpy as np

k = 4
num_val_samples = len(train_data) // 4
num_epochs = 500
all_source = []
all_mae_histories = []

for i in range(k):
    print('processing fold #', i)
    val_data = train_data[i * num_val_samples : (i + 1) * num_val_samples]
    val_targets = train_targets[i*num_val_samples:(i+1)*num_val_samples]

    partial_train_data = np.concatenate([train_data[:i * num_val_samples], train_data[(i + 1)*num_val_samples:]], axis=0)
    partial_train_targets = np.concatenate([train_targets[:i*num_val_samples], train_targets[(i+1)*num_val_samples:]], axis=0)
    model = build_model()
    history = model.fit(partial_train_data, partial_train_targets, validation_data=(val_data, val_targets), epochs=num_epochs, batch_size=1, verbose=0)
    # val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    # all_source.append(val_mae)
    print(history.history.keys())
    mae_history = history.history['val_mae']
    all_mae_histories.append(mae_history)

Draw charts and observe the training process

# 计算所有轮次中的K折验证分数平均值
average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]
# 绘制验证分数
import matplotlib.pyplot as plt

plt.plot(range(1, len(average_mae_history)+1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

Insert picture description here

# 绘制验证分数,删除前十个点
"""
Q:为什么要重新绘制图表?
A:因为纵轴的范围较大,且数据方差相对较大,难以看清这张图的规律。

Q:怎样优化图表?
A:删除前10个数据点,因为它们的取值范围与曲线上的其他点不同。
   将每个数据点替换为前面数据点的指数移动平均值,以得到光滑的曲线。
"""
def smooth_curve(points, factor=0.9):
    smoothed_points = []
    for point in points:
        if smoothed_points:
            previous = smoothed_points[-1]
            smoothed_points.append(previous * factor + point * (1 - factor))
        else:
            smoothed_points.append(point)
    return smoothed_points

smooth_mae_history = smooth_curve(average_mae_history[10:])

plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

Insert picture description here

Train the final model

# 训练最终模型
model = build_model()
model.fit(train_data, train_targets, epochs=80, batch_size=16, verbose=0)
test_mes_score, test_mae_score = model.evaluate(test_data, test_targets)

# 输出最终结果
print(test_mae_score)
# 2.509598970413208

summary

  • The loss function used in regression problems is different from that in classification problems. The commonly used loss function in regression is the mean square error (MSE) .
  • The evaluation indicators used in regression problems are also different from classification problems, and the concept of accuracy does not apply to regression problems. A common regression indicator is the mean absolute error (MAE).
  • If the features of the input data have different value ranges, preprocessing should be performed first, and each feature should be scaled separately.
  • If there is little data available, K-fold verification can be used to reliably evaluate the model.
  • If there is little training data available, it is better to use a small network with fewer hidden layers (usually only one or two) to avoid severe overfitting.

Guess you like

Origin blog.csdn.net/qq_43580193/article/details/108138404