TensorFlow基本使用步骤——以线性回归为练习

前期准备

加载必要的库

from __future__ import print_function

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

加载数据集

california_housing_dataframe = pd.read_csv("https://download.mlcc.google.cn/mledu-datasets/california_housing_train.csv", sep=",")

对数据进行随机化处理，以确保不会出现任何病态排序结果（可能会损害随机梯度下降法的效果）。此外，将 median_house_value 调整为以千为单位，这样，模型就能够以常用范围内的学习速率较为轻松地学习这些数据。

california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))
california_housing_dataframe["median_house_value"] /= 1000.0
california_housing_dataframe

output would be like this:
在这里插入图片描述

检查数据

使用数据前利用california_housing_dataframe.describe()对数据进行统计处理，得到关于各列的一些实用统计信息快速摘要：样本数、均值、标准偏差、最大值、最小值和各种分位数。

california_housing_dataframe.describe()

output would be like this:
在这里插入图片描述

开始构建第一个模型

练习目标是尝试预测median_house_value的值，使用total_rooms作为输入特征。
为了训练模型，这里使用TensorFlow Estimator API 提供的LinearRegressor接口。此 API 负责处理大量低级别模型搭建工作，并会提供执行模型训练、评估和推理的便利方法。

定义特征并配置特征列

为了将训练数据导入 TensorFlow，需要指定每个特征包含的数据类型。主要使用以下两类数据：

分类数据，文字型数据，不包含任何分类特征，包括一些无用的文字或修饰词。
数值数据，数值型数据（整数或者浮点）。
此时的输入数值特征为total_rooms，下面的代码会从california_housing_dataframe中提取total_rooms数据，并使用numeric_column 来定义特征列，这样会将其数据指定为数值：

# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]

# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

注意：total_rooms数据的形状是一维数组（每个街区的房间总数列表）。这是 numeric_column 的默认形状，因此我们不必将其作为参数传递。

定义目标

定义目标，即定义median_housing_dataframe，可以从 california_housing_dataframe 中提取它：

# Define the label.
targets = california_housing_dataframe["median_house_value"]

配置LinearRegressor

使用LinearRegressor配置线性回归模型，使用GradientDescenOptimizer（能实现小批量随机梯度下降法（SGD））训练该模型，learning_rate参数课控制梯度步长的大小。

接下来，我们将使用 LinearRegressor 配置线性回归模型，并使用 GradientDescentOptimizer（它会实现小批量随机梯度下降法 (SGD)）训练该模型。learning_rate 参数可控制梯度步长的大小。

注意：为了安全起见，还可以通过 clip_gradients_by_norm 将梯度剪裁应用到优化器。梯度裁剪可确保梯度大小在训练期间不会变得过大，梯度过大会导致梯度下降法失败。

# Use gradient descent as the optimizer for training the model.
my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

# Configure the linear regression model with our feature columns and optimizer.
# Set a learning rate of 0.0000001 for Gradient Descent.
linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=feature_columns,
    optimizer=my_optimizer
)

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

定义输入函数

要将据导入 LinearRegressor，需要定义一个输入函数，让它告诉 TensorFlow 如何对数据进行预处理，以及在模型训练期间如何批处理、随机处理和重复数据。
首先，将 Pandas 特征数据转换成 NumPy 数组字典。然后，使用 TensorFlow Dataset API 根据数据来构建 Dataset 对象，并将数据拆分成大小为 batch_size 的多批数据，以按照指定周期数 (num_epochs) 进行重复。

注意：如果将默认值 num_epochs=None 传递到 repeat()，输入数据会无限期重复。

然后，如果 shuffle 设置为 True，则会对数据进行随机处理，以便数据在训练期间以随机方式传递到模型。buffer_size 参数会指定 shuffle 将从中随机抽样的数据集的大小。

最后，输入函数会为该数据集构建一个迭代器，并向 LinearRegressor 返回下一批数据。

训练模型

在 linear_regressor 上调用 train() 来训练模型。将 my_input_fn 封装在 lambda 中，以便可以将 my_feature 和 target 作为参数传入（有关详情，请参阅 TensorFlow 输入函数教程），首先训练 100 步。

_ = linear_regressor.train(
    input_fn = lambda:my_input_fn(my_feature, targets),
    steps=100
)

评估模型

基于训练数据做一次预测，看模型在训练期间与这些数据的拟合情况。
注意：训练误差可以衡量模型与训练数据的拟合情况，但并不能衡量模型泛化到新数据的效果。

…不想搬了，待续