Deep Learning with Python 系列笔记（一）：深度学习基础

神经网络的初探

现在来看一个神经网络的第一个具体例子，它利用了Python库Keras来学习对手写数字进行分类。
Mnist是一个含有10类的28 * 28 灰度图片，可以将“解决”MNIST看作是深度学习的“Hello World”，需要做的是验证实现的算法是否按预期工作。

在Keras上加载Mnist数据集

from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images和train_label组成了“训练集”，模型将从数据中学习。然后，该模型将在“测试集”：test_images和test_label上进行测试。我们的图像被编码为Numpy数组，而标签只是一组数字，从0到9，图像和标签之间存在一一对应关系。

The training data


>>> train_images.shape
(60000, 28, 28)
>>> len(train_labels)
60000
>>> train_labels
array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

The test data

>>> test_images.shape
(10000, 28, 28)
>>> len(test_labels)
10000
>>> test_labels
array([7, 2, 1, ..., 4, 5, 6], dtype=uint8)

我们的工作如下:首先，我们将用训练数据:train_images和train_label来训练我们的神经网络。然后，网络学习将图像和标签关联起来。最后，我们将要求网络对test_images进行预测，我们将验证这些预测是否与test_label中的标签匹配。

网络结构

from keras import models
from keras import layers

network = models.Sequential()
network.add(layers.Dense(512,activation='relu',input_shape=(28*28,))) # 全连接层：512个神经元，激活函数：relu，输入大小： 28*28
network.add(layers.Dense(10,activation='softmax')) # 输出层：返回10个类别的概率

在这里，我们的网络由两层密集的层组成，它们是紧密相连的(全连接层)神经层。第二个(即最后一个)层是一个10类的“softmax”层，这意味着它将返回一个10个概率值的数组(总和为1)。

为了使我们的网络为培训做好准备，我们需要定义另外三个参数，作为“编译”步骤的一部分。

**1.损失函数：**网络衡量它的学习性能，以及它如何能够定义网络朝着正确的方向前进。
**2.优化参数：**这是网络根据数据和损失函数更新自身的机制，如：SGD、Rmsprop等
3.度量指标：accuracy等

网络编译

network.compile(optimizer='rmsprop',
		loss='categorical_crossentropy',
		metrics=['accuracy'])

在训练之前，我们将对数据进行预处理，将其修改成网络期望的形状，并将其缩放，使所有值都在[0,1]区间内。
未处理前，我们的训练图像存储在uint8类型的数组中(60000,28,28)，值在[0,255]区间内。我们将它转换为一个浮点数(60000,28 * 28)，值在0和1之间。

Preparing the image data

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

我们还需要对标签进行分类编码。

from keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

我们现在已经准备好训练我们的网络，这在Keras是通过对网络的 fit 方法的调用来完成的:我们将模型与它的训练数据“匹配”。

训练网络


>>> network.fit(train_images, train_labels, epochs=5, batch_size=128)
Epoch 1/5
60000/60000 [==============================] - 9s - loss: 0.2524 - acc: 0.9273
Epoch 2/5
51328/60000 [========================>.....] - ETA: 1s - loss: 0.1035 - acc: 0.9692

我们在训练数据上很快达到0.989(即98.9%)的精确度。

验证网络

test_loass, test_acc = network.evaluate(test_images, test_labels)
print('test_acc': test_acc)
>>test_acc: 0.9785

我们的测试集准确度为97.8%，比训练集的准确度要低很多。训练准确性和测试精度之间的差距是“过度拟合”的一个例子，即机器学习模型在新数据上的表现往往比训练数据差。

张量（tensors）

标量Scalars( 0D tensors)

一个只包含一个数字的张量称为“标量”(或“标量张量”，即0维张量，或0D张量)。在Numpy中，float32或float64数字是一个标量张量(或标量数组)。可以通过ndim属性显示一个Numpy张量的轴数;标量张量有0个轴(ndim == 0)，张量的轴数也称为秩。


>>> import numpy as np
>>> x = np.array(12)
>>> x
array(12)
>>> x.ndim
0

向量Vectors ( 1D tensors)

一组数字被称为向量，即1D张量。一个1D张量将被说成只有一个“轴”。


>>> x = np.array([12, 3, 6, 14])
>>> x
array([12, 3, 6, 14])
>>> x.ndim
1

在这里，这个向量有5个元素，所以将被称为一个“5维向量”。不要把一个5D的矢量和一个5D张量混淆!一个5D的矢量只有一个轴，并且沿着它的轴有5个维度，而5D张量有5个轴(并且可能在每个轴上有任意数量的尺寸)。

矩阵Matrices ( 2D tensors)

向量的数组是一个矩阵，或者说二维张量。矩阵有两个轴(通常表示“行”和“列”)。你可以把一个矩阵直观地解释为一个矩形的数字网格。


>>> x = np.array([[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]])
>>> x.ndim
2

第一个轴称为“行”，第二个轴称为“列”。在上面的例子中，[5,78,2,34,0]是第一行，[5,6,7]是第一列。

3D tensors and higher-dimensional tensors


>>> x = np.array([[[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]],
[[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]],
[[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]]])
>>> x.ndim
3

通过在数组中封装3D张量，您可以创建一个4D张量。等等。在深度学习中，您通常会操作0D到4D的张量，但如果处理视频数据，则可能会达到5D。

tensor 关键属性

一个tensor由3个关键属性定义
**1.axes的数量：秩。**例如，一个三维张量有3个轴，一个矩阵有2个轴。这也被称为张量的ndim，Python库中如Numpy。
**2.形状。**这是一个整数的元组，它描述了张量在每个轴上的大小。例如，上面的矩阵示例有形状(3,5)，而我们的三维张量示例有形状(3、3、5)，一个向量的形状只有一个元素，比如(5，)，而标量将有一个空的形状()。
**3.数据类型：**在Python库中通常称为dtype。tensor中包含的数据类型;例如，float32, uint8, float64…

为了使这个更加具体，让我们回顾一下在我们的MNIST示例中处理的数据：

from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
>>> print(train_images.ndim)
3
>>> print(train_images.shape)
(60000, 28, 28)
>>> print(train_images.dtype)
uint8

这里是一个8-bit integers 的3D tensor。更准确地说，它是一个60000个矩阵，包含28x28个整数。每一个这样的矩阵都是灰度图像，系数在0到255之间。

让我们使用库Matplotlib(标准的科学Python套件的一部分)在这个3D tensor中显示第四个数:

digit = train_images[4]
import matplotlib.pyplot as plt
plt.imshow(digit, cmap=plt.cm.binary)
plt.show()

这里写图片描述

Numpy中处理tensors

在张量中“选择”特定的元素称为“张量切片”。

>>> my_slice = train_images[10:100]
>>> print(my_slice.shape)
(90, 28, 28)

或者如下：

>>> my_slice = train_images[10:100, :, :] # equivalent to the above example
>>> my_slice.shape
>(90, 28, 28)
>>> my_slice = train_images[10:100, 0:28, 0:28] # also equivalent to the above example
>>> my_slice.shape
(90, 28, 28)

一般情况下，可以沿着每个张量轴选择任意两个指标。例如，为了在所有图像的右下角选择14x14像素，就可以这样做：

my_slice = train_images[:, 14:, 14:]

也可以使用负指数。与Python列表中的负索引类似，它们表示相对于当前轴的末端的位置。为了将我们的图像裁剪成14x14像素的中间位置，我们可以这样做：

my_slice = train_images[:, 7:-7, 7:-7]

data batch

深度学习模型不同时处理整个数据集，而是将数据分解成小批量。具体地说，这里设置MNIST数字一批为128：

batch = train_images[:128]
# and here's the next batch
batch = train_images[128:256]
# and the n-th batch:
batch = train_images[128 * n:128 * (n + 1)]

当设置这么一个batch tensor时，第一轴（axis 0）被称作“batch axis”或者“batch dimension”。

data tensors 的真实例子。

您将处理的数据几乎总是属于以下类别之一：
1.向量（矢量）数据：2D tensors of shape (samples, features).
**2.Timeseries data or sequence data:**3D tensors of shape (samples, timesteps,features).
**3.图像：**4D tensors of shape (samples, width, height, channels) or (samples,channels, width, height).
**4.视频：**5D tensors of shape (samples, frames, width, height, channels) or (samples, frames, channels, width, height).

向量数据（Vector data）

在这样的数据集中，每一个单一的数据点都可以被编码成一个矢量，因此一批数据将被编码为一个二维张量(即一个矢量数组)，其中第一个轴是“样本轴”，第二轴是“特征轴”。如下：

一个精算数据集，我们考虑每个人的年龄，身份编码和收入。每个人可以被描述为一个3个值的向量，因此，10万人的整个数据集可以存储在一个二维的形状张量中(100000,3)。

Timeseries data or sequence data

当时间在你的数据(或序列顺序的概念)中起作用时，将它存储在一个带有显式时间轴的三维张量中是有意义的。每个样本可以被编码成一个向量序列(一个二维张量)，因此，一批数据将被编码为一个三维张量。

股票价格的数据集。每一分钟，我们都会储存当前的价格，过去一分钟的最高价格和过去一分钟的最低价格。因此每一分钟都是那么编码为一个三维向量,整个交易日编码作为形状的二维张量(390，3)在交易日(390分钟)，和250天的数据可以存储在一个3D形状张量(250、390、3)。在这里,每个样本将有一天的数据。
这里写图片描述

图像数据 Image data

图像通常有3个维度:宽度、高度和颜色深度。尽管灰度图像(如我们的MNIST数字)只有一个单一的颜色通道，因此可以存储在二维张量中，但传统的图像张量通常是三维的，有一个一维彩色通道用于灰度图像。
这里写图片描述

视频数据 Video data

视频数据是为数不多的需要5D张量的真实数据类型之一。一个视频可以被理解为一个帧序列，每个帧都是一个彩色图像。由于每个帧都可以存储在3D张量(width, height, color_depth),中，那么一个帧序列可以存储在4D张量中(frames, width,height,color_depth),，因此可以存储一组不同的视频(samples, frames, width, height, color_depth)。

例如，一个60秒，256x144的YouTube视频剪辑，以每秒4帧的速度采样，将会有240帧。一组4个这样的视频剪辑将被存储在一个张量(4,240,256,144,3)中。这总共是106,168,320个值!如果张量的dtype是float32，那么每个值都存储在32位中，所以张量将代表425MB。现实生活中遇到的视频要小得多，因为它们没有存储在float32中，而且它们通常被一个大的因素压缩(例如MPEG格式)。

Tensor operations

实现Relu函数的一个简单应用：

def naive_relu(x):
# x is 2D Numpy tensor
assert len(x.shape) == 2
# 断言函数
x = x.copy() # Avoid overwriting the input tensor
for i in range(x.shape[0]):
for j in range(x.shape[1]):
x[i, j] = max(x[i, j], 0)
return x

Python中断言函数应用如下：

>>> assert 1==1
>>> assert 1==0
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    assert 1==0
AssertionError
>>> assert True
>>> assert False
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    assert False
AssertionError
>>> assert 3<2
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    assert 3<2
AssertionError

矩阵加法的简单应用：

def naive_add_matrix_and_vector(x, y):
# x is a 2D Numpy tensor
# y is a Numpy vector
assert len(x.shape) == 2
assert len(y.shape) == 1
assert x.shape[1] == y.shape[0]
x = x.copy() # Avoid overwriting the input tensor
for i in range(x.shape[0]):
for j in range(x.shape[1]):
x[i, j] += y[j]
return x

两个不同shape的简单maximum应用：

import numpy as np
# x is a random tensor with shape (64, 3, 32, 10)
x = np.random.random((64, 3, 32, 10))
# y is a random tensor with shape (32, 10)
y = np.random.random((32, 10))
# The output z has shape (64, 3, 32, 10) like x
z = np.maximum(x, y)

Tensor 乘法

import numpy as np
z = np.dot(x, y)

tensor的matrix-vector乘法：

import numpy as np
def naive_matrix_vector_dot(x, y):
	# x is a Numpy matrix
	# y is a Numpy vector
	assert len(x.shape) == 2
	assert len(y.shape) == 1
	# The 1st dimension of x must be
	# the same as the 0th dimension of y!
	assert x.shape[1] == y.shape[0]
	# This operation returns a vector of 0s
	# with the same shape as y
	z = np.zeros(x.shape[0])
	for i in range(x.shape[0]):
		for j in range(x.shape[1]):
			z[i] += x[i, j] * y[j]
	return z

tensor 的乘法形式如下：
这里写图片描述
Tensor reshaping

>>> x = np.array([[0., 1.],
[2., 3.],
[4., 5.]])
>>> print(x.shape)
(3, 2)
>>> x = x.reshape((6, 1))
array([[ 0.],
[ 1.],
[ 2.],
[ 3.],
[ 4.],
[ 5.]])
>>> x = x.reshape((2, 3))
array([[ 0., 1., 2.],
[ 3., 4., 5.]])

神经网络的解析

正如我们在前几章中看到的，训练一个神经网络围绕着以下对象：
1. 层
将它们合并到一个网络(或模型)中。
2. 输入数据以及相应的目标
3. 损失函数：
定义了用于学习的反馈信号。
4. 优化函数
决定了学习的进程。
这里写图片描述

层：深度学习的基石

神经网络的基本数据结构是“层”，在前一章中已经介绍过了。一个层是一个数据处理模块，它作为输入一个或多个张量，输出一个或多个张量。有些层是无状态的，但是更多的层有一个状态:层的“权值”，一个或几个张量通过随机梯度下降进行学习的。

不同的层适合不同的张量格式和不同类型的数据处理。例如，形如(samples,features),的二维张量中的简单矢量数据通常由“全连接”层处理(the Dense class in Keras)。序列数据形如(samples, timesteps, features)的三维张量中，通常由“recurrent”层(如LSTM层)处理。存储在4D张量中的图像数据通常由二维卷积层处理(Conv2D)。

from keras import layers
# A dense layer with 32 output units
layer = layers.Dense(32, input_shape=(784,))

创建一个层，只接受第一个维度为784的输入2D张量(the zero-th dimension, the batch dimension, is unspecified and thus any value would be accepted)。这一层将返回一个张量，其中第一个维度被转换为3。


>>> layer.output_shape
(None, 32)

因此，这个层只能连接到一个期望32维向量作为输入。当使用Keras时，不必担心兼容性，因为添加到模型中的层是动态构建的，以匹配传入层的形状。

from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(32, input_shape=(784,)))
model.add(layers.Dense(32))

由Sequential model定义的Keras

from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(32, activation='relu', input_shape=(784,)))
model.add(layers.Dense(10, activation='softmax'))

由Functional API定义的Keras:

input_tensor = layers.Input(shape=(784,))
x = layers.Dense(32, activation='relu')(input_tensor)
output_tensor = layers.Dense(10, activation='softmax')(x)
model = models.Model(input=input_tensor, output=output_tensor)

一旦模型架构被定义，是否使用了顺序模型或功能API就不重要了，接下来所有的步骤都是一样的。

from keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
		loss='mse',
		metrics=['accuracy'])
model.fit(input_tensor, target_tensor, batch_size=128, epochs=10)

分类电影评论:一个二分类的例子

加载IMDB数据集

from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

参数num_words=10000意味着我们将只保留训练数据中最常见的10,000个单词。生词将被丢弃。

变量train_data和test_data是数据集列表，每个评审都是一个单词索引列表(编码一个单词序列)。train_label和test_label是0和1的列表，0代表“负数”，1代表“正数”。

>>> train_data[0]
[1, 14, 22, 16, ... 178, 32]
>>> train_labels[0]
1
>>> max([max(sequence) for sequence in train_data])
9999

我们不能将整数的列表输入到一个神经网络中，必须把名单变成张量。

import numpy as np
def vectorize_sequences(sequences, dimension=10000):
# Create an all-zero matrix of shape (len(sequences), dimension)
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1. # set specific indices of results[i] to 1s
return results
# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)

编码样例

>>> x_train[0]
array([ 0., 1., 1., ..., 0., 0., 0.])

编码标签

# Our vectorized labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

编译模型

model.compile(optimizer='rmsprop',
		loss='binary_crossentropy',
		metrics=['accuracy'])

或者加入优化器的设置参数：

from keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
		loss='binary_crossentropy',
		metrics=['accuracy'])

使用自定义损失和度量：

from keras import losses
from keras import metrics
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
		loss=losses.binary_crossentropy,
		metrics=[metrics.binary_accuracy])

验证模型：

为了在训练过程中监控数据的准确性，我们将创建一个“验证集”，从原始训练数据中分离出10,000个样本。

x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

训练模型：

history = model.fit(partial_x_train,
		partial_y_train,epochs=20,
		batch_size=512,
		validation_data=(x_val, y_val))

注意，对model.fit()的调用返回一个history对象。这个对象有一个成员history，它是一个字典，包含在训练期间发生的所有事情的数据。

>>> history_dict = history.history
>>> history_dict.keys()
[u'acc', u'loss', u'val_acc', u'val_loss']

可见history对象有4个关键词，可用于绘制图像：

import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

这里写图片描述

plt.clf() # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

这里写图片描述

可见，网络在epoch = 4时表现最好，设置epochs=4重新训练网络：

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)

>>> results
[0.2929924130630493, 0.88327999999999995]
>>> model.predict(x_test)
[[ 0.98006207]
[ 0.99758697]
[ 0.99975556]
...,
[ 0.82167041]
[ 0.02885115]
[ 0.65371346]]

分类newswires:一个多类分类示例

加载Reuters数据集

from keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

与IMDB数据集一样，参数num_words=10000将数据限制为数据中发现的10,000个最常见的单词。

>>> len(train_data)
8982
>>> len(test_data)
2246

数据和标签如下：

>>> train_data[10]
[1, 245, 273, 207, 156, 53, 74, 160, 26, 14, 46, 296, 26, 39, 74, 2979,
3554, 14, 46, 4689, 4329, 86, 61, 3499, 4795, 14, 61, 451, 4329, 17, 12]
>>> train_labels[10]
3

数据预处理

import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)

One-hot encoding the labels

def to_one_hot(labels, dimension=46):
results = np.zeros((len(labels), dimension))
for i, label in enumerate(labels):
results[i, label] = 1.
return results
# Our vectorized training labels
one_hot_train_labels = to_one_hot(train_labels)
# Our vectorized test labels
one_hot_test_labels = to_one_hot(test_labels)

One-hot encoding the labels, the Keras way

from keras.utils.np_utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

定义模型：

from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

关于这个架构，还有两件事需要注意：

我们正在以46维的全连接层输出网络。这意味着对于每个输入样本，我们的网络将输出一个46维的向量。这个向量中的每个条目(每个维度)将编码一个不同的输出类。
最后一层使用softmax激活。您已经在MNIST示例中看到了这种模式。这意味着网络将输出46个不同输出类的概率分布，即每个输入样本，网络将产生一个46维输出向量，输出[i]是样本属于i类的概率，46个值之和为1。

编译模型

model.compile(optimizer='rmsprop',
		loss='categorical_crossentropy',
		metrics=['accuracy'])

验证模型

x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]
history = model.fit(partial_x_train,
	partial_y_train,
	epochs=20,
	batch_size=512,
	validation_data=(x_val, y_val))

测试结果可视化：

import matplotlib.pyplot as plt
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

这里写图片描述

plt.clf() # clear figure
acc = history.history['acc']
val_acc = history.history['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

这里写图片描述

对新数据进行预测

predictions = model.predict(x_test)

>>> predictions[0].shape
(46,)
>>> np.sum(predictions[0])
1.0
>>> np.argmax(predictions[0])
4

处理标签和损失的不同方法

编码标签的另一种方法是把它们转换成一个整数张量，像这样：

y_train = np.array(train_labels)
y_test = np.array(test_labels)

唯一改变的是损失函数的选择。我们之前的损失，categorical_crossentropy，期望标签遵循分类编码。对于整数标签，我们应该用sparse_categorical_cross熵。

model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['acc'])

预测房价:一个回归的例子

在前面的两个例子中，我们考虑了分类问题，目标是预测输入数据点的单个离散标签。另一种常见的机器学习问题是“回归”，它包括预测一个连续的值，而不是一个离散的标签。例如，根据气象数据，预测明天的温度，或者预测软件项目需要完成的时间。

加载数据集：the Boston Housing Price dataset

from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
>>> train_data.shape
(404, 13)
>>> test_data.shape
(102, 13)

如您所见，我们有404个训练样本和102个测试样本。该数据由13个特征组成。

训练目标如下：

>>> train_targets
[ 15.2, 42.3, 50. ... 19.4, 19.4, 29.1]

数据预处理
将其输入到神经网络的值中是有问题的，因为它们的取值范围都非常不同。网络可能能够自动适应这种异构的数据，但它肯定会使学习变得更加困难。最多处理这样的数据是feature-wise normalization:每个特性的输入数据(输入数据矩阵中的一列),我们会减去均值的特性和除以标准差,所以功能将集中在0和单位标准差。

mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

构建模型

from keras import models
from keras import layers
def build_model():
	# Because we will need to instantiate
	# the same model multiple time,
	# we use a function to construct it.
	model = models.Sequential()
	model.add(layers.Dense(64, activation='relu',
	input_shape=(train_data.shape[1],)))
	model.add(layers.Dense(64, activation='relu'))
	model.add(layers.Dense(1))
	model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
	return model

我们的网络输出一个值，没有激活函数(即它将是线性层)。这是标量回归的典型设置(即我们试图预测单个连续值的回归)。应用激活函数会限制输出的范围;例如，如果我们将sigmoid激活函数应用到我们的最后一层，那么网络只能学习在0到1之间的值。在这里，由于最后一层是纯线性的，网络可以自由地学习在任何范围内预测值。

使用K-fold validation验证
为了评估我们的网络，当我们不断调整其参数(例如用于epochs的数量)时，我们可以简单地将数据分解为一个训练集和一个验证集，就像我们在前面的示例中所做的那样。然而，由于我们的数据点太少，验证集最终会非常小(例如大约100个例子)。其结果是，我们的验证分数可能会发生很大的变化，这取决于我们选择用于验证的数据点，以及我们选择的训练集，即验证分数在验证分离方面可能有很大的差异，这将阻止我们可靠地评估我们的模型。

在这种情况下，最好的做法是使用K-fold交叉验证。它包括将可用数据分割成K个分区(通常为K=4或5)，然后实例化K个相同的模型，并在K-1分区上训练每个分区，同时对其余分区进行评估，使用的模型的验证分数将是获得的K验证分数的平均值。
这里写图片描述

import numpy as np
k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []
for i in range(k):
	print('processing fold #', i)
	# Prepare the validation data: data from partition # k
	val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
	val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
	# Prepare the training data: data from all other partitions
	partial_train_data = np.concatenate(
		[train_data[:i * num_val_samples],
		train_data[(i + 1) * num_val_samples:]],
		axis=0)
	partial_train_targets = np.concatenate(
		[train_targets[:i * num_val_samples],
		train_targets[(i + 1) * num_val_samples:]],
		axis=0)
	# Build the Keras model (already compiled)
	model = build_model()
	# Train the model (in silent mode, verbose=0)
	model.fit(partial_train_data, partial_train_targets,
	epochs=num_epochs, batch_size=1, verbose=0)
	# Evaluate the model on the validation data
	val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
	all_scores.append(val_mae)

使用num_epochs = 100运行上面的代码片段，可以得到以下结果：

>>> all_scores
[2.588258957792037, 3.1289568449719116, 3.1856116051248984, 3.0763342615401386]
>>> np.mean(all_scores)
2.9947904173572462

让我们试着训练这个网络更长一点:500个epochs。为了记录这个模型在每个epoch的表现，我们将修改我们的训练循环，以保存每个阶段的验证记录日志。

num_epochs = 500
all_mae_histories = []
for i in range(k):
	print('processing fold #', i)
	# Prepare the validation data: data from partition # k
	val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
	val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
	# Prepare the training data: data from all other partitions
	partial_train_data = np.concatenate(
		[train_data[:i * num_val_samples],
		train_data[(i + 1) * num_val_samples:]],
		axis=0)
	partial_train_targets = np.concatenate(
		[train_targets[:i * num_val_samples],
		train_targets[(i + 1) * num_val_samples:]],
		axis=0)
	# Build the Keras model (already compiled)
	model = build_model()
	# Train the model (in silent mode, verbose=0)
	history = model.fit(partial_train_data, partial_train_targets,
	validation_data=(val_data, val_targets),
	epochs=num_epochs, batch_size=1, verbose=0)
	mae_history = history.history['val_mean_absolute_error']
	all_mae_histories.append(mae_history)

之后可以计算出每个epoch的MSE得分的平均值：

average_mae_history = [
	np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

import matplotlib.pyplot as plt
plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

这里写图片描述
省略前10个数据点，它们与曲线的其余部分有不同的刻度。
用指数移动平均值来代替每一个点，以获得平滑的曲线。

def smooth_curve(points, factor=0.9):
	smoothed_points = []
	for point in points:
		if smoothed_points:
			previous = smoothed_points[-1]
			smoothed_points.append(previous * factor + point * (1 - factor))
		else:
			smoothed_points.append(point)
	return smoothed_points
	
smooth_mae_history = smooth_curve(average_mae_history[10:])

plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

这里写图片描述
根据这一情节，在80个epochs之后，验证MAE似乎不再有明显的改善。在这一点上，我们开始过拟合。
一旦做了优化，其他参数模型(除了epoch的数量,我们也可以调整隐藏层的大小),我们可以训练最后一个“生产”模型的训练数据的最好参数，然后看看它的性能测试数据。

# Get a fresh, compiled model.
model = build_model()
# Train it on the entirety of the data.
model.fit(train_data, train_targets,
epochs=80, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

>>> test_mae_score
2.5532484335057877