TensorFlow Estimator 官方文档之----Feature column

Feature column

本文档详细介绍了特征列（feature columns）。您可以将特征列视为原始数据和 Estimator 之间的媒介。特征列非常丰富，使您可以将各种原始数据转换为 Estimators 可用的格式，从而可以轻松进行实验。

在内置 Estimators 部分的教程中，我们训练了一个 tf.estimator.DNNClassifier 去完成 Iris 花的分类任务。在该例子中，我们只使用了numerical feature columns（tf.feature_column.numeric_column）类型。尽管numeric column可以有效地表示花瓣、花蕊的长度和宽度，但在实际的数据集中包含了各种特征，其中很多不是数值。
在这里插入图片描述

1. 深度神经网络的输入

深度神经网络只能处理数值类型的数据，但我们收集的特征并不全是数值类型的。以一个可包含下列三个非数值的 product_class 特征为例：

kitchenware
electronics
sports

机器学习模型一般将分类值表示为简单的矢量，其中 1 表示存在某个值，0 表示不存在某个值。例如，如果将product_class设置为sports时，机器学习模型通常将product_class表示为[0, 0, 1]，即：

0：kitchenware is absent。
0：electronics is absent。
1：sports is present。

因此，虽然原始数据可以是数值或分类值，但机器学习模型会将所有特征表示为数值。

2. Feature Columns

如下图所示，你可以通过 Estimator 的 feature_columns 参数来指定模型的输入。特征列在输入数据（由input_fn返回）与模型之间架起了桥梁。
在这里插入图片描述
要创建特征列，请调用 tf.feature_column 模块的函数。本文档介绍了该模块中的 9 个函数。如下图所示，除了 bucketized_column 外的函数要么返回一个 Categorical Column 对象，要么返回一个 Dense Column 对象。

下面我们详细介绍下这些函数。

2.1 Numeric column（数值列）

Iris 分类器对所有输入特征调用 tf.feature_column.numeric_column 函数：

SepalLength
SepalWidth
PetalLength
PetalWidth

tf.feature_column 有许多可选参数。如果不指定可选参数，将默认指定该特征列的数值类型为 tf.float32。

# Defaults to a tf.float32 scalar.
numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength")

可以使用dtype参数来指定数值类型。

# Represent a tf.float64 scalar.
numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength",
                                                          dtype=tf.float64)

默认情况下，numeric column 只表示单个值（标量）。可以使用 shape 参数来指定形状。

# Represent a 10-element vector in which each cell contains a tf.float32.
vector_feature_column = tf.feature_column.numeric_column(key="Bowling",
                                                         shape=10)

# Represent a 10x5 matrix in which each cell contains a tf.float32.
matrix_feature_column = tf.feature_column.numeric_column(key="MyMatrix",
                                                         shape=[10,5])

2.2 Bucketized column（分桶列）

通常，我们不直接将一个数值直接传给模型，而是根据数值范围将其值分为不同的 categories。上述功能可以通过 tf.feature_column.bucketized_column 实现。以表示房屋建造年份的原始数据为例。我们并非以标量数值列表示年份，而是将年份分成下列四个分桶：
在这里插入图片描述
模型将按以下方式表示这些 bucket：

日期范围	表示为…
< 1960 年	[1, 0, 0, 0]
>= 1960 年但 < 1980 年	[0, 1, 0, 0]
>= 1980 年但 < 2000 年	[0, 0, 1, 0]
>= 2000 年	[0, 0, 0, 1]

为什么要将数字（一个完全有效的模型输入）拆分为分类值？首先，该分类将单个输入数字分成了一个四元素矢量。因此模型现在可以学习四个单独的权重而不是一个。四个权重能够创建一个更强大的模型。更重要的是，借助 bucket，模型能够清楚地区分不同年份类别，因为仅设置了一个元素 (1)，其他三个元素则被清除 (0)。例如，当我们仅将单个数字（年份）用作输入时，线性模型只能学习线性关系，而使用 bucket 后，模型可以学习更复杂的关系。

以下代码演示了如何创建 bucketized feature：

# 首先，将原始输入转换为一个numeric column
numeric_feature_column = tf.feature_column.numeric_column("Year")

# 然后，按照边界[1960,1980,2000]将numeric column进行bucket
bucketized_feature_column = tf.feature_column.bucketized_column(
    source_column = numeric_feature_column,
    boundaries = [1960, 1980, 2000])

请注意，指定一个三元素边界矢量可创建一个四元素 bucket 矢量。

2.3 Categorical identity column（类别标识列）

可以将 categorical identity column 看成 bucketized column 的一个特例。在一般的 bucketized column 中，每一个 bucket 表示值的一个范围（例如，从1960到1979）。在一个 categorical identity column 中，每个 bucket 表示单个、独一无二的整数。例如，假设您想要表示整数范围 [0, 4)。也就是说，您想要表示整数 0、1、2 或 3。在这种情况下，分类标识映射如下所示：
在这里插入图片描述

注意：转换后的编码是one_hot编码，非二元数值编码
与分桶列一样，模型可以在类别标识列中学习每个类别各自的权重。例如，我们使用唯一的整数值来表示每个类别，而不是使用某个字符串来表示 product_class。即：

0=“kitchenware”
1=“electronics”
2=“sport”

调用 tf.feature_column.categorical_column_with_identity 以实现类别标识列。例如：

# Create categorical output for an integer feature named "my_feature_b",
# The values of my_feature_b must be >= 0 and < num_buckets
identity_feature_column = tf.feature_column.categorical_column_with_identity(
    key='my_feature_b',
    num_buckets=4) # Values [0, 4)

# In order for the preceding call to work, the input_fn() must return
# a dictionary containing 'my_feature_b' as a key. Furthermore, the values
# assigned to 'my_feature_b' must belong to the set [0, 4).
def input_fn():
    ...
    return ({ 'my_feature_a':[7, 9, 5, 2], 'my_feature_b':[3, 1, 2, 2] },
            [Label_values])

2.4 Categorical vocabulary column（类别词汇表）

我们不能直接向模型中输入字符串。我们必须首先将字符串映射为数值或类别值。Categorical vocabulary column 可以将字符串表示为one_hot格式的向量。
在这里插入图片描述
如上所示，categorical vocabulary columns 是 categorical identity columns 的一种特例。TensorFlow提供了两种不同的函数去创建categorical vocabulary columns：

tf.feature_column.categorical_column_with_vocabulary_list
tf.feature_column.categorical_column_with_vocabulary_file

categorical_column_with_vocabulary_list 根据明确的词汇表将每个字符串映射到一个整数。

# Given input "feature_name_from_input_fn" which is a string,
# create a categorical feature by mapping the input to one of
# the elements in the vocabulary list.
vocabulary_feature_column =
    tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature_name_from_input_fn,
        vocabulary_list=["kitchenware", "electronics", "sports"])

上面的函数非常简单，但它有一个明显的缺点。那就是，当词汇表很长时，需要输入的内容太多了。在这种情况下，可以调用 tf.feature_column.categorical_column_with_vocabulary_file，以便将词汇表放在单独的文件中。

# Given input "feature_name_from_input_fn" which is a string,
# create a categorical feature to our model by mapping the input to one of
# the elements in the vocabulary file
vocabulary_feature_column =
    tf.feature_column.categorical_column_with_vocabulary_file(
        key=feature_name_from_input_fn,
        vocabulary_file="product_class.txt",
        vocabulary_size=3)

product_class.txt 应该让每个词汇各占一行。在我们的示例中：

kitchenware
electronics
sports

2.5 Hashed Column（哈希列）

到目前为止，我们处理的示例都包含很少的类别。但当类别的数量特别大时，我们不可能为每个词汇或整数设置单独的类别，因为这将会消耗非常大的内存。对于此类情况，我们可以反问自己：“我愿意为我的输入设置多少类别？”实际上，tf.feature_column.categorical_column_with_hash_bucket 函数使您能够指定类别的数量。对于这种 feature column，模型会计算输入值的 hash 值，然后使用模运算符将其置于其中一个 hash_bucket_size 类别中，如以下伪代码所示：

# 伪代码
feature_id = hash(raw_feature) % hash_bucket_size

创建 categorical_column_with_hash_bucket 的代码可能如下所示：

hashed_feature_column =
    tf.feature_column.categorical_column_with_hash_bucket(
        key = "some_feature",
        hash_bucket_size = 100) # The number of categories

此时，您可能会认为：“这太疯狂了！”，这种想法很正常。毕竟，我们是将不同的输入值强制划分成更少数量的类别。这意味着，两个可能不相关的输入会被映射到同一个类别，这样一来，神经网络也会面临同样的结果。下面的图说明了这个问题，厨具和运动用品都被分配到类别（哈希分桶）12：：
在这里插入图片描述
与机器学习中的很多反直觉现象一样，事实证明哈希技术经常非常有用。这是因为哈希类别为模型提供了一些分隔方式。模型可以使用其他特征进一步将厨具与运动用品分隔开来。

2.6 Crossed column（组合列）

通过将多个特征组合为一个特征（称为特征组合，），模型可学习每个特征组合的单独权重。

更具体地说，假设我们希望模型计算佐治亚州亚特兰大的房地产价格。这个城市的房地产价格在不同位置差异很大。在确定对房地产位置的依赖性方面，将纬度和经度表示为单独的特征用处不大；但是，将纬度和经度组合为一个特征则可精确定位位置。假设我们将亚特兰大表示为一个 100x100 的矩形网格区块，按纬度和经度的特征组合标识全部 10000 个区块。借助这种特征组合，模型可以针对与各个区块相关的房价条件进行训练，这比单独的经纬度信号强得多。

下图展示了我们的想法（以红色文本显示城市各角落的纬度和经度值）：
在这里插入图片描述
为了解决此问题，我们同时使用了 tf.feature_column.crossed_column 函数及先前介绍的 bucketized_column。

def make_dataset(latitude, longitude, labels):
    assert latitude.shape == longitude.shape == labels.shape

    features = {'latitude': latitude.flatten(),
                'longitude': longitude.flatten()}
    labels=labels.flatten()

    return tf.data.Dataset.from_tensor_slices((features, labels))


# Bucketize the latitude and longitude using the `edges`
latitude_bucket_fc = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('latitude'),
    list(atlanta.latitude.edges))

longitude_bucket_fc = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('longitude'),
    list(atlanta.longitude.edges))

# Cross the bucketized columns, using 5000 hash bins.
crossed_lat_lon_fc = tf.feature_column.crossed_column(
    [latitude_bucket_fc, longitude_bucket_fc], 5000)

fc = [
    latitude_bucket_fc,
    longitude_bucket_fc,
    crossed_lat_lon_fc]

# Build and train the Estimator.
est = tf.estimator.LinearRegressor(fc, ...)

您可以根据下列内容创建一个特征组合：

Feature names（input_fn 函数返回的 dict 中的名字）。
除categorical_column_with_hash_bucket之外的categorical column（因为 crossed_column 会对输入进行哈希处理）。

当特征列 latitude_bucket_fc 和 longitude_bucket_fc 组合时，TensorFlow 会为每个样本创建 (latitude_fc, longitude_fc) 对。这会生成完整的网格，如下所示：

 (0,0),  (0,1)...  (0,99)
 (1,0),  (1,1)...  (1,99)
   ...     ...       ...
(99,0), (99,1)...(99, 99)

为了避免创建一个完整的巨大输入表，crossed_column 通过hash_bucket_size 参数来控制组合后的特征的维度。特征列通过对输入元组进行 hash 及模运算来为输入指定一个索引。

如前面所说，进行“hash”和“模运算”可以限制categories的数量，但是可能导致category冲突：多个 (latitude, longitude) 组合特征可能会位于相同的 hash bucket 中。不过，在实践中特征组合仍能够有效地提升模型的效果。

有些反直觉的是，在创建特征组合时，通常仍应在模型中包含原始（未组合）特征（如前面的代码段中所示）。独立的纬度和经度特征有助于模型区分组合特征中发生哈希冲突的样本。

2.7 Indicator and embedding columns（指示列和嵌入列）

指标列和嵌入列从不直接处理特征，而是将分类列视为输入。

使用指标列时，我们指示 TensorFlow 完成我们在分类 product_class 样本中看到的确切操作。也就是说，指标列将每个类别视为独热矢量中的一个元素，其中匹配类别的值为 1，其余类别为 0：
在这里插入图片描述
以下是通过调用 tf.feature_column.indicator_column 创建指标列的方法：

categorical_column = ... # Create any type of categorical column.

# Represent the categorical column as an indicator column.
indicator_column = tf.feature_column.indicator_column(categorical_column)

现在，假设我们有一百万个可能的类别，或者可能有十亿个，而不是只有三个。出于多种原因，随着类别数量的增加，使用指标列来训练神经网络变得不可行。

我们可以使用嵌入列来克服这一限制。嵌入列并非将数据表示为很多维度的独热矢量，而是将数据表示为低维度普通矢量，其中每个单元格可以包含任意数字，而不仅仅是 0 或 1。通过使每个单元格能够包含更丰富的数字，嵌入列包含的单元格数量远远少于指标列。

我们来看一个将指标列和嵌入列进行比较的示例。假设我们的输入样本包含多个不同的字词（取自仅有 81 个字词的有限词汇表）。我们进一步假设数据集在 4 个不同的样本中提供了下列输入字词：

“dog”
“spoon”
“scissors”
“guitar”

在这种情况下，下图说明了嵌入列或指标列的处理流程。
在这里插入图片描述

嵌入列将分类数据存储在低于指标列的低维度矢量中。（我们只是将随机数字放入嵌入矢量中；由训练决定实际数字。）

处理样本时，其中一个 categorical_column_with... 函数会将样本字符串映射到分类数值。例如，一个函数将“spoon”映射到 [32]。（32 是我们想象出来的，实际值取决于映射函数。）然后，您可以通过下列两种方式之一表示这些分类数值：

作为指标列。函数将每个分类数值转换为一个 81 元素的矢量（因为我们的词汇表由 81 个字词组成），将 1 置于分类值 (0, 32, 79, 80) 的索引处，将 0 置于所有其他位置。
作为嵌入列。函数将分类数值 (0, 32, 79, 80) 用作对照表的索引。该对照表中的每个槽位都包含一个 3 元素矢量。

嵌入矢量中的值如何神奇地得到分配？实际上，分配值在训练期间进行。也就是说，模型学习了将输入分类数值映射到嵌入矢量值以解决问题的最佳方法。嵌入列可以增强模型的功能，因为嵌入矢量从训练数据中学习了类别之间的新关系。

为什么示例中的嵌入矢量大小为 3？下面的“公式”提供了关于嵌入维度数量的一般经验法则：

embedding_dimensions =  number_of_categories**0.25

也就是说，嵌入矢量维数应该是类别数量的 4 次方根。由于本示例中的词汇量为 81，建议维数为 3：

3 =  81**0.25

请注意，这只是一个一般规则；您可以根据需要设置嵌入维度的数量。

调用 tf.feature_column.embedding_column 来创建一个 embedding_column，如以下代码段所示：

categorical_column = ... # Create any categorical column

# Represent the categorical column as an embedding column.
# This means creating a one-hot vector with one element for each category.
embedding_column = tf.feature_column.embedding_column(
    categorical_column=categorical_column,
    dimension=dimension_of_embedding_vector)

嵌入是机器学习中的一个重要主题。这些信息仅仅是帮助您将其用作特征列的入门信息。

3. 将特征列传递给 Estimator

如下所示，并非所有 Estimator 都支持所有类型的 feature_columns 参数：

LinearClassifier 和 LinearRegressor：接受所有类型的特征列。
DNNClassifier 和 DNNRegressor：只接受密集列。其他类型的列必须封装在 indicator_column 或 embedding_column 中。
DNNLinearCombinedClassifier 和 DNNLinearCombinedRegressor：
- linear_feature_columns 参数接受任何类型的特征列。
- dnn_feature_columns 参数只接受密集列。

4. 其他资料

关于特征列的更多实例，请查看：

低阶 API 简介：展示了 TensorFlow 的低阶 API 与 feature_columns 的配合使用。
宽度模型和宽度与深度模型教程：针对各种输入数据类型使用 feature_columns 解决二元分类问题。

要想了解关于embedding的更多情况，请查看：

深度学习、NLP 和表示法（Chris Olah 的博客）
TensorFlow Embedding Projector