TensorFlow 系列案例（3）：使用TensorFlow DNN分类器对数据进行分类

TensorFlow 卷积神经网络系列案例（1）：猫狗识别 https://blog.csdn.net/duan_zhihua/article/details/81156693

TensorFlow 系列案例（2）：自然语言处理-TensorFlow + Word2Vechttps://blog.csdn.net/duan_zhihua/article/details/81257323

步骤：

分别读入iris训练数据集、测试数据集。
使用tf.contrib.learn.DNNClassifier算法对数据集进行分类。
使用classifier.fit训练模型。
评估分类准确度。
使用新的样本数据进行预测。

iris_training.csv训练数据如下：第一行为标题，第一个数字120为数据的行数，第二个数字4为特征维度。

120,4,setosa,versicolor,virginica
6.4,2.8,5.6,2.2,2
5.0,2.3,3.3,1.0,1
4.9,2.5,4.5,1.7,2
4.9,3.1,1.5,0.1,0
5.7,3.8,1.7,0.3,0
4.4,3.2,1.3,0.2,0
5.4,3.4,1.5,0.4,0
6.9,3.1,5.1,2.3,2
6.7,3.1,4.4,1.4,1
5.1,3.7,1.5,0.4,0
5.2,2.7,3.9,1.4,1
6.9,3.1,4.9,1.5,1
5.8,4.0,1.2,0.2,0
5.4,3.9,1.7,0.4,0
7.7,3.8,6.7,2.2,2
6.3,3.3,4.7,1.6,1
6.8,3.2,5.9,2.3,2
7.6,3.0,6.6,2.1,2

iris_test.csv数据文件：测试数据如下：第一行为标题，第一个数字30为数据的行数，第二个数字4为特征维度。

30,4,setosa,versicolor,virginica
5.9,3.0,4.2,1.5,1
6.9,3.1,5.4,2.1,2
5.1,3.3,1.7,0.5,0
6.0,3.4,4.5,1.6,1
5.5,2.5,4.0,1.3,1
6.2,2.9,4.3,1.3,1
5.5,4.2,1.4,0.2,0
6.3,2.8,5.1,1.5,2

iris.py代码：

# -*- coding: utf-8 -*-
 
# 引入必要的module
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import urllib

import numpy as np
import tensorflow as tf

# Data sets
IRIS_TRAINING = "iris_training.csv"
IRIS_TRAINING_URL = "http://download.tensorflow.org/data/iris_training.csv"

IRIS_TEST = "iris_test.csv"
IRIS_TEST_URL = "http://download.tensorflow.org/data/iris_test.csv"

def main():
  # If the training and test sets aren't stored locally, download them.
  if not os.path.exists(IRIS_TRAINING):
    raw = urllib.urlopen(IRIS_TRAINING_URL).read()
    with open(IRIS_TRAINING, "w") as f:
      f.write(raw)

  if not os.path.exists(IRIS_TEST):
    raw = urllib.urlopen(IRIS_TEST_URL).read()
    with open(IRIS_TEST, "w") as f:
      f.write(raw)

  # Load datasets.
  training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
      filename=IRIS_TRAINING,
      target_dtype=np.int,
      features_dtype=np.float32)
  test_set = tf.contrib.learn.datasets.base.load_csv_with_header(
      filename=IRIS_TEST,
      target_dtype=np.int,
      features_dtype=np.float32)

  # Specify that all features have real-value data
  feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)]

  # Build 3 layer DNN with 10, 20, 10 units respectively.
  classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,hidden_units=[10, 20, 10],n_classes=3,model_dir="/tmp/iris_model")

  # Define the training inputs
  def get_train_inputs():
    x = tf.constant(training_set.data)
    y = tf.constant(training_set.target)
    return x, y

  # Fit model.
  classifier.fit(input_fn=get_train_inputs, steps=2000)

  # Define the test inputs
  def get_test_inputs():
    x = tf.constant(test_set.data)
    y = tf.constant(test_set.target)
    return x, y

  # Evaluate accuracy.
  accuracy_score = classifier.evaluate(input_fn=get_test_inputs,steps=1)["accuracy"]
  print("\nTest Accuracy: {0:f}\n".format(accuracy_score))

  # Classify two new flower samples.
  def new_samples():
    return np.array(
      [[6.4, 3.2, 4.5, 1.5],
       [5.8, 3.1, 5.0, 1.7]], dtype=np.float32)

  predictions = list(classifier.predict(input_fn=new_samples))

  print(
      "New Samples, Class Predictions:    {}\n"
      .format(predictions))

if __name__ == "__main__":
    main()

运行结果为：

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into ./tmp/iris_model\model.ckpt.
INFO:tensorflow:loss = 1.27656, step = 1
INFO:tensorflow:global_step/sec: 1038.9
INFO:tensorflow:loss = 0.189959, step = 101 (0.097 sec)
INFO:tensorflow:global_step/sec: 1172.9
INFO:tensorflow:loss = 0.0868506, step = 201 (0.086 sec)
INFO:tensorflow:global_step/sec: 1133.76
INFO:tensorflow:loss = 0.0672543, step = 301 (0.089 sec)
INFO:tensorflow:global_step/sec: 1028.17
INFO:tensorflow:loss = 0.0576761, step = 401 (0.095 sec)
INFO:tensorflow:global_step/sec: 1061.02
INFO:tensorflow:loss = 0.051202, step = 501 (0.094 sec)
INFO:tensorflow:global_step/sec: 997.134
INFO:tensorflow:loss = 0.0487256, step = 601 (0.101 sec)
INFO:tensorflow:global_step/sec: 959.174
INFO:tensorflow:loss = 0.0473462, step = 701 (0.105 sec)
INFO:tensorflow:global_step/sec: 905.811
INFO:tensorflow:loss = 0.0449771, step = 801 (0.108 sec)
INFO:tensorflow:global_step/sec: 811.094
INFO:tensorflow:loss = 0.0430157, step = 901 (0.123 sec)
INFO:tensorflow:global_step/sec: 959.593
INFO:tensorflow:loss = 0.0419328, step = 1001 (0.105 sec)
INFO:tensorflow:global_step/sec: 1095.99
INFO:tensorflow:loss = 0.0405608, step = 1101 (0.090 sec)
INFO:tensorflow:global_step/sec: 1120.6
INFO:tensorflow:loss = 0.0401995, step = 1201 (0.090 sec)
INFO:tensorflow:global_step/sec: 997.35
INFO:tensorflow:loss = 0.0391046, step = 1301 (0.100 sec)
INFO:tensorflow:global_step/sec: 968.277
INFO:tensorflow:loss = 0.038355, step = 1401 (0.103 sec)
INFO:tensorflow:global_step/sec: 1159.71
INFO:tensorflow:loss = 0.0374698, step = 1501 (0.086 sec)
INFO:tensorflow:global_step/sec: 1173.35
INFO:tensorflow:loss = 0.0361673, step = 1601 (0.085 sec)
INFO:tensorflow:global_step/sec: 1216.27
INFO:tensorflow:loss = 0.0356515, step = 1701 (0.081 sec)
INFO:tensorflow:global_step/sec: 1231.28
INFO:tensorflow:loss = 0.0347832, step = 1801 (0.082 sec)
INFO:tensorflow:global_step/sec: 1146.37
INFO:tensorflow:loss = 0.034383, step = 1901 (0.086 sec)
INFO:tensorflow:Saving checkpoints for 2000 into ./tmp/iris_model\model.ckpt.
INFO:tensorflow:Loss for final step: 0.0331372.
WARNING:tensorflow:From g:\ProgramData\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\head.py:625: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2018-08-02-13:22:35
INFO:tensorflow:Restoring parameters from ./tmp/iris_model\model.ckpt-2000
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-08-02-13:22:35
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.966667, global_step = 2000, loss = 0.0624721

Test Accuracy: 0.966667

WARNING:tensorflow:From g:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py:347: calling DNNClassifier.predict (from tensorflow.contrib.learn.python.learn.estimators.dnn) with outputs=None is deprecated and will be removed after 2017-03-01.
Instructions for updating:
Please switch to predict_classes, or set `outputs` argument.
INFO:tensorflow:Restoring parameters from ./tmp/iris_model\model.ckpt-2000
New Samples, Class Predictions:    [1, 2]

iris.py代码中的load_csv_with_header读入数据文件，解析返回data、target；其中target标签列是数据文件的最后一列。

load_csv_with_header的源代码：

@deprecated(None, 'Use tf.data instead.')
def load_csv_with_header(filename,
                         target_dtype,
                         features_dtype,
                         target_column=-1):
  """Load dataset from CSV file with a header row."""
  with gfile.Open(filename) as csv_file:
    data_file = csv.reader(csv_file)
    header = next(data_file)
    n_samples = int(header[0])
    n_features = int(header[1])
    data = np.zeros((n_samples, n_features), dtype=features_dtype)
    target = np.zeros((n_samples,), dtype=target_dtype)
    for i, row in enumerate(data_file):
      target[i] = np.asarray(row.pop(target_column), dtype=target_dtype)
      data[i] = np.asarray(row, dtype=features_dtype)

  return Dataset(data=data, target=target)

real_valued_column方法源代码：

def real_valued_column(column_name,
                       dimension=1,
                       default_value=None,
                       dtype=dtypes.float32,
                       normalizer=None):
  """Creates a `_RealValuedColumn` for dense numeric data.

  Args:
    column_name: A string defining real valued column name.
    dimension: An integer specifying dimension of the real valued column.
      The default is 1.
    default_value: A single value compatible with dtype or a list of values
      compatible with dtype which the column takes on during tf.Example parsing
      if data is missing. When dimension is not None, a default value of None
      will cause tf.parse_example to fail if an example does not contain this
      column. If a single value is provided, the same value will be applied as
      the default value for every dimension. If a list of values is provided,
      the length of the list should be equal to the value of `dimension`.
      Only scalar default value is supported in case dimension is not specified.
    dtype: defines the type of values. Default value is tf.float32. Must be a
      non-quantized, real integer or floating point type.
    normalizer: If not None, a function that can be used to normalize the value
      of the real valued column after default_value is applied for parsing.
      Normalizer function takes the input tensor as its argument, and returns
      the output tensor. (e.g. lambda x: (x - 3.0) / 4.2). Note that for
      variable length columns, the normalizer should expect an input_tensor of
      type `SparseTensor`.
  Returns:
    A _RealValuedColumn.
  Raises:
    TypeError: if dimension is not an int
    ValueError: if dimension is not a positive integer
    TypeError: if default_value is a list but its length is not equal to the
      value of `dimension`.
    TypeError: if default_value is not compatible with dtype.
    ValueError: if dtype is not convertible to tf.float32.
  """

  if dimension is None:
    raise TypeError("dimension must be an integer. Use the "
                    "_real_valued_var_len_column for variable length features."
                    "dimension: {}, column_name: {}".format(dimension,
                                                            column_name))
  if not isinstance(dimension, int):
    raise TypeError("dimension must be an integer. "
                    "dimension: {}, column_name: {}".format(dimension,
                                                            column_name))
  if dimension < 1:
    raise ValueError("dimension must be greater than 0. "
                     "dimension: {}, column_name: {}".format(dimension,
                                                             column_name))

  if not (dtype.is_integer or dtype.is_floating):
    raise ValueError("dtype must be convertible to float. "
                     "dtype: {}, column_name: {}".format(dtype, column_name))

  if default_value is None:
    return _RealValuedColumn(column_name, dimension, default_value, dtype,
                             normalizer)

  if isinstance(default_value, int):
    if dtype.is_integer:
      default_value = ([default_value for _ in range(dimension)] if dimension
                       else [default_value])
      return _RealValuedColumn(column_name, dimension, default_value, dtype,
                               normalizer)
    if dtype.is_floating:
      default_value = float(default_value)
      default_value = ([default_value for _ in range(dimension)] if dimension
                       else [default_value])
      return _RealValuedColumn(column_name, dimension, default_value, dtype,
                               normalizer)

  if isinstance(default_value, float):
    if dtype.is_floating and (not dtype.is_integer):
      default_value = ([default_value for _ in range(dimension)] if dimension
                       else [default_value])
      return _RealValuedColumn(column_name, dimension, default_value, dtype,
                               normalizer)

  if isinstance(default_value, list):
    if len(default_value) != dimension:
      raise ValueError(
          "The length of default_value must be equal to dimension. "
          "default_value: {}, dimension: {}, column_name: {}".format(
              default_value, dimension, column_name))
    # Check if the values in the list are all integers or are convertible to
    # floats.
    is_list_all_int = True
    is_list_all_float = True
    for v in default_value:
      if not isinstance(v, int):
        is_list_all_int = False
      if not (isinstance(v, float) or isinstance(v, int)):
        is_list_all_float = False
    if is_list_all_int:
      if dtype.is_integer:
        return _RealValuedColumn(column_name, dimension, default_value, dtype,
                                 normalizer)
      elif dtype.is_floating:
        default_value = [float(v) for v in default_value]
        return _RealValuedColumn(column_name, dimension, default_value, dtype,
                                 normalizer)
    if is_list_all_float:
      if dtype.is_floating and (not dtype.is_integer):
        default_value = [float(v) for v in default_value]
        return _RealValuedColumn(column_name, dimension, default_value, dtype,
                                 normalizer)

  raise TypeError("default_value must be compatible with dtype. "
                  "default_value: {}, dtype: {}, column_name: {}".format(
                      default_value, dtype, column_name))

dnn.py源代码：

class DNNClassifier(estimator.Estimator):
  """A classifier for TensorFlow DNN models.

  Example:

  ```python
  sparse_feature_a = sparse_column_with_hash_bucket(...)
  sparse_feature_b = sparse_column_with_hash_bucket(...)

  sparse_feature_a_emb = embedding_column(sparse_id_column=sparse_feature_a,
                                          ...)
  sparse_feature_b_emb = embedding_column(sparse_id_column=sparse_feature_b,
                                          ...)

  estimator = DNNClassifier(
      feature_columns=[sparse_feature_a_emb, sparse_feature_b_emb],
      hidden_units=[1024, 512, 256])

  # Or estimator using the ProximalAdagradOptimizer optimizer with
  # regularization.
  estimator = DNNClassifier(
      feature_columns=[sparse_feature_a_emb, sparse_feature_b_emb],
      hidden_units=[1024, 512, 256],
      optimizer=tf.train.ProximalAdagradOptimizer(
        learning_rate=0.1,
        l1_regularization_strength=0.001
      ))

  # Input builders
  def input_fn_train: # returns x, y (where y represents label's class index).
    pass
  estimator.fit(input_fn=input_fn_train)

  def input_fn_eval: # returns x, y (where y represents label's class index).
    pass
  estimator.evaluate(input_fn=input_fn_eval)
  def input_fn_predict: # returns x, None
  # predict_classes returns class indices.
  estimator.predict_classes(input_fn=input_fn_predict)
  ```

  If the user specifies `label_keys` in constructor, labels must be strings from
  the `label_keys` vocabulary. Example:

  ```python
  label_keys = ['label0', 'label1', 'label2']
  estimator = DNNClassifier(
      feature_columns=[sparse_feature_a_emb, sparse_feature_b_emb],
      hidden_units=[1024, 512, 256],
      label_keys=label_keys)

  def input_fn_train: # returns x, y (where y is one of label_keys).
    pass
  estimator.fit(input_fn=input_fn_train)

  def input_fn_eval: # returns x, y (where y is one of label_keys).
    pass
  estimator.evaluate(input_fn=input_fn_eval)
  def input_fn_predict: # returns x, None
  # predict_classes returns one of label_keys.
  estimator.predict_classes(input_fn=input_fn_predict)
  ```

  Input of `fit` and `evaluate` should have following features,
    otherwise there will be a `KeyError`:

  * if `weight_column_name` is not `None`, a feature with
     `key=weight_column_name` whose value is a `Tensor`.
  * for each `column` in `feature_columns`:
    - if `column` is a `SparseColumn`, a feature with `key=column.name`
      whose `value` is a `SparseTensor`.
    - if `column` is a `WeightedSparseColumn`, two features: the first with
      `key` the id column name, the second with `key` the weight column name.
      Both features' `value` must be a `SparseTensor`.
    - if `column` is a `RealValuedColumn`, a feature with `key=column.name`
      whose `value` is a `Tensor`.
  """
......

类似的，对于一些数据分类的案例，可以使用机器学习挖掘算法进行分类；也可使用深度学习进行分类，只需模仿使用鸢尾花数据集的数据源文件格式，稍微修改上述的代码，就可以体验深度学习的分类算法了。

学到很多东西的诀窍，就是一下子不要学很多。——洛克

TensorFlow 系列案例（3）： 使用TensorFlow DNN分类器对数据进行分类

猜你喜欢

TensorFlow 系列案例（3）：使用TensorFlow DNN分类器对数据进行分类