In "handwritten numeral recognition - to manually set up a fully connected layer" in one article, we constructed by machine learning basic formula out of a network model, there is no doubt that the implementation process is too complex - have to consider such as data type matching, gradient calculation, the accuracy of the statistics and other issues, but such practical understanding of machine learning is helpful. In most cases, we still hope to the more simple and more simply to build the network model, which also considered worthy of TensorFlow this powerful tool. In this section, or in the handwritten data sets MNIST for example, the use of keras before TensorFlow2.0 high-level API to reproduce the network.

A, importing data and preprocessing

About this process, similar to the previous section talked about, will not repeat them. Is required to mention that, in order to clean the program, the data type conversion function written in the Preprocess single pretreatment, applied by the pre-processing function map method Dataset object. Introducing the entire data preprocessed code as follows:

import tensorflow as tf
from tensorflow.keras import datasets,optimizers,Sequential,metrics,layers

# Change the data type 
DEF the Preprocess (X, Y):
    x = tf.cast(x,dtype=tf.float32)/255-0.5
    x = tf.reshape(x,[-1,28*28])
    y = tf.one_hot(y, depth=10)
    y = tf.cast(y, dtype=tf.int32)
    return x,y

#60k 28*28
(train_x,train_y),(val_x,val_y) = datasets.mnist.load_data()

#生成Dataset对象
train_db = tf.data.Dataset.from_tensor_slices((train_x,train_y)).shuffle(10000).batch(256)
val_db = tf.data.Dataset.from_tensor_slices((val_x,val_y)).shuffle(10000).batch(256)

# Pretreatment, for each data application the Preprocess 
train_db = train_db.map (the Preprocess)
val_db = val_db.map(preprocess)

Second, the model building

For fully connected layers, keras provided layers.Dense (units, activation) interfaces by means of which one can establish Layer, Sequential multilayer stack into a container provided keras, a network model is formed. Dense of parameters, units determines the number of neurons in this layer contained layer, activation of the activation function is selected. As with the previous network, our network communication can be seen as: input (784 units) -> layer1 (256 units) -> ReLu-> layer2 (128 units) -> ReLu-> output (10 units). Thus, after the container is defined in three Sequential, RELU Activation designated, and the input layers need to specify the time to build input_shape through the network to tell the number of neurons in the input layer. Construction of the following code, the network information can be printed by the method of summary.

# Network model 
Model = Sequential ([
    layers.Dense(256,activation=tf.nn.relu),
    layers.Dense(128,activation=tf.nn.relu),
    layers.Dense(10),
])
#input_shape=(batch_size,input_dims)
model.build(input_shape=(None,28*28))
model.summary()

Third, the model of training

Training model of the most important is the weight updating and accuracy of the statistics. Providing more keras optimizer ( Optimizer ) for updating the weights. The optimizer is in fact a different gradient descent algorithm to alleviate the traditional gradient descent may not converge to the global minimum of problems. In the previous section on a little discussed three. Here simply compare some of the optimizer, as detailed distinctions have time to write essays devoted to the future:

SGD : SGD TensorFlow2.0 actually stochastic gradient descent + momentum of comprehensive optimizer. Stochastic gradient descent is updated every time a randomly selected sample calculation gradient, thus calculated gradient is much faster, but the fear of loud noises; gradient descent momentum is based on the accumulated history information gradient acceleration gradient descent, because on the one hand it wants water diluted milk as possible to reduce the sensitivity to noise stochastic gradient descent, on the other hand is lowered to impart momentum of inertia, it is contemplated gradient. This reminds me of the optimizer to tell the truth to the PID control. SGD need to specify the size of the learning rate and momentum. In general, the momentum size is set to 0.9.
Adagrad : optimized adaptive gradient. The so-called adaptive gradient, is according to the frequency parameters, different parameters are applied to each learning rate. But the number of iterations of the algorithm becomes large, the learning rate becomes very small, resulting in can not continue to be updated. Adagrad requirements specified learning rate initialization, the initial accumulator value and to prevent the denominator is 0 offset value.
Adadelta : using adaptive delta optimizer. Adagrad algorithm solves the problem of the disappearance of learning rate. Adagrad initialization requirements specified learning rate, attenuation rate, and to prevent the denominator offset value of 0. The decay rate with similar momentum, generally designated as 0.9.
RMSprop : similar Adadelta.
ADAM : gradient of first and second order moments of updating the parameter estimates. It combines the advantages of Adadelta and RMSprop. We can say that the depth of learning usually choose Adam optimizer. TensorFlow in, Adam optimizer need to specify four parameters, but experience has shown that its default parameters can show good results.

In view of the above comparison, and herein, as Adam optimizer, and using the default parameters.

In addition to gradient descent, also we need to consider the calculation of Loss. Before, we used the mean squared difference between predicted and actual values of probability, the professional name should be Euclidean loss function. In fact, this was a mistake, Euclid loss function is suitable for binary classification, multivariate classification should be used cross-entropy loss function. Sometimes for multi-function, we will be very consciously trying to normalize the output layer, so will be after the output layer, the first cross softmax look before entropy calculation. However, due to the exponential form softmax is calculated, then the probability of a larger difference if the output of various types, the high probability after normalization almost 1 small probability normalized almost to zero. To avoid this problem, usually SoftMax removed, in the cross-entropy function tf.losses.CategoricalCrossentropy parameters middle from_logits = True.

Loss function and by optimizing configurations can compile designation method, you can also specify a list of metrics is calculated automatically to determine the required information, such as accuracy.

By fit can pass the training data and test data method. code show as below:

# Mating Adam optimizer, Loss cross-entropy function, a list of metrics 
model.compile (Optimizer = optimizers.Adam (),
                loss=tf.losses.CategoricalCrossentropy(from_logits=True),
                metrics = [ ' Accuracy ' ])
 # data into, 10 iterations train_db, once per iteration, calculates a test dataset accuracy 
model.fit (train_db, epochs = 10, validation_data = val_db, validation_freq = 1)

Network model established above in the first iteration after train_db can reach more than 0.8 accuracy, and this iteration takes only about every three seconds. After about 50 iterations, the accuracy can be as high as 0.98!

Fourth, the complete code

 1 import tensorflow as tf
 2 from tensorflow.keras import datasets,optimizers,Sequential,metrics,layers
 3 
 4 # 改变数据类型
 5 def preprocess(x,y):
 6     x = tf.cast(x,dtype=tf.float32)/255-0.5
 7     x = tf.reshape(x,[-1,28*28])
 8     y = tf.one_hot(y, depth=10)
 9     y = tf.cast(y, dtype=tf.int32)
10     return x,y
11 
12 #60k 28*28
13 (train_x,train_y),(val_x,val_y) = datasets.mnist.load_data()
14  
15  # to generate Dataset object 
16 train_db tf.data.Dataset.from_tensor_slices = ((train_x, train_y)). Shuffle (10000) .batch (256 )
 . 17 val_db tf.data.Dataset.from_tensor_slices = ((val_x, val_y)) .shuffle (10000) .batch (256 )
 18 is  
. 19  # pretreatment, for each data application the Preprocess 
20 is train_db = train_db.map (the Preprocess)
 21 is val_db = val_db.map (the Preprocess)
 22 is  
23 is  # network model 
24 model = the Sequential ( [
 25      layers.Dense (256, Activation = tf.nn.relu),
 26 is      layers.Dense (128, Activation = tf.nn.relu),
27      layers.Dense (10 ),
 28  ])
 29  # input_shape = (the batch_size, input_dims) 
30 model.build (= input_shape (None, 28 * 28 ))
 31 is  model.summary ()
 32  
33 is  # mating Adam optimizer, cross Loss entropy function, a list of metrics 
34 is model.compile (Optimizer = optimizers.Adam (),
 35                  Loss = tf.losses.CategoricalCrossentropy (from_logits = True),
 36                  metrics = [ ' Accuracy ' ])
 37 [  # data into, iteration 10 times train_db, once per iteration, calculating the accuracy of a test dataset 
38 model.fit(train_db,epochs=100,validation_data=val_db,validation_freq=5)

Digital Recognition - use keras high-level API to quickly build and optimize network model

A, importing data and preprocessing

Second, the model building

Third, the model of training

Fourth, the complete code

Guess you like