Basic concepts of deep learning

Artificial intelligence is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. The purpose of artificial intelligence is to enable computers to think like humans.
Strong artificial intelligence: It is to make the machine learn the human's ability to understand, learn and perform tasks.
Narrow AI: Refers to software designed to automate specific tasks.
The broad concept of machine learning refers to the method of obtaining laws from known data and using the laws to predict unknown data. Machine learning is a statistical learning method in which machines such as robots and computers need to learn using large amounts of data to extract the desired information.
Deep learning is a technology that uses deep artificial neural networks for automatic classification, prediction and learning.
Artificial Intelligence > Machine Learning > Deep Learning

TensorFlow

TensorFlow is a powerful open source software library for numerical computing, especially for fine-tuning large-scale machine learning.

MNIST data reading, modeling, compilation, training, testing

import tensorflow as tf
# 载入数据
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
# 建模
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)), # 将输入“压平”，即把多维的输入一维化，只有第一层有输入数据的形状
  tf.keras.layers.Dense(128, activation='relu'), # 该层的输出维度或神经元个数和激活函数
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])
# 编译（优化器、损失和评价）
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
# 训练（训练和验证数据不一样）
model.fit(x_train, y_train, epochs=5)
#  验证
model.evaluate(x_test, y_test, verbose=2)

Optimizer tf.optimizers;
SGD stochastic gradient descent optimizer

tf.keras.optimizers.SGD( learning_rate=0.01, momentum=0.0,nesterov=False, name='SGD', **kwargs)
学习率、动量、是否使用nesterov震荡

Adam optimizer
tf.keras.optimizers.Adam(learning_rate=0.001)

Loss function: root mean square error MSE (MeanSquaredError), binary cross entropy (BinaryCrossentropy);
metrics sparse_categorical_crossentropy (multi-classification loss function)
The metrics evaluation function
"accuracy": the actual value (y_) and the predicted value (y) are both numerical values;
"sparse_accuracy": both y_ and y are expressed in one-hot code and probability distribution;
"sparse_categorical_accuracy": y_ is a numerical value is given in the form, y is given in one-hot code
Data structure
1) Rank: dimension
2) Shape: the number of data in each dimension
3) Data type: bool, int, float, string, complex, Truncated float, Quantized int
4) Variables and Constant: variables and constants

Common functions

data type conversion

insert image description here

import tensorflow as tf
import tensorflow.compat.v1 as tf1
string = '12.3'
n1 = tf1.string_to_number(string, out_type=None, name=None)  #字符串转为数字
x = 12.3
d1 = tf1.to_double(x, name='ToDouble')#转为64位浮点类型float64
f1 = tf1.to_float(x, name='ToFloat') #转为32位浮点类型float32
i1 = tf1.to_int32(x, name='ToInt32') #转为32位整型int32
i2 = tf1.to_int64(x, name='ToInt64')     #转为64位整型–int64
a = [1.8, 2.2]
i3 = tf.cast(a, tf.int32)  # 将a或者a.values转换为dtype

shape manipulation

t = [ [[1, 1, 1], [2, 2, 2]] ,  [[3, 3, 3], [4, 4, 4]] ]
tf.shape(t)    #返回维度中数据的个数
tf.size(t)      #返回数据的元素数量
tf.rank(t)      #返回数据维度
t = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(tf.shape(t))
t2 = tf.reshape(t, [3, 3])  #改变tensor的形状
print(tf.shape(t2))
t3 = tf.reshape(t, [3, -1])
print(t3)

insert image description here

Other operation functions

t = [2,3] #插⼊维度1进⼊⼀个tensor中
t1 = tf.shape(tf.expand_dims(t, 0)) [1,2]
t2 = tf.shape(tf.expand_dims(t, 1))  [2,1]
t3 = tf.shape(tf.expand_dims(t, -1)) [2,1]
t4 = tf.ones([2,3,5])   #用数字1填充所有维度
t6 = tf.shape(tf.expand_dims(t4, 2))  shape=【2，3，1，5】
t= [[[1, 1, 1], [2, 2, 2]],[[3, 3, 3], [4, 4, 4]],[[5, 5, 5], [6, 6, 6]]]
#对tensor进⾏切⽚操作
t1 = tf.slice(t, [1, 0, 0], [1, 1, 3])  
t2 = tf.slice(t, [1, 0, 0], [1, 2, 3])
t3 = tf.slice(t, [1, 0, 0], [2, 1, 3])

t = tf.ones([5,30]) 
t1, t2, t3 = tf.split(t,3,1)  #沿着某⼀维度将tensor分离
print(tf.shape(t1))
print(tf.shape(t2))
print(tf.shape(t3))

t1 = [[1, 2, 3], [4, 5, 6]]
t2 = [[7, 8, 9], [10, 11, 12]] 
t3 = tf.concat([t1, t2], 0)  #沿着某⼀维度连结tensor
t4 = tf.concat([t1, t2], 1)

x = [1, 4] y = [2, 5] z = [3, 6]
t1 =tf.stack([x, y, z]) #沿着第⼀维stack [[1 4] [2 5] [3 6]]
t2 = tf.stack([x, y, z], axis=1) #axis取0时表示按x轴叠加，取1时表示按y轴叠加#[[1 2 3] [4 5 6]]

t = [[  [[ 0, 1, 2, 3],[ 4, 5, 6, 7],[ 8, 9, 10, 11]], 
 [[12, 13, 14, 15],[16, 17, 18, 19],[20, 21, 22, 23]]  ]] # shape=(1,2,3,4),dims = [3]
t1 = tf.reverse(t, dims) #沿着某维度进⾏序列反转dims = [1] 
t2 = tf.reverse(t, dims) #dims = [2] 
t3 = tf.reverse(t, dims) [[  [[ 8 9 10 11] [ 4 5 6 7] [ 0 1 2 3]]  [[20 21 22 23] [16 17 18 19] [12 13 14 15]] ]]

t = [[1, 2, 3],[4, 5, 6]]
t1 = tf.transpose(t)  #调换tensor的维度顺序	
t2 = tf.transpose(t, perm=[1, 0])  # [[1 4] [2 5] [3 6]]

indices = [0, 1, 2]  depth = 3
t1 = tf.one_hot(indices, depth)  #
indices = [0, 2, -1, 1]   depth = 3
t2 = tf.one_hot(indices, depth,on_value=5.0, off_value=0.0,axis=-1) 
indices = [[0, 2], [1, -1]]  depth = 3
t3 = tf.one_hot(indices, depth,on_value=1.0, off_value=0.0,axis=-1) #    [ [[1. 0. 0.][0. 0. 1.]]
[[0. 1. 0.][0. 0. 0.]] ]
t = [1, 1, 2, 4, 4, 4, 7, 8, 8]
y, idx = tf.unique(t)  
print(y) #t的所有唯一元素
print(idx) #y中每个值的索引

tf.math.ceil([-1.7,-1.5,-0.2,0.2,1.5,1.7,2.0])  #向上取整函数
tf.gather(params,indices,axis=0 ) 从params的axis维根据indices的参数值获取切片

operation

tf.diag(diagonal) 根据主对角线元素生成矩阵
tf.trace(x, name=None)求二维tensor对角值之和
tf.matrix_determinant(input, name=None)求方阵行列式
tf.matrix_inverse(input, adjoint=None,name=None)求逆矩阵
tf.matmul(a,b,transpose_a=False,transpose_b=False,a_is_sparse=False,b_is_sparse=False,name=None)矩阵相乘
tf.complex(real, imag, name=None) 将两实数转换为复数形式
tf.complex_abs(x, name=None) 计算复数的绝对值，即⻓度。
tf.conj(input, name=None) 计算共轭复数
tf.imag(input, name=None)
tf.real(input, name=None)
tf.eye(num_rows,num_columns=None)生成单位阵
tf.fill(dims,value,name=None)   fill([2, 3], 9) ==> [[9, 9, 9][9, 9, 9]]
tf.ones(shape,dtype=tf.dtypes.float32,name=None)

data generation

⽣成随机张量
tf.random_normal(shape, mean=0.0, stddev=1.0, dtype=tf.float32, seed=None)
从“服从指定正态分布的序列”中随机取出指定个数的值，shape: 输出张量的形状，mean: 正态分布的均值，stddev: 正态分布的标准差，dtype: 输出的类型，seed: 随机数种子，是一个整数，当设置之后，每次生成的随机数都一样

tf.truncated_normal(shape, mean, stddev)
产生截断正态分布随机数，取值范围为 [ mean - 2 * stddev, mean + 2 * stddev ]

tf.random_uniform(shape，minval=0, maxval=None, dtype=tf.float32, seed=None)
生成的值在 [minval, maxval) 范围内遵循均匀分布.

tf.random_shuffle()  随机地将张量沿其第一维度打乱

变量：
biases = tf.Variable(tf.zeros([200]),name=“var") 生成一组变量（没有初始化）
const = tf.constant(1.0,name="constant") 创建常量

名称作用域
在TensorFlow 应用程序中，可能会有数以千计的计算节点。如此多节点汇集在一起，难以分析，甚至无法用标准的图表工具来展示。为解决这个问题，一个有效方法就是，为Op/Tensor 划定名称范围。在TensorFlow 中，这个机制叫名称作用域(name scope)

tf.variable_scope() 管理变量命名空间，可以创建同名变量。在tf.variable_scope 中创建的变量,名称. name 中名称前面会加入命名空间的名称，并通过“/” 来分隔命名空间名和变量名。

tf.get_variable(name="foo/bar", shape=[1]) ，可以通过带命名空间名称的变量名来获取其命名空间下的变量。如果变量存在，则使用以前创建的变量，如果不存在，则新创建一个变量。不能创建同名变量。

Pytorch

Concept: Pytorch is a scientific computing package based on python, the main purpose is,

As an alternative to Numpy, you can use the performance of the GPU for efficient calculations
As a highly flexible and fast deep learning platform

测试：
torch.cuda.is_available()
x = torch.rand(5,3) print(x)
print(x)

Basic elements: tensor (Tensor), variable (Variable), neural network module (nn.Module)

Tensor : Tensor is the most basic element in PyTorch, equivalent to numpy.ndarray. Tensor is a replacement for numpy.ndarray in PyTorch.
Variable : When building a neural network, variable is needed to build a calculation graph.
Variable is the encapsulation of tensor, and it is a physical location to store the changing value (tensor).
variable has 3 properties:
- variable.data : the value of tensor in variable
- variable.grad : the gradient of tensor in variable;
- variable.gradfn : points to the Function object, used for gradient calculation of backpropagation

nn.Module

Neural network module nn.Module: The interface of the neural network. When defining your own neural network, you need to inherit the nn.Module class

(1) Liner module

torch.nn.Linear(in_features, out_features, bias=True)
In the fully connected layer in the network, the input and output of the fully connected layer are two-dimensional tensors, and the input shape is [batch_size, size].
in_features: refers to the size of the input two-dimensional tensor.
out_features: refers to the size of the output two-dimensional tensor, that is, the shape of the output two-dimensional tensor is [batch_size, output_size], and also represents the number of neurons in the fully connected layer.

(2) Conv2d module

torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)
kernel_size: The size of the convolution kernel, use (H, W) to represent the output of HxW, and H represents the output of H*H size.
stride: convolution stride, the convolution kernel moves each time
padding: padding operation, controls the number of padding_mode. The default is Zero-padding.
dilation: Dilation operation, controlling the spacing of kernel points (convolution kernel points), the default is 1.
group: controls the group convolution, the default is not grouped. (Group the input feature map, and then convolve each group separately.)
bias: Whether to add a bias, if true, add a learnable bias to the output, the default is True.

insert image description here

Linear and Conv2d are connected, and the output of Conv2d is a four-dimensional tensor, which can be used as the input of the fully connected layer after being converted into a two-dimensional tensor .

insert image description here

(3) Two-dimensional batch normalization module

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True)
num_features: number of features C

insert image description here

(4) Maximum pooling module

torch.nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1)
Input: (N,C,H_in,W_in)
Output: (N,C,H_out,W_out)

insert image description here

(5) Average pooling module:

torch.nn.AvgPool2d(kernel_size, stride=None, padding=0, ceil_mode=False)
ceil_mode: If True, use the ceil function instead of floor when calculating the output shape

insert image description here

(6) Adaptive average pooling module:

torch.nn.AdaptiveAvgPool2d(output_size) output_size: The size of the output signal.
The specificity of Adaptive Pooling is that the size of the output tensor is the given output_size. For inputs of any input size, the output size can be specified as H*W, but the number of input and output features does not vary.
insert image description here

nn.function

Location: torch.nn.functional
Example:torch.nn.functional.adaptive_avg_pool2d(input, output_size)
torch.nn.ReLU(inplace=False) inplace: choose whether to perform in-place operation, that is, x = x+1
Sigmoid:

m = nn.Sigmoid() 
input = torch.randn(2) 
output = m(input) 
input, output 
(tensor([-0.8425, 0.7383]), tensor([0.3010, 0.6766]))

4. Tanh:

m = nn.Tanh() 
input = torch.randn(2) 
output = m(input) 
input, output 
(tensor([1.3372, 0.6170]), tensor([0.8710, 0.5490]))

RNN

function:torch.nn.RNN(input_size, hidden_size, num_layers)

hidden_size: The number of neurons in the hidden layer is also the dimension of the output, because the rnn output is the hidden state at each time step;
num_layers: the number of hidden layers;

$h_n$ of the hidden layer at the last moment $h_{n}$

Forward prediction:
insert image description here

x: [seq_len, batch, feature_len], it is to input all the features at one time, and it is not necessary to input the xt at the current moment each time;
h0 is the Tensor of the memory units of all layers at the first initial moment (understood as the hidden output of each sentence in each layer);

Example: Input a piece of Chinese and output a piece of English. Each Chinese character is encoded with 100-dimensional data, the dimension of each hidden layer is 20, and there are 4 hidden layers. So input_size = 100, hidden_size = 20, num_layers = 4. Suppose the model has been trained, and now there is a sentence of length 10 as input, then seq_len = 10, batch_size = 1.


import torch 
import torch.nn as nn 
input_size = 100 # 输⼊数据编码的维度 
hidden_size = 20 # 隐含层维度 
num_layers = 4 # 隐含层层数 
rnn = nn.RNN(input_size=input_size,hidden_size=hidden_size,num_layers=num_layers) 
print("rnn:",rnn)
seq_len = 10 # 句⼦⻓度 
batch_size = 1 

x = torch.randn(seq_len,batch_size,input_size) # 输⼊数据x 
h0 = torch.zeros(num_layers,batch_size,hidden_size) # 输⼊数据h0 

out, h = rnn(x, h0) # 输出数据 

print("out.shape:",out.shape) 
print("h.shape:",h.shape)

RNNCell module:
(1) Difference: nn.RNN inputs all time features into the network at one time. nn.RNNCell separates the 'data at each moment' on the sequence to process
(2) forward prediction: $h_t = forward(x_t, h_{t-1})$

For example: If you want to process 3 sentences, each sentence contains 10 words, and each word is represented by a 100-dimensional embedding vector, the
shape of the Tensor passed
in by nn. The shape of Tensor is [3,100], run this computing unit 10 times

LSTM module
- function:torch.nn.LSTM(input_size, hidden_size, num_layers)
- Input and output formats:out, (h_t, c_t) = lstm(x, [h_t0, c_t0])
- LSTMCell module:h_t, c_t = lstmcell(x_t, [h_t-1, c_t-1])
  - xt: [batch, feature_len] indicates the input at time t
  - ht−1, ct−1: [batch, hidden_len], the hidden unit and memory unit of this layer at time t−1)

nn.LSTM inputs all time features into the network at one time.
nn.LSTMCell processes the 'data at each moment' on the sequence separately

CNN

The learning of feedforward neural network mainly adopts error correction method (such as BP algorithm), the calculation process is relatively slow, and the convergence speed is also slow;
The feedback neural network mainly adopts the Hebb learning rule, and the calculation convergence speed is very fast.

CNN uses a partially connected layer, and the three ideas are locality, identity, and invariance

convolutional layer

The first layer of neurons connects only the pixels in the input image that are in his perceptual area to focus on the low-level features of the image
The second layer of neurons is only connected to neurons located within the perceptual area in the first layer of high-level features

Filters (Filters, convolution kernel) Convolution operation setting parameters
1) Length, width and height HWC of the filter
2) Step size (Stride)
3) Boundary filling

insert image description here

Note: The image processed by the filter is called a feature map, and all neurons in the feature map share the same parameters (weights, biases)

Basic structure (input layer + convolutional layer (feature extraction) + pooling layer (compressed feature) + fully connected layer (non-linear output))

Representation of input image: 3D Tensor [height, width, channels];
Minibatch representation: 4D Tensor [minibatch, height, width, channels];

The weight representation of the convolutional layer: 4D tensor [fh,fw,fn,fn′]

fh is the height of the current layer filter; fw is the width of the current layer filter; fn is the number of current layer filters; fn' is the number of feature maps of the previous layer

The bias item representation of the convolutional layer: 1D tensor [fn]

计算：一个具有5×5过滤器的卷积层，输出尺寸为150×100的200个特征图，带有步幅1和SAME填充。

参数数量：若输入是150×100 RGB图像（三通道），则参数的数量：（5×5×3+1）*200=15,200

200个特征图每一个都包含150×100个神经元，每个都需要计算其5×5×3=75个输入的加权和，总共有75*150*100*200=2.25亿次浮点乘法。

若用32位浮点数表示特征图，卷积层输出将占200×150×100×32=96百万位（约11.4 MB）内存

about padding

When writing code, please note that padding has two modes, namely 'same' and 'valid',

padding='same'Indicates padding, the value of padding is calculated by the algorithm based on the size of the convolution kernel, the purpose is to make the output size equal to the input; it means not
padding='valid'padding, that is, padding=0, only valid window positions are used, which is the default option.

Filled value = (b - 1) / 2
filled value, the value of b is the size of the convolution kernel, here is why the convolution kernel size is usually chosen to be an odd number

Example:
Question 1. A feature map with a size of 5 5 passes through a convolutional layer of 3 3 with stride = 1. If the output size is equal to the input size, what should be the padding value?
Answer: Filling value = (3 - 1)/2 = 1, that is, each side is filled with 1 layer

Question 2. A feature map with a size of 224 224, after 7 7 convolutional layers, stride = 2, if you want the output size to be equal to the input size, what should the padding value be equal to?
Answer: Filling value = (7 - 1)/2 = 3, that is, each side is filled with 3 layers

The purpose of padding='same' is to make the output size equal to the input size, but the premise is that the stride=1, if the stride is not 1, then the output size must be different from the input size.

The two straightforward understandings of the same and valid modes of padding are either no padding, or padding to make the output size equal to the input size. The
value of padding will not be taken randomly. In two cases, either padding=0 or padding=(b - 1)/2.

Pooling layer:

1) Goal: Subsample (i.e. downscale) the input image to reduce computation, memory usage, and number of parameters (reduce the risk of overfitting) 2) Features: Each pooling
layer The neurons are all connected to the output of the neurons in the previous layer, and are located in a small rectangular receptive area. But a pooling neuron has no weights, it aggregates the input with an aggregation function such as max_pool() or avg_pool().
3) Function: feature dimensionality reduction, avoid overfitting; space invariance; reduce parameters, reduce training difficulty

4) The code implements a 2x2 kernel pooling layer (keras implementation)

keras.layers.MaxPool2D( pool_size=(2,2),  strides=None, #默认值和pool_size相等
padding='valid', data_format=None)

keras.layers.AvgPool2D(pool_size=(2,2),strides= None, # 默认值和pool_size相等
padding='valid', data_format=None)

keras.layers.GlobalAveragePooling2D(data_format=None)

Design a CNN network for MNIST processing, the structure and parameters are as follows, add a Dropout with a coefficient of 0.25 after each MaxPooling.

insert image description here

import os
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, 
import numpy as np

(X_tarin, y_train), (X_test, y_test) = mnist.load_data()
X_train4D = X_tarin.reshape(X_tarin.shape[0], 28, 28, 1).astype('float32')
X_test4D = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32')

 # 归⼀化
X_train4D_Normalize = X_train4D / 255
X_test4D_Normalize = X_test4D / 255
# 标签onehot编码
y_trainOnehot = to_categorical(y_train)
y_testOnehot = to_categorical(y_test)

# 建⽴模型
model = Sequential()
model.add(Conv2D(filters=16,kernel_size=(5, 5),padding='SAME', input_shape=(28, 28, 1),activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

# ⼆层卷积
model.add(Conv2D(filters=32,kernel_size=(5, 5), padding='SAME', activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

#三、四层卷积
model.add(Conv2D(filters=64, kernel_size=(5, 5), padding='same', activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(filters=128, kernel_size=(5, 5), padding='same', activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25)) 
# 全连接层
model.add(Flatten()) 
model.add(Dense(128, activation='relu')) 
model.add(Dropout(0.25)) 

model.add(Dense(10, activation='softmax')) 
model.summary()
# 编译模型
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
#训练模型
train_history = model.fit(x=X_train4D_Normalize,y=y_trainOnehot,validation_split=0.2,batch_size=300,
 epochs=10,verbose=2)
#评估模型
model.evaluate(X_test4D_Normalize, y_testOnehot)[1] 
#预测
prediction = model.predict_classes(X_test4D_Normalize)

RNN

Concept : a type of neural network used to process sequence data, the current output of a sequence is related to the previous output. The specific manifestation is: the network will remember the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layers are connected, and the input of the hidden layer includes not only the input layer The output of also includes the output of the hidden layer at the previous moment.

Structure :
insert image description here

Parameter 3 groups of weights:

U (input x and computation between nodes)
W (node conversion calculation over time)
V: (computation between node x and output node)
St is the state of the hidden layer at time t $s_t = f(U_{x_t} + W_{s_t−1})$

Incentive function: generally tanh or ReLU
$o_t$ is the output at time t. $o_t = softmax(V_{s_t})$

Classification:

According to the structure of the input and output sequences : One to One, One to Many (vector to sequence), Many to One (sequence to vector), Many to Many, Encoder-Decoder model (Seq2Seq model)
According to the internal structure : traditional RNN, LSTM, Bi-LSTM, GRU, Bi-GRU

Disadvantages of simple RNN: It does not perform well in predicting long sequences, faces unstable gradient problems, and is prone to gradient disappearance or explosion; and when RNN processes long sequences, it will gradually forget the first input of the sequence

Simple RNN

np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential([
keras.layers.SimpleRNN(1, input_shape=[None, 1]) ])

optimizer = keras.optimizers.Adam(learning_rate=0.005)

model.compile(loss="mse", optimizer=optimizer)
history = model.fit(X_train, y_train, epochs=20,validation_data=(X_valid, y_valid))

model.evaluate(X_valid, y_valid)
y_pred = model.predict(X_valid)

Deep RNN

np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential([
 keras.layers.SimpleRNN(20, return_sequences=True,input_shape=[None, 1]),
 keras.layers.SimpleRNN(20, return_sequences=True),
 keras.layers.SimpleRNN(1)
])
model.compile(loss="mse", optimizer="adam")
history = model.fit(X_train, y_train, epochs=20,validation_data=(X_valid, y_valid))
model.evaluate(X_valid, y_valid)
y_pred = model.predict(X_valid)

np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential([
keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1])
keras.layers.BatchNormalization(),
keras.layers.SimpleRNN(20, return_sequences=True),
keras.layers.BatchNormalization(), 批量归一化
keras.layers.TimeDistributed(keras.layers.Dense(10)) 时间步
每⼀步都预测10个值，即在时刻0会预测1到10，时刻1预测2到11.
])
model.summary()
model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
history = model.fit(X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid))
model.evaluate(X_valid, Y_valid)

LSTM

insert image description here

The state of the LSTM cell: h[t] (short-term memory state), c[t] (long-term memory state)

c[t−1] first passes through the forgetting door to store 丢弃some memories; then adds and operates 增加some memories (select
some from the input door), and finally outputs directly without conversion. After the addition operation, the long-term state 复制is passed through the tanh activation function, and the result is output 过滤to obtain the short-term state h[t] . h[t] is the unit output y[t] of this time step.

Short-term memory state $h [t - 1]$ and the input vector $x [t]$ passes through 4 different fully connected layers: $The g [t]$ layer is the main layer (save the most important part in the long-term state, tanh is the activation function), and the other three fully connected layers (FC) are⻔控制器(Logistic as the activation function)

weight calculation
$W_{xi}, W_{xf} , W_{xo}, W_{xg}$ is the weight of the four fully connected layer connection input vectors x(t);
$W_{hi}, W_{hf} , W_{ho}, W_{hg}$ is the weight of the four fully connected layer connections h(t−1);
$b_i, b_f , b_o, b_g$ are the four bias terms of the fully connected layer.

insert image description here
$W_{xi}，W_{xf} ，W_{xo}，W_{xg}$ Number of weights:
number of input parameters × number of hidden layer parameters + number of hidden layer parameters × number of hidden layer parameters
Bias number: number of hidden layer parameters

np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential([
keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]),
keras.layers.LSTM(20, return_sequences=True),
keras.layers.TimeDistributed(keras.layers.Dense(10))
])

model.summary()
model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
history = model.fit(X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid))
model.evaluate(X_valid, Y_valid)

GRU

insert image description here
The long-term state and short-term state are combined into a vector h(t). Use a door controller z(t) (reset door) to control the forgetting door and the input door

(If the door controller outputs 1, the forgetting door is open (= 1), and the input door is closed (1 - 1 = 0).
If the output is 0, do the opposite operation) cancel the output gate, and output the full state vector at each time step.

Add a control gate r(t) (update gate) to control which parts of the previous state are presented to the main layer g(t)

insert image description here

Note: Although they can handle longer sequences than simple RNNs, they still have a certain degree of short-term memory .

One way to deal with this is to shorten the input sequence, for example using a 1D convolutional layer. A 1D convolutional layer slides several kernels over a sequence, each kernel can produce a 1D feature map. Each core can learn a very short sequence pattern (no more than the size of the core)

np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential([
keras.layers.GRU(20, return_sequences=True, input_shape=[None, 1]),
keras.layers.GRU(20, return_sequences=True),
keras.layers.TimeDistributed(keras.layers.Dense(10))
])
model.summary()
model.compile(loss="mse", optimizer="adam", metrics=[last_time_step_mse])
history = model.fit(X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid))

Realize the classification task of RNN on the data set MNIST

import torch
import torch.nn as nn
import torchvision.datasets as ds
import torchvision.transforms as transforms
from torch.autograd import Variable

sequence_length = 28 , input_size = 28 ,  hidden_size = 128 
num_layers = 2, num_classes = 10 ,  batch_size = 100  ,  num_epochs = 2      
learning_rate = 0.003

#加载MNIST数据集
train_dataset = ds.MNIST(root='.\data',train=True,transform=transforms.ToTensor(),download=True)
test_dataset = ds.MNIST(root='.\data',train=False,transform=transforms.ToTensor())
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,batch_size=batch_size,shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,batch_size=batch_size,shuffle=False)

#定义多对⼀双向RNN模型
class BiRNN(nn.Module):
   def __init__(self, input_size, hidden_size, num_layers, num_classes):
      super(BiRNN, self).__init__()
      self.hidden_size = hidden_size
      self.num_layers = num_layers
      self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
      batch_first=True, bidirectional=True)
      self.fc = nn.Linear(hidden_size * 2, num_classes) 
      
   def forward(self, x):
      h0 = Variable(torch.zeros(self.num_layers * 2, x.size(0),self.hidden_size)) # 设置初始状态
      c0 = Variable(torch.zeros(self.num_layers * 2, x.size(0),self.hidden_size)) 
      out, _ = self.lstm(x, (h0, c0)) #前向传播
      out = self.fc(out[:, -1, :]) # 上一时间步隐藏状态输出
      return out

#定义模型
rnn = BiRNN(input_size, hidden_size, num_layers, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)

#训练模型
for epoch in range(num_epochs):
   for i, (images, labels) in enumerate(train_loader):
     images = Variable(images.view(-1, sequence_length, input_size)) 
     labels = Variable(labels) 
     optimizer.zero_grad()#前向+后向+优化器
     outputs = rnn(images)
	 loss = criterion(outputs, labels)
	 loss.backward()
	 optimizer.step()
 if (i + 1) % 100 == 0:
 	print('Epoch [%d/%d], Step [%d/%d], Loss: %.4f'% (epoch+1,num_epochs,i + 1, len(train_dataset)
 // batch_size, loss.item()))
 
# 测试模型
correct = 0
total = 0
for images, labels in test_loader:
	 images = Variable(images.view(-1, sequence_length, input_size))
	 outputs = rnn(images)
	 _, predicted = torch.max(outputs.data, 1)
	 total += labels.size(0)
	 correct += (predicted.cpu() == labels).sum()
	 print('Test Accuracy of the model on test images: %d %%'  %  (100 * correct / total))

AE

Feature
autoencoders usually have a much lower dimensionality than the input data, making them useful for dimensionality reduction purposes.
Autoencoders can also be applied in noise reduction situations.
Autoencoders can also serve as powerful feature detectors, and they can be used for unsupervised pre-training of deep neural networks.
Some autoencoders are generative models that randomly generate new data that closely resembles the training data.
An autoencoder consists of two parts:
a recognition network: an encoder that converts inputs into latent representations a
generative network: a decoder that converts latent representations into outputs
The latent space representation of the data contains all the important information needed to represent the original data points. The representation must represent the features of the original data. In other words, the model learns the characteristics of the data and simplifies its representation, making it easier to analyze.
The number of neurons in the output layer must equal the number of inputs. Autoencoders try to reconstruct the input, so the output is often called a reconstruction, and the loss function contains a reconstruction loss that penalizes the model when the reconstruction differs from the input.

X_train = np.random.rand(100,3)
encoder = Sequential([Dense(2, input_shape=[3])])
decoder = Sequential([Dense(3)])

autoencoder = Sequential([encoder, decoder])
autoencoder.compile(loss="mse", optimizer=SGD(lr=0.1))

history = autoencoder.fit(X_train, X_train, epochs=20)
codings = encoder.predict(X_train)

Autoencoders are forced to learn the most important features of the input data and to remove unimportant features. (with PCA) If the autoencoder uses only linear activations and the loss function is mean squared error (MSE), you end up doing principal component analysis (PCA)
insert image description here

Deep Autoencoder

An autoencoder can have multiple hidden layers and is called a deep autoencoder.
Adding more layers helps the autoencoder learn more complex encodings. But don't make the autoencoder too powerful.
If an encoder is too powerful, it will lead to overfitting problems. Such an autoencoder will perfectly reconstruct the training data, but it will not learn any useful data representation in the process, and it will not May generalize well to new instances.
insert image description here

import utils.mnist_reader
X_train, y_train = utils.mnist_reader.load_mnist('./data/', kind='train')
X_valid, y_valid = utils.mnist_reader.load_mnist('./data/', kind='t10k')
X_train = X_train.reshape(X_train.shape[0],28,28).astype('float32')
X_valid = X_valid.reshape(X_valid.shape[0],28,28).astype('float32')

stacked_encoder = Sequential([
Flatten(input_shape=[28, 28]),
Dense(100, activation="selu"),
Dense(30, activation="selu"),
])

stacked_decoder = Sequential([
Dense(100, activation="selu", input_shape=[30]),
Dense(28 * 28, activation="sigmoid"),
Reshape([28, 28])
])

stacked_ae = Sequential([stacked_encoder, stacked_decoder])
stacked_ae.compile( loss="binary_crossentropy", optimizer=SGD(lr=1.5))
history = stacked_ae.fit(X_train,X_train,epochs=10,validation_data=[X_valid])

Convolutional Autoencoding

Convolutional autoencoders typically reduce the spatial dimensionality (i.e., height and width) of the input while increasing depth (i.e., the number of feature maps). The decoder works in reverse, magnifying the picture and compressing the depth. This can be achieved by transposing convolutional layers. It is also possible to combine upsampling and convolutional layers.

conv_encoder = Sequential([
Reshape([28, 28, 1], input_shape=[28, 28]),
Conv2D(16, kernel_size=3, padding="same", activation="selu"),
MaxPool2D(pool_size=2),
Conv2D(32, kernel_size=3, padding="same", activation="selu"),
MaxPool2D(pool_size=2),
Conv2D(64, kernel_size=3, padding="same", activation="selu"),
MaxPool2D(pool_size=2)
])

conv_decoder = Sequential([
Conv2DTranspose(32, kernel_size=3, strides=2, padding="valid",
activation="selu", input_shape=[3, 3, 64]),
Conv2DTranspose(16, kernel_size=3, strides=2, padding="same",
activation="selu"),
Conv2DTranspose(1, kernel_size=3, strides=2, padding="same",
activation="sigmoid"),
Reshape([28, 28])
])

conv_ae = Sequential([conv_encoder, conv_decoder])

Deconvolution :
A method used to enlarge the size of an image, called deconvolution or deconvolution.
For example: input feature map: 3 ∗ 3 input convolution kernel: kernel=3 ∗ 3, stride=2, padding=1
output feature map: 3 * 3 - 3 + 2 * 1 +1 = 5
Recurrent Autoencoders
The encoder in a recurrent autoencoder is a sequence-to-vector RNN, and the decoder is a vector-to-sequence RNN.

recurrent_encoder = Sequential([
	LSTM(100, return_sequences=True, input_shape=[None, 28]),
	LSTM(30)
])
recurrent_decoder = Sequential([
	RepeatVector(28, input_shape=[30]), 将输⼊重复28次
	LSTM(100, return_sequences=True),
	TimeDistributed(Dense(28, activation="sigmoid"))将Dense层应⽤于输⼊的每个时间⽚（对输入的每个向量进行一次Dense操作）。
])
recurrent_ae = Sequential([recurrent_encoder, recurrent_decoder])

VAE (Variational Autoencoder)

Loss function:

Reconstruction Loss: Pushes an autoencoder to reproduce its input.
Potential loss: pushes the autoencoder to make the encoding appear to be sampled from a simple Gaussian distribution;
uses the KL divergence between the target distribution (Gaussian) and the actual distribution of the encoding

A variational autoencoder does not directly generate an encoding for a given input; the encoder produces an average encoding μ and a standard deviation σ , and then randomly samples from a Gaussian distribution of the mean μ and standard deviation σ to achieve the encoding. Afterwards, the decoder decodes the encoding of the samples normally.

Through random sampling, the previously discrete latent attribute space can be transformed into a continuous, smooth latent space.

RL

Agent : The intelligent body, that is, the robot, your code.
Environment : the environment, that is, the game itself, openai gym provides a variety of games, that is, it provides multiple game environments.
Action : actions, such as playing Super Mario, up and down, etc.
State : State, every time the agent takes an action, the environment will respond accordingly, returning a state and a reward.
Reward : Reward: Score according to game rules. The agent does not know how to score. It tries to understand the rules of the game continuously. For example, if it makes an upward movement in this state and scores, then the next time it is in this environmental state, it will tend to make an upward movement.

In reinforcement learning, the agent (Agent) observes (Observation) the state (State) in the environment (Environment) and makes a decision (Action), and then it will get a reward (Reward).

The agent's goal is to learn how to act to maximize the expected reward .

The core interface of gym is the unified environment Env, which contains the following core methods:

1. reset(self): Reset the state of the environment and return the observation value observation.
2. step(self,action): Advance a time step and return [observation, reward, done, info]
3. render(self,mode='human', close=False): Redraw a frame of the environment. The default mode is generally more friendly, such as popping up a window.
4. info(dict): Diagnostic information for debugging.
5. observation(object): Returns an object of a specific environment, describing the observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the state of a board in a board game.
6. reward(float): Returns the total reward value obtained by the previous action.
7. done(boolean): Returns whether the environment should be reset. Most game tasks are divided into multiple stages (episode). When done=true, it means that this stage is over.

For the CartPole environment, each observation is a 1D Numpy vector containing four floats:
1. Horizontal position of the cart (centered at 0)
2. Velocity of the cart
3. Angle of the pole (vertical)
4. Angular velocity of the pole

CartPole-v0 example

env = gym.make('CartPole-v0') # 初始化环境
for i_episode in range(20): # 外层循环
	obs = env.reset()  返回观察值
	for t in range(100): # 内层循环
		env.render()  重绘环境的一帧。
		action = env.action_space.sample()  随机选取动作
		obs, re, done, info = env.step(action)
		print(obs)
		if done:
			print("done == True : %d" % (t+1))
			break
env.close()

deep reinforcement learning

strategic search

In reinforcement learning, the algorithm used by an agent to change its behavior is called a policy. For example, a policy could be a neural network that takes Observations as input and Actions as output.

Search strategy method:
1) Random strategy : this strategy can be any algorithm you can think of;
2) Trial strategy : you can search by trying, that is, choose the best combination of parameters through a large number of experiments;
3) Genetic algorithm : through genetic Algorithm to search for optimal parameters
4) Policy Gradient (PG): Using optimization techniques, by evaluating the gradient of the reward with respect to the policy parameters, and then adjusting these parameters by following the gradient towards higher rewards (gradient ascent); 5) Neural
Network Policy : A neural network will take an observation as input and output an action to perform. It will estimate the probability of each action, and then we will randomly choose an action based on the estimated probability.

Q: Why choose random actions based on the probability given by the neural network instead of choosing the action with the highest score?
A: This method allows the agent to find a balance between exploring new behaviors and choosing feasible actions.

Evaluating Behavior: The Credit Assignment Problem

In reinforcement learning, the only way an agent gets guidance is through rewards, which are usually sparse and delayed.

The credit assignment problem : when an agent is rewarded, how does it know which actions to trust (or blame).

The solution strategy is to evaluate an action based on the sum of its post-action scores, usually applying a decay rate r at each step . If the decay rate is close to 0, future rewards won't mean much compared to immediate rewards. Conversely, if the decay rate is close to 1, then rewards to the future are almost equal to immediate rewards.

If an agent decides to go right three times in a row, it gets a +10 reward after the first step, 0 after the second step, and -50 after the third step, with a decay rate of r=0.8, then the first action will get 10 $r^2 ×(−50) = 10+0.8×0 +0.8^2 ×(−50)=−22$ .

Q-Learning

The Markov chain
contains N states, and the transition probability (Transition Probability) from a certain state to another state is determined and does not depend on the past state (the system has no memory). The future state distribution depends only on the present and has nothing to do with the past. The transition probability relationship between all states can be represented by a matrix, called the transition matrix (Transition Matrix).
Markov Decision Process (MDP):
1) At each step, the agent can choose one of several possible actions, and the transition probability depends on the chosen action. Some state transitions return some reward (positive or negative), and the agent's goal is to find a policy that maximizes the reward over time.
2) A method to estimate the best state value of any state s, denoted as V*(s):

T(s,a,s') is the transition probability from state s to state s' when the agent chooses action a.
R(s,a,s') is the reward obtained when the agent chooses action a from state s to state s'. γ is the discount rate.

Q value iterative algorithm:

An example of dynamic programming, which breaks down a complex problem into tractable subproblems that can be solved iteratively.

Principle : The state-action value (s, a) is called the Q value, denoted as Q(s, a), which will return the future reward expectation of performing the action in this state. The best Q value of state-action (s, a), denoted as Q*(s, a).

How it works : First initialize all Q-value estimates to zero, then update them using the Q-value iterative algorithm.

The Q-Learning algorithm is a Q-value iterative algorithm that adapts to situations where transition probabilities and rewards are unknown.

对于每个状态 - 动作对（s，a），算法跟踪agent 在离开状态s时采取动作a的奖励的运行平均值，以及预期稍后获得的奖励。Since the target policy will take the optimal action, we take the maximum Q-value estimate for the next state.

Valuation Q-Learning
The main problem with Q-Learning is that it does not scale well to large, medium-sized MDP problems with many states and actions. Finding a function to approximate the Q value using a manageable number of parameters is called valuation Q-Learning

DNNs used to estimate Q values are called Deep Q Networks (DQNs). Approximate Q-learning with DQN is called deep Q-learning (DQL).

Q-learning example

A simple example is used to introduce the execution process of the Q-Learning algorithm. There are 6 rooms in a building, the rooms are represented by numbers 0-5, and the rooms are connected by doors. Room 5 is the peripheral space of the building, which can be regarded as a large room. First place the Agent in any room in the building, and its 1 goal is to walk from that room to room 5.

insert image description here

Initialization: Initialize Q, R, γ
Q-Learning Algorithm

insert image description here

Step 1 给定参数 和R矩阵，Q=0.
Step 2 循环执行Step3,4
Step 3 随机选择一个初始状态s
Step 4 循环执行：
	1.判断s是否是目标状态，如否则执行下列步骤,如是则结束Step 4.
	2.在当前状态s的所有可能行动中选取一个动作a
	3.利用选定的动作a，得到下一个状态s~
	4.利用公式计算Q(s,a)
	5.令s=s~

Implementation process

 循环1
 	令初始状态为1。
 	状态不为5，则在R矩阵中选择状态1的可能动作：3或5。
 	选择状态5，得到下一个状态为5。
 	利用公式计算Q(1,5)
 	Q矩阵更新为：
 		当前状态为5，所以下次循环判断后即可结束

insert image description here

	循环2
		回到Step 3, 重新选择一个初始状态。
		选择初始状态为3.
		找出状态3的可能动作：1,2,4
		选择状态1，对应2个可能的动作：3或5
		利用公式计算Q(3,1)
		Q矩阵更新为：
			现在的当前状态变成了状态1，因为状态1不是目标状态，因此从状态1 的可能动作选择一个：3或5。
		选择状态5，其动作有3个可能：1,4或5；
		利用公式计算Q(1,5)

insert image description here

	继续循环，继续执行多次后，将得到矩阵Q：
	
	一旦Q接近收敛状态，就可以指导Agent进行最佳路径选择。
	如，假定Agent从状态2出发，可到到下列路径：
	1. Q矩阵中，状态2的最大元素值为320，选择状态3；
	2. 状态3的最大元素值为400， 状态为1或4，选择状态4；
	3. 状态4的最大元素值为500，即状态5，到达目标。
	因此最佳路径为：2-3-4-5

Deep Learning Architecture - Tensorflow

Basic concepts of deep learning

TensorFlow

Common functions

Pytorch

nn.Module

nn.function

RNN

CNN

convolutional layer

about padding

Pooling layer:

RNN

LSTM

GRU

AE

Deep Autoencoder

Convolutional Autoencoding

RL

deep reinforcement learning

strategic search

Evaluating Behavior: The Credit Assignment Problem

Q-Learning

Q-learning example

Guess you like