The first day of the second week the second day the third day the fourth day the fifth day

Meeting:

Divide the task, wyl is the language model, and yyp and I are the acoustic model. I still want to understand the code of the ARST system first, and then make the modification!

Two: the feature package is installed

Three: ASRT can run

Four: training data, about project documents, ppt, and the content of the meeting

Five: look at the code

 

 

Example---Using Tsinghua University public speech data set data_thchs30 (wav audio)

train:20000 / 2

dev:1786 / 2

test:4990 / 2

 

(1) Development of Artificial Neural Network

1. MP neurons

 

2. Activation function (response function)

Map the input value to the output value "0" (neuron inhibition) or "1" (neuron excitement),

The Sigmoid function is commonly used to squeeze the input value that may vary in a larger range to the range of (0, 1)

 

3. Perceptron and multilayer perceptron network

Single layer perceptron : 2 layers of neurons (input layer + output layer (MP neurons)) [output layer is functional neurons]

Features:

Can only handle linearly separable problems (exclusive OR problems cannot be handled)

Impact: The first wave of artificial neural networks was set off , and there was no way to deal with simple XOR problems, and the tide was "ebb"

 

Multilayer perceptron : Multilayer: ①Single hidden layer (one hidden layer) ②Multiple hidden layers (multilayer hidden layers) [Hidden layer and output layer are both functional neurons with activation functions]

Features:

①Fully connected (neurons in each layer are connected to all neurons in the previous layer, there are no connections in the same layer and across layers, and the connection weights are different)

② Feedforward neural network

 

Disadvantages: (feedforward) It was difficult to train and learn at the time, and there was no effective learning method to learn parameters.

------->So, later [see the second wave of neural networks below ] proposed the BP algorithm, using the Sigmoid activation function (making the neural network capable of solving nonlinear problems)

 

To solve the non-linear separability problem, it is necessary to consider the use of multilayer perceptrons.

The first to break the curse of the father of nonlinear depth study of Jeffrey · Sinton ( Geoffrey Hinton ), which in 1986 invented applicable to multi-layer Perceptron (in MLP ) of BP algorithm (back-propagation algorithm), and The Sigmoid activation function is used for nonlinear mapping, which effectively solves the problem of nonlinear classification and learning. This method caused the second upsurge of neural networks.

 

 

In 1989 . Yann LeCun used the idea of ​​backpropagation to invent a convolutional neural network - LeNet , and used it for digital recognition, and achieved good results.

Unfortunately, the wave only lasted until the mid-1990s and then "ebbed"

 

the reason:

Ø The approach didn’t scale to larger problems( vanishing gradient problem, data and computing power )
Can't handle more complex problems (the problem of vanishing gradient, insufficient data, insufficient computing power)
 
      The problem of gradient disappearance: that is, in the process of backward propagation of the error gradient, the error gradient is almost 0 when it reaches the previous layer , so the previous layer cannot be effectively learned. This discovery has made the NN development at this time worse.
 
       Not enough data: the Internet was not developed at that time
 
       Insufficient computing power: without GPU, insufficient CPU power
 
Ø 1997: LSTM model ( short and long term memory model , Long-Short Term Memory ) is the invention, although the model sequence modeling characteristics of the very prominent, but is in NN downhill period, no sufficient attention.
 
 
Ø The  support vector machine  (SVM) became the method of choice
         Plus 90 in the mid-years of machine learning, support vector machines other machine learning algorithms of birth, have achieved very good results in many important task , the research we once again taken away from the neural network.
 
 
 
 
 

Around 2006, Hinton Hinton once again declared that he knew how the brain works, and proposed the idea of ​​unsupervised pre-training and deep belief networks. Using this strategy, people can train a deeper network than before , prompting the "neural network" to be renamed "deep learning."

Caused the third wave of neural networks.

 

The big breakthrough of the real neural network was in 2012,

GPU has been proposed in 2012

on imagenet image dataset (produced by Stanford University), held a competition

According to the picture, as the number of GPU blocks increases, as the network structure is improved, the error rate is decreasing.

 

 

 

Now that I have finished talking about the development history of artificial neural networks , before talking about the BP algorithm:

  • How to build neural networks?
  •  Learning or Training Process?

  •  How to update parameters ?

  •  How to efficiently update weights and bias ?

 

 

(1)How to build neural networks?

The learning process of the neural network is to adjust the artificial parameters so that the learning parameters are constantly approaching the optimal process

  • Manually determined parameters:

Number of hidden layers; number of neurons in each hidden layer; activation function of each neuron; loss function E; learning rate; Batsch size; Epoch

  • Parameters to be learned: (initialized randomly in the network at the beginning)

Weights   ;    Bias

 

 

 

 

(2) Learning or Training Process?

Prepare the data (divided into training set and validation set), for example:

xi represents the input of the acoustic model (feature matrix after data preprocessing), and yi represents the output of the acoustic model (pinyin list)

 

Illustration: Suppose 10000 wav files in the training set, batch size = 100, which means that 100 speech feature matrices are sent to the neural network each time, and then the loss values ​​of these 100 are checked, and the learning parameters w and b are adjusted by back propagation, and then this It is an iteration. Then input batch size = 100 voices and repeat the above process. After training 10,000 voices in the training set, this is considered one round, one epoch, one round of training is obviously not enough! ! ! Validation set to view training effect

Disrupt the training set, select a batch, and then train one epoch. Verify the training set

Disrupt the training set, select a batch, and then train one epoch. Verify the training set

 

Training stop:

1) Manually determine the number of epochs

2) Determine the loss on the validation set in each round. If there is no decline in multiple rounds, you can stop.

 

for example:

1) Stochastic Gradient Descent, SGD stochastic gradient descent method

Basically, in SGD, we are using the loss of 1 sample to update parameters at each iteration

For this example, one epoch of training update the weight and bias 10000 times

Put a sample batch size =1 each time, update the connection weight and bias 10000 times in one round

2) Batch Gradient Descent, BGD batch gradient descent method

In BGD, we are using the mean of the loss of ALL examples in training set at each iteration

For this example, one epoch of training update the weight and bias 1 time

Put all samples in batch size = 10000 each time, update the connection weight and bias once in a round

3) Mini-batch Gradient Descent mini-batch gradient descent method

Mini-batch gradient descent uses n samples(instead of 1 sample in SGD) at each iteration.

For this example, one epoch of training update the weight and bias 10000 / n  times

Put all samples in batch size = n each time, update (10000 / n) connection weights and biases in one round

 

 

 

 

(3)How to update parameters ?

Goal: min loss function  

In fact, it is a very complex function, often using gradient descent

 

 

What is a gradient?

Along the gradient descent direction, easy to find the maximum value of the function

That is: along the direction of negative gradient descent (is it gradient rise?), it is easy to find the minimum value of the function

Θ = (Θ0, Θ1) T

Θ(k+1) = Θ(k) + λd [Descent direction: d =-Δf (Θ)]

 

λ Step: It is called "learning rate" in neural network.

The step size is too large, it may cross the lowest point

The step size is too small, there may be too many iterations

 

4. BP algorithm (back propagation algorithm)

(4) How to efficiently update weights and bias ?

Examples are as follows:

 

 

 

 

 

 

 

Calculate from back to forward, from back (backpropagation)

1) First find the residual of each neuron in the output layer

2) Find the residual of the previous hidden layer of the output layer

3) and so on... until all parameters are updated

 

 

 

 

 

MLP: Multilayer Perceptron

 

The softmax layer is used as the output layer. (The activation function of the output layer uses the softmax function)

 

to sum up:

Sigmoid maps a real value to the interval of (0,1) (of course it can also be (-1,1)), which can be used for binary classification.


And softmax maps a k-dimensional real value vector (a1, a2, a3, a4...) into one (b1, b2, b3, b4...) where bi is a 0-1 constant, and then can be based on the size of bi Perform multi-classification tasks, such as taking the one dimension with the largest weight.


————————————————
Copyright statement: This article is the original article of the CSDN blogger "trayfour", and it follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprinting. .
Original link: https://blog.csdn.net/u014422406/article/details/52805924

 

 

 

Limitations of Multilayer Perceptron MLP:

Full connection------>Too many connection rights, requiring large memory and strong calculation

 

How to reduce the number of parameters in the network? ------->CNN

(2) Deep learning

 

1、CNN

(1) Receptive Field (partial receptive field)

 

  Each hidden layer node is only connected to a certain local pixel area of ​​the image, thereby greatly reducing the weight parameters that need to be trained.

 

 

(2) Parameter Sharing (weight sharing)

 

(3)Pooling(Subsampling) (池化)

 

Dimensionality reduction

 

2. Typical CNN structure

(1) Input layer

Keep the original pixel value of the image,

input is a vector, here is a multi-channel (RGB) image

[width x height x channels ]  =  (here) [32 × 32 × 3]

(2) Convolution-single channel

 

Convolution kernel: weight w (weight sharing)

 

 

 

A bit like extracting features

N × N    Image        F×F    Filter     ----->    Output size = (N - F) / S + 1     Feature Map

Stride (S): Stride

Trainable Parameters:The value in filter (Weights)

Parameters that need to be trained: convolution kernel (it is initialized randomly, so it needs to be learned)

Problem: The output is decimal

To prevent decimals in Output size, 0 should be added to the periphery of the input channel

How many laps to add 0? This is a parameter that needs to be determined manually and is a super parameter.

 

(3) Convolution-Three channel (three channels)

 

Three convolution sum + bias = 4+1+4+1 = 10

Get activated Feature Map through activation function ---->

 

(4) Activation Function

1)sigmoid

When saturated, the gradient is 0, the gradient disappears, and the learning parameters do not change much.

2 fishy (x)

Improvement of sigmoid, compressed to [-1, 1]

3) The resumption

A better activation function, there is no saturation region, and there will be no case where the gradient is 0, which speeds up the convergence.

 

 

Multiple sets of features can be extracted (using multiple convolution kernels) to form multiple sets of Feature Map.

Each layer in CNN is composed of multiple maps , and each map is composed of multiple neural units. All neural units of the same map share a convolution kernel (ie, weight), and the convolution kernel often represents a feature.

Each Feature Map has a bias, and M has M biases to learn.

 

One more super parameter: M (the number of convolution kernels)

 

 

(5) Pooling layer (pooling layer)

 

Reduce image size

 

Maximum pooling: select the maximum value of a certain area

Average pooling: select the average value of a certain area

 

No learning parameters! ! ! Without w and b

 

Pooling: Pooling a Feature Map

 

 

Pooling layer -----> Convolutional layer

There are many Feature Maps in the pooling layer, each of which is equivalent to the input channel of a convolutional layer

Assuming a total of four input channels, one convolution kernel may only use 3 of them, and another convolution kernel may only use 2 of them

 

 

 

 

 

Repool after convolution

After pooling, if you connect to the fully connected layer, you need to expand Flatten's pooled Feature Map

And then put it in the back fully connected layer

 

 

to sum up:

Convolutional layer: depth = the number of Feature Maps (maybe greater than the number of channels??) (or the number of channels)

Pooling layer: The depth is the same as the convolutional layer because pooling only needs to pool each Feature Map

Fully connected layer:

(Softmax layer)

 

 

3. Handwritten digit recognition

32 Feature Map

Convolution kernel size 3 × 3

Activation function relu

  

Maximum pooling 2 × 2

 

Dropout: Prevent over-fitting, and selectively ignore some neurons with a probability of 0.25 during training

 

Dense fully connected 128 neurons

num_classes: number of categories

 

Compile:

Use the mean squared difference in front of loss, and use cross entropy more

Optimization: Adadelta, the learning rate is dynamic learning, the core is still gradient descent

Measurement method: accuracy

 

 

(3) The content of the training platform

1. Convolutional neural network

 

Convolution: divided into narrow convolution, full convolution, and same convolution

Narrow convolution

Narrow convolution (valid convolution) can also be easily understood literally, that is, the generated feature map is smaller than the original original picture, and its step size is variable. If the sliding step is S and the dimension of the original image is N1×N1, then the size of the convolution kernel is N2×N2, and the image size after convolution (N1-N2)/S+1×(N1-N2)/S +1.

Same convolution

Same convolution (same convolution) means that the size of the image after convolution is as large as the size of the original image, the step size of the same convolution is fixed, and the sliding step size is 1. In general operation, padding technology must be used (the periphery is filled with 0 to ensure that the generated size remains unchanged).

Full convolution

Full convolution (full convolution), also called deconvolution, is to expand each pixel in the original image with a convolution operation. As shown in the figure, the white blocks are the original pictures, the lighter ones are the convolution kernels, and the darker ones are the pixels being convolved. During the deconvolution operation, padding operation is also required on the original picture, and the generated result will be larger than the original picture size.

 

The step size of full convolution is also fixed, the sliding step size is 1. If the dimension of the original picture is N1×N1, then the size of the convolution kernel is N2×N2, and the image size after convolution is N1+N2-1 ×N1+N2-1

The previous narrow convolution and same convolution are commonly used techniques in convolutional networks, but full convolution (full convolution) is the opposite, it is more used in deconvolution networks, about the content of deconvolution networks , Will be introduced in later chapters.

 

 

 

Deconvolutional Neural Network

1 Deconvolution refers to the process of reconstructing unknown input by measuring output and known input. In neural networks, the deconvolution process does not have the ability to learn. It is only used to visualize a trained convolutional network model, and there is no learning and training process.

 

Note: The method of tf.nn.max_pool_with_argmax only supports GPU operations, so using this method is currently not available on CPU machines.

 

 

What the hell are Epoch, Batchsize, and Iterations in deep learning? What the hell are Epoch, Batchsize, and Iterations in deep learning?

 

Guess you like

Origin blog.csdn.net/sunshine04/article/details/106757401