Meeting:
Divide the task, wyl is the language model, and yyp and I are the acoustic model. I still want to understand the code of the ARST system first, and then make the modification!
Two: the feature package is installed
Three: ASRT can run
Four: training data, about project documents, ppt, and the content of the meeting
Five: look at the code
Example---Using Tsinghua University public speech data set data_thchs30 (wav audio)
train:20000 / 2
dev:1786 / 2
test:4990 / 2
(1) Development of Artificial Neural Network
1. MP neurons
2. Activation function (response function)
Map the input value to the output value "0" (neuron inhibition) or "1" (neuron excitement),
The Sigmoid function is commonly used to squeeze the input value that may vary in a larger range to the range of (0, 1)
3. Perceptron and multilayer perceptron network
Single layer perceptron : 2 layers of neurons (input layer + output layer (MP neurons)) [output layer is functional neurons]
Features:
Can only handle linearly separable problems (exclusive OR problems cannot be handled)
Impact: The first wave of artificial neural networks was set off , and there was no way to deal with simple XOR problems, and the tide was "ebb"
Multilayer perceptron : Multilayer: ①Single hidden layer (one hidden layer) ②Multiple hidden layers (multilayer hidden layers) [Hidden layer and output layer are both functional neurons with activation functions]
Features:
①Fully connected (neurons in each layer are connected to all neurons in the previous layer, there are no connections in the same layer and across layers, and the connection weights are different)
② Feedforward neural network
Disadvantages: (feedforward) It was difficult to train and learn at the time, and there was no effective learning method to learn parameters.
------->So, later [see the second wave of neural networks below ] proposed the BP algorithm, using the Sigmoid activation function (making the neural network capable of solving nonlinear problems)
To solve the non-linear separability problem, it is necessary to consider the use of multilayer perceptrons.
The first to break the curse of the father of nonlinear depth study of Jeffrey · Sinton ( Geoffrey Hinton ), which in 1986 invented applicable to multi-layer Perceptron (in MLP ) of BP algorithm (back-propagation algorithm), and The Sigmoid activation function is used for nonlinear mapping, which effectively solves the problem of nonlinear classification and learning. This method caused the second upsurge of neural networks.
In 1989 . Yann LeCun used the idea of backpropagation to invent a convolutional neural network - LeNet , and used it for digital recognition, and achieved good results.
Unfortunately, the wave only lasted until the mid-1990s and then "ebbed"
the reason:
Around 2006, Hinton Hinton once again declared that he knew how the brain works, and proposed the idea of unsupervised pre-training and deep belief networks. Using this strategy, people can train a deeper network than before , prompting the "neural network" to be renamed "deep learning."
Caused the third wave of neural networks.
The big breakthrough of the real neural network was in 2012,
GPU has been proposed in 2012
on imagenet image dataset (produced by Stanford University), held a competition
According to the picture, as the number of GPU blocks increases, as the network structure is improved, the error rate is decreasing.
Now that I have finished talking about the development history of artificial neural networks , before talking about the BP algorithm:
- How to build neural networks?
-
Learning or Training Process?
-
How to update parameters ?
-
How to efficiently update weights and bias ?
(1)How to build neural networks?
The learning process of the neural network is to adjust the artificial parameters so that the learning parameters are constantly approaching the optimal process
- Manually determined parameters:
Number of hidden layers; number of neurons in each hidden layer; activation function of each neuron; loss function E; learning rate; Batsch size; Epoch
- Parameters to be learned: (initialized randomly in the network at the beginning)
Weights ; Bias
(2) Learning or Training Process?
Prepare the data (divided into training set and validation set), for example:
xi represents the input of the acoustic model (feature matrix after data preprocessing), and yi represents the output of the acoustic model (pinyin list)
Illustration: Suppose 10000 wav files in the training set, batch size = 100, which means that 100 speech feature matrices are sent to the neural network each time, and then the loss values of these 100 are checked, and the learning parameters w and b are adjusted by back propagation, and then this It is an iteration. Then input batch size = 100 voices and repeat the above process. After training 10,000 voices in the training set, this is considered one round, one epoch, one round of training is obviously not enough! ! ! Validation set to view training effect
Disrupt the training set, select a batch, and then train one epoch. Verify the training set
Disrupt the training set, select a batch, and then train one epoch. Verify the training set
Training stop:
1) Manually determine the number of epochs
2) Determine the loss on the validation set in each round. If there is no decline in multiple rounds, you can stop.
for example:
1) Stochastic Gradient Descent, SGD stochastic gradient descent method
Basically, in SGD, we are using the loss of 1 sample to update parameters at each iteration
For this example, one epoch of training update the weight and bias 10000 times
Put a sample batch size =1 each time, update the connection weight and bias 10000 times in one round
2) Batch Gradient Descent, BGD batch gradient descent method
In BGD, we are using the mean of the loss of ALL examples in training set at each iteration
For this example, one epoch of training update the weight and bias 1 time
Put all samples in batch size = 10000 each time, update the connection weight and bias once in a round
3) Mini-batch Gradient Descent mini-batch gradient descent method
Mini-batch gradient descent uses n samples(instead of 1 sample in SGD) at each iteration.
For this example, one epoch of training update the weight and bias 10000 / n times
Put all samples in batch size = n each time, update (10000 / n) connection weights and biases in one round
(3)How to update parameters ?
Goal: min loss function
In fact, it is a very complex function, often using gradient descent
What is a gradient?
Along the gradient descent direction, easy to find the maximum value of the function
That is: along the direction of negative gradient descent (is it gradient rise?), it is easy to find the minimum value of the function
Θ = (Θ0, Θ1) T
Θ(k+1) = Θ(k) + λd [Descent direction: d =-Δf (Θ)]
λ Step: It is called "learning rate" in neural network.
The step size is too large, it may cross the lowest point
The step size is too small, there may be too many iterations
4. BP algorithm (back propagation algorithm)
(4) How to efficiently update weights and bias ?
Examples are as follows:
Calculate from back to forward, from back (backpropagation)
1) First find the residual of each neuron in the output layer
2) Find the residual of the previous hidden layer of the output layer
3) and so on... until all parameters are updated
MLP: Multilayer Perceptron
The softmax layer is used as the output layer. (The activation function of the output layer uses the softmax function)
to sum up:
Sigmoid maps a real value to the interval of (0,1) (of course it can also be (-1,1)), which can be used for binary classification.
And softmax maps a k-dimensional real value vector (a1, a2, a3, a4...) into one (b1, b2, b3, b4...) where bi is a 0-1 constant, and then can be based on the size of bi Perform multi-classification tasks, such as taking the one dimension with the largest weight.
————————————————
Copyright statement: This article is the original article of the CSDN blogger "trayfour", and it follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprinting. .
Original link: https://blog.csdn.net/u014422406/article/details/52805924
Limitations of Multilayer Perceptron MLP:
Full connection------>Too many connection rights, requiring large memory and strong calculation
How to reduce the number of parameters in the network? ------->CNN
(2) Deep learning
1、CNN
(1) Receptive Field (partial receptive field)
Each hidden layer node is only connected to a certain local pixel area of the image, thereby greatly reducing the weight parameters that need to be trained.
(2) Parameter Sharing (weight sharing)
(3)Pooling(Subsampling) (池化)
Dimensionality reduction
2. Typical CNN structure
(1) Input layer
Keep the original pixel value of the image,
input is a vector, here is a multi-channel (RGB) image
[width x height x channels ] = (here) [32 × 32 × 3]
(2) Convolution-single channel
Convolution kernel: weight w (weight sharing)
A bit like extracting features
N × N Image F×F Filter -----> Output size = (N - F) / S + 1 Feature Map
Stride (S): Stride
Trainable Parameters:The value in filter (Weights)
Parameters that need to be trained: convolution kernel (it is initialized randomly, so it needs to be learned)
Problem: The output is decimal
To prevent decimals in Output size, 0 should be added to the periphery of the input channel
How many laps to add 0? This is a parameter that needs to be determined manually and is a super parameter.
(3) Convolution-Three channel (three channels)
Three convolution sum + bias = 4+1+4+1 = 10
Get activated Feature Map through activation function ---->
(4) Activation Function
1)sigmoid
When saturated, the gradient is 0, the gradient disappears, and the learning parameters do not change much.
2 fishy (x)
Improvement of sigmoid, compressed to [-1, 1]
3) The resumption
A better activation function, there is no saturation region, and there will be no case where the gradient is 0, which speeds up the convergence.
Multiple sets of features can be extracted (using multiple convolution kernels) to form multiple sets of Feature Map.
Each layer in CNN is composed of multiple maps , and each map is composed of multiple neural units. All neural units of the same map share a convolution kernel (ie, weight), and the convolution kernel often represents a feature.
Each Feature Map has a bias, and M has M biases to learn.
One more super parameter: M (the number of convolution kernels)
(5) Pooling layer (pooling layer)
Reduce image size
Maximum pooling: select the maximum value of a certain area
Average pooling: select the average value of a certain area
No learning parameters! ! ! Without w and b
Pooling: Pooling a Feature Map
Pooling layer -----> Convolutional layer
There are many Feature Maps in the pooling layer, each of which is equivalent to the input channel of a convolutional layer
Assuming a total of four input channels, one convolution kernel may only use 3 of them, and another convolution kernel may only use 2 of them
Repool after convolution
After pooling, if you connect to the fully connected layer, you need to expand Flatten's pooled Feature Map
And then put it in the back fully connected layer
to sum up:
Convolutional layer: depth = the number of Feature Maps (maybe greater than the number of channels??) (or the number of channels)
Pooling layer: The depth is the same as the convolutional layer because pooling only needs to pool each Feature Map
Fully connected layer:
(Softmax layer)
3. Handwritten digit recognition
32 Feature Map
Convolution kernel size 3 × 3
Activation function relu
Maximum pooling 2 × 2
Dropout: Prevent over-fitting, and selectively ignore some neurons with a probability of 0.25 during training
Dense fully connected 128 neurons
num_classes: number of categories
Compile:
Use the mean squared difference in front of loss, and use cross entropy more
Optimization: Adadelta, the learning rate is dynamic learning, the core is still gradient descent
Measurement method: accuracy
(3) The content of the training platform
1. Convolutional neural network
Convolution: divided into narrow convolution, full convolution, and same convolution
Narrow convolution
Narrow convolution (valid convolution) can also be easily understood literally, that is, the generated feature map is smaller than the original original picture, and its step size is variable. If the sliding step is S and the dimension of the original image is N1×N1, then the size of the convolution kernel is N2×N2, and the image size after convolution (N1-N2)/S+1×(N1-N2)/S +1.
Same convolution
Same convolution (same convolution) means that the size of the image after convolution is as large as the size of the original image, the step size of the same convolution is fixed, and the sliding step size is 1. In general operation, padding technology must be used (the periphery is filled with 0 to ensure that the generated size remains unchanged).
Full convolution
Full convolution (full convolution), also called deconvolution, is to expand each pixel in the original image with a convolution operation. As shown in the figure, the white blocks are the original pictures, the lighter ones are the convolution kernels, and the darker ones are the pixels being convolved. During the deconvolution operation, padding operation is also required on the original picture, and the generated result will be larger than the original picture size.
The step size of full convolution is also fixed, the sliding step size is 1. If the dimension of the original picture is N1×N1, then the size of the convolution kernel is N2×N2, and the image size after convolution is N1+N2-1 ×N1+N2-1
The previous narrow convolution and same convolution are commonly used techniques in convolutional networks, but full convolution (full convolution) is the opposite, it is more used in deconvolution networks, about the content of deconvolution networks , Will be introduced in later chapters.
Deconvolutional Neural Network
1 Deconvolution refers to the process of reconstructing unknown input by measuring output and known input. In neural networks, the deconvolution process does not have the ability to learn. It is only used to visualize a trained convolutional network model, and there is no learning and training process.
Note: The method of tf.nn.max_pool_with_argmax only supports GPU operations, so using this method is currently not available on CPU machines.