[Deep Learning] Andrew Ng - Notes Softmax layer, optimized neural network, Bayesian optimal error rate, changing error rate index, transfer learning

Softmax layer(maybe)

Softmax extends logistic regression to more than two classifications

对于输出层L,Z^{[L]} = W^{[L]}A^{[L-1]} + b^{[L]} , (4, 1)

softmax function

a^{[L]} = \frac{e^{Z^{[L]}}}{\sum_{j = 1}^{3}t_{i}}   (4,1)         

t = e^{(Z^{[L]})} 

a^{[L]}_{i} = \frac{t_{i}}{\sum_{j = 1}^{3}t_{i}}, calculate the a value of each unit in the output layer, that is, the probability of each classification

 

 Softmax loss function

a^{[L]} =\hat{y}^{}     a^{[L](1)} =\hat{y}^{(1)} 

L(\hat{y}^{(i)},y) = -\sum_{j = 1}^{4}y_{j}log\hat{y}_{j}

J(w^{1},b^{1}...,w^{l},b^{l}) = \frac{1}{m}\sum_{i = 1}^{m}L(\hat{y}^{(i)},y) 

  • Softmax regression is suitable for classification problems. It uses the softmax operation to output the probability distribution of the categories.
  • Softmax regression is a single-layer neural network, and the number of outputs is equal to the number of categories in the classification problem.
  • Cross entropy is suitable for measuring the difference between two probability distributions.

Optimizing Neural Networks 

  • Collect more data
  • Collect more diverse training sets
  • Training algorithms with gradient descent takes longer
  • Replacing Gradient Descent with Adam
  • Try a larger network
  • Try a smaller network
  • Try Dropout
  • Add L2 regularization term
  • Network structure adjustment (activation function, hidden unit, hidden layer)
  • Measure with evaluation indicators 

https://blog.csdn.net/m0_51933492/article/details/126540695?spm=1001.2014.3001.5502

  • Choose the cross-validation set and test set with the same distribution to accommodateimportant data obtained in reality in the future, data that will be useful to the model Randomly assigned to development set and test set
  • Set up development set (cross-validation set) and test set in the same distribution
  • Divide a higher proportion of the data into the training set. The purpose of the test set is to evaluate the final cost deviation. It only needs to be much less than 30% of the overall data volume.

Change the error rate metric (reduce classification errors)

Increase the penalty weight of classification errors by 10 times

Error = \frac{1}{m_{dev}}\sum_{i = 1}^{m_{dev}}\omega ^{(i)}L(y_{pred}^{(i)}\neq y^{i})

w^{(i)} = \left\{\begin{matrix} 1 & if& x^{(i)}& True& \\ 10& if& x^{(i)}& False& \end{matrix}\right.

If your data evaluation metrics fit well on both the dev and test sets, but do not perform well on the real-world application, change your data evaluation metrics on both the dev and test sets, or individually, to match your real-world application needs.

Bayesian optimal error rate (can be approximated as the optimal error rate for humans to complete tasks)

Humans are good at processing complex natural perception tasks. Bayesian optimal error rate is generally considered to be the optimal error rate that a machine can theoretically achieve, and it is also the optimal error rate for humans to complete tasks. When your algorithm works worse than humans, You should:

  • Get more labeled data from humans
  • Human analysis of errors adds insight, how do humans get it right when they do it?
  • Further analysis of bias/variance

The difference between the Bayes error rate and the error rate on the training set isavoidable bias, while the error rate on the training set The difference between the error rate on the training set and the error rate on the validation set is variance. Compare the error rate on the training set with the Bayesian error rate and the error rate on the validation set. Bias and variance, determine whether focusing on reducing bias or variance is more effective for algorithm tuning

If the goal of the algorithm is to replace the Bayesian error rate, and progress becomes increasingly difficult as it approaches human level, the human level error rate in the figure below should be (d) Experienced Professor

  •  For problems with a lot of data noise (speech recognition with blurred background sounds), estimating the Bayesian error rate is of great significance to avoid avoidable bias and variance.
  • For algorithms with a Bayesian error rate of almost 0, the value obtained on the training set can be compared to 0% (no longer compared to the optimal error rate of humans completing the task)
  • Look at the distance between the training error rate and the Bayesian error rate estimate to get an idea of ​​how much bias can be avoided
  • Looking at the distance between development error rate and training error rate, we know how big the variance problem is and how much effort should be made to make the algorithm performance generalize from the training set to the development set. The algorithm is not trained on the development set.

 Two basic assumptions of supervised learning

  • Performed well on the training set
  • It performs well on the training set and also works well on the development set/test set.

transfer learning

  • Apply knowledge or patterns learned in a certain field or task to a different but related field or problem
  • Transfer learning refers to a training method that transfers the network structure and weights originally used to solve task A to task B, and can also obtain better results in task B. The principle why transfer learning can be realized is that the characteristics of shallow learning of convolutional neural networks are universal. When there are insufficient samples, transfer learning can be used to transfer these general feature learning from other already trained networks, thereby saving training time and obtaining better recognition results.
  • Using transfer learning Finetune, also called fine-tuning and finetuning, is an important concept in deep learning. inetune uses networks that have been trained by others and adjusts them for its own tasks.

Migrate A to B

  • When task A and task B have the same input
  • Task A has more data than task B
  • Low-level features from A are useful for learning B

Model-based migration:

That is to say, build a model with shared parameters. This is mainly used in neural networks because the structure of the neural network can be directly migrated. For example, finetune, the most classic neural network, is a good embodiment of model parameter migration.

 A. Why do we need a trained network? In practical applications, we usually do not train a neural network from scratch for a new task. Such an operation is obviously very time-consuming. In particular, our training data cannot be as large as ImageNet to train a deep neural network with strong enough generalization capabilities. Even with so much training data, the cost of training from scratch is unaffordable. So what to do? Transfer learning tells us that we can use a previously trained model and transfer it to our own tasks.

B. Why is finetune needed? Because the models trained by others may not be completely suitable for our own tasks. Maybe other people's training data and our data do not obey the same distribution; maybe other people's networks can do more things than our tasks; maybe other people's networks are more complex and our tasks are simpler. For example, if we want to train a neural network for binary classification of cat and dog images, then a very valuable reference is the neural network trained on CIFAR-100. But CIFAR-100 has 100 categories and we only need 2 categories. At this time, we need to fix the relevant layers of the original network and modify the output layer of the network for our own tasks to make the results more in line with our needs.

C. Advantages of Finetune: There is no need to train the network from scratch for new tasks, saving time and cost; pre-trained models are usually performed on large data sets, which virtually expands our training data and makes the model more robust. It is great and has better generalization ability; Finetune is simple to implement, allowing us to only focus on our own tasks. In practical applications, few people usually train a neural network from scratch for their new task. Finetune is an ideal choice.

Finetune of deep networks can help us save training time and improve learning accuracy. But finetune has its inherent shortcomings: it cannot handle situations where training data and test data have different distributions. This phenomenon is common in practical applications. Because the basic assumption of finetune is that the training data and test data obey the same data distribution. This is also not true in transfer learning. Therefore, we need to go further and develop better methods for deep networks to better complete transfer learning tasks. Taking the data distribution adaptive method as a reference, many deep learning methods have developed adaptive layers

(Adaptation Layer) to complete the adaptation of source domain and target domain data. Adaptation can make the data distribution of the source domain and the target domain closer, thus making the network more effective. From the above analysis, we can conclude that the adaptation of deep networks mainly completes two parts of the work: first, which layers can be adaptive, which determines the degree of learning of the network; second, what kind of adaptation method (measurement criterion) is used , which determines the generalization ability of the network.

Basic principles for designing deep transfer networks: decide on adaptive layers, then add adaptive metrics to these layers, and finally finetune the network

Relationship-based migration:

This method is rarely used. It mainly involves mining and using relationships for analogy migration. For example, teachers giving lessons and students listening can be compared to the scene of a company meeting. This is a kind of relationship migration. This method focuses more on the relationship between samples in the source domain and target domain.

If you want to make a computer vision application, instead of training weights from scratch, or starting from random initialization of weights, it is better to download the weights of a network structure that has been trained by others. You can usually make progress quite quickly. . Use this as pre-training and then switch to tasks that interest you. The computer vision research community likes to upload many data sets to the Internet, such as ImageNet, MS COCO or Pascal type data sets. You can download the open source weight parameters that took others weeks or even months to make and use it as a good initialization for your own neural network. Use transfer learning to transfer knowledge from public data sets to your own problems. As shown in the picture below, if you want to build a cat detector to detect your own pet cat, Tigger, Misty or Neither, ignore the situation where two cats appear in the same picture at the same time. You probably don't have a lot of images of Tigger and Misty right now, so your training set is small. At this time, you can download some open source implementations of neural networks from the Internet, not only downloading the code but also downloading the weights. There are many trained networks you can download, such as ImageNet, which has 1000 different categories, so this network will have a Softmax unit that can output one of 1000 possible categories. You can remove this Softmax layer and create your own Softmax unit to output the three categories of Tiger, Misty, and Neither. As far as networking is concerned, it is recommended that you think of all layers as frozen. Your frozen network contains parameters for all layers, you only need to train the parameters related to your softmax layer. This Softmax layer has three possible outputs. By using someone else's pretrained weights, you're likely to get good performance even with a small dataset. Fortunately, most deep learning frameworks support this operation, in fact, it depends on the framework used. If you have more data, you can freeze fewer layers and train more layers. The idea is that if you have a larger training set, and perhaps enough data, then instead of just training a softmax unit, consider training a medium-sized network that includes the later layers of the network you will eventually use. Finally, if you have a lot of data, what you should do is use the open source network and its weights as an initialization and then train the entire network.
(4 messages) Introduction to transfer learning in deep learning_fengbingchun’s blog-CSDN blog_deep learning transfer learning

Guess you like

Origin blog.csdn.net/m0_51933492/article/details/126613606
Recommended