Andrew Ng network machine learning strategies class notes 1--

Improve the system model There are many ways, but how can we make sure not to waste time on system performance has little or no effect on the action of the method? How to find the right method for improvement?

  1. Orthogonalization
  2. Single number evaluation criteria
  3. Meet the target and optimize indexes
  4. Train / Dev / Test set partitioning
  5. Setting benchmarks
  6. Reduce the deviation / variance
  7. Error Analysis
  8. Transfer learning
  9. Multi-task learning
  10. End-learning

 

Collect more data, to create a more diverse training set (different positions, in different directions, more counterexample ..), the optimization algorithm, the network deeper or less, using the dropout / L2 regularization, network architecture modification (activation function , the number of hidden units ...) ...

But if you make the wrong choice, could be a waste of time, without the slightest optimize the results.

Orthogonalization : Each dimension is only a change of the control variable, and the multiple dimensions without disturbing each other; to make things right for a supervised learning system usually requires adjustment knob system.

Ensure four things:. (1) Fit training set well on cost function can train a larger network, or you can switch to a better optimization algorithms. . (2) Fit dev set well on cost function increases the training set (3) Fit test set well on cost function larger validation set - a good set of results verification, test set does not work, it may be too fit for the validation set ( . 4) Performs well in real world validation set or change the loss function - either distributed validation set is not set correctly, or does not measure loss function.

Early stopping will affect the fit of the training set and a validation set of performance improvement, not orthogonal

 

Digital single evaluation criteria (Single number evaluation metric)

Have multiple evaluation criteria (such as precision, recall), when the two diverge, and how hard to choose. At this time, it may be a combination of both (e.g., F. . 1 , , precision P and recall the harmonic mean R)

 

Meet and optimization metrics ( Satisfing and Optimizing metrics )

Sometimes a combination of all take into account the indicators into a single numerical targets will not be easy, and the establishment Satisfing and optimizing indicators are important (such as accuracy (Optimizing), running time (Satisfing)), the combination of run time and accuracy as a whole evaluation index or meet uptime requirements to maximize accuracy.

Sometimes need to consider the N number of indicators from which to choose as the optimization index is an indicator of a reasonable, so try to optimize the index, and the remaining N-1 indicators are to meet the targets.

 

Train / Dev / Test set division

Dev / Test sets from the same distribution as possible: all randomly shuffled data (different distributions), placed in the test set and validation set. Verification and test sets of data reflect the establishment of the future will be set up to verify and assess the set of indicators on the definition of the goals to be aimed at the establishment of the training set will affect how quickly you approach that goal.

1e2-1e4:70/30,60/20/20

More than 1e6: 98/1/1

 

Machine Learning tasks: 1. set goals, 2 goals to aim and design, optimization system to improve index score.

 

If the data collected and the practical application of the data can not obtain the same distribution, and application of the same data and the actual distribution of the collected much less than the overall data set, how to divide the training set / validation set / test set?

For example, the collection of training images are from high-definition image, and in the application user to upload low-resolution images. The resulting data set high-definition image 200k, low-resolution image 10k. Partitioning method:

(1)  all images were randomly divided dataset benefits: training set / test set / validation set were from the same distribution, better management, the downside: the result of training does not really care about your data distribution.

(2)  training set: HD 200k + resolution 5k, validation set / test set: a low-resolution 2.5k. Benefits: Now the goal is aimed at the target you want to work, but also the distribution of the picture really care. Disadvantages: training set, validation set, the test set does not come from the same distribution. But such a process, in the long term can lead to better performance.

 

When changing the dev / test and targets ?

1.  evaluation index can not assess the correct algorithm to rank well - currently a better indicator of the results of classifiers reach a lower tolerance (such as a lower classification error rate as the current classification puts pornography publishing). Review: The evaluation index plus the weight (given a higher weighting pornographic images)

2.  In the test set, or set to develop training set distribution indicators performed well, but the actual application, where the real concern is not good (eg, the use of high-definition images for training, testing, users upload a low-resolution picture) .

 

Human performance

Bayes error rate: unsurpassable best error rate

When you define the level of human error rates, to figure out where your goal, if you want to show that you can go beyond single human ( 1%), it is reasonable to deploy your system in some cases; but if you want to replace Bayes error rate , then the definition of (experienced team -0.5%) is appropriate.

Define your target location, can be adjusted in a targeted results came after training system model (e.g., training error of 8%, the validation error rate of 10%; 7% target error rate can be avoided deviation of 1%, 2% variance, Next you need to do is to reduce the variance, try regularization; if the target error rate of 1%, the focus on reducing the deviation, can be trained to a larger network, or running a little longer gradient descent).

 

Reducing avoidable bias : the use of a larger model scale, the use of better optimization algorithm, to find new and better neural network architecture, or better parameters (activation function, network layers, the number of hidden units), using other models / architecture.

Reduce variance : collect more data, regularization ( L2 regularization / dropout), using different neural network architectures, ultra-parameter search.

 

 Error Analysis :

Using error analysis can distinguish whether a predetermined current system using an improved method will bring better performance, or the time spent on the process is reasonable (e.g., cats classifier error rate of 10%, which comprises the dog misclassification case, now ready to improve on dog classification).

Approach: First, collect a plurality of (e.g. 100) samples the error flag validation set, wherein the manually check the correct number of samples marked dogs. If only 5% of a dog, the dog even solved the problem of classification, most will reduce the error rate to 9.5% ; if 50% of the dog, can reduce the resolve to 5% , so you can take time to determine whether or not appropriate. The embodiment 9.5% to 5% to improve algorithm performance limit .

In an error analysis can evaluate multiple ideas (eg, feline, fuzzy picture).

 

(1)  mislabeled samples

Training set: deep learning algorithm random error (random error mark appears) very robust, but systematic errors (always a common feature of picture error flag) is not so strong a.

Validation set / test set: the same error analysis, the labeled samples obtained error ratio, determines whether it is worth to improve.

(2)  error correcting labeled samples:

The method used in the validation set when the correction applied to the test set simultaneously, because both must be the same distribution;

At the same time checking algorithm to determine the correct sample and a sample error of judgment;

Is the same method applied to the training set could be considered, because the training set is much greater.

 

Deviation, variance and data mismatch

If the training set and validation set from the same distribution system is the result of bias and variance which is more need to be optimized at a glance, but if the two are not the same distribution, how to judge?

Training set and validation set is not the same distribution, the variance from two sources

① algorithm only seen the training data set, have not seen the validation set of data

② validation data set from different distributions

To find out which factor greater impact, in order to define a new set of data (training - validation set), randomly selected from the training set, while in training, will not let the algorithm in training - running on a validation set. At this time, error analysis, it is necessary to see the training set, the training - the error on the validation set and a validation set.

As :( 1) 1% error in the training set, the training - validation set of 9%, 10% validation set. At this point you can explain, the existence of variance algorithm problem. (2) training error of 1%, training - 1.5% validation, verification 10%. Then the variance is very small, much as the data does not match the problem. ( 3) 10% error in the training, the training - 11% validation, verification 12%. High deviation. (4) 10% error in the training, the training - 11% validation, verification 20%. High to avoid bias, and (small variance, the data does not match a big problem).

( Human Level & Training deviation size measured error; Training error & Training-dev error variance measure size; Training-dev error & Dev / Test error measure the size of the data mismatch )

How to deal with data mismatch:

① the training data more like a validation set: synthetic data (the space of all possible from only a small possibility of a selected portion of data to analog), avoiding the use of data integrated over a single, prevent training occurred fitting;

② gather more data similar validation set and test set.

 

Transfer learning : neural network learned knowledge from a task, and apply this knowledge to another independent task. (Input into the network need to be migrated to the task, while the last two weights'new task data set is small' random assignment or re-train all the rest of the layers, such training is in the original task iteration after obtaining the network continues, we may continue to play a role in the current data set through knowledge acquisition before).

 

Migrating applications to learn to function:

① two tasks with the same type of input;

② There are many sources of data on migration issues, but the problem is not so much the migration target data;

③ low-level features learned from the source of the problem may be useful to target the problem.

 

Multi-task learning :( used for computer vision, object detection) at the same time began to learn, trying to get a single neural network at the same time do a few things, then hopefully this can help each task to other tasks. (Eg: identifying a plurality of image objects from an early feature of neural networks will be used in identifying different objects, to train a neural network to do so.4 things than four completely independent training the neural network are made 4 somethingbetter performance)

 

Multi-task learning application scenarios:

① a task group of low-level training can be a common feature;

② amount of data per task is very close (not necessarily); When you focus on a single task, if you want to get a great performance, so there must be other tasks add up to a much larger amount of data than this task; that is, if for a single task you have 1000 samples, and then for all other tasks, and finally there are more than 1000 samples, this knowledge to other tasks to help you improve the performance of this task;

③ large enough neural network; the only circumstances under multi-task learning will reduce the performance of a single neural network training and lower performance compared to the case where the neural network is not big enough.

 

End :used to have some data processing systems or learning systems, they need to deal with multiple stages, end-to-depth study is to ignore these stages, replace it with a single neural network. The neural network comprises all information processing (e.g., audio, including feature extraction, found phonemes, phoneme series and the like, the network an audio input, output directly dictated text).

 

benefit:

① may be more able to capture any statistics data, rather than being forced to introduce human prejudices; (eg, audio need to find the phoneme, phoneme and may not otherwise exist, are more likely to be created when)

② required less manual set up design.

 

Disadvantages:

① requires a lot of data to make a good system performance; learn to direct x-> y mapping may require a lot of (x, y) data.

② ruled out the possibility of useful manual design components. If you do not have a lot of data, you are learning algorithm is no way to gain insight from a small training set data.

 

When considering the use of end-to-depth study, first consider whether there is sufficient data can be directly learned from sufficiently complex function of y x->.

Guess you like

Origin www.cnblogs.com/fanzhongjie/p/11248023.html