Andrew Ng machine learning notes (17) - a large-scale machine learning

Chapter XVIII of large-scale machine learning

Study large data sets

This chapter describes the algorithm will be able to process massive amounts of data.

Question: Why do you want to use large data sets? To get to know a high-performance machine learning systems approach is the use of low-bias learning algorithm and train with big data.

Mentioned before to get here easily confused words to, for example, For breakfast I ate __ eggs, here to fill two, but not too or to, the following figures from clear, as long as the use of large data algorithm to train its effects it seems to be better.
Here Insert Picture Description

Can be drawn from these results, in machine learning, "It's not who has the best algorithm that wins. It's who has the mostdata.", Meaning that the determining factor is often not the best algorithm but who most of the training data.

But large data set has its own special problems, namely computational problems. Suppose m is equal to one million training samples, want to train a linear regression model or a logistic regression model, and then with the gradient descent update: Here Insert Picture Description, which can be seen calculate the gradient term, when m is equal to one million calculated cost too, so I hope to find an alternative algorithm of algorithms or look for more efficient ways to calculate this, then, will introduce two main methods: stochastic gradient descent and reduce the mapping to handle massive data sets.

Stochastic gradient descent

For many machine learning algorithm, such as linear regression, logistic regression and neural network algorithm is to propose a method to derive a cost function optimization or suggest a target, and the minimum value of such gradient descent algorithm for the cost function, but when the training set is when a large amount of computation using a gradient descent algorithm can become very large, the following will discuss improvement of the gradient descent algorithm: stochastic gradient descent method.

Look linear regression model mentioned earlier:

Function and cost function is assumed as follows:
Here Insert Picture Description

Gradient descent using the following equation:
Here Insert Picture Description

Then when the time training set is large, decrease the use of such gradient updates will be very slow, it takes too much expense, following look at a more efficient algorithm that can better handle large data sets.

Write the cost function in another way:
Here Insert Picture Description

The cost function is actually a measure of performance on the assumption that the function of a sample, with the overall cost function is:

Here Insert Picture Description

Application of this method to a linear regression model, the stochastic gradient descent procedure to write:

▷ random shuffle all data sets (random upset: m all re-training samples are randomly arranged);

▷ conduct all training samples traversal, update: Here Insert Picture DescriptionThis is a fact;Here Insert Picture Description

Therefore, stochastic gradient descent actually traverse all the training samples, first the first set of training samples Here Insert Picture Description, then only the first cost function is a gradient descent training sample operation, and slightly modify the parameters, so that the fitting better. Then in the same manner as the latter to continue the training sample, until the entire training set.

Stochastic gradient descent is not the same place that do not need to get all gradient terms of m samples summation, just need to find the gradient terms of individual training samples, look at the stochastic gradient descent iteration process:
Here Insert Picture Description

Overall, the parameters are moved in the direction towards the global minimum, or in the process of random detour path forward toward the global minimum, as compared to normal gradient descent (red curve), stochastic gradient descent form Convergence is different, it does is continuously hovering direction toward the global minimum in a certain area.

Mini-Batch gradient descent

This section will introduce the Mini-Batch gradient descent, it is sometimes more than stochastic gradient descent but also faster.

in conclusion:

(1) Common gradient descent: each iteration of m should be used for all samples;

(2) stochastic gradient descent: use only one sample per iteration;

(3) Mini-Batch gradient descent: it is in between the above, each iteration will use sample b (b is a parameter called Mini-Batch size, b generally range 2-100).

For example: Suppose b = 10, to obtain 10 samples: Here Insert Picture Descriptionand gradient update: Here Insert Picture Description, followed update starts from i + 10, is continued as long to write the complete algorithm is as follows:

Here Insert Picture Description

One of the drawbacks Mini-Batch gradient descent algorithm is to calculate the size of the parameter b, it may take some time, but if you have a good method to quantify, sometimes it will run faster decline than the stochastic gradient.

Stochastic Gradient Descent

In this section the value of the learning rate α ensure proper convergence of the algorithm and adjustment of the stochastic gradient descent.

Ordinary gradient descent before the review, to ensure a standard gradient descent method has converged is plotted cost function; and for the stochastic gradient descent, in order to check whether the algorithm has converged, you can perform the following tasks:

▷ follows the previously defined cost function:
Here Insert Picture Description

▷ When stochastic gradient descent learning method, using a sample before update parameters, can calculate how well the training samples corresponding to the performance assumptions (i.e., the calculated cost function);

▷ In order to check whether convergence stochastic gradient descent, to do every 1000 iterations, you draw the previous step in the calculated cost function, put the average cost function is drawn before the 1000 samples, drawn by observing FIG, can check whether a stochastic gradient descent method convergence.

The following is an example shown in the drawings:
Here Insert Picture Description

If such a graph obtained above, it can be seen in the decreased value of the cost function, it can be determined that the learning algorithm has converged;

Here Insert Picture Description

If the above this happens, look no cost function decreasing, making the learning algorithm does not seem
Here Insert Picture Description

However, if given further training samples averaged, the result might appear as a curve shown in the figure above the red curve, this fact can be seen in the cost function is reduced, but too small averaging sample case next, leading to possible see actually tends to decrease;

Here Insert Picture Description

If such curves obtained above figure, it appears to be on the rise, such a case is diverging algorithm signal, when to do this with a smaller learning rate α.

So, by these figures drawn above, you can know the various situations that may arise, it is possible to deal with different situations different measures.

Finally, the talk about the situation with regard to the learning rate α:

In the most typical application of the stochastic gradient descent, a learning rate constant α is typically constant, and therefore the final result will be obtained hovering near the global minimum is obtained a value very close to the global minimum. If you want better stochastic gradient descent converge to the global minimum, allowing the value of the learning rate α gradually decreases with time.

A typical method is to set the value so that a constant equal to 1 divided by the number of iterations plus a constant 2 ( Here Insert Picture Description), its drawback is to determine the value of the two constants take some time, but if you can find it two constants, the resulting effect is very good.

Online Learning

In this section we discuss a new large-scale machine learning mechanisms: online learning mechanism.

Online Learning Case:

Suppose you provide transportation services to users who ask you to ship the parcel from A to B service, also assume that you have a site that users visit the website to tell you what they want and where to send the parcel sent to where to go, then your website out transport parcel service prices, according to this price to your users, users sometimes accept the transportation services (y = 1), and sometimes will not accept (y = 0), where you want to use a learning help optimize our algorithms to the user out of the price:

Assume that the user has acquired the features described features, such as user demographics, user parcel of origin and destination, we have to do is learn the probability of these features user will use our transport services to transport packages Therefore the use of these probabilities, we can provide the right price at the time the new user.

Consider logistic regression algorithm: Suppose there is a continuous operation of the site, the following is done online learning algorithm:

Here Insert Picture Description

If you really run a large site, the site's users have a continuous flow, then this online learning algorithm is very applicable.

Another example of the use of online learning:

This application is a product search, we want to use a learning algorithm to learn how to search for good feedback to the user list. Suppose there is a sell mobile phone shop, there is a user interface that allows users to log on to your site and type in a search entry, for example, "Android phones, 1080p camera", assuming that the shop has 100 kinds of mobile phones, because the site design, when you type a user Search command, will identify 10 suitable phone for users to choose. Here you want to use a learning algorithm to help us find what 10 phones this phone is 100 should feedback to the user.

Next is the problem-solving ideas:

▷ For each handset and a given user search command, can build a feature vector x, the feature vector may indicate the various features of the phone, might be: how high degree of similarity searches with this phone user, the user search command how many words can match the name of this phone and so on.

▷ we need to do is to estimate the user clicks on a phone link probability, it will be defined as y = 1 mobile phone users clicked on the link, and y = 0 means that the user does not click on the link, then according to the characteristics to predict the user clicks on a specific x the probability of the link.

▷ If any one can estimate the phone's click-through rate, the user can use this to show 10 that they are most likely to click on the phone.

This is the online learning mechanism, we are using this algorithm and stochastic gradient descent algorithm is very similar, the only difference is that does not use a fixed set of data, but users get a sample, learn from this sample, and then process the next one, and if your application has a continuous flow of data, online learning mechanism is very worthy of consideration.

Mapping data with a parallel reduction

In this section we discuss another idea can be applied to large-scale machine learning: is called MapReduce.

MapReduce ideas:

For general gradient descent method, it is assumed that there are training set as follows:

Here Insert Picture Description

The MapReduce thought, the training set is divided into different subsets, assuming m = 400 (for convenience of description herein, the actual large-scale data processing 400 million m should be set number), there are four machines can process data,

The first machine with the first quarter of the training set:Here Insert Picture Description

Summation:Here Insert Picture Description

Then subsequently treated so later training set:

https://img-blog.csdn.net/20180703111344105?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM0NjExNTc5/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70

Each machine is now doing a quarter of the work, so that they can increase four times the original speed, after they complete their computing temp, temp and then sent to a central server to integrate the results, last updated parameters: Here Insert Picture Description.

Below is a schematic diagram of MapReduce:

Here Insert Picture Description

If you want MapReduce applications on the idea of ​​some kind of learning algorithm to achieve accelerated, thinking through multiple computers parallel computing: whether learning algorithm shows a pair of summing the training set?

In fact, many learning algorithms can be expressed in pairs training set function summation, while running on large data sets, calculate the amount consumed is that the need for very large sums training set, so long as the learning algorithm can be expressed as training set sum, then you can use MapReduce will learn to extend the use of algorithms to very large data sets.

See the examples below:

Suppose you want to use an advanced optimization algorithm, such as L-BFGS, conjugate gradient algorithms, etc., it is supposed to be training a logistic regression learning algorithm, need to calculate the following two amounts:

(1) calculate the optimal objective cost function:

Here Insert Picture Description

(2) Advanced optimization algorithm requires calculation of the partial derivative term of:

Here Insert Picture Description

As long as the learning algorithm can be represented as a series summation form or in the form of representation of the summation function, MapReduce techniques can be used on the training set of learning algorithms in parallel, so that it can be applied to very large data sets.

This is the use of MapReduce parallel to achieve operational data, can effectively improve operational efficiency, so studying this chapter, hoping for a large-scale machine learning methods can somewhat understand and master.

Published 80 original articles · won praise 140 · views 640 000 +

Guess you like

Origin blog.csdn.net/linjpg/article/details/104434124
Recommended