Andrew Ng 2. The model describes machine learning to understand, cost function

Model description (Model representation)

Let's start with an example: we use contains a Portland, Oregon house price data sets. Depending on the size of houses sold prices, marked on the map. If your friend's house is 1250 square feet, he wanted to sell the house, you want to help the estimated selling price. You can build a model, where you can simulate a straight line, from the model point of view, his house would be able to sell about $ 220,000. This example is an example of supervised learning algorithms. At the same time this problem can also be called a regression problem.

In this example, the known size of the house and their corresponding selling price is called a training set (Training the Set) . In a later study, we will indicate the number of training samples with $ m $.

We will be used to describe the regression of labeled as follows:
the number $ m $ representative of the training set instances
$ X $ representative feature / input variable
$ Y $ represents the target variable / output variable
Examples $ (x, y) $ the representative training set
$ (X {(I)}, Y {(I)}) $ $ I $ represents the observation examples

Let's see how work is supervised learning, the first data set to provide a learning algorithm, the learning task algorithm is a function of the output, usually expressed in $ h $, this function is called hypothesis function (hypothesis function) . His role is to house size as the input variable $ x $, and outputs a prediction value corresponding to $ y $.

Next we have to do is find this assumption function $ h $, in the above example, we can assume that $ h $ is

$$ h_ \ theta (x) = \ theta_0 + \ theta_1x $$
Since this function is linear, the problem is a regression problem, so this model is called linear regression , since there is only one function input variable $ x $, so this problem is also known as single-variable linear regression problem .

The cost function (cost function)

FIG like on the training set in the above example, the number of samples in the training set $ m = $ 47.
Our hypothesis function:
$ H_ \ Theta (X) = \ theta_0 + \ $ theta_1x
These $ \ $ theta_i is referred to as model parameters, different $ \ theta_0 $ and $ \ $ theta_1 assumptions will result in different functions.

Next we have to choose the most appropriate $ \ theta_0 $ and $ \ theta_1 $ predictive values assume the function of making the most accurate. I.e. so that the difference between the predicted value and the true value of the minimum. I.e., obtain $ minimize $ $ h_ \ theta ( x) - y $, or $ minimize $ $ (h_ \ theta (x) - y) ^ 2 $. Perhaps not for each $ (i, j) $ satisfies $ minimize $ $ h_ \ theta ( x) - y $, but may be such that all the $ (x, y) $ meet the minimum error sum. That Minimize $ $ $ \ sum_ {I} = ^ {m}. 1 (H_ \ Theta (X ^ {(I)}) - Y {(I)}) 2 $

Similarly then, meet Minimize $ $ $
\ FRAC. 1 {{}} 2M \ sum_ {I} = ^ {m}. 1 (H_ \ Theta (X ^ {(I)}) - Y {(I)}) 2 $ can be.
So we get a function

$J(\theta_0,\theta_1) =
\frac{1}{2m} \sum_{i = 1}^{m} (h_\theta(x^{(i)}) - y{(i)})2$

We call it the cost function (cost function) , of course, the cost function, there are many forms, but the average loss is linear regression task in the most commonly used. Because we want to try to reduce the cost function, the cost function is also known as optimization goals.

An intuitive understanding of the cost function Ⅰ

In order to be more intuitive understanding of what is the cost function, we assume that $ \ theta_0 $ = 0, then assume the function $ h_ \ theta (x) $ = $ \ theta_1x $ is a straight line through the dots, the cost function $ J ( \ theta_1) =
\ {FRAC 2M. 1} {} \ sum_ {I} = ^ {m}. 1 (H_ \ Theta (X ^ {(I)}) - Y {(I)}) 2 $

So our goal now is to find a suitable $ \ theta_1 $ to make $ J (\ theta_1) $ minimal.

First, we must be clear that that is $ h_ \ theta (x) $ and $ J (\ theta_1) $ parameters are $ x $ and $ \ theta_1 $. We assume that our data set is (1,1), (2,2), (3,3)
when we choose $ \ theta_1 $ = 1 when, assuming the function $ h_ \ theta (x) = x $ exactly a straight line through these three points. Then each of $ (H_ \ Theta (X ^ {(I)}) - Y {(I)}) obtained by summing the calculated $ 2 $ J (1) = 0 $ , as shown in FIG.

Similarly, when we choose a different $ \ $ get different theta_1 $ h_ \ theta (x) $ function to give different $ J (\ theta_1) $ corresponding point, we finally get $ J (\ theta_1 ) $ of image.

Now we remember optimization task learning algorithm is by choosing a different $ \ theta_1 $ get $ minimize J (\ theta_1) $, in the $ J (\ theta_1) we can see that when the $ \ theta_1 = 1 $, $ on $ curve J (\ theta_1) $ minimum, corresponding to $ h_ \ theta (x) $ also preferably bonded dataset.

An intuitive understanding of the cost function Ⅱ

This time we want while preserving $ \ theta_0 $ and $ \ theta_1 $. Suppose we take $ \ theta_0 = 50 $, $ \ theta_1 = 0.06 $, get $ h_ \ theta (x) = 50 + 0.06x $, its image as shown on the left, red x our dataset.

The last time we only have $ \ shown on the right side of the image like theta_1 $ when the resulting graph, this time when we are at the same time retain the $ \ theta_0 $ and $ \ when theta_1 $, the cost function picture becomes more complicated.

This is a 3-dimensional curved surface of the bowl, and the horizontal axis represents the $ \ theta_0 $ and $ \ theta_1 $. Surface height is $ J (\ theta_0, \ theta_1) $. In the following study, in order to better display the image, we no longer perspective view but the use of a cost function represented contour graph. The following is an example of the right contour plot.

The so-called contour map is the altitude of each line is the same (the concept of geography), that is to say the same line on the chart $ J (\ theta_0, \ theta_1) $ is the same, $ J (\ theta_0 , \ theta_1) $ is the height on the graph, for example, the following three points have the same $ J (\ theta_0, \ theta_1) $ value.

In fact, when we imagine three-dimensional graphics of this figure, you can imagine the lowest point of the image that is $ J (\ theta_0, \ theta_1 ) $ minimum on most middle of the oval. Thus a contour map showing $ J (\ theta_0, \ theta_1 ) $ easier.
For example, the following figures have little ①, it corresponds to the $ \ theta_0 = 800 $, $ \ theta_1 = -0.15 $, $ h_ \ theta (x) $ as a function corresponding to the left. From the left we can see that this assumption does not function well fitted curve, the cost value ① is still very far away from the minimum value of ②.

Let's look at another example that there is still no function hypothesis to fit the data well, but a little better than the previous example.

This last example fit better, but in fact it is not the smallest cost value.

Through the above example, we would like to note the significance of the cost function, and different $ J (\ theta_0, \ theta_1) $ How to correspond to different assumptions functions $ h_ \ theta (x) $, and the closer the minimum cost value corresponding to the point of the cost function better fit the data.

Guess you like

Origin www.cnblogs.com/wangxue574/p/12634178.html