The cost regression function

 

1. What is the cost function?


Suppose training samples (x, y), the model is h, parameter θ. H ([theta]) = [theta] T X ([theta] T represents transposition of [theta]).

(1) profile is concerned, any measure of model predicted values ​​of h (θ) and a function of the difference between the true value y can be called a cost function C (θ), if there are a plurality of samples, may be all the cost value averaging function, denoted J (θ). Therefore, it is easy to draw the following about the nature of the cost function:

  • For each algorithm, the cost function is not unique;
  • The cost function is a function of the parameter θ;
  • The overall cost function J (θ) may be used to evaluate the quality of the model, cost function and the smaller the model parameters meet the training samples (x, y);
  • J (θ) is a scalar;

(2) When we determine the model h, all the things to do is back training model parameter θ. So when training the model to end it? This time is also related to the cost function, due to the cost function is used to measure the quality of the model, of course, our goal is to get the best model (that is, the model that best meet the training sample (x, y) of). So the training process is changing the parameters θ, to thereby obtain a smaller procedure J (θ) is. Ideally, when we take the cost function J to a minimum value, to obtain the optimal parameter [theta], referred to as:

 

min i J ( i ) minthJ (i)

 

For example, J (θ) = 0, we model indicates perfect fit observed data, there is no error.

(3) In the process of optimization parameters [theta], the most commonly used method is gradient descent gradient here is the cost function J (θ) of [theta] . 1 , [theta] 2 , ..., [theta] n- number of partial derivatives. Since the partial derivatives request, we can obtain a cost function of the nature of the other:

  • The cost function is selected, the best selection parameter θ be a function (total differential exists, there must be partial derivatives)

 

2. The common form of the cost function


After the above description, a good cost function need to satisfy two basic requirements: the accuracy of the model can be evaluated, differentiable parameters θ. 

 

2.1 mean square error

Linear regression, the most commonly used is the mean square error (Mean squared error), specifically in the form of:

 

J ( θ 0 , θ 1 ) = 1 2 m Σ in = 1 m ( y ^ ( a ) - y ( in ) ) 2 = 1 2 m Σ in = 1 m ( h θ ( x ( in ) ) - y ( a ) ) 2 J (θ0, θ1) = 12mΣi 1m = (y ^ (i) -y (i)) 2 = 12mΣi = 1m (hθ (x (i)) - y (i)) 2

 

m: number of training samples;

H [theta] (x): the value of parameter x and [theta] Y of the predicted;

y: y value of the original training sample, which is the standard answer

The superscript (i): i-th sample

 

2.2 cross entropy

In logistic regression, the most commonly used is the cost function is the cross-entropy (Cross Entropy), cross-entropy is a common cost function, will be used in the neural network. The following is a "neural networks and deep learning," a book interpretation of the cross entropy:

Cross-entropy is "unexpected" (Translator's Note: Original use suprise) metric. Target neuron is a function to calculate y, and y = y (x). But we let it replace function to calculate a, and a = a (x). Suppose we put a probability as y is equal to 1, 1-a is equal to the probability of y 0. So, the cross entropy is a measure of average when we know the true value of y "unexpected" degree. When the output value is what we expect, our level of "unexpected" relatively low; when the output is not what we expected, our level of "unexpected" is relatively high.

 

In 1948, Claude Elwood Shannon entropy of thermodynamics, information theory is introduced to, so it is called Shannon entropy (Shannon Entropy), which is the amount of information Shannon (Shannon Information Content, SIC) expectations. Shannon information is used to measure the size of uncertainty: Shannon information content of an event equal to 0, indicating the occurrence of the event does not provide any new information to us, such as deterministic events, the probability of occurrence is 1, occurred it will not cause any surprise; when the unlikely event occurs, Shannon infinite amount of information, which means that provided us with an infinite number of new information, and makes us infinite surprise. More explanation can be seen here .

 

J ( θ ) = - 1 m [ Σ in = 1 m ( y ( in ) log h θ ( x ( in ) ) + ( 1 - y ( in ) ) log ( 1 - h θ ( x ( in ) ) ) ] J (θ) = - 1m [Σi = 1m (y (i) log⁡hθ (x (i)) + (1-y (i)) log⁡ (hθ-1 (x (i)))]

 

Symbol Description ibid. 

 

2.3 Neural Network cost function

After learning through the neural network, we found that logistic regression is actually a special case of neural networks (neural network is not hidden layers). Thus the cost and the logistic regression function in the neural network is very similar to the cost function:

 

J(θ)=1m[i=1mk=1K(y(i)kloghθ(x(i))+(1y(i)k)log(1(hθ(x(i)))k)]J (θ) = - 1m [Σi = 1mΣk = 1K (yk (i) log⁡hθ (x (i)) + (1-yk (i)) log⁡ (1- (hθ (x (i ))) k)]

 

The reason here more than a layer summation term, because the output of the neural network is generally not a single value, K represents the number of types in the multi-classification.

Number Recognition e.g., K = 10, the points 10 represents a class. At this time, for a particular sample, the output results are as follows:

Press Ctrl + C to copy the code
Press Ctrl + C to copy the code

A 10-dimensional column vector, represents the prediction result of the digital input is the probability of any one of 0 to 9, the maximum probability can be taken as the prediction result. The above example 9 prediction result. Results predicted ideal case should be as follows (the probability is 1 9, others are 0):

Copy the code
   0
   0
   0
   0
   0
   0
   0
   0
   0
   1
Copy the code

Comparing the predicted results and the results in the ideal case can be seen there are differences between corresponding elements of two vectors, a total of 10 groups, where the cost function 10 where it means K, the difference corresponding to each type of We have accumulated up.

 

3. The cost function with parameters


The cost function is a measure of the difference between the value of h (θ) and model predictive standard answer y, the total cost function J is h (θ) and a function of y, i.e., J = f (h (θ), y). And because the training sample y are given, h (θ) is determined by θ, so, eventually changing the parameters θ J model has led to the change. For different θ, corresponding to different prediction value h (θ), it will correspond to different values ​​of the cost function J. Change process:

 

θ>h(θ)>J(θ)θ−−>h(θ)−−>J(θ)

 

It causes a change in [theta] h (θ), thereby changing the J (θ) values. To see more intuitive parameters on the cost function, for a simple example:

There are training samples {(0, 0), (1, 1), (2, 2), (4, 4)}, i.e., four training samples, each sample value of x represents the number of the first, the number 2 indicates the value of y. These points are obviously y = x point on this line. As shown below:

abc

                                                                                                             Figure 1: Different parameters may fit different linear

  View Code

Constant term is zero, can take [theta] 0 = 0, then take a different [theta] . 1 , can be a different fitted line. When [theta] 1 = 0, fitted line is y = 0, i.e., the blue line, this time from the sample point farthest cost function value (error) is also the largest; when [theta] 1 = 1, a straight line fitting is y = x, i.e. green line, when fitted line through each sample point, the cost function value of zero.

As can be viewed through the [theta] FIG 1 changes, J (θ) of the changes:

                                                                                                        Figure 2: the cost function J (θ) varies with the parameters

  View Code

Can directly observe the influence on the cost function [theta] from the figure, when [theta] 1 = 1, the cost function J (θ) takes the minimum value. Properties of the first derivative of the cost function as a linear regression model (mean square error) is very good, and therefore may be used directly algebraically, seeking J (θ) is 0 point can be directly determined optimum value [theta] (normal equation method).

 

4. The cost function gradient


 Gradient descent gradient refers to the cost function of the partial derivative of each parameter, a direction partial derivative determines the direction of decline in the learning process parameters, the learning rate (usually expressed in [alpha]) determines the step change in step with the learning rate derivatives and can use the gradient descent algorithm (gradient Descent algorithm) to update the parameters. The following diagram illustrates the process in only two parameters of the model using the gradient descent algorithm.

 

 

 

FIG lower the cost function can be seen in FIG. J (θ) made by the parameter [theta], a point on the surface ([theta] 0 , [theta] . 1 , J (θ)), there are numerous tangent, in the xy plane in which the tangent (bottom surface, corresponding to [theta] 0 , [theta] . 1 ) is the angle between the tangential direction of the biggest piece of the gradient of the point, along the movement direction, have the greatest height variation (with respect to the z-axis, where the z-axis corresponds to the cost function J (θ)).

 

4.1 partial derivative of the cost function for the linear regression model parameters

 

 Or two parameters, for example, each parameter has a partial derivative, and comprehensive information on all samples.

 

 

 

4.3 partial derivative of the cost function for the neural network model parameters

Gradient value test (Numerical Gradient Checking) method. Idea of ​​this method is that by estimating the gradient values ​​to test whether we calculate the value of the derivative is really our requirements.

 

 

 

 

Guess you like

Origin www.cnblogs.com/cmybky/p/11725891.html