1. What is the cost function?
Suppose training samples (x, y), the model is h, parameter θ. H ([theta]) = [theta] T X ([theta] T represents transposition of [theta]).
(1) profile is concerned, any measure of model predicted values of h (θ) and a function of the difference between the true value y can be called a cost function C (θ), if there are a plurality of samples, may be all the cost value averaging function, denoted J (θ). Therefore, it is easy to draw the following about the nature of the cost function:
- For each algorithm, the cost function is not unique;
- The cost function is a function of the parameter θ;
- The overall cost function J (θ) may be used to evaluate the quality of the model, cost function and the smaller the model parameters meet the training samples (x, y);
- J (θ) is a scalar;
(2) When we determine the model h, all the things to do is back training model parameter θ. So when training the model to end it? This time is also related to the cost function, due to the cost function is used to measure the quality of the model, of course, our goal is to get the best model (that is, the model that best meet the training sample (x, y) of). So the training process is changing the parameters θ, to thereby obtain a smaller procedure J (θ) is. Ideally, when we take the cost function J to a minimum value, to obtain the optimal parameter [theta], referred to as:
For example, J (θ) = 0, we model indicates perfect fit observed data, there is no error.
(3) In the process of optimization parameters [theta], the most commonly used method is gradient descent gradient here is the cost function J (θ) of [theta] . 1 , [theta] 2 , ..., [theta] n- number of partial derivatives. Since the partial derivatives request, we can obtain a cost function of the nature of the other:
- The cost function is selected, the best selection parameter θ be a function (total differential exists, there must be partial derivatives)
2. The common form of the cost function
After the above description, a good cost function need to satisfy two basic requirements: the accuracy of the model can be evaluated, differentiable parameters θ.
2.1 mean square error
Linear regression, the most commonly used is the mean square error (Mean squared error), specifically in the form of:
m: number of training samples;
H [theta] (x): the value of parameter x and [theta] Y of the predicted;
y: y value of the original training sample, which is the standard answer
The superscript (i): i-th sample
2.2 cross entropy
In logistic regression, the most commonly used is the cost function is the cross-entropy (Cross Entropy), cross-entropy is a common cost function, will be used in the neural network. The following is a "neural networks and deep learning," a book interpretation of the cross entropy:
Cross-entropy is "unexpected" (Translator's Note: Original use suprise) metric. Target neuron is a function to calculate y, and y = y (x). But we let it replace function to calculate a, and a = a (x). Suppose we put a probability as y is equal to 1, 1-a is equal to the probability of y 0. So, the cross entropy is a measure of average when we know the true value of y "unexpected" degree. When the output value is what we expect, our level of "unexpected" relatively low; when the output is not what we expected, our level of "unexpected" is relatively high.
In 1948, Claude Elwood Shannon entropy of thermodynamics, information theory is introduced to, so it is called Shannon entropy (Shannon Entropy), which is the amount of information Shannon (Shannon Information Content, SIC) expectations. Shannon information is used to measure the size of uncertainty: Shannon information content of an event equal to 0, indicating the occurrence of the event does not provide any new information to us, such as deterministic events, the probability of occurrence is 1, occurred it will not cause any surprise; when the unlikely event occurs, Shannon infinite amount of information, which means that provided us with an infinite number of new information, and makes us infinite surprise. More explanation can be seen here .
Symbol Description ibid.
2.3 Neural Network cost function
After learning through the neural network, we found that logistic regression is actually a special case of neural networks (neural network is not hidden layers). Thus the cost and the logistic regression function in the neural network is very similar to the cost function:
The reason here more than a layer summation term, because the output of the neural network is generally not a single value, K represents the number of types in the multi-classification.
Number Recognition e.g., K = 10, the points 10 represents a class. At this time, for a particular sample, the output results are as follows:
A 10-dimensional column vector, represents the prediction result of the digital input is the probability of any one of 0 to 9, the maximum probability can be taken as the prediction result. The above example 9 prediction result. Results predicted ideal case should be as follows (the probability is 1 9, others are 0):
0 0 0 0 0 0 0 0 0 1
Comparing the predicted results and the results in the ideal case can be seen there are differences between corresponding elements of two vectors, a total of 10 groups, where the cost function 10 where it means K, the difference corresponding to each type of We have accumulated up.
3. The cost function with parameters
The cost function is a measure of the difference between the value of h (θ) and model predictive standard answer y, the total cost function J is h (θ) and a function of y, i.e., J = f (h (θ), y). And because the training sample y are given, h (θ) is determined by θ, so, eventually changing the parameters θ J model has led to the change. For different θ, corresponding to different prediction value h (θ), it will correspond to different values of the cost function J. Change process:
It causes a change in [theta] h (θ), thereby changing the J (θ) values. To see more intuitive parameters on the cost function, for a simple example:
There are training samples {(0, 0), (1, 1), (2, 2), (4, 4)}, i.e., four training samples, each sample value of x represents the number of the first, the number 2 indicates the value of y. These points are obviously y = x point on this line. As shown below:
Figure 1: Different parameters may fit different linear
Constant term is zero, can take [theta] 0 = 0, then take a different [theta] . 1 , can be a different fitted line. When [theta] 1 = 0, fitted line is y = 0, i.e., the blue line, this time from the sample point farthest cost function value (error) is also the largest; when [theta] 1 = 1, a straight line fitting is y = x, i.e. green line, when fitted line through each sample point, the cost function value of zero.
As can be viewed through the [theta] FIG 1 changes, J (θ) of the changes:
Figure 2: the cost function J (θ) varies with the parameters
Can directly observe the influence on the cost function [theta] from the figure, when [theta] 1 = 1, the cost function J (θ) takes the minimum value. Properties of the first derivative of the cost function as a linear regression model (mean square error) is very good, and therefore may be used directly algebraically, seeking J (θ) is 0 point can be directly determined optimum value [theta] (normal equation method).
4. The cost function gradient
Gradient descent gradient refers to the cost function of the partial derivative of each parameter, a direction partial derivative determines the direction of decline in the learning process parameters, the learning rate (usually expressed in [alpha]) determines the step change in step with the learning rate derivatives and can use the gradient descent algorithm (gradient Descent algorithm) to update the parameters. The following diagram illustrates the process in only two parameters of the model using the gradient descent algorithm.
4.1 partial derivative of the cost function for the linear regression model parameters
Or two parameters, for example, each parameter has a partial derivative, and comprehensive information on all samples.
4.3 partial derivative of the cost function for the neural network model parameters