Mathematics in Machine Learning (1) - Regression, Gradient Descent

Regression and Gradient Descent:

   Regression is mathematically given a set of points, which can be fitted with a curve. If the curve is a straight line, it is called linear regression, and if the curve is a quadratic curve, it is called quadratic Regression, there are many variants of regression, such as locally weighted regression, logistic regression, etc., which will be discussed later.

   Use a very simple example to illustrate regression. This example comes from many places and is also seen in many open source software, such as weka. Probably, to make a house value evaluation system, the value of a house comes from many places, such as area, number of rooms (a few rooms and several halls), location, orientation, etc. These variables that affect the value of houses are called characteristics ( feature), feature is a very important concept in machine learning, and there are many papers devoted to this. Here, for simplicity, let's assume that our house is affected by one variable, the size of the house.

   Suppose there is a house sales data as follows:

   Area (m^2) Sales price (10,000 yuan)

   123            250

   150            320

   87              160

   102            220

   …               …

   This table is similar to the price of houses around the 5th ring of the imperial capital. We can make a graph where the x-axis is the area of ​​the house. The y-axis is the selling price of the house, as follows:

   image

   If a new area comes, what should we do if there is no record in the sales price?

   We can use a curve to fit these data as accurately as possible, and then if there is a new input, we can return the value corresponding to this point on the curve. If you use a straight line to fit, it may look like the following:

    image

   The green points are the points we want to predict.

   First, some concepts and commonly used symbols are given, which may be different in different machine learning books.

   Home sales record table - training set or training data, is the input data in our process, generally called x

   Home sale price - output data, commonly referred to as y

   The fitted function (or hypothesis or model), usually written as y = h(x)

   The number of entries in the training data (#training set), a piece of training data is composed of a pair of input data and output data

   The dimension of the input data (number of features, #features), n

   The following is a typical machine learning process. First, given an input data, our algorithm will obtain an estimated function through a series of processes. This function has the ability to give a new estimate for new data that has not been seen before. Also known as building a model. Just like the linear regression function above.

 

   image

    We use X1, X2..Xn to describe the components in the feature, such as x1 = the area of ​​the room, x2 = the orientation of the room, etc., we can make an estimation function:

image

    θ is called a parameter here, which means to adjust the influence of each component in the feature, that is, whether the area of ​​the house or the location of the house is more important. So that if we set X0 = 1, it can be represented by a vector:

image

    Our program also needs a mechanism to evaluate whether our θ is better, so we need to evaluate the h function we made. Generally, this function is called a loss function or an error function, which describes whether the h function is not Good degree, in the following, we call this function the J function

    Here we can make the following error function:

image 

    This error estimation function is to use the sum of the squares of the difference between the estimated value of x(i) and the real value y(i) as the error estimation function, and the 1/2 multiplied before is for the purpose of derivation, this coefficient disappears .

    There are many ways to adjust θ so that J(θ) can get the minimum value. Among them, there is the least square method (min square), which is a method that is completely mathematically described. The least square method will be deduced in the last part of the Stanford Machine Learning Open Course. The source of the formula, which can be found in many machine learning and mathematics books. I will not mention the least squares method here, but talk about the gradient descent method.

The derivation process of gradient descent can be found in the following articles:

http://www.zhizhihu.com/html/y2011/3632.html

    The gradient descent method is carried out according to the following process:

    1) First assign a value to θ, which can be random, or let θ be a vector of all zeros.

    2) Change the value of θ so that J(θ) decreases in the direction of gradient descent.

    For more clarity, the following diagram is given:

image    This is a graph showing the relationship between the parameter θ and the error function J(θ). The red part indicates that J(θ) has a relatively high value. What we need is to make the value of J(θ) as low as possible. That is the dark blue part. θ0, θ1 represent the two dimensions of the θ vector.

    The first step in the gradient descent method mentioned above is to give an initial value to θ, assuming that the initial value randomly given is the cross point on the graph.

    Then we adjust θ according to the direction of gradient descent, which will make J(θ) change in a lower direction. As shown in the figure, the end of the algorithm will be when θ drops to the point where it cannot continue to descend.

image     Of course, the final point of gradient descent may not be the global minimum point, but a local minimum point, which may be the following cases:

image

   The above picture is a description of a local minimum point, which is obtained by re-selecting an initial point. It seems that our algorithm will be largely affected by the selection of the initial point and fall into the local minimum point.  

   Below I will use an example to describe the process of gradient reduction, and find the partial derivative J for our function J(θ): (If you don't understand the process of derivation, you can review the calculus)

   image

    The following is the update process, that is, θi will be reduced in the direction of the smallest gradient. θi represents the value before the update, - the latter part represents the amount of decrease in the gradient direction, and α represents the step size, that is, how much it changes in the direction of gradient decrease each time.

image     A very important point to note is that the gradient is directional. For a vector θ, each dimensional component θi can find a gradient direction, and we can find an overall direction. When changing, we A minimum point can be reached just by making changes in the direction of the greatest drop, whether it is local or global.

    Describe step 2) in simpler mathematical language like this:

   image    The inverted triangle represents the gradient. In this way, θi disappears. See if you use vectors and matrices well, it will really simplify the mathematical description.

 

Reference to: http://www.cnblogs.com/LeftNotEasy/archive/2010/12/05/mathmatic_in_machine_learning_1_regression_and_gradient_descent.html

If there is any error, please correct me

Email: [email protected]

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326490604&siteId=291194637