Linear regression and feature normalization for machine learning



Linear regression is a regression analysis technique. Regression analysis is essentially a function estimation problem (function estimation includes parameter estimation and nonparametric estimation), which is to find the causal relationship between dependent variables and independent variables. The dependent variable of regression analysis should be a continuous variable. If the dependent variable is a discrete variable, the problem is transformed into a classification problem. Regression analysis is a supervised learning problem.

Linearity is actually a linear combination of a series of primary features, which is a straight line in two-dimensional space and a plane in three-dimensional space, and then extended to n-dimensional space, you can understand dimensional generalized linearity.

For example, to predict the price of a house, first extract features. The selection of features will affect the accuracy of the model, such as the height of the house and the area of ​​the house. There is no doubt that the area is an important factor affecting the house price, and the second height is basically irrelevant to the house price.

In the figure below, the four features of area, number of layers, number of layers, and construction time are selected, and then some train Set{x (i)  , y (i) } are selected.

 With this data, training is performed. A schematic diagram of supervised learning is attached below.

Train Set obtains the model h according to the learning algorithm. For New Data x, the predicted value y can be obtained directly by using the model. In this example, the size of the house can be obtained. In fact, it is essentially to find the law based on historical data, and things are always biased towards the direction Development in a direction that has happened many times in history.

The following is the calculation model, the only measure is to minimize the empirical risk, that is, the purpose of our training model is that the results generated on the model training data should be as close to the actual y (i) as possible (assuming x 0 =1) , define the loss function J(θ) as follows, that is, we need the loss function to be as small as possible. The J(θ) defined by this method is called a convex (Convex) function in optimization theory, that is, there is only one optimal solution globally, and then The optimal solution can be found by the gradient descent algorithm, and the form of gradient descent has been given.

 

 

A specific form of gradient descent: For details on gradient descent, see Gradient Descent Explained

locally weighted regression

Sometimes the fluctuation of the sample is obvious, you can use the local weighted regression, as shown in the figure below, the red line is the result of the local weighted regression, and the blue line is the result of the ordinary polynomial regression. The blue line is somewhat underfit.

The method of locally weighted regression is as follows, first look at the loss function of linear or polynomial regression"

Obviously, the locally weighted regression will re-determine the parameters every time a new sample is predicted to achieve a better prediction effect. When the scale of data is relatively large, the amount of calculation is large, and the learning efficiency is very low. And local weighted regression is not necessarily to avoid underfitting, because those fluctuating samples may be outliers or data noise.

When solving models for linear regression, there are two issues to be aware of:
One is the problem of feature combination. For example, the length and width of the house are used as two features to participate in the construction of the model. It is better to multiply them to obtain the area and then solve it as a feature, so that the dimension reduction work is done in the feature selection.
The second is feature normalization (Feature Scaling), which is also a problem that many machine learning models need to pay attention to.
After some models are unevenly scaled in each dimension, the optimal solution is equivalent to the original one, such as logistic regression. For such a model, normalization does not theoretically change the optimal solution. However, since the actual solution often uses an iterative algorithm, if the shape of the objective function is too "flat", the iterative algorithm may converge very slowly or even not converge. Therefore, for models with scaling invariance, it is best to perform data normalization as well.
After some models are unevenly scaled in each dimension, the optimal solution is equivalent to the original one, such as logistic regression. For such a model, normalization does not theoretically change the optimal solution. However, since the actual solution often uses an iterative algorithm, if the shape of the objective function is too "flat", the iterative algorithm may converge very slowly or even not converge. Therefore, for models with scaling invariance, it is best to perform data normalization as well.

There are two advantages after normalization: as shown in the figure below, the value of x 1 is 0-2000, and the value of x 2 is 1-5. If there are only these two features, when they are optimized, a narrow The long ellipse causes the gradient to take a zigzag route in the direction of the vertical contour line when the gradient descends, which will make the iteration very slow. In contrast, the iteration on the right will be very fast and improve the model. The convergence rate of

 

Another advantage of normalization is to improve the accuracy, which has a significant effect when it involves some distance calculation algorithms. For example, the algorithm needs to calculate the Euclidean distance. The value range of x2 in the above figure is relatively small. The effect of the result is much smaller than that brought by x1, so this will cause a loss of precision. So normalization is necessary, he can make the contribution of each feature to the result the same.

There are three main types of normalization:

1) Linear normalization

x' = \frac{x - \text{min}(x)}{\text{max}(x)-\text{min}(x)}

      This normalization method is more suitable for the case of numerical comparison. This method has a flaw. If max and min are unstable, it is easy to make the normalization result unstable, and the subsequent use effect is also unstable. In actual use, max and min can be replaced by empirical constant values.

2) Standard deviation normalization

  The processed data conforms to the standard normal distribution, that is, the mean is 0 and the standard deviation is 1. The transformation function is:

  where μ is the mean of all sample data and σ is the standard deviation of all sample data.

3) Nonlinear normalization

     It is often used in scenarios where data differentiation is relatively large. Some values ​​are large and some are small. The original value is mapped by some mathematical function. The methods include log, exponential, tangent, etc. It is necessary to determine the curve of the nonlinear function, such as log(V, 2) or log(V, 10), according to the data distribution.

Reference: http://www.cnblogs.com/LBSer/p/4440590.html

      http://www.cnblogs.com/LBSer/p/4440590.html


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324731558&siteId=291194637