1. Overview of Linear Regression
Linear regression (Linear regression): It is an analysis method that uses a regression equation (function) to model the relationshipbetween one or more independent variables (eigenvalues) and dependent variables (target values)
Features: The case of only one independent variable is called univariate regression, and the case of more than one independent variable is called multiple regression
A relationship is established between the eigenvalue and the target value, which can be understood as a linear model
2. Analysis of the relationship between the characteristics of linear regression and the target
There are two main models in linear regression, one is a linear relationship and the other is a nonlinear relationship. Here we can only draw a plane to understand better, so we use a single feature or two features as an example
linear relationship
The univariate linear relationship and multivariate linear relationship are shown in the figure below
The relationship between a single feature and the target value is linear, or the relationship between two features and the target value is flat
The nonlinear relationship regression equation can be understood as:, as follows
3. Linear regression API
- sklearn.linear_model.LinearRegression()
- LinearRegression.coef_: regression coefficient
The simple use code is as follows
from sklearn.linear_model import LinearRegression
x = [[80, 86], [82, 80], [85, 78], [90, 90], [86, 82], [82, 90], [78, 80], [92, 94]] # [平时成绩,考试成绩]
y = [84.2, 80.6, 80.1, 90, 83.2, 87.6, 79.4, 93.4] # 总成绩
estimator = LinearRegression() # 实例化API
estimator.fit(x, y) # 使用fit方法进行训练
print('系数是:', estimator.coef_) # 平时成绩与考试成绩系数,即所占比例
print('预测值是:', estimator.predict([[90, 85]])) # 预测总成绩
------------------------------------------------------------------
输出:
系数是: [0.3 0.7]
预测值是: [86.5]
4. Derivatives of Common Functions
Four Arithmetic Operations of Derivatives
5. Loss function and optimization algorithm
5.1 Loss function
The total loss is defined as
That is, the sum of the squares of the differences between each predicted value and the actual value, where
- is the true value of the i-th training sample
- Combine the prediction function for the i-th training sample feature value, that is, the predicted value
- The loss function is also called the least squares method
5.2 Optimization algorithm (optimization reduces total loss)
The purpose of optimization : find the W value corresponding to the minimum loss
There are two optimization algorithms commonly used in linear regression
1. Normal equation
Use the inverse and transpose of the matrix to solve the problem, X is the eigenvalue matrix, and y is the target value matrix. After bringing X and y into it, the best result can be obtained directly
Disadvantages: only suitable for samples and fewer features, when the features are too many and complex, the solution speed is too slow and no results can be obtained
The derivation process of the normal equation
Convert this loss function into a matrix notation
where y is the true value matrix, X is the eigenvalue matrix, and w is the weight matrix
Solve for the minimum value of w, starting and ending y, X are known quadratic function direct derivation, the position where the derivative is zero is the minimum value
The derivation is as follows
In the above derivation process, X is a matrix with m rows and n columns, and it cannot be guaranteed to have an inverse matrix. It can be transformed into a square matrix by right multiplication or left multiplication to ensure that it has an inverse matrix
2. Gradient descent
Take the following mountain as an example. Based on his current position, find the steepest place in this position, and then walk towards the place where the height of the mountain drops. Every time you walk a certain distance, you will use the same method repeatedly, and finally you will be able to successfully reach the valley.
Gradient descent: A differentiable function represents a mountain. The minimum value of the function is the bottom of the mountain. The fastest way to go down the mountain is to find the steepest direction of the current position, and then go down this direction, which corresponds to the function. It is to find the gradient of a given point, and then move in the opposite direction of the gradient, so that the function value can drop the fastest, because the direction of the gradient is the direction of the fastest change of the function. Repeat this method, repeatedly obtain the gradient, and finally reach the local minimum
- In a univariate function: the gradient is actually the differential of the function, which represents the slope of the tangent of the function at a given point
- In a multivariate the gradient is a vector, the vector has a direction, and the direction of the gradient indicates the fastest rising direction of the function at a given point
- The direction of the gradient is the direction in which the function rises fastest at a given point , then the opposite direction of the gradient is the direction in which the function drops fastest at a given point
Univariate Function Gradient Descent
Suppose there is a univariate function: J(θ) =
- Differentiation of a function: J'(θ) = 2θ
- Initialization, starting at: = 1
- Learning rate: α = 0.4
The iterative calculation process of gradient descent is as follows
Gradient Descent for Multivariate Functions
Suppose there is an objective function:
Now we need to calculate the minimum value of this function through the gradient descent method. Through observation, we can find that the minimum value is actually the point (0, 0). Now we start to calculate the minimum value step by step from the gradient descent algorithm. Suppose the initial starting point is: = (1, 3 )
The initial learning rate is: α = 0.1
The gradient of the function is
multiple iterations
Univariate and multivariate
Gradient Descent formula
α is called the learning rate or step size in the gradient descent algorithm. It means that α can be used to control the distance of each step to ensure that the steps are not too large and miss the lowest point. At the same time, it is also necessary to ensure not to walk too slowly
Adding a negative sign before the gradient means moving in the opposite direction of the gradient. The direction of the gradient is actually the direction in which the function rises fastest at this point, and walking in the direction of the fastest decline is naturally the direction of the negative gradient, so you need to add a negative sign
gradient descent | normal equation |
---|---|
Need to choose the learning rate | unnecessary |
need to iteratively solve | One operation yields |
Larger number of features can use | Need to calculate the equation, the time complexity is high O( ) |
With an optimization algorithm such as gradient descent, regression has the ability to "automatically learn"
Optimize the dynamic graph as follows
choose
- Small-scale data: LinearRegression (doesn't solve the fitting problem) and ridge regression
- Large scale data: SGDRegressor
Learning to navigate: http://xqnav.top/