Python machine learning (four) linear algebra review, multiple linear regression, polynomial regression, standard equation solution, linear regression case

Review Linear Algebra

matrix

A matrix can be understood as another representation of a two-dimensional array. A matrix is ​​a matrix with three rows and two columns, and matrix B is a matrix with two rows and three columns. The elements of the matrix can be obtained through subscripts. The subscripts start from 0 by default. A ij : A_{ij}:Aij: Indicates thesecondline i , linejjelements of column j .
insert image description here

vector

A vector is a special matrix, a matrix with only 1 column, and C is a vector with 4 rows and 1 column.
insert image description here

Matrix and Scalar Operations

The scalar and each element in the matrix can also be imagined as using the broadcast mechanism to treat the scalar as a matrix with the same shape as the matrix and each element is a scalar, and perform operations on the corresponding positions.
insert image description here
Operations between matrices and scalars operate on each element with a scalar.

Matrix and Vector Operations

insert image description here
n n n rowmmMatrix of m columns multiplied bymmA vector of m rows and 1 column, getnnA vector of n rows and one column.
Example:
For example, the size of a house affects the price level, and the size is used as characteristic data.
A feature data:[ 1 2 3 ] \begin{bmatrix} 1\\ 2\\3\end{bmatrix} 123 , the linear relationship is: h ( x ) = 2 x + 1 h(x)=2x+1h(x)=2x _+1. How to use the knowledge of linear algebra to representh ( x ) h(x)h ( x ) andxxThe relationship between x ?
Construct a feature matrix, add a column with coefficients all 1, and then construct a parameter vector, 1 corresponds toθ 0 θ_0i0, 2 corresponds to θ 1 θ_1i1. x is the characteristic data, and the sample data has mmm , at this timem = 3 m=3m=3 , corresponding tox 1 , x 2 , x 3 x_1,x_2,x_3x1,x2,x3, to ensure that the intercept does not interfere with other coefficients
insert image description here

Matrix and Matrix Operations

Operates on the corresponding position. Matrices and vectors cannot be added directly.
insert image description here
Bitwise operations between matrices, and the shape must be consistent, otherwise it cannot be operated.
There are 3 ∗ 2 3*23The matrix of 2 and2 ∗ 3 2*323 matrices are multiplied, and the following2 ∗ 3 2*32The matrix of 3 is divided into 3 vectors for calculation, and the result is3 ∗ 3 3*333 matrix.
insert image description here
nn rowmmMatrix of m columns multiplied bymmm rownnA matrix of n columns, to getnnn rownnA matrix of n columns.
Example:
A feature data:[ 1 2 3 ] \begin{bmatrix} 1\\ 2\\3\end{bmatrix} 123 , the linear relationship is: h ( x ) = 2 x + 1 h(x)=2x+1h(x)=2x _+1 h ( x ) = 3 x + 2 h(x)=3x+2 h(x)=3x _+2. How to use the knowledge of linear algebra to representh ( x ) h(x)h ( x ) andxxThe relationship between x ?
insert image description here
There are multipleh ( x ) h(x)h ( x ) expression, put 1 in the first line equation toθ 0 θ_0i0, 2 corresponds to θ 1 θ_1i1, the 2 in the second equation corresponds to θ 0 ′ θ_0'i0, 3 corresponds to θ 1 ′ θ_1'i1, multiple equations are placed in sequence.

identity matrix

Among the natural numbers, 1 times any number is equal to any number times 1, which is equal to any number itself. The diagonal element of the identity matrix is ​​1, and the other elements are all 0, and the rows and columns are the same, which is also a square matrix.
I = [ 1 0 0 0 1 0 0 0 1 ] I=\begin{bmatrix} 1&0&0\\ 0&1&0\\0&0&1\end{bmatrix}I= 100010001 A ∗ I = I ∗ A = A A*I=I*A=A AI=IA=A
m ∗ m m*m mm identity matrix multiplied bym ∗ nm*nmThe matrix of n ism ∗ nm*nmThe matrix of n , the multiplication of the matrix and the identity matrix satisfies the commutative law, and the calculation of other matrices does not satisfy the commutative law.

transpose matrix

Swap the rows and columns of a matrix.
insert image description here
A ij = A ji T A_{ij}=A_{ji}^TAij=AjiT

inverse matrix

Only m ∗ mm*mmThe square matrix of m has an inverse matrix, and the determinant is not 0. The characteristics of the inverse matrix are:A ∗ A − 1 = A − 1 ∗ A = IA*A^{-1}=A^{-1}*A= IAA1=A1A=I , like3 ∗ 3 − 1 = 1 3*3^{-1}=1331=1
insert image description here

multiple linear regression

Previously, house prices were predicted based on one feature (the size of the house). In reality, the factors that affect house prices are not only the size of the house, but also many other features that affect house prices. For example, the price of apartments in Beijing is higher than that in small cities. villas etc. At this time, the feature considered is no longer only one, but multiple. When there are multiple features to train the model, the variable will change from x 1 x_1x1becomes x 1 . . . xn x_1...x_nx1...xn.
How many records are there in the figure below, that is, how many sample data sets are there, the label is the target value (price), and the characteristics are: bedrooms, bathrooms, sqft_living, sqft_lot, floors, and the linear relationship is: h ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ nxnh(x)=θ_0+θ_1x_1+θ_2x_2+...+θ_nx_nh(x)=i0+i1x1+i2x2+...+inxn.
insert image description here
In the formula, it can be seen as x 0 = 1 x_0=1x0=1 , construct the feature vectorxxx , parameter vectorθθ . havemmm data sample sets, vectorxxThe length of x isn + 1 n+1n+1 .
insert image description here
Multiple linear regression, multivariate means that there are multiple parameters, multiple feature quantities or multiple variables to predicth ( x ) h(x)h ( x ) , the ultimate goal is to obtain a series ofθ θθ , which minimizes the cost function.
Formula:h ( x ) = θ T ∗ X = θ 0 x 0 + θ 1 x 1 + . . . + θ nxnh(x)=θ^T*X=θ_0x_0+θ_1x_1+...+θ_nx_nh(x)=iTX=i0x0+i1x1+...+inxn
Parameters: θ 0 , θ 1 , . . . , θ n θ_0,θ_1,...,θ_ni0,i1,...,in
cost function:J ( θ 0 , θ 1 , . . . θ n ) = 1 2 m ∑ i = 1 m ( h ( xi ) − yi ) 2 J(θ_0,θ_1,...θ_n)= \frac{ 1}{2m}\displaystyle{\sum_{i=1}^{m}(h(x^i)-y^i)^2}J(θ0,i1,... in)=2 m1i=1m(h(xi)yi)2
Goal: Findθ 0 , θ 1 , . . . , θ n θ_0,θ_1,...,θ_ni0,i1,...,in, so that the cost function is minimized
Gradient descent method: draw up a number of iterations, and keep on θ θθ partial derivative, to change or iterateθ θθ parameters until the cost function is minimized.
Forθ 0 θ_0i0Find the partial derivative, θ 0 θ_0i0is the intercept, from θ 1 θ_1i1At the beginning, the following ones correspond to x 1 x_1x1previous coefficients.
insert image description here
For θ 1 θ_1i1Seek partial guidance.
insert image description here
For θ 2 θ_2i2Seek partial guidance.
insert image description here
insert image description here

Gradient Descent Method for Multiple Linear Regression

The freight company delivers goods and wants to implement a model for predicting the total transportation time based on historical data.
The transportation mileage and transportation times are all features, the total transportation time is the label (target value), and the total transportation time is predicted by the transportation mileage and transportation times.
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
It can be seen that the points are distributed on the top and bottom of the plane, and there are losses on the top and bottom. Generally speaking, in the middle position, the data of two features are fitted through a plane, and the predicted results can be obtained.

sklearn implements multiple linear regression

insert image description here
insert image description here
insert image description here
insert image description here

When the cost function is minimized, the plane graph is obtained. sklearn is not encapsulated by the gradient descent method, but uses a standardized equation.

polynomial regression

Suppose there are two features of the house: the width and length of the house, and the equation for predicting the house price: h ( x ) = θ 0 + θ 1 ∗ width + θ 2 ∗ length h(x)=θ_0+θ_1*width+θ_2*lengthh(x)=i0+i1Width+i2Long , two features can be converted into one by using the area formula of the house, and the obtained formula is:h ( x ) = θ 0 + θ 1 ∗ area h(x)=θ_0+θ_1*areah(x)=i0+i1area .
The visualized graph is as follows:
insert image description here
If there is only linear regression, an upward straight line is fitted, and the corresponding equation is:h ( x ) = θ 0 + θ 1 xh(x)=θ_0+θ_1xh(x)=i0+i1x . This line is not a good fitting line for the data, and the mse (loss) is relatively large.
If the curve is used for fitting, the obtained equation is:h ( x ) = θ 0 + θ 1 x + θ 2 x 2 h(x)=θ_0+θ_1x+θ_2x^2h(x)=i0+i1x+i2x2. The general direction of the function can be estimated by the power. The graph is determined by the height of the power. If the curve is a parabola, the housing price will increase with the increase of the area. After a certain level, it will increase with the area of ​​the house. However, this does not happen for the same region.
If it is said that the housing price will gradually rise with the curve, it is necessary to introduce a high power, the corresponding equation:h ( x ) = θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 h(x)= θ_0+θ_1x+θ_2x^2+θ_3x^3h(x)=i0+i1x+i2x2+i3x3 .
It involves a one-dimensional cubic equation, and the high-order power can be changed to a low-order power by a replacement method.
,
Suppose:x 1 = ( size ) ; x 2 = ( size ) 2 ; x 3 = ( size ) 3 x_1=(size);x_2=(size)^2;x_3=(size)^3x1=(size);x2=(size)2;x3=(size)3 , transformed into only one feature of x,

h ( x ) = θ 0 + θ 1 x + θ 2 x 2 + θ 3 x 3 = θ 0 + θ 1 ( size ) + θ 2 ( size ) 2 + θ 3 ( size ) 3 = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 h(x)=θ_0+θ_1x+θ_2x^2+θ_3x^3=θ_0+θ_1(size)+θ_2(size)^2+θ_3(size)^3= θ_0+θ_1x_1+θ_2x_2+θ_3x_3h(x)=i0+i1x+i2x2+i3x3=i0+i1(size)+i2(size)2+i3(size)3=i0+i1x1+i2x2+i3x3
Problems converting high powers to multiple linear regression.

Polynomial Regression Case

The salaries corresponding to different levels of positions are as follows, establish a level salary prediction system, and predict Salary by passing in a new level.
insert image description here
insert image description here
insert image description here
The effect of fitting is not very good, and polynomial regression is used for processing.
insert image description here
insert image description here
The degree of fitting can be changed by adjusting the degree.

standard equation method

The gradient descent method is to get the minimum cost function J ( θ ) J(θ)J ( θ ),求密θ 0 , θ 1 , . . . , θ n θ_0,θ_1,...,θ_ni0,i1,...,in, it needs many iterations to converge to the global minimum.

The standard equation method can also obtain the global minimum value, no need to use iterative algorithm, and can obtain the optimal value of θ at one time. J ( θ ) = a θ 2 + b θ + c J(θ)=aθ^2+bθ+cJ(θ)=aθ2+bθ+c , the graph drawn is as shown in the figure below:
insert image description here
select the appropriateθ θθ Find the minimum value of the equation, findthe θθ can be obtained by solving the equation, the slope of the lowest point is horizontal and 0, which is equivalent to deriving at this point, the slope is 0, and it can be obtained, 2 aθ + b = 0 2aθ+b=02aθ+b=0θ = − b 2 a θ=-\frac{b}{2a}i=2a _b.
But theta thetaθ is usually a vector, so the cost function is: J ( θ 0 , θ 1 , . . . , θ n ) = 1 2 m ∑ i = 1 m ( h ( xi ) − yi ) 2 J(θ_0,θ_1,. ..,θ_n)= \frac{1}{2m}{\displaystyle{\sum_{i=1}^{m}(h(x^i)-y^i)^2}}J(θ0,i1,...,in)=2 m1i=1m(h(xi)yi)2
If you need to use the standard equation method to solve, you need to calculate eachθ θCalculate the partial derivative of θ and set the result to 0. Solving for the correspondingθ θTheta value.

Standard Equation Case

When using the gradient descent method to solve the problem before, it is necessary to use two layers of for loops, resulting in code redundancy.
J ( θ 0 , θ 1 , . . . , θ n ) = 1 2 m ∑ i = 1 m ( h ( xi ) − yi ) 2 J(θ_0,θ_1,...,θ_n)= \frac{1 }{2m}{\displaystyle{\sum_{i=1}^{m}(h(x^i)-y^i)^2}}J(θ0,i1,...,in)=2 m1i=1m(h(xi)yi)2
There are 4 training samples in this data, price is the label corresponding toyyy , there are 5 features, which are bedrooms, bathrooms, sqft_living, sqft_lot, floors, and the distribution corresponds tox 1 − − x 5 x_1--x_5x1x5, there are multiple feature data, you need to build a matrix, add x 0 x_0x0, and make its value all 1.
insert image description here
The formed matrix is: X = [ 1 3 1.00 1185 5650 1.0 1 3 2.25 5270 7242 2.0 1 2 1.00 770 10000 1.0 1 4 3.00 1960 5000 1.0 ] X=\begin{bmatrix} 1&3&1.0 0&1185&5650&1.0\\ 1&3&2.25&5270&7242&2 .0\\1&2&1.00&770&10000&1.0\\1&4&3.00&1960&5000&1.0\end{bmatrix}X= 111133241.002.251.003.00118552707701960565072421000050001.02.01.01.0
Matrix XXX is the corresponding feature matrix. yyy is the real value, and the constructed label is a vector,y = [ 221900 538000 180000 604000 ] y=\begin{bmatrix} 221900\\ 538000\\180000\\604000\end{bmatrix}y= 221900538000180000604000
θ = [ θ 0 θ 1 θ 2 θ 3 θ 4 θ 5 ] θ=\begin{bmatrix} θ_0\\ θ_1\\θ_2\\θ_3\\θ_4\\θ_5\end{bmatrix}i= i0i1i2i3i4i5
利用 X , y , θ X,y,θ X,y,θConstruct the cost function, first construct the matrixX θ Xθ is actuallyh ( x ) h(x)h ( x ) part, the error value is the real value minus the predicted value,y − h ( x ) yh(x)yh ( x ) ,y − X θ y-X θy .
insert image description here
The obtainederror errorerror is a single error value. To get the sum of the squares of the errors, it is converted to the transposition of the error vector multiplied by the error vector just obtained.
As shown in the figure below,error 0 = ( y − X θ ) error_0=(y-Xθ)error0=(y) ,error 0 ∗ error 0 = ( y − X θ ) 2 error_0*error_0=(y-Xθ)^2error0error0=(y)2 , the product of the error vector and the transposition of the error vector is( y − X θ ) 2 (y-Xθ)^2(y)2 .
insert image description here
Find the lowest value, that is, the point where the slope is 0. The derivation process is as follows.
The result of multiplying the inverse matrix by the matrix is ​​the identity matrix. Realized the standard equation method to findθ θTheta way.
insert image description here

Comparison of Standard Equation Method and Gradient Descent Method

Advantages of the standard equation method: finding θ θWhen the value of θ is θ , there is no need to select the learning rate; no iteration is required, and the direct operation between the two vectors replaces the previous calculation of controlling each sample data through a loop. Disadvantages: need to calculate inverse matrix, not all matrices have inverse matrix (square matrix, rows and columns are not 0); when there are many features (n is more), the construction is ( n + 1 ) ∗ ( m+ 1 ) (n+1)*(m+1)(n+1)(m+1 ) matrix, when calculating, the operation dimension is very high, so when there are many characteristic variables, it is not suitable to use the standard equation method to solve it.
The advantage of the gradient descent method: it does not need to calculate the inverse matrix, and it is also suitable for more feature quantities. Disadvantages: need to choose learning rate; need to iterate.
If the amount of data is relatively large (the value of n is more than 10,000), the gradient descent method is suitable, and when the feature vector is relatively small, the standard equation method is used.

Numpy implements standard equation method to solve

insert image description here
insert image description here
insert image description here
insert image description here

Linear Regression Case

There is the following data: "kc_house_data.csv", we need to train the model on it.

Univariate linear regression analysis

Check the data information, sample 21613, 21 columns, 'id' is the serial number of the house, 'date' is the time of data collection, which has no effect on house prices. 'price' is the label and the rest are features. If you use linear regression with one variable, take one feature, such as taking the housing area "sqft_living".
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Multiple Linear Regression Analysis

There are multiple features, and the label has only one "price", and multiple features are affecting the price.
insert image description here
Select some features that have the greatest impact on the price to train the model, and display it in a visualized form. The relationship between the number of rooms, the number of floors, the number of bathrooms and the housing price. The number of rooms is not necessarily continuous. Draw a box diagram.
insert image description here
insert image description here
Analyze the obtained graphics, when the bedrooms are 11, 33, there is no distribution of housing prices, indicating that these two are unnecessary. All housing prices above 6 can be removed. The price does not have a completely linear relationship with some features, and the relationship between features needs to be considered.
insert image description here
insert image description here
Sub-picture 1 has little relationship with sub-picture 2, but sub-picture 3 has a relationship with sub-picture 4. When the housing area increases, the parking area, the number of rooms, and the number of bathrooms all increase, indicating that the housing area is closely related to the parking area, The correlation between the number of rooms and the number of bathrooms is relatively large; the area of ​​the house has little relationship with the number of floors and rooms. Use a heat map to plot the relationship between two graphs.
insert image description here
insert image description here
insert image description here
When the color is very light, such as -0.4, it shows a negative correlation; when the color is very green, such as 0.6, it shows a positive correlation, the correlation is relatively close, and the correlation between yourself and yourself is 1. The housing area "sqft_living" is closely related to price, bathrooms, grade, and sqft_living15. If you only look at those with strong positive correlation and completely avoid some bad features in feature selection, the model will overfit after training.
When selecting features, pay attention to: one is the relationship between features and labels, and generally a positive correlation relationship should be selected; the other is to leave a part to prevent overfitting, and the third is to eliminate the characteristic features between the relationship and the relationship , such as housing area and 15 housing area, otherwise there will be a lot of calculations.
After the features are analyzed, the multiple linear regression model is established. When the features are different, the training results are also different.

Polynomial handling

insert image description here
The result of polynomial training has a stronger score, but the predicted test result is not so strong, indicating that there is over-fitting in the data. The real results are not as strong as those trained.

Guess you like

Origin blog.csdn.net/hwwaizs/article/details/131853488