Interpretation of the principles of linear models (machine learning must read 01)

1. Linear regression

         Linear regression is an algorithm under supervised machine learning in machine learning. The regression problem mainly focuses on the relationship between the dependent variable (the value to be predicted, which can be one or more) and one or more numerical independent variables (predictor variables).

  1. The value to be predicted: that is, the target variable, target, y, continuous value predictor variable< /span>.
  2. Factors affecting the target variable: ... , which can be continuous or discrete values.
  3. The relationship between the dependent variable and the independent variable: that is, the model, is what we want to solve.

1.1 Simple Linear Regression

As mentioned earlier, algorithms are simply formulas. Simple linear regression is an algorithm and its corresponding formula.

        y = wx + b

In this formula, y is the target variable, that is, the value to be predicted in the future, x is the factor that affects y, and w and b are the parameters in the formula, that is, the required model. In fact, b is our intercept, and w is the slope! So it is obvious that if the model is calculated,the unknown number that affects the y value in the future is an x ​​value. It can also be said that there is only one factor that affects the y value, so This is why it is called simple linear regression.

1.2 Optimal solution

  • Actual value:Actual value, generally represented by y.  
  • Predicted valuePredicted value is to bring the known x into the formula and The guessed parameters w and b are calculated and generally expressed using $\hat{y}$.
  • ErrorError, the difference between the predicted value and the true value, generally used\varepsilon means.
  • Optimal solution: Try to find a model that minimizes the overall error, and the overall The error is usually called loss.
  • Loss: The overall error, Loss is calculated through the loss function Loss function.

1.3 Multiple linear regression

In real life, there is often more than one factor that affects the result y. At this time, x changes from one to n, X_1.....$X_n$At the same time, the formula of simple linear regression is no longer applicable. Multiple linear regressionThe formula is as follows:

        $\hat{y} = w_1X_1 + w_2X_2 +....... + w_nX_n + b$

Use vectors to represent:

        $\hat{y} = W^TX$

2. Gaussian function

2.1 Normal distribution

Normal Distribution, also known as Gaussian Distribution, is very useful in practical applications because many natural phenomena and human behaviors approximately follow the normal distribution. For example, height, weight, IQ, measurement error, etc. can all be described by normal distribution. In statistical analysis, many parameter estimation and hypothesis testing methods are based on the assumption of normal distribution. In statistical modeling, it is usually assumed that the error calculated by each linear model and the error of the correct value conform to the normal distribution. Based on this assumption, the weights of a linear model can be estimated bycomputing the values ​​of the normal distribution that minimize the error. This approach helps fit models to better explain data and make predictions. Key Features:

  1. Symmetry: The normal distribution is a symmetrical distribution with the mean, median and mode located at the center of the distribution, which is the peak of the distribution.

  2. Central Tendency: A normal distribution has a central tendency, with data points being more likely to be close to the mean, with the probability decreasing further away from the mean.

  3. Definition: The normal distribution is determined by two parameters, the mean (μ) and the variance (σ^2), which determine the center and dispersion of the distribution.

  4. Standard Normal Distribution: When the mean is 0 and the variance is 1, the normal distribution is called the Standard Normal Distribution. The probability density function of the standard normal distribution can be found using the standard normal distribution table.

  5. Classic bell-shaped curve: The probability density function of the normal distribution presents a typical bell-shaped curve, with the tails gradually decreasing on both sides and reaching a peak at the mean.

The probability density function of the normal distribution (Probability Density Function) is:

2.2 Error analysis

Assume that the errors of all samples areindependent, with up and down oscillations. The oscillations are considered random variables. There are enough random variables. The distribution formed after superposition obeys the normal distribution, because it is the distribution in the normal state, which is the Gaussian distribution! Mean is a certain value, Variance is a certain value. We don’t care about the variance for now. We can always find a way to make the mean equal to zero , because we have an intercept b here, and we can think of all errors It is independently distributed, 1<=i<=n, and obeys the Gaussian distribution with mean value 0 and variance of a certain value.. In machine learning, weassume that the error conforms to the normal distribution with a mean of 0 and a constant variance! ! !

        $\varepsilon_i = |y_i - \hat{y}|$

Tai distribution formula:

        $f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x - \mu)^2}{2\sigma^2}}$

As the parameters μ and σ change, the probability distribution also changes. The next important step is coming. We need to determine the total likelihood of a set of data errors, that is, the reason why a set of data corresponds to the overall likelihood of errors 2> Expressed, because the error of the data is assumed to obey a Gaussian distribution, and the intercept term is used to translate the position of the overall distribution so that makes μ=0, so we can express the error of the sample as the value of its probability density function as follows:

        $f(\varepsilon|\mu = 0,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(\varepsilon - 0)^2}{ 2\sigma^2}}$

Error normal distribution, simplify and remove the mean μ:

        $f(\varepsilon| 0,\sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\varepsilon ^2}{2\sigma^2}} $

2.3 Total likelihood of error

Multiplication problem:

        $P = \prod\limits_{i = 0}^{n}f(\varepsilon_i|0,\sigma^2) = \prod\limits_{i = 0}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{\varepsilon_i ^2}{2\sigma^2}}$

According to the previous formula $\varepsilon_i = |y_i - W^Tx_i|$ the following formula can be derived:

        $P = \prod\limits_{i = 0}^{n}f(\varepsilon_i|0,\sigma^2) = \prod\limits_{i = 0}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i - W^Tx_i)^2}{2\sigma^2}}$

The unknown variable in the formula is , which is the coefficient of the equation. The coefficient includes the intercept~ If the above is regarded as an equation, it is the equation of probability P with respect to W! The remaining symbols are all constants!

        $P_W= \prod\limits_{i = 0}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i - W^Tx_i)^2}{2\sigma^2}}$

By finding the logarithm, the cumulative multiplication problem is transformed into an accumulation problem:

        $log_e(P_W) = log_e(\prod\limits_{i = 0}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i - W^Tx_i)^2}{2\sigma^2}})$

simplify:

                        $=\sum\limits_{i = 0}^{n}log_e(\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i - W^Tx_i)^2}{2\sigma^2}})$

                        $=\sum\limits_{i = 0}^{n}(log_e\frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{\sigma^2}\cdot\frac{1}{2}(y_i - W^Tx_i)^2)$

The above formula is a deformation of the logarithm of maximum likelihood, in which $\pi, \sigma$ are constants, and $(y_i - W^Tx_i)^2$ is definitely greater than Zero! The above problem of finding the maximum value can be transformed into the following problem of finding the minimum value of:

        $L(W) = \frac{1}{2}\sum\limits_{i = 0}^n(y^{(i)} - W^Tx^{(i)})^2$

L stands for Loss, which represents the loss function. The smaller the loss function, the greater the maximum likelihood above.~

Some formulas in books can also be written like this, using $J(\theta)$ to express one meaning, and the role of $\theta$ is W:

        $J(\theta) = \frac{1}{2}\sum\limits_{i = 1}^n(y^{(i)} - \theta^Tx^{(i)})^2$

                 $ = \frac{1}{2}\sum\limits_{i = 1}^n(\theta^Tx^{(i)} - y^{(i)})^2$

Further derivation:

        $J(\theta) = \frac{1}{2}\sum\limits_{i = 1}^n(h_{\theta}(x^{(i)}) - y^{(i)})^2$

in:

  $\hat{y} = h_{\theta}(X) =X \theta$ represents all the data, which is a matrix, and X represents multiple data. When performing matrix multiplication, it is placed in front;

  $\hat{y}i = h{\theta}(x^{(i)}) = \theta^Tx^{(i)}$Represents the i-th data, which is a vector, so when performing multiplication, one of them needs to be transposed.

Because there is a negative sign in the maximum likelihood formula, the maximum total likelihood becomes the minimum part after the negative sign. At this point, we have derived the MSE loss function$J(\theta)$. From the formula, we can also see the origin of the name MSE, mean squared error. The above formula is also called Least squares method!

        This minimum secondary method is estimated, in fact, we can think that assumes that the error obeys the Zhengtai distribution . Here our linear regression can be said to be least squares linear regression. logistic regression and Poisson regression are all types of generalized linear regression         So sometimes we can also regard linear regression as generalized linear regression. For example, of other distributions to derive the loss function. use the probability density function, or other distributions. We have to the error obeys Poisson distribution
        There is also an example of assuming that

Three normal equations

3.1 Least squares matrix representation

The least squares method can transform the error equation into a system of algebraic equations with definite solutions (the number of equations is exactly equal to the number of unknowns) , thus these unknown parameters can be solved. This system of algebraic equationsthat has a definite solution is called the normal equation estimated by the least squares method. The formula is as follows:

        $\theta = (X^TX)^{-1}X^Ty$  or $W = (X^TX)^{-1}X^Ty$

The least squares formula is as follows:

        $J(\theta) = \frac{1}{2}\sum\limits_{i = 0}^n(h_{\theta}(x_i) - y_i)^2$

Use matrix representation:

        $J(\theta) = \frac{1}{2}(X\theta - y)^T(X\theta - y)$

3.2 Matrix transpose formula and derivation formula:

3.2.1 Formula expansion

$J(\theta) = \frac{1}{2}(\theta^TX^TX\theta - \theta^TX^Ty -y^TX\theta + y^Ty)$

3.2.2 Derivation

$J'(\theta) = \frac{1}{2}(\theta^TX^TX\theta - \theta^TX^Ty -y^TX\theta + y^Ty)'$

3.2.3 Derivation

$J'(\theta) =X^T(X\theta -y)$          Let the derivative   $J'(\theta) = 0$

3.2.4 Use inverse matrix for transformation

$(X^TX)^{-1}X^TX\theta = (X^TX)^{-1}X^Ty$

The formula derivation ends:

$\theta = (X^TX)^{-1}X^Ty$

4. Calculation using code

4.1 Simple linear regression

        $y = wx + b$

4.1.1 Data generation

import numpy as np
import matplotlib.pyplot as plt

# 转化为矩阵
X = np.linspace(0, 10, num=30).reshape(-1, 1)
# 斜率和截距,随机生成
w = np.random.randint(1,5,size = 1)
b = np.random.randint(1,10,size = 1)
# 根据一元一次方程计算目标值y,并加上“噪声”,数据有上下波动~
y = X * w + b + np.random.randn(30,1)
print(f'w:{w}, b:{b}')  # w:[4], b:[1]
plt.scatter(X,y)  # "scatter" 表示"散点"

4.1.2 Solving normal equations

Seeking official:

        $\theta = (X^TX)^{-1}X^Ty$  

np.concatenate is a function in the NumPy library for array splicing and concatenation:

plt.scatter(X,y)
# 重新构造X, 不相当于截距,给b一个权重1,方便矩阵计算
"""np.concatenate 是 NumPy 库中用于数组拼接和连接的函数"""
X = np.concatenate([X, np.full(shape = (30, 1), fill_value=1)], axis=1)

# 正规方程求解  
""""
1 np.linalg.inv 是 NumPy 中的线性代数函数,用于计算矩阵的逆
2 np.dot() 计算了它们的矩阵乘法。  
3 使用 .T 属性来获取矩阵的转置
4 .round(2) 是 Python 中用于四舍五入数字到指定小数位数的方法。
"""
a = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y).round(2)
print(f'查看正规方程求解结果:', a)  # [[3.98] [3.88]]
plt.plot(X[:, 0], X.dot(a), color='green')

4.2 Multiple linear regression solution

        $y = w_1x_1 + w_2x_2 + b$

        $\theta = (X^TX)^{-1}X^Ty$

Solving formula:θ = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y).round(2)

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.axes3d import Axes3D # 绘制三维图像
# 转化成矩阵
x1 = np.random.randint(-150,150,size = (300,1))
x2 = np.random.randint(0,300,size = (300,1))
# 斜率和截距,随机生成
w = np.random.randint(1,5,size = 2)
b = np.random.randint(1,10,size = 1)
# 根据二元一次方程计算目标值y,并加上“噪声”,数据有上下波动~
y = x1 * w[0] + x2 * w[1] + b + np.random.randn(300,1)
fig = plt.figure(figsize=(15, 12))
ax = fig.add_subplot(projection='3d')
ax.scatter(x1,x2,y) # 三维散点图
ax.view_init(elev=10, azim=-20) # 调整视角
# 重新构造X,将x1、x2以及截距b,相当于系数w0,前面统一乘以1进行数据合并
X = np.concatenate([x1,x2,np.full(shape = (300,1),fill_value=1)],axis = 1)
w = np.concatenate([w,b])
# 正规方程求解
θ = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y).round(2)  # 计算公式
print('二元一次方程真实的斜率和截距是:',w)
print('通过正规方程求解的斜率和截距是:',θ.reshape(-1))
# # 根据求解的斜率和截距绘制线性回归线型图
x = np.linspace(-150,150,100)
y = np.linspace(0,300,100)
z = x * θ[0] + y * θ[1] + θ[2]
ax.plot(x,y,z ,color = 'red')

Guess you like

Origin blog.csdn.net/March_A/article/details/134090536