Principle and Implementation of Logistic Regression

1. The principle and implementation of Logistic Regression

The notes are from "Mathematics of Vernacular Machine Learning"

Logistic regression is used to solve binary classification problems

1.1 The principle of logistic regression

1.1.1 Sigmoid function

How does the sigmoid function work in neural networks? See my notes for details: machine learning and AI underlying logic
complex nonlinear classification -> multiple line segments -> each line segment is superimposed -> sigmoid function as the source of small line segments -> different sigmoids can be superimposed into arbitrary shapes Small line segments, multiple small line segments spliced ​​to obtain the final complex classification of line segments -> this process is expressed in an abstract way (neural network)

The sigmoid function takes the data x \boldsymbol{x}x maps to[0,1][0,1][0,1 ] , the probability range is also[ 0 , 1 ] [0,1][0,1 ] , you can use the probability size and the threshold comparison to determine which category the data belongs to


x \boldsymbol{x} x is the data, y is the label (indicating which category),θ \boldsymbol{\theta}θ is an unknown parameter vector (assume an initial vector and then obtain the optimal result through training),θ T x \boldsymbol{\theta}^{T}\boldsymbol{x}iT xis the classification boundary.
If the data is linearly separable, the classification boundary is a straight line
. If the data is linearly inseparable, the classification boundary is a curve.

θ T x < 0 \boldsymbol{\theta}^{T}\boldsymbol{x}\lt 0 iTx<0,则 f θ ( x ) < 0.5 f_{\boldsymbol{\theta}}(\boldsymbol{x})\lt0.5 fi(x)<0.5 f θ ( x ) = P ( y = 0 ∣ x ) f_{\boldsymbol{\theta}}(\boldsymbol{x})=P(y=0|\boldsymbol{x}) fi(x)=P ( and=0∣ x ) (the probability that the data belongs to class 0), the probability is in[ 0 , 0.5 ] [0, 0.5][ 0 , 0.5 ] range, the data is judged as belonging to category 0
ifθ T x > 0 \boldsymbol{\theta}^{T}\boldsymbol{x}\gt 0iTx>0 , thenf θ ( x ) > 0.5 f_{\boldsymbol{\theta}}(\boldsymbol{x})\gt0.5fi(x)>0.5 f θ ( x ) = P ( y = 1 ∣ x ) f_{\boldsymbol{\theta}}(\boldsymbol{x})=P(y=1|\boldsymbol{x}) fi(x)=P ( and=1∣ x ) (the probability that the data belongs to class 1), the probability is in[0.5, 1] ​​[0.5, 1][ 0.5 , 1 ] ​​range, the data is judged as belonging to category 1

Classification boundaries are straight lines
Classification boundaries are curved
example:

Tuning parameters through training to achieve optimal results

1.1.2 Likelihood function

For the case of binary classification, the value of label y is only 0, 1, satisfying P ( y = 0 ∣ x ) + P ( y = 1 ∣ x ) = 1 P(y=0|\boldsymbol{x})+P (y=1|\boldsymbol{x})=1P ( and=0∣x)+P ( and=1∣x)=1
Ideal relationship between data and labels
labely = 0 y=0y=0 , we want the data to be the probability of label 0f θ ( x ) = P ( y = 0 ∣ x ) f_{\boldsymbol{\theta}}(\boldsymbol{x})=P(y=0|\boldsymbol {x})fi(x)=P ( and=0∣ x ) max
labely = 1 y=1y=1 , we want the data to be the probability of label 1f θ ( x ) = P ( y = 1 ∣ x ) f_{\boldsymbol{\theta}}(\boldsymbol{x})=P(y=1|\boldsymbol {x})fi(x)=P ( and=1∣ x ) max
The product of the probability of each data in all training data to get the correct label is the probability of all the data to get the correct label (joint probability), that isObjective function/cost function of logistic regression: likelihood function
Maximize the likelihood estimation to find the maximum value of the likelihood function, and then get the parameter estimateFinding the minimum value of the cost function is to maximize the likelihood function
Next, we find the maximum value of the likelihood function.

First,Take the logarithm of the likelihood function, since the logarithmic function is also a single-increasing function, I want to maximize the likelihood function, and all logarithms will not affect the increase or decrease of the likelihood function.

Perform the above resultsdifferential


∂ u ∂ θ j = ∑ i = 1 n ( y ( i ) − f θ ( x ( i ) ) ) x j ( i ) \frac{\partial u}{\partial \theta_j}=\sum_{i=1}^{n}\big(y^{(i)}-f_{\theta}(\boldsymbol{x}^{(i)})\big)x_j^{(i)} θju=i=1n(y(i)fi(x(i)))xj(i)
useGradient DescentIterate the parameters to finally obtain the optimal
parameters. Parameter update expression

If you want to unify the parameter update expression of ordinary regression, put a negative sign in the brackets and write the negative sign before the learning rate

1.3 Implementation of Logistic Regression

Input the pixels in the horizontal direction and vertical direction of the image, and determine whether the image is vertical or horizontal?

portrait
horizontal

1.3.1 Linearly Separable Data

x1 x1x 1 is horizontal pixel,x 2 x2x 2 is the vertical pixel,yyy is the label,y = 0 y=0y=0 means that the image is vertical,y = 1 y=1y=1 means the image is landscape

import numpy as np
import matplotlib.pyplot as plt

# 读入训练数据
train = np.loadtxt('~/Downloads/sourcecode-cn/images2.csv', delimiter=',', skiprows=1)
train_x = train[:,0:2]
train_y = train[:,2]
# 参数初始化
theta = np.random.rand(3)
# 标准化
mu = train_x.mean(axis=0)
sigma = train_x.std(axis=0)

Standardize the variables, find the mean and standard deviation of all values ​​​​of the variable x, and use the following formula to standardize all values

def standardize(x):
    return (x - mu) / sigma

train_z = standardize(train_x)
# 增加 x0
def to_matrix(x):
    x0 = np.ones([x.shape[0], 1])
    return np.hstack([x0, x])

X = to_matrix(train_z)

Plot normalized training data

plt.plot(train_z[train_y == 1, 0], train_z[train_y == 1, 1], 'o')
plt.plot(train_z[train_y == 0, 0], train_z[train_y == 0, 1], 'x')
plt.show()

Linearly Sortable Data

# sigmoid 函数
def f(x):
    return 1 / (1 + np.exp(-np.dot(x, theta)))

# 分类函数
def classify(x):
    return (f(x) >= 0.5).astype(np.int) # 将概率大于0.5的直接处理为1
# 学习率
ETA = 1e-3

# 重复次数
epoch = 5000

# 更新次数
count = 0

We set the number of repetitions epoch slightly more, such as about 5000 times. In practical problems, this value needs to be set by trial and error, that is, by confirming the accuracy in learning to determine how many repetitions are good enough

# 重复学习
for _ in range(epoch):
    theta = theta - ETA * np.dot(f(X) - train_y, X) 

    # 日志输出
    count += 1
    print('第 {} 次 : theta = {}'.format(count, theta))

The following formula is the parameter update expression, and the expression in the above loop body is the matrix form of the following formula. The

parameter update expression in the above loop body is updated using a matrix. For the specific formula derivation, please refer to my blog: The principle and implementation of polynomial regression, Principles of Multiple Regression

After the learning is completed, the optimal results of the parameters are obtained, and the optimal results of these parameters are substituted into the classification boundary expression

# 绘图确认
x0 = np.linspace(-2, 2, 100)
plt.plot(train_z[train_y == 1, 0], train_z[train_y == 1, 1], 'o')
plt.plot(train_z[train_y == 0, 0], train_z[train_y == 0, 1], 'x')
plt.plot(x0, -(theta[0] + theta[1] * x0) / theta[2], linestyle='dashed')
plt.show()


Use the Sigmoid function to verify the prediction result
f returns the horizontal probability of x.

The probability that image 1 is horizontal is 91.7%, and the probability that image 2 is horizontal is 2.9%.

Use the classification function to verify the prediction results.

Image 1 is class 1, which is a landscape image, and image 2 is class 2, which is a portrait image.

1.3.1 Linearly inseparable data

x1 x1x 1 is horizontal pixel,x 2 x2x 2 is the vertical pixel,yyy is the label,y = 0 y=0y=0 means that the image is vertical,y = 1 y=1y=1 means the image is landscape

import numpy as np
import matplotlib.pyplot as plt

# 读入训练数据
train = np.loadtxt('~/Downloads/sourcecode-cn/data3.csv', delimiter=',', skiprows=1)
train_x = train[:,0:2]
train_y = train[:,2]
# 参数初始化
theta = np.random.rand(4)

# 标准化
mu = train_x.mean(axis=0)
sigma = train_x.std(axis=0)

def standardize(x):
    return (x - mu) / sigma

train_z = standardize(train_x)
plt.plot(train_z[train_y == 1, 0], train_z[train_y == 1, 1], 'o')
plt.plot(train_z[train_y == 0, 0], train_z[train_y == 0, 1], 'x')
plt.show()

# 增加 x0 和 x3
def to_matrix(x):
    x0 = np.ones([x.shape[0], 1])
    x3 = x[:,0,np.newaxis] ** 2
    return np.hstack([x0, x, x3])

X = to_matrix(train_z)


decision boundary expression

# sigmoid 函数
def f(x):
    return 1 / (1 + np.exp(-np.dot(x, theta)))

# 分类函数
def classify(x):
    return (f(x) >= 0.5).astype(np.int)
# 学习率
ETA = 1e-3

# 重复次数
epoch = 5000

# 更新次数
count = 0

We set the number of repetitions epoch slightly more, such as about 5000 times. In practical problems, this value needs to be set by trial and error, that is, by confirming the accuracy in learning to determine how many repetitions are good enough

# 重复学习
for _ in range(epoch):
    theta = theta - ETA * np.dot(f(X) - train_y, X)

    # 日志输出
    count += 1
    print('第 {} 次 : theta = {}'.format(count, theta))

The following formula is the parameter update expression, and the expression in the above loop body is the matrix form of the following formula. The

parameter update expression in the above loop body is updated using a matrix. For the specific formula derivation, please refer to my blog: The principle and implementation of polynomial regression, Principles of Multiple Regression

Iterative results

After the learning is completed, the optimal results of the parameters are obtained, and the optimal results of these parameters are substituted into the classification boundary expression

# 绘图确认
x1 = np.linspace(-2, 2, 100)
x2 = -(theta[0] + theta[1] * x1 + theta[3] * x1 ** 2) / theta[2]
plt.plot(train_z[train_y == 1, 0], train_z[train_y == 1, 1], 'o')
plt.plot(train_z[train_y == 0, 0], train_z[train_y == 0, 1], 'x')
plt.plot(x1, x2, linestyle='dashed')
plt.show()

x1 x1x 1 is the horizontal axis,x 2 x2x 2 is the vertical axis

Verify the model, plot the number of iterations on the horizontal axis and the accuracy on the vertical axis

# 参数初始化
theta = np.random.rand(4)
# 精度的历史记录
accuracies = []
# 重复学习
for _ in range(epoch):
    theta = theta - ETA * np.dot(f(X) - train_y, X)
    # 计算现在的精度
    result = classify(X) == train_y
    accuracy = len(result[result == True]) / len(result) # 模型评估指标:精度
    accuracies.append(accuracy)

# 将精度绘图
x = np.arange(len(accuracies))
plt.plot(x, accuracies)
plt.show()

Observe the figure below, as the number of iterations increases, the accuracy is getting closer and closer to 1. When the number of iterations is 900, the accuracy basically reaches 1, so the number of iterations epoch can be set to about 1000. The gradient descent method is used in the above process (using all

training data) to optimize the objective function
If the stochastic gradient descent method (using a training data) is used to optimize the objective function,
only the loop body part in the learning part can be modified

# 重复学习
for _ in range(epoch):
    p = np.random.permutation(X.shape[0])
    for x, y in zip(X[p,:], train_y[p]):
        theta = theta - ETA * (f(x) - y) * x

おすすめ

転載: blog.csdn.net/weixin_48524215/article/details/131350759
おすすめ