Logistic regression principle and code implementation based on numpy


1. Logistic regression principle

Logistic regression is an algorithm based on regression ideas to solve classification problems. It uses the values ​​output by linear regression to perform certain processing and convert them into classification label values.

The difference between classification problems and regression problems

We can divide tasks into regression tasks and classification tasks according to the type of tasks. The output of a classification problem is a discrete value, such as identifying cats and dogs. The output of the task can only be a cat or a dog, which is a discrete value. The output of a regression problem is a continuous value, such as predicting a person's weight based on certain features, and the weight is a continuous value.

Consider: The output value of a linear model is a continuous value, how to relate it to a classification problem?

Find a monotonic differentiable function to connect the true label y of the classification task to the predicted value of the linear regression model
z= θ \ theta The output of θT x + b is real-valued. We convert the real-valued z into a 0/1 value using the logarithmic probability function.

Logarithmic probability function (sigmoid function)

Please add image description

Please add image description
z= θ \theta θT x + b This is a linear recall.
x sum θ \theta θ is a vector at the same latitude. x has several characteristics θ \theta θ has several components
If the output value of the sigmoid function is greater than 0.5, it is judged to be a positive class, and if it is less than 0.5, it is judged to be a negative class< /span>

cost function

Please add image description

When y=1, the closer the output h is to 1, the smaller the cost, and the further away from 1, the greater the cost. When y=0, the closer the output h is to 0, the smaller the cost, and the further away from 0, the greater the cost.

Since the value of y still needs to be determined, this formula is combined into one formula
Please add image description

Parameter update method: gradient descent method

A certain number of people θ \theta θ, calculate the partial derivative g of the loss with respect to this parameter, then update θ \theta θ= θ \theta θ-α*g, α is the learning rate, a manually set hyperparameter. Setting it too large will cause the model to fail to converge, and setting it too small will cause the convergence to be too slow, so you need to set an appropriate size.

Derivative derivation of sigmoid function

Please add image description

Cost loss derivation process for parameters
For the convenience of calculation, a component 1 will be added to the first position of x, which is the same as θ \ theta θThe position of component 1 after multiplication represents the offset value, so θ \theta θTx+b,变成了z= θ \theta θTx、Small one piece offset 值b、However x sum θ \theta θ adds one more component respectively, and multiplication can represent the offset value b.

Please add image description

2. Code implementation

from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import  StandardScaler

def sigmoid(x):
    #添加判断防止e的正数次幂过大导致溢出
    if x>0:
        return 1/(1+np.exp(-x))
    else:
        return np.exp(x)/(1+np.exp(x))
#h(x)函数
def fun_h(thetas,x):
    z=np.dot(thetas,x)
    h=sigmoid(z)
    return h

class model():
    def __init__(self,num_iters=20,lr=0.1):
        """
        :param num_iters: 迭代次数
        :param lr: 学习率
        """
        self.num_iters=num_iters
        self.lr=lr
    def fit(self,X,Y):
        """
        :param X: 训练集特征数据
        :param Y: 标签
        :return:
        """
        #初始化参数θ
        self.thetas = np.zeros(X.shape[1])
        m = len(X)
        for k in range(self.num_iters):
            for j in range(len(self.thetas)):
                d_thetaj = 0
                loss = 0
                for i in range(m):
                    h = fun_h(self.thetas, x_train[i])
                    loss += -(Y[i] * np.log(fun_h(self.thetas, x_train[i])) + (1 - Y[i]) * np.log(1 - fun_h(self.thetas, x_train[i])))
                    d_thetaj += (h - Y[i]) * x_train[i][j]
                loss /= m
                d_thetaj /= m
                self.thetas[j] -= self.lr * d_thetaj

            print("iter:%d,loss:%f"%(k,loss))
    #判断类别
    def predict(self,x):
        h=fun_h(self.thetas,x)
        if h>=0.5:
            return 1
        else:
            return 0
    #计算测试集准确率
    def score(self,x_test,y_test):
        num_correct = 0
        for x, y in zip(x_test, y_test):
            y_hat = self.predict( x)

            if y_hat == y:
                num_correct += 1
        return num_correct / len(x_test)
if __name__ == '__main__':
    cancer = datasets.load_breast_cancer()

    data=cancer.data
    target=cancer.target

    #进行数据标准化
    std = StandardScaler()
    data = std.fit_transform(data)

    #对数据添加一列1方便计算
    ones=np.ones((len(data),1))
    data=np.c_[ones,data]

    x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=0)

    model=model(30,0.1)
    model.fit(x_train,y_train)
    score=model.score(x_test,y_test)
    print("acc:",score)


运行结果:
iter:0,loss:0.545435
iter:1,loss:0.458332
iter:2,loss:0.401712
iter:3,loss:0.361705
iter:4,loss:0.331730
iter:5,loss:0.308293
iter:6,loss:0.289367
iter:7,loss:0.273695
iter:8,loss:0.260458
iter:9,loss:0.249093
iter:10,loss:0.239204
iter:11,loss:0.230502
iter:12,loss:0.222772
iter:13,loss:0.215849
iter:14,loss:0.209604
iter:15,loss:0.203937
iter:16,loss:0.198765
iter:17,loss:0.194023
iter:18,loss:0.189656
iter:19,loss:0.185618
iter:20,loss:0.181872
iter:21,loss:0.178386
iter:22,loss:0.175131
iter:23,loss:0.172085
iter:24,loss:0.169227
iter:25,loss:0.166539
iter:26,loss:0.164005
iter:27,loss:0.161612
iter:28,loss:0.159348
iter:29,loss:0.157202
acc: 0.956140350877193

Process finished with exit code 0

The loss is gradually decreasing, and the final test set also reached a good accuracy rate of 95%.

Summarize

Tips: Here is a summary of the article:
The above is what I will talk about today. This article only briefly introduces the simple principles of logistic regression and the numpy implementation based on the sklearn breast cancer data set< /span>

おすすめ

転載: blog.csdn.net/weixin_44599230/article/details/121442118
おすすめ