Andrew Ng's "Machine Learning" - Logistics regression code implementation

The data set and source files can be obtained in the Github project
link: https://github.com/Raymond-Yang-2001/AndrewNg-Machine-Learing-Homework

1. Sigmoid and binary classification

Unlike linear regression, logistic regression, although called regression, is often used to implement classification functions. The only difference from linear regression is that a Sigmoid function is added after the output of linear regression so that its output can represent the probability of classification without predicting the value. As for why the Sigmoid function has such a function, we will explain it next.

Sigmoid function

The mathematical expression of the Sigmoid function is as follows:
σ ( x ) = 1 1 + e − x \sigma(x)=\frac{1}{1+e^{-x}}σ ( x )=1+ex1
Analysis can show that the Sigmoid function is an increasing function. The larger the x value, σ ( x ) \sigma(x)The closer σ ( x ) is to 1, the smaller the x value is,σ ( x ) \sigma(x)The closer σ ( x ) is to 0. When x=0,σ ( x ) = 0.5 \sigma(x)=0.5σ ( x )=0.5 . Its function curve is as follows:
Insert image description here

Why can the Sigmoid function represent binary classification probability?

Here we first review the knowledge of Bernoulli distribution .

Bernoulli distribution refers to the random variable XXX has, the parameter isp (0 < p < 1) p(0<p<1)p(0<p<1 ) , if it is respectively with probabilityppp and1 − p 1-p1p takes values ​​1 and 0. E (X) = p E(X)= pE ( X )=p, D ( X ) = p ( 1 − p ) D(X)=p(1-p) D(X)=p(1p ) . The number of successful Bernoulli trials obeys the Bernoulli distribution, parameterppp is the probability of success of the experiment. The Bernoulli distribution is a discrete probability distribution, which isN = 1 N=1N=1. Special case of binomial distribution

For the Bernoulli distribution, we have p = μ x ( 1 − μ ) ( 1 − x ) p=\mu^{x}(1-\mu)^{(1-x)}p=mx(1m )( 1 x ),对等式做电影,有
p = ex ln ⁡ μ + ( 1 − x ) ln ⁡ ( 1 − μ ) = ex ln ⁡ μ 1 − μ + ln ⁡ ( 1 − μ ) p= e^{x\ln{\mu}+(1-x)\ln{(1-\mu)}}=e^{x\ln{\frac{\mu}{1-\mu}}+\ ln{(1-\in)}}p=exlnm + ( 1 x )ln( 1 μ )=exln1 mm+ln( 1 μ )
Next, we usethe exponential family distributionto represent the Bernoulli distribution.

The exponential family of distributions, also known as the exponential distribution family, is the most important parameter distribution family in statistics.

The general parameterization of the exponential family distribution is expressed as:
p ( y ; η ) = b ( y ) e η ⊤ T ( y ) − α ( η ) p(y;\eta)=b(y)e^{\eta ^{\top}T(y)-\alpha(\eta)}p ( y ;the )=b(y)ethe T(y)α(η)
among them,

  • yyy is a natural parameter
  • T ( y ) T(y) T ( y )yySufficient statistics of y
  • α ( η ) \alpha(\eta)α ( η ) is the logarithmic part function, used to ensure that∑ p ( y ; η ) = 1 \sum{p(y;\eta)}=1p ( y ;the )=1

From this formula, we can get, η = ln ⁡ μ 1 − μ \eta=\ln{\frac{\mu}{1-\mu}}the=ln1 mm,即
μ = 1 1 + e − η \mu=\frac{1}{1+e^{-\eta}}m=1+eh1

It can be seen that the Sigmoid function can express the probability of Bernoulli distribution.

2. Logistics regression

Like linear regression, logistic regression can also be solved using gradient-based optimization methods. The model formula of logistic regression is as follows:
h ( x ; θ ) = σ ( θ x ⊤ ) h(\boldsymbol{x};\boldsymbol{\theta})=\sigma{(\boldsymbol{\theta x^{ \top}})}h(x;i )=σ ( θ x )
where,x \boldsymbol{x}x is a dimension of( n , d + 1 ) (n,d+1)(n,d+1 ) sample (the first dimension is added to be all 1 to facilitate the calculation of the offset term),θ \boldsymbol{\theta}θ is( 1 , d + 1 ) (1,d+1)(1,d+1 ) parameters. Get( 1 , n ) (1,n)(1,n ) output.

Cross entropy loss function

In linear regression, we can use the mean square error MSE to measure the difference between the predicted value and the true value. This loss function based on the numerical difference is easy to understand. However, in classification tasks, MSE is obviously no longer suitable for measuring classification differences. For this reason, we introduce a new loss function-cross entropy loss.

In order to understand cross-entropy loss, we first start with an important concept in information theory- KL divergence .

In classification tasks, whether it is multi-classification or two-classification, our task can be seen as outputting a predicted distribution . For binary classification, this is a Bernoulli distribution, and for multiclass classification, this is a multinomial distribution. The better the classification effect, the closer the output distribution should be to the target distribution. So how do you measure the similarity of two distributions? This is what KL divergence does.

Consider a KKClassification problem of K class, let our target distribution beq ( k ∣ x ) q(k|x)q ( ​​k x ) , the output distribution isp ( k ∣ x ) p(k|x)p ( k | x ) . These two distributions respectively specify that the sample is thekthThe probability of k classes. For the target distribution, it is obvious that this is a one-hot style distribution, that is, the probability of the true category is 1, and the other probabilities are 0. The KL divergence of the two distributions is written as:
KL ( q ∣ ∣ p ) = ∑ k = 1 K q ( k ∣ x ) log ⁡ q ( k ∣ x ) p ( k ∣ x ) KL(q||p)= \sum_{k=1}^{K}{q(k|x)\log{\frac{q(k|x)}{p(k|x)}}}KL(q∣∣p)=k=1Kq(kx)logp(kx)q(kx)
The closer the two distributions are, the smaller the KL divergence is. It can be observed that when p = qp=qp=When q , the KL divergence is 0.

If we further expand the KL divergence, we can get:
KL ( q ∣ ∣ p ) = ∑ k = 1 K q ( k ∣ x ) log ⁡ q ( k ∣ x ) − q ( k ∣ x ) log ⁡ p ( k ∣ x ) KL(q||p)=\sum_{k=1}^{K}{q(k|x)\log{q(k|x)}-q(k|x)\log{ p(k|x)}}KL(q∣∣p)=k=1Kq(kx)logq(kx)q(kx)log
The first half of p ( k | x ) is about the distribution qqconstant for q , taking into account the distribution qqq is a fixed target distribution, and the KL divergence is only related to the second half, also called cross entropy.
C ross E ntropy = − ∑ k = 1 K q ( k ∣ x ) log ⁡ p ( k ∣ x ) CrossEntropy=-\sum_{k=1}^{K}q(k|x)\log{p( k|x)}CrossEntropy=k=1Kq(kx)log
The closer the two distributions p ( k | x ) are, the smaller the cross entropy is, and conversely, the larger the cross entropy is.

In particular, for binary classification tasks, there is binary cross entropy BCE (Binary Cross Entropy):
BCE = − ( y log ⁡ y ^ + ( 1 − y ) log ⁡ ( 1 − y ^ ) ) BCE=-(y\ log{\hat{y}}+(1-y)\log{(1-\hat{y})})BCE=(ylogy^+(1y)log(1y^) )
This is actuallyK = 2 K=2K=2 special circumstances.

gradient

The gradient representation of logistic regression is the same as linear regression, both are
θ j = θ j − α 1 m ∑ i = 1 m ( h ( x ( i ) ; θ ) − y ( i ) ) xj ( i ) \theta_{j }=\theta_{j}-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h(x^{(i)};\theta)-y^{(i )})x_{j}^{(i)}}ij=ijam1i=1m(h(x(i);i )y(i))xj(i)
Note that in logistic regression, h ( x ; θ ) = σ ( θ x ⊤ ) h(\boldsymbol{x};\boldsymbol{\theta})=\sigma{(\boldsymbol{\theta x^{\top }})}h(x;i )=σ ( θ x ), in linear regression, there is no Sigmoid operation.

The specific derivation is as follows:
J ( θ ) = − [ y log ⁡ p + ( 1 − y ) log ⁡ ( 1 − p ) ] J(\boldsymbol{\theta})=-\left[ y\log{p}+ (1-y)\log{(1-p)}\right]J(θ)=[ylogp+(1y)log(1p ) ]
index,p = σ ( θ x ) p=\sigma(\ball symbol{\theta x})p=σ(θx)
线性法则
∂ J ∂ θ j = ∂ J ∂ p ∂ p ∂ ( θ x ) ∂ ( θ x ) ∂ θ j \frac{\partial{J}}{\partial{\theta_{j}}}=\frac{\partial{J}}{\partial{p}}\frac{\partial{p}}{\partial{(\boldsymbol{\theta x})}}\frac{\partial{(\boldsymbol{\theta x})}}{\partial{\theta_{j}}} θjJ=pJ(θx)pθj(θx)
in:

  • ∂ J ∂ p = − y p + 1 − y 1 − p \frac{\partial{J}}{\partial{p}}=-\frac{y}{p}+\frac{1-y}{1-p} pJ=py+1p1y
  • ∂ p ∂ ( θ x ) = σ ( θ x ) ( 1 − σ ( θ x ) ) \frac{\partial{p}}{\partial{(\boldsymbol{\theta x})}}=\sigma(\boldsymbol{\theta x})(1-\sigma{(\boldsymbol{\theta x})}) (θx)p=σ ( θx ) ( 1σ ( θx ) )
  • ∂ ( θ x ) ∂ θ j = x j \frac{\partial{(\boldsymbol{\theta x})}}{\partial{\theta_{j}}}=x_{j} θj(θx)=xj

For one equation, let
J ( θ ) = [ − yp + 1 − y 1 − p ] σ ( θ x ) ( 1 − σ ( θ x ) ) xj = [ − y σ ( θ x ) + − y 1 − σ ( θ x ) ] σ ( θ x ) ( 1 − σ ( θ x ) ) xj = [ − y ( 1 − σ ( θ x ) ) + ( 1 − y ) σ ( θ x ) xj = ( σ ( θ x ) − y ) xj \begin{aligned} J(\ball symbol{\theta})&=\left[-\frac{y}{p}+\frac{1-y}{1 -p}\right]\sigma(\ballsymbol{\theta x})(1-\sigma{(\ballsymbol{\theta x})})x_{j}\\&=\left[-\frac{y }{\sigma(\ball symbol{\theta x})}+\frac{1-y}{1-\sigma(\ball symbol{\theta x})}\right]\sigma(\ball symbol{\theta x} )(1-\sigma{(\ball symbol{\theta x})})x_{j}\\ &=\left[-y(1-\sigma(\ball symbol{\theta x}))+(1- y)\sigma(\ballsymbol{\theta x})\right]x_{j}\\ &=(\sigma(\ballsymbol{\theta x})-y)x_{j}\end{aligned}J(θ)=[py+1p1y]σ ( θx ) ( 1σ ( θx ) ) xj=[σ ( θx )y+1σ ( θx )1y]σ ( θx ) ( 1σ ( θx ) ) xj=[y(1σ ( θx ))+(1y ) σ ( θx ) ]xj=( σ ( θx )y)xj

Overfitting and underfitting

In the field of machine learning, overfitting has always been an important problem. Overfitting refers to the situation where the learned model has small deviation and excessive variance .

For example, there is a large classification data set, the labels of this data set conform to a certain distribution, and the goal of the model is to learn the mapping from samples to this distribution. Suppose we randomly divide this data set into 10 subsets and train the corresponding ten models respectively. These ten models should be able to classify well on their own data sets, and the gap between the models should not be Big - because all the data comes from one dataset. This means that the bias and variance of the model are relatively small .

Consider a situation, assuming that the "dog" samples included in one of the data sets are all white dogs, this model believes that all white animals are dogs. Although this can classify well on the "White Dog" data set, compared with other models, the "difference" of this model appears to be too large. This means that when the deviation is small, the variance becomes large . Also called overfitting.

On the contrary, if the deviation is large and the variance is small , that is, the differences between the models are very small, but they cannot be accurately classified, this situation is called underfitting .

For underfitting, we can solve it through data enhancement or violently increasing the number of iterations. For overfitting, we will introduce a method called regularization.

Regularization

The most commonly used regularization method is to append a parameter penalty term after the loss function to control the parameters from developing in the direction of overfitting. The general expression form is:
R egularized loss = L oss + λ ( θ ) Regularized\ loss = Loss + \lambda(\theta)Regularized loss=Loss+λ ( θ )
In this code, we implementL 2 L^{2}L2正则化。
正则化损失函数如下所示:
J ( θ ) = 1 m ∑ i = 1 m [ − y ( i ) log ⁡ ( h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{ {y}^{(i)}}\log \left( { {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)-\left( 1-{ {y}^{(i)}} \right)\log \left( 1-{ {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)]}+\frac{\lambda }{2m}\sum\limits_{j=1}^{n}{\theta _{j}^{2}} J( i )=m1i=1m[y(i)log(hi(x(i)))(1y(i))log(1hi(x(i)))]+2m _lj=1nij2
We find the gradient and get the regularized gradient:
g ( θ ) = 1 m ∑ i = 1 m ( h ( x ( i ) ; θ ) − y ( i ) ) xj ( i ) + λ m θ jg(\theta)=\frac{1}{m}\sum_{i=1}^{m}{(h(x^{(i)};\theta)-y^{(i)})x_ {j}^{(i)}}+\frac{\lambda }{m}\theta _{j}g ( i )=m1i=1m(h(x(i);i )y(i))xj(i)+mlij
In the case of the infinitesimal equation, we have the following formula:
θ j = θ j − α 1 m ∑ i = 1 m ( h ( x ( i ) ; θ ) − y ( i ) ) xj ( i ) − λ m ∑ j = 1 n θ j = ( 1 − λ m ) θ j − α 1 m ∑ i = 1 m ( h ( x ( i ) ; θ ) − y ( i ) ) xj ( i ) \theta_{j}=\ theta_{j}-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h(x^{(i)};\theta)-y^{(i)}) x_{j}^{(i)}}-\frac{\lambda}{m}\sum\limits_{j=1}^{n}{\theta_{j}}=\left(1-\frac {\lambda}{m}\right)\theta_{j}-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h(x^{(i)};\ theta)-y^{(i)})x_{j}^{(i)}}ij=ijam1i=1m(h(x(i);i )y(i))xj(i)mlj=1nij=(1ml)ijam1i=1m(h(x(i);i )y(i))xj(i)
It can be seen that regularization actually reduces the parameters to a certain extent. The degree of reduction is related to λ / m \lambda / mrelated to λ / m .
Insert image description here
This figure illustratesL 2 L^{2}L2 Influence of regularization parameters. The dashed line represents the loss contour of the regularized term, and the solid line represents the loss contour of the unregularized loss function. Both are inw ~ \tilde{w}w~ Achieve balance. atw 1 w_{1}w1direction, when the parameters change, the loss function does not change too much, but when w 2 w_{2}w2This change appears to be more dramatic. That is to say w 2 w_{2}w2Compared to w 1 w_{1}w1It can significantly reduce the loss function value.

L 2 L^{2}L2 Regularization makes the parameters in the direction of significantly reducing the value of the loss function better preserved, and does not help the direction of the loss function to change significantly, because this will not significantly affect the gradient.

3. Python code implementation

Here is the code for the Logistic regression class, which also implements data normalization and regularization:

import numpy as np


def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def bce_loss(pred, target):
    """
    计算误差
    :param pred: 预测
    :param target: ground truth
    :return: 损失序列
    """
    return -np.mean(target * np.log(pred) + (1-target) * np.log(1-pred))


class LogisticRegression:
    """
    Logistic回归类
    """

    def __init__(self, x, y, val_x, val_y, epoch=100, lr=0.1, normalize=True, regularize=None, scale=0, show=True):
        """
        初始化
        :param x: 样本, (sample_number, dimension)
        :param y: 标签, (sample_numer, 1)
        :param epoch: 训练迭代次数
        :param lr: 学习率
        """
        self.theta = None
        self.loss = []
        self.val_loss = []
        self.n = x.shape[0]
        self.d = x.shape[1]

        self.epoch = epoch
        self.lr = lr

        t = np.ones(shape=(self.n, 1))

        self.normalize = normalize

        if self.normalize:
            self.x_std = x.std(axis=0)
            self.x_mean = x.mean(axis=0)
            self.y_mean = y.mean(axis=0)
            self.y_std = y.std(axis=0)
            x = (x - self.x_mean) / self.x_std

        self.y = y
        self.x = np.concatenate((t, x), axis=1)

        # self.val_x = (val_x - val_x.mean(axis=0)) / val_x.std(axis=0)
        self.val_x = val_x
        self.val_y = val_y

        self.regularize = regularize
        self.scale = scale

        self.show = show

    def init_theta(self):
        """
        初始化参数
        :return: theta (1, d+1)
        """
        self.theta = np.zeros(shape=(1, self.d + 1))

    def gradient_decent(self, pred):
        """
        实现梯度下降求解
        """
        # error (n,1)
        error = pred - self.y
        # term (d+1, 1)
        term = np.matmul(self.x.T, error)
        # term (1,d+1)
        term = term.T

        if self.regularize == "L2":
            re = self.scale / self.n * self.theta[0, 1:]
            re = np.expand_dims(np.array(re), axis=0)
            re = np.concatenate((np.array([[0]]), re), axis=1)
            # re [0,...] (1,d+1)
            self.theta = self.theta - self.lr * (term / self.n + re)
        # update parameters
        else:
            self.theta = self.theta - self.lr * (term / self.n)

    def validation(self, x, y):
        if self.normalize:
            x = (x - x.mean(axis=0)) / x.std(axis=0)
        outputs = self.get_prob(x)
        curr_loss = bce_loss(outputs, y)
        if self.regularize == "L2":
            curr_loss += self.scale / self.n * np.sum(self.theta[0, 1:] ** 2)
        self.val_loss.append(curr_loss)
        predicted = np.expand_dims(np.where(outputs[:, 0] > 0.5, 1, 0), axis=1)
        count = np.sum(predicted == y)
        if self.show:
            print("Accuracy on Val set: {:.2f}%\tLoss on Val set: {:.4f}".format(count / y.shape[0] * 100, curr_loss))

    def test(self, x, y):
        outputs = self.get_prob(x)
        predicted = np.expand_dims(np.where(outputs[:, 0] > 0.5, 1, 0), axis=1)
        count = np.sum(predicted == y)
        # print("Accuracy on Test set: {:.2f}%".format(count / y.shape[0] * 100))
        # curr_loss = bce_loss(outputs, y)
        # if self.regularize == "L2":
        # curr_loss += self.scale / self.n * np.sum(self.theta[0, 1:] ** 2)
        return count / y.shape[0]  # , curr_loss

    def train(self):
        """
        训练Logistic回归
        :return: 参数矩阵theta (1,d+1); 损失序列 loss
        """
        self.init_theta()

        for i in range(self.epoch):
            # pred (1,n); theta (1,d+1); self.x.T (d+1, n)
            z = np.matmul(self.theta, self.x.T).T
            # pred (n,1)
            pred = sigmoid(z)
            curr_loss = bce_loss(pred, self.y)
            if self.regularize == "L2":
                curr_loss += self.scale / self.n * np.sum(self.theta[0, 1:] ** 2)
            self.loss.append(curr_loss)
            self.gradient_decent(pred)
            if self.show:
                print("Epoch: {}/{}, Train Loss: {:.4f}".format(i + 1, self.epoch, curr_loss))
            self.validation(self.val_x, self.val_y)

        if self.normalize:
            y_mean = np.mean(z, axis=0)
            self.theta[0, 1:] = self.theta[0, 1:] / self.x_std.T
            self.theta[0, 0] = y_mean - np.dot(self.theta[0, 1:], self.x_mean.T)
        return self.theta, self.loss, self.val_loss

    def get_prob(self, x):
        """
        回归预测
        :param x: 输入样本 (n,d)
        :return: 预测结果 (n,1)
        """
        t = np.ones(shape=(x.shape[0], 1))
        x = np.concatenate((t, x), axis=1)
        pred = sigmoid(np.matmul(self.theta, x.T))
        return pred.T

    def get_inner_product(self, x):
        t = np.ones(shape=(x.shape[0], 1))
        x = np.concatenate((t, x), axis=1)
        return np.matmul(self.theta, x.T)

    def predict(self, x):
        prob = self.get_prob(x)
        return np.expand_dims(np.where(prob[:, 0] > 0.5, 1, 0), axis=1)

4. Single-dimensional and multi-dimensional logistic classification

Single-dimensional data classification

Data set visualization
Insert image description here
divides training set and validation set.
Training set visualization:
Insert image description here
Validation set visualization:
Insert image description here
Call algorithm for classification.

from LogisticRegression import LogisticRegression

epochs = 5000
alpha = 0.01
logistic_reg = LogisticRegression(x=train_x,y=train_y_ex,val_x=val_x,val_y=val_y_ex,epoch=epochs,lr=alpha)
theta,train_loss,val_loss = logistic_reg.train()

Classification performance

Accuracy on Test Set: 80.00%
My F1 Score: 0.8571

sklearn library function verification

Sklearn Accuracy: 80.00%
Sklearn F1 Score: 0.8571

Visualizing decision boundaries
Insert image description here

Visualization of the training process
shows that there is obvious overfitting, which can be suppressed by adjusting the learning rate and the number of iterations. Of course, regularization can also be performed.
Insert image description here

Multidimensional data classification

The data set is visually
Insert image description here
divided into training set and validation set
. Training set:
Insert image description here
Validation set:
Insert image description here

Do data enhancement (expand data dimensions)
{1}, x_{2}, x_{1}^{2}, x_{1}x_{2}, x_{2}^{2}, x_{1}^{3}, x_{1}^ {2}x_{2},\cdots]X=[x1,x2,x12,x1x2,x22,x13,x12x2,]
This is expanded to the sixth power

def feature_mapping(x, degree):
    feature = np.zeros([x.shape[0],1])
    for i in range(0, 1 + degree):
        for j in range(0, 1 + degree - i):
            if i==0 and j==0: continue
            feature=np.concatenate((feature, np.expand_dims(np.multiply(np.power(x[:, 0], i) , np.power(x[:, 1], j)), axis=1)),axis=1)
    return feature[:,1:]

train_x_map = feature_mapping(train_x,degree=6)
val_x_map = feature_mapping(val_x,degree=6)

The regularization parameter is 2

Accuracy on Test Set: 66.67%
My F1 Score: 0.5556
Sklearn Accuracy: 70.83%
Sklearn Val Loss: 0.3076
Sklearn F1 Score: 0.6667

The regularization parameter of the visualization of the decision boundary
Insert image description here
is 0. It can be seen from the visualization of the decision boundary
that the decision boundary appears to be too "deformed", which is a prominent manifestation of over-fitting.
Insert image description here

Guess you like

Origin blog.csdn.net/d33332/article/details/128496284