logistic regression

Preliminary knowledge

Logistic regression is mainly used in binary classification problems, the formula: $\frac{1}{1+e^{-z}}$ The gradient return formula is: $\frac{\partial p}{\partial z}=p(1-p)$

For the loss function, we use binary cross entropy (BCE, binary_cross_entropy), assuming y is the label and p is the predicted probability: $l o s s = - y l o g (p) - (1 - y) l o g (1 - p)$ The gradient return formula is:
$\frac{\partial loss}{\partial p} = -\frac{y}{p}+\frac{1- y}{1-p}$

Implementation case

Here we implement a classifier with two hidden layers and no bias items, assuming our input is $x$ , labeled $y$ , the weight of the two fully connected layers is $w_1$ Sum $w_2$ , we can express the output with the following formula:
$h_1 = w_1x \\ h_2 = w_2h_1 \\ p = sigmoid(h_2)$ to calculate the probability $p$ and label $Loss of y$ : $l o s s = B C E (p, y)$
Apply the chain gradient update formula to calculate $w_1$ Sum $w_2$ 的梯度：
$\frac{\partial loss}{\partial w_2} = \frac{\partial loss}{\partial p}\frac{\partial p}{\partial h_2}\frac{\partial h_2}{\partial w_2}=(-\frac{y}{p}+\frac{1-y}{1-p})*p(1-p)*h_1$ $\frac{\partial loss}{\partial w_1} = \frac{\partial loss}{\partial p}\frac{\partial p}{\partial h_2}\frac{\partial h_2}{\partial h_1}\frac{\partial h_1}{\partial w_1}=(-\frac{y}{p}+\frac{1-y}{1-p})*p(1-p)*w_2*x$

1. numpy implementation

import numpy as np
# 定义训练数据，一共1000个向量，每个向量5维，表示5个特征。
x = np.random.randn(1000, 5) # (batch_size, in_channel)
# 定义标签，如果5个特征值的和为正数，标签为1，否则标签为0。
y = (np.sum(x, axis=-1, keepdims=True)>0).astype(np.float32)

w1 = np.random.rand(5, 8) # (in_channel, hidden_channel)
w2 = np.random.rand(8, 1) # (hidden_channel, out_channel)

def sigmoid(z):
    return 1/(1+np.exp(-z))

lr = 0.001 # learning rate
for i in range(10):
    h1 = x.dot(w1)
    h2 = h1.dot(w2)
    p = np.clip(sigmoid(h2), 0.0001, 0.9999) # clip防止交叉熵出现nan
    # 损失计算 BCEloss
    loss = np.mean(-(y*np.log(p)+(1-y)*np.log(1-p))) 
    # 计算 accuracy
    acc = np.mean((p>0.5)==y)
    print("{}--loss:{:.4f} acc:{:.4f}".format(i, loss, acc))
    # 梯度回传，计算每一个中间变量的梯度
    grad_p  = -y/p+(1-y)/(1-p)  # dloss/dp
    grad_h2 = grad_p*p*(1-p)    # dloss/dp * dp/dh2
    grad_w2 = h1.T.dot(grad_h2) # dloss/dp * dp/dh2 * dh2/dw2
    grad_h1 = grad_h2.dot(w2.T) # dloss/dp * dp/dh2 * dh2/dh1
    grad_w1 = X.T.dot(grad_h1)  # dloss/dp * dp/dh2 * dh2/dh1 * dh1/dw1 
    # 参数更新
    w1 -= lr*grad_w1
    w2 -= lr*grad_w2

The output is as follows:

0--loss:0.1126 acc:0.9670
1--loss:0.1020 acc:0.9580
2--loss:0.1552 acc:0.9240
3--loss:0.2582 acc:0.9070
4--loss:0.2042 acc:0.9140
5--loss:0.0925 acc:0.9710
6--loss:0.0522 acc:0.9910
7--loss:0.0432 acc:0.9940
8--loss:0.0384 acc:0.9930
9--loss:0.0351 acc:0.9950

Using numpy we implemented the probability manually by matrix multiplicationCalculation of $p$ $l o s s$ Gradient calculation for each intermediate variable. When calculating the gradient, care must be taken to ensure that the shape of the gradient of eachvariable is the same as the shape of the variable itself. As long as we know the gradient formula and the chain rule, it is still very simple.

2. tensorflow implementation

import numpy as np
import tensorflow as tf
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

X = np.random.randn(1000, 5)
Y = (np.sum(X,axis=-1, keepdims=True)>0).astype(np.float32)

# 定义包含两个全连接层且最后一层用sigmoid激活的model
model = tf.keras.models.Sequential(
    [tf.keras.layers.Dense(8, use_bias=False),
    tf.keras.layers.Dense(1, use_bias=False, activation='sigmoid')]
)

x = tf.placeholder(dtype=tf.float32, shape=[None, 5]) #定义输入placeholder
y = tf.placeholder(dtype=tf.float32, shape=[None, 1]) #定义标签placeholder
p = model(x) # inference 拿到预测概率 p
# 计算BCEloss
p = tf.clip_by_value(p,1e-7,1-1e-7)
loss_fn = tf.reduce_mean(-(y*tf.log(p)+(1-y)*tf.log(1-p)))
# 计算accuracy
binary_p = tf.where(p>0.5, tf.ones_like(p), tf.zeros_like(p))
acc = tf.reduce_mean(tf.cast(tf.equal(binary_p, y), tf.float32))
# 定义Adam优化器
optimizer = tf.train.AdamOptimizer(0.1).minimize(loss_fn)
# 在sess里面训练模型
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    for i in range(10):
        loss, accuracy, _ = sess.run([loss_fn, acc, optimizer], 
                                     feed_dict={
    
    x: X, y:Y})
        print("{}--loss:{:.4f} acc:{:.4f}".format(i, loss, accuracy))

The output is as follows:

0--loss:0.5949 acc:0.6660
1--loss:0.3835 acc:0.8600
2--loss:0.2731 acc:0.9260
3--loss:0.2061 acc:0.9540
4--loss:0.1606 acc:0.9710
5--loss:0.1279 acc:0.9810
6--loss:0.1046 acc:0.9880
7--loss:0.0888 acc:0.9910
8--loss:0.0782 acc:0.9910
9--loss:0.0701 acc:0.9900

As you can see, the implementation of tensorflow is quite troublesome. . The process it follows is: input tensor definition $\rightarrow$ model definition $\rightarrow$ inference gets the output tensor $\rightarrow$ loss definition $\rightarrow$ metrics definition $\rightarrow$ Optimizer definition, so that a complete graph is formed, and then a sess is opened. In this sess, we can run our optimizer to train the model, and we can also get loss and metrics.

3. keras implementation

import numpy as np
import keras

x = np.random.randn(1000, 5)
y = (np.sum(x,axis=-1, keepdims=True)>0).astype(np.float32)

model = keras.models.Sequential(
    [keras.layers.Dense(8, use_bias=False),
    keras.layers.Dense(1, use_bias=False, activation='sigmoid')]
)
# 模型编译
model.compile(keras.optimizers.Adam(0.001), # 使用adam优化器
              loss=keras.losses.binary_crossentropy, #使用自带的BCEloss
              metrics=['accuracy']) # 使用自带的accuracy
model.fit(x,y, batch_size=16, epochs=10) # 开始训练

The output is as follows:

Epoch 1/10
1000/1000 [==============================] - 1s 1ms/step - loss: 0.7465 - acc: 0.5730
Epoch 2/10
1000/1000 [==============================] - 0s 140us/step - loss: 0.5893 - acc: 0.6840
Epoch 3/10
1000/1000 [==============================] - 0s 148us/step - loss: 0.4777 - acc: 0.7640
Epoch 4/10
1000/1000 [==============================] - 0s 139us/step - loss: 0.3973 - acc: 0.8240
Epoch 5/10
1000/1000 [==============================] - 0s 137us/step - loss: 0.3374 - acc: 0.8770
Epoch 6/10
1000/1000 [==============================] - 0s 140us/step - loss: 0.2929 - acc: 0.9140
Epoch 7/10
1000/1000 [==============================] - 0s 134us/step - loss: 0.2586 - acc: 0.9440
Epoch 8/10
1000/1000 [==============================] - 0s 135us/step - loss: 0.2315 - acc: 0.9580
Epoch 9/10
1000/1000 [==============================] - 0s 139us/step - loss: 0.2100 - acc: 0.9680
Epoch 10/10
1000/1000 [==============================] - 0s 134us/step - loss: 0.1924 - acc: 0.9780

It can be seen that keras is very concise. The reason is that keras has made a very deep package for some commonly used loss calculations, gradient return, and parameter updates. We only need to call the fit function to easily train a model. However, from the implementation of numpy, torch and tensorflow, we can pay more attention to the details of some model training.

4. pytorch implementation

import torch

x = torch.randn(1000, 5)
y = (torch.sum(x, dim=1, keepdim=True)>0).float()
# 定义一个Sequential模型
model = torch.nn.Sequential(
    torch.nn.Linear(5, 8, bias=False),
    torch.nn.Linear(8, 1, bias=False),
    torch.nn.Sigmoid()
)
criterion = torch.nn.BCELoss() # 定义loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.1) # 定义优化器
for i in range(10):
    p = model(x) # inference拿到概率p
    loss = criterion(p, y) # 计算loss
    acc = torch.mean(((p>0.5)==y).float()) # 计算accuracy
    print("{}--loss:{:.4f} acc:{:.4f}".format(i, loss, acc))
    optimizer.zero_grad() # 清零梯度
    loss.backward() # 梯度回传，类似于numpy中我们把每个变量的梯度都计算出来
    optimizer.step() # 参数更新，类似于numpy中我们更新w1和w2的操作

0--loss:0.6549 acc:0.6260
1--loss:0.4986 acc:0.8080
2--loss:0.3646 acc:0.8780
3--loss:0.2605 acc:0.9350
4--loss:0.1877 acc:0.9620
5--loss:0.1393 acc:0.9720
6--loss:0.1077 acc:0.9790
7--loss:0.0864 acc:0.9860
8--loss:0.0711 acc:0.9890
9--loss:0.0599 acc:0.9950

The implementation of pytorch is also simple and easy to understand. Compared with the graph mechanism of tensorflow, the operation of torch is more like numpy and easier to accept. Compared with the deep encapsulation of keras, we can control the gradient calculation and parameter update steps of torch in model training more easily. I still prefer pytorch.

Summarize

Except for numpy, other frameworks implement gradient return and parameter update internally, which makes it much more convenient for us to train the model, and these frameworks also implement commonly used loss functions and evaluation functions, but the interviewer may ask manual code during the interview The complete model training process requires us to have a good understanding of each step.

talk is cheap，show me the code

The four frameworks numpy, tensorflow, keras, and pytorch respectively implement the logistic regression classification algorithm, realize gradient calculation and parameter update

logistic regression

Preliminary knowledge

Implementation case

1. numpy implementation

2. tensorflow implementation

3. keras implementation

4. pytorch implementation

Summarize

Guess you like