logistic regression
Preliminary knowledge
Logistic regression is mainly used in binary classification problems, the formula: p = sigmoid ( z ) = 1 1 + e − zp = sigmoid(z) = \frac{1}{1+e^{-z}}p=sigmoid(z)=1+e−z1The gradient return formula is: ∂ p ∂ z = p ( 1 − p ) \frac{\partial p}{\partial z}=p(1-p)∂z∂p=p(1−p)
For the loss function, we use binary cross entropy (BCE, binary_cross_entropy), assuming y is the label and p is the predicted probability: loss = − ylog ( p ) − ( 1 − y ) log ( 1 − p ) loss = -ylog(p) -(1-y)log(1-p)loss=−ylog(p)−(1−y)log(1−p ) The gradient return formula is:
∂ loss ∂ p = − yp + 1 − y 1 − p \frac{\partial loss}{\partial p} = -\frac{y}{p}+\frac{1- y}{1-p}∂p∂loss=−py+1−p1−y
Implementation case
Here we implement a classifier with two hidden layers and no bias items, assuming our input is xxx , labeledyyy , the weight of the two fully connected layers isw 1 w_1w1Sum w 2 w_2w2, we can express the output with the following formula:
h 1 = w 1 xh 2 = w 2 h 1 p = sigmoid ( h 2 ) h_1 = w_1x \\ h_2 = w_2h_1 \\ p = sigmoid(h_2)h1=w1xh2=w2h1p=sigmoid(h2) to calculate the probabilityppp and labelyyLoss of y : loss = BCE ( p , y ) loss = BCE(p, y)loss=BCE(p,y )
Apply the chain gradient update formula to calculatew 1 w_1w1Sum w 2 w_2w2的梯度:
∂ l o s s ∂ w 2 = ∂ l o s s ∂ p ∂ p ∂ h 2 ∂ h 2 ∂ w 2 = ( − y p + 1 − y 1 − p ) ∗ p ( 1 − p ) ∗ h 1 \frac{\partial loss}{\partial w_2} = \frac{\partial loss}{\partial p}\frac{\partial p}{\partial h_2}\frac{\partial h_2}{\partial w_2}=(-\frac{y}{p}+\frac{1-y}{1-p})*p(1-p)*h_1 ∂w2∂loss=∂p∂loss∂h2∂p∂w2∂h2=(−py+1−p1−y)∗p(1−p)∗h1 ∂ l o s s ∂ w 1 = ∂ l o s s ∂ p ∂ p ∂ h 2 ∂ h 2 ∂ h 1 ∂ h 1 ∂ w 1 = ( − y p + 1 − y 1 − p ) ∗ p ( 1 − p ) ∗ w 2 ∗ x \frac{\partial loss}{\partial w_1} = \frac{\partial loss}{\partial p}\frac{\partial p}{\partial h_2}\frac{\partial h_2}{\partial h_1}\frac{\partial h_1}{\partial w_1}=(-\frac{y}{p}+\frac{1-y}{1-p})*p(1-p)*w_2*x ∂w1∂loss=∂p∂loss∂h2∂p∂h1∂h2∂w1∂h1=(−py+1−p1−y)∗p(1−p)∗w2∗x
1. numpy implementation
import numpy as np
# 定义训练数据,一共1000个向量,每个向量5维,表示5个特征。
x = np.random.randn(1000, 5) # (batch_size, in_channel)
# 定义标签,如果5个特征值的和为正数,标签为1,否则标签为0。
y = (np.sum(x, axis=-1, keepdims=True)>0).astype(np.float32)
w1 = np.random.rand(5, 8) # (in_channel, hidden_channel)
w2 = np.random.rand(8, 1) # (hidden_channel, out_channel)
def sigmoid(z):
return 1/(1+np.exp(-z))
lr = 0.001 # learning rate
for i in range(10):
h1 = x.dot(w1)
h2 = h1.dot(w2)
p = np.clip(sigmoid(h2), 0.0001, 0.9999) # clip防止交叉熵出现nan
# 损失计算 BCEloss
loss = np.mean(-(y*np.log(p)+(1-y)*np.log(1-p)))
# 计算 accuracy
acc = np.mean((p>0.5)==y)
print("{}--loss:{:.4f} acc:{:.4f}".format(i, loss, acc))
# 梯度回传,计算每一个中间变量的梯度
grad_p = -y/p+(1-y)/(1-p) # dloss/dp
grad_h2 = grad_p*p*(1-p) # dloss/dp * dp/dh2
grad_w2 = h1.T.dot(grad_h2) # dloss/dp * dp/dh2 * dh2/dw2
grad_h1 = grad_h2.dot(w2.T) # dloss/dp * dp/dh2 * dh2/dh1
grad_w1 = X.T.dot(grad_h1) # dloss/dp * dp/dh2 * dh2/dh1 * dh1/dw1
# 参数更新
w1 -= lr*grad_w1
w2 -= lr*grad_w2
The output is as follows:
0--loss:0.1126 acc:0.9670 1--loss:0.1020 acc:0.9580 2--loss:0.1552 acc:0.9240 3--loss:0.2582 acc:0.9070 4--loss:0.2042 acc:0.9140 5--loss:0.0925 acc:0.9710 6--loss:0.0522 acc:0.9910 7--loss:0.0432 acc:0.9940 8--loss:0.0384 acc:0.9930 9--loss:0.0351 acc:0.9950
Using numpy we implemented the probability pp manually by matrix multiplicationCalculation of p , and loss lossl o s s Gradient calculation for each intermediate variable. When calculating the gradient, care must be taken to ensure that the shape of the gradient of eachvariable is the same as the shape of the variable itself. As long as we know the gradient formula and the chain rule, it is still very simple.
2. tensorflow implementation
import numpy as np
import tensorflow as tf
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
X = np.random.randn(1000, 5)
Y = (np.sum(X,axis=-1, keepdims=True)>0).astype(np.float32)
# 定义包含两个全连接层且最后一层用sigmoid激活的model
model = tf.keras.models.Sequential(
[tf.keras.layers.Dense(8, use_bias=False),
tf.keras.layers.Dense(1, use_bias=False, activation='sigmoid')]
)
x = tf.placeholder(dtype=tf.float32, shape=[None, 5]) #定义输入placeholder
y = tf.placeholder(dtype=tf.float32, shape=[None, 1]) #定义标签placeholder
p = model(x) # inference 拿到预测概率 p
# 计算BCEloss
p = tf.clip_by_value(p,1e-7,1-1e-7)
loss_fn = tf.reduce_mean(-(y*tf.log(p)+(1-y)*tf.log(1-p)))
# 计算accuracy
binary_p = tf.where(p>0.5, tf.ones_like(p), tf.zeros_like(p))
acc = tf.reduce_mean(tf.cast(tf.equal(binary_p, y), tf.float32))
# 定义Adam优化器
optimizer = tf.train.AdamOptimizer(0.1).minimize(loss_fn)
# 在sess里面训练模型
with tf.Session() as sess:
tf.global_variables_initializer().run()
for i in range(10):
loss, accuracy, _ = sess.run([loss_fn, acc, optimizer],
feed_dict={
x: X, y:Y})
print("{}--loss:{:.4f} acc:{:.4f}".format(i, loss, accuracy))
The output is as follows:
0--loss:0.5949 acc:0.6660 1--loss:0.3835 acc:0.8600 2--loss:0.2731 acc:0.9260 3--loss:0.2061 acc:0.9540 4--loss:0.1606 acc:0.9710 5--loss:0.1279 acc:0.9810 6--loss:0.1046 acc:0.9880 7--loss:0.0888 acc:0.9910 8--loss:0.0782 acc:0.9910 9--loss:0.0701 acc:0.9900
As you can see, the implementation of tensorflow is quite troublesome. . The process it follows is: input tensor definition → \rightarrow→ model definition→ \rightarrow→ inference gets the output tensor→ \rightarrow→ loss definition→ \rightarrow→ metrics definition→ \rightarrow→ Optimizer definition, so that a complete graph is formed, and then a sess is opened. In this sess, we can run our optimizer to train the model, and we can also get loss and metrics.
3. keras implementation
import numpy as np
import keras
x = np.random.randn(1000, 5)
y = (np.sum(x,axis=-1, keepdims=True)>0).astype(np.float32)
model = keras.models.Sequential(
[keras.layers.Dense(8, use_bias=False),
keras.layers.Dense(1, use_bias=False, activation='sigmoid')]
)
# 模型编译
model.compile(keras.optimizers.Adam(0.001), # 使用adam优化器
loss=keras.losses.binary_crossentropy, #使用自带的BCEloss
metrics=['accuracy']) # 使用自带的accuracy
model.fit(x,y, batch_size=16, epochs=10) # 开始训练
The output is as follows:
Epoch 1/10 1000/1000 [==============================] - 1s 1ms/step - loss: 0.7465 - acc: 0.5730 Epoch 2/10 1000/1000 [==============================] - 0s 140us/step - loss: 0.5893 - acc: 0.6840 Epoch 3/10 1000/1000 [==============================] - 0s 148us/step - loss: 0.4777 - acc: 0.7640 Epoch 4/10 1000/1000 [==============================] - 0s 139us/step - loss: 0.3973 - acc: 0.8240 Epoch 5/10 1000/1000 [==============================] - 0s 137us/step - loss: 0.3374 - acc: 0.8770 Epoch 6/10 1000/1000 [==============================] - 0s 140us/step - loss: 0.2929 - acc: 0.9140 Epoch 7/10 1000/1000 [==============================] - 0s 134us/step - loss: 0.2586 - acc: 0.9440 Epoch 8/10 1000/1000 [==============================] - 0s 135us/step - loss: 0.2315 - acc: 0.9580 Epoch 9/10 1000/1000 [==============================] - 0s 139us/step - loss: 0.2100 - acc: 0.9680 Epoch 10/10 1000/1000 [==============================] - 0s 134us/step - loss: 0.1924 - acc: 0.9780
It can be seen that keras is very concise. The reason is that keras has made a very deep package for some commonly used loss calculations, gradient return, and parameter updates. We only need to call the fit function to easily train a model. However, from the implementation of numpy, torch and tensorflow, we can pay more attention to the details of some model training.
4. pytorch implementation
import torch
x = torch.randn(1000, 5)
y = (torch.sum(x, dim=1, keepdim=True)>0).float()
# 定义一个Sequential模型
model = torch.nn.Sequential(
torch.nn.Linear(5, 8, bias=False),
torch.nn.Linear(8, 1, bias=False),
torch.nn.Sigmoid()
)
criterion = torch.nn.BCELoss() # 定义loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.1) # 定义优化器
for i in range(10):
p = model(x) # inference拿到概率p
loss = criterion(p, y) # 计算loss
acc = torch.mean(((p>0.5)==y).float()) # 计算accuracy
print("{}--loss:{:.4f} acc:{:.4f}".format(i, loss, acc))
optimizer.zero_grad() # 清零梯度
loss.backward() # 梯度回传,类似于numpy中我们把每个变量的梯度都计算出来
optimizer.step() # 参数更新,类似于numpy中我们更新w1和w2的操作
0--loss:0.6549 acc:0.6260 1--loss:0.4986 acc:0.8080 2--loss:0.3646 acc:0.8780 3--loss:0.2605 acc:0.9350 4--loss:0.1877 acc:0.9620 5--loss:0.1393 acc:0.9720 6--loss:0.1077 acc:0.9790 7--loss:0.0864 acc:0.9860 8--loss:0.0711 acc:0.9890 9--loss:0.0599 acc:0.9950
The implementation of pytorch is also simple and easy to understand. Compared with the graph mechanism of tensorflow, the operation of torch is more like numpy and easier to accept. Compared with the deep encapsulation of keras, we can control the gradient calculation and parameter update steps of torch in model training more easily. I still prefer pytorch.
Summarize
Except for numpy, other frameworks implement gradient return and parameter update internally, which makes it much more convenient for us to train the model, and these frameworks also implement commonly used loss functions and evaluation functions, but the interviewer may ask manual code during the interview The complete model training process requires us to have a good understanding of each step.
talk is cheap,show me the code