Hands-on implementation of deep learning framework-4 uses cross-entropy loss function to support classification tasks

Code repository: https://github.com/brandonlyg/cute-dl

aims

  1. Increase the cross-entropy loss function, so that the framework can support the classification task model.
  2. Construct an MLP model and perform classification tasks on the mnist dataset with an accuracy rate of 91%.

Implement cross-entropy loss function

Principles of Mathematics

Decompose the cross-entropy loss function

        The cross-entropy loss function treats the output value of the model as a distribution of discrete random variables. Let the output of the model be: \ (\ hat {Y} = f (X) \) , where \ (f (X) \) represents the model. \ (\ hat {Y} \) is an m X n matrix, as shown below:

\[\begin{bmatrix} \hat{y}_{11} & \hat{y}_{12} & ... & \hat{y}_{1n} \\ \hat{y}_{21} & \hat{y}_{22} & ... & \hat{y}_{2n} \\ ... & ... & ... & ... \\ \hat{y}_{m1} & \hat{y}_{m2} & ... & \hat{y}_{mn} \end{bmatrix} \]

        Let the i-th row of this matrix be \ (\ hat {y} _i \) , which is a \ (\\ R ^ {1Xn} \) vector, and its j-th element be \ (\ hat {y } _ {ij} \) .
        The cross-entropy loss function requires that \ (\ hat {y} _i \) has the following properties:

\[\begin{matrix} 0<=\hat{y}_{ij}<=1 & & (1)\\ \sum_{j=1}^{n} \hat{y}_{ij} = 1, & n=2,3,... & (2) \end{matrix} \]

        In particular, when n = 1, only the first property needs to be satisfied. We first consider the case of n> 1, in which case n = 2 is equivalent to n = 1. In engineering, n = 1 can be regarded as an optimization of n = 2.
        Sometimes the model does not guarantee that the output value has these properties, then the loss function must convert \ (\ hat {y} _i \) into a distribution column: \ (\ hat {p} _i \) , the definition of the conversion function as follows:

\[\begin{matrix} S_i = \sum_{j=1}^{n} e^{\hat{y}_{ij}}\\ \hat{p}_{ij} = \frac{e^{\hat{y}_{ij}}}{S_i} \end{matrix} \]

        Here \ (\ hat {p} _i \) is required to meet. The function \ (e ^ {\ hat {y} _ {ij}} \) is a monotonically increasing function, for any two different \ (\ hat {y} _ {ia} < \ hat {y} _ {ib} \) , All: \ (e ^ {\ hat {y} _ {ia}} \)\ (e ^ {\ hat {y} _ {ib}} \) , which gives you: \ (\ hat { p} _ {ia} < \ hat {p} _ {ib} \) . Therefore, this function turns the output value of the model into a probability value, and the relationship between the size of the probability and the size of the output value is the same.
        Let the category label of data \ (x_i \) be \ (y_i \)\ (\\ R ^ (1Xn) \) . If the real category of \ (x_i \) is t, \ (y_i \) satisfies:

\ [\ begin {matrix} y_ {ij} = 1 & {if j = t} \\ y_ {ij} = 0 & {if j ≠ t} \ end {matrix} \]

        \ (y_i \) uses one-hot encoding. The definition of the cross-entropy loss function is:

\[J_i = \frac{1}{m} \sum_{j=1}^{n} -y_{ij}ln(\hat{p}_{ij}) \]

        For any \ (y_ {ij} \) , any item in the loss function has the following properties:

\ [\ begin {matrix} -y_ {ij} ln (\ hat {p} _ {ij}) ∈ [0, ∞), & if: y_ {ij} = 1 \\ -y_ {ij} ln (\ hat {p} _ {ij}) = 0, & if: y_ {ij} = 0 \ end {matrix} \]

        It can be seen that the term \ (y_ {ij} = 0 \) does not affect the value of the loss function, so such terms can be ignored from the loss function during calculation. For other terms of \ (y_ {ij} = 1 \) , when \ (\ hat {p} _ {ij} = y_ {ij} = 1 \) , the loss function reaches the minimum value of 0.

Gradient derivation

        According to the chain rule, the gradient of the loss function is:

\[\frac{\partial J_i}{\partial \hat{y}_{ij}} = \frac{\partial J_i}{\partial \hat{p}_{ij}} \frac{\partial \hat{p}_{ij}}{\partial \hat{y}_{ij}}, \quad (1) \]

        among them:

\[\frac{\partial J_i}{\partial \hat{p}_{ij}} = \frac{1}{m} \frac{-y_{ij}}{\hat{p}_{ij}} \quad (2) \]

\[\frac{\partial \hat{p}_{ij}}{\partial \hat{y}_{ij}} = \frac{e^{\hat{y}_{ij}}S_i - e^{2\hat{y}_{ij}}}{S_i^2} = \frac{\hat{y}_{ij}}{S_i} - [\frac{e^{\hat{y}_{ij}}}{S_i}]^2 = \hat{p}_{ij} - (\hat{p}_{ij})^2 = \hat{p}_{ij}(1-\hat{p}_{ij}) \quad (3) \]

        Substituting (2), (3) into (1) gives:

\[\frac{\partial J_i}{\partial \hat{y}_{ij}} = \frac{1}{m} \frac{-y_{ij}}{\hat{p}_{ij}} \hat{p}_{ij}(1-\hat{p}_{ij}) = \frac{1}{m}(y_{ij}\hat{p}_{ij} -y_{ij}) \]

        Since when \ (y_ {ij} = 0 \) , the gradient value is 0, so this situation can be ignored, and the final gradient is:

\[\frac{\partial J_i}{\partial \hat{y}_{ij}} = \frac{1}{m}(\hat{p}_{ij} -y_{ij}) \]

        If the output value of the model is the distribution columns of a random variable, the loss of function can be omitted to \ (\ hat {y} _ {ij} \) is converted into \ (\ hat {p} _ {ij} \) step , This time \ (\ hat {y} _ {ij} = \ hat {p} _ {ij} \) , the final gradient becomes:

\[\frac{\partial J_i}{\partial \hat{y}_{ij}} = \frac{\partial J_i}{\partial \hat{p}_{ij}} = - \frac{y_{ij}}{m\hat{y}_{ij}} \]


Special case of cross-entropy loss function: only two categories

        Now let's discuss the case when n = 1. At this time, \ (\ hat {y} _i \)\ (\\ R ^ {1 X 1} \) can be treated as a scalar.
        If the output of the model is not a distribution column, the loss function can be decomposed into:

\[\begin{matrix} \hat{p}_{i} = \frac{1}{1+e^{-\hat{y}_{i}}} \\ \\ J_i = \frac{1}{m}[-y_iln(\hat{p}_{i}) - (1-y_i)ln(1-\hat{p}_{i})] \end{matrix} \]

        The gradient of the loss function with respect to the output value is:

\[\frac{\partial J_i}{\partial \hat{p}_i} = \frac{1}{m}(-\frac{y_i}{\hat{p}_i} + \frac{1-y_i}{1 - \hat{p}_i}) = \frac{\hat{p}_i - y_i}{m\hat{p}_i(1-\hat{p}_i)}, \quad (1) \]

\[\frac{\partial \hat{p}_i}{\partial \hat{y}_i} = \frac{e^{-\hat{y}_{i}}}{(1+e^{-\hat{y}_{i}})^2} = \frac{1}{1+e^{-\hat{y}_{i}}} \frac{e^{-\hat{y}_{i}}}{1+e^{-\hat{y}_{i}}} = \hat{p}_{i}(1- \hat{p}_{i} ), \quad (2) \]

\[\frac{\partial J_i}{\partial \hat{y}_i} = \frac{\partial J_i}{\partial \hat{p}_i} \frac{\partial \hat{p}_i}{\partial \hat{y}_i}, \quad (3) \]

        Substituting (1), (2) into (3) gives:

\[\frac{\partial J_i}{\partial \hat{y}_i} = \frac{\hat{p}_i - y_i}{m\hat{p}_i(1-\hat{p}_i)} \hat{p}_{i}(1- \hat{p}_{i} ) = \frac{1}{m}(\hat{p}_i - y_i) \]

        If the model output value is the distribution column of a random variable, then there are:

\[\frac{\partial J_i}{\partial \hat{y}_i} = \frac{\partial J_i}{\partial \hat{p}_i} = \frac{\hat{y}_i - y_i}{m\hat{y}_i(1-\hat{y}_i)} \]


Implementation code

        The implementation codes of these two cross-entropy loss functions are in cutedl / losses.py. The general cross-entropy loss function class name is CategoricalCrossentropy, and its main implementation code is as follows:

  '''
  输入形状为(m, n)
  '''
  def __call__(self, y_true, y_pred):
      m = y_true.shape[0]
      #pdb.set_trace()
      if not self.__form_logists:
          #计算误差
          loss = (-y_true*np.log(y_pred)).sum(axis=0)/m
          #计算梯度
          self.__grad = -y_true/(m*y_pred)
          return loss.sum()

      m = y_true.shape[0]
      #转换成概率分布
      y_prob = dlmath.prob_distribution(y_pred)
      #pdb.set_trace()
      #计算误差
      loss = (-y_true*np.log(y_prob)).sum(axis=0)/m
      #计算梯度
      self.__grad  = (y_prob - y_true)/m

      return loss.sum()

        The prob_distribution function converts the model output into a distribution column, and the implementation method is as follows:

def prob_distribution(x):
    expval = np.exp(x)
    sum = expval.sum(axis=1).reshape(-1,1) + 1e-8

    prob_d = expval/sum

    return prob_d

        The binary classification cross-entropy loss function class name is Binary Crosssentropy, and its main implementation code is as follows:

'''
输入形状为(m, 1)
'''
def __call__(self, y_true, y_pred):
    #pdb.set_trace()
    m = y_true.shape[0]

    if not self.__form_logists:
        #计算误差
        loss = (-y_true*np.log(y_pred)-(1-y_true)*np.log(1-y_pred))/m
        #计算梯度
        self.__grad = (y_pred - y_true)/(m*y_pred*(1-y_pred))
        return loss.sum()

    #转换成概率
    y_prob = dlmath.sigmoid(y_pred)
    #计算误差
    loss = (-y_true*np.log(y_prob) - (1-y_true)*np.log(1-y_prob))/m
    #计算梯度
    self.__grad = (y_prob - y_true)/m

    return loss.sum()

Verify on the MNIST dataset

        Now use the MNIST classification task to verify the cross-entropy loss function. The code is located in the examples / mlp / mnist-recognize.py file. Before running this code, download the original MNIST data set to examples / datasets / and decompress it. The data set download link is: https://pan.baidu.com / s / 1CmYYLyLJ87M8wH2iQWrrFA, password: 1rgr

        The code for training the model is as follows:

'''
训练模型
'''
def fit():
    inshape = ds_train.data.shape[1]
    model = Model([
                nn.Dense(10, inshape=inshape, activation='relu')
            ])
    model.assemble()

    sess = Session(model,
            loss=losses.CategoricalCrossentropy(),
            optimizer=optimizers.Fixed(0.001)
            )

    stop_fit = session.condition_callback(lambda :sess.stop_fit(), 'val_loss', 10)

    #pdb.set_trace()
    history = sess.fit(ds_train, 20000, val_epochs=5, val_data=ds_test,
                        listeners=[
                            stop_fit,
                            session.FitListener('val_end', callback=accuracy)
                        ]
                    )

    fit_report(history, report_path+"0.png")

        Fitting report: It

        can be seen that after an hour (3699s), nearly 6 million steps of training, the model accuracy rate reached 92%. The same model can reach 91% after ten minutes of training in tensorflow (CPU version). This shows that the cute-dl framework is no problem in terms of task performance, but the speed of training models is not good.

to sum up

        At this stage, the framework implements the support for classification tasks, and verifies that the model performance meets expectations on the MNIST dataset. The speed of model training is not satisfactory.
        In the next stage, a learning rate optimizer will be added to the model to speed up model training without losing generalization capabilities.

Guess you like

Origin www.cnblogs.com/brandonli/p/12745859.html