[NLP] QA interview activation function and loss function

What are the advantages and disadvantages Sigmoid function

advantage:

  • Priority output range, the output can be mapped to any range within the range of (0, 1) represents the output probability of the binary classification may be used in the output layer
  • Ease of derivation

Disadvantages:

  • Sigmoid function easily saturated, and the gradient in the range (0, 0.25], in backpropagation gradient is liable to cause disappearing.

ReLU advantages and disadvantages

advantage

  • ReLU non unsaturation activation may provide a relatively wide boundaries.
  • Only gradients 0, 1 two variables, effectively solve the problem of the disappearance of the gradient.
  • Unilateral suppression ReLU provides the ability to express a sparse network.

Shortcoming

  • The training process can lead to problems neuronal death. In training, if a parameter in an inappropriate update, a ReLU first hidden layer neural element is not to be activated on all of the training data. So, this gradient will always neuron itself parameter is 0, can never be updated after the training process. This phenomenon is called death ReLU problem (Dying ReLU Problem)

What is the cross-entropy

  • Cross entropy is a portrait of the distance between two probability distributions, describes the differences between forecast and actual distribution of distribution .
  • Cross entropy formula: \ (H (P, Q) = - \ sum_x P (X) log \ Q (X) \) , wherein, x is the probability of each category of a sample

Why loss function classification problem is not cross-entropy is MSE?

From the modeling point of view:

  • MSE is a hypothetical data in line with Gaussian distribution, conditional probability distribution of the negative log-likelihood. It represents the Euclidean distance between two vectors
  • CE is assumed that the model is a polynomial distribution profile, conditional probability distribution of the negative log likelihood. It represents the true distribution of the differences between predicted and distribution

From the point of view Gradient:

  • MSE 的梯度 \(\frac{\partial L}{\partial \hat y_i} = 2(\hat y_i - y_i)\)
  • CE 的梯度 \(\frac{\partial L}{\partial \hat y_i} = \frac{y_i}{\hat y_i}\)

MSE tends to zero in optimizing the late side residuals will be very small, resulting in optimized slow down. The CE component in optimizing the latter category is the right tends to 1, rather than the correct type of component constant at 0, optimization faster.

Intuitively:

  • MSE no difference was concerned about the difference between real and predicted probability of the probability of all categories.
  • CE concern is to predict the probability of correct categories.

Multi-classification problems, and using sigmoid softmax as the difference between the last layer activation function

  • Each output of sigmoid function is independent, it does not reflect the correlation between the samples.
  • The softmax normalized mean output increases must be accompanied by a reduced output of the other, which is more in line with the rules of probability, it reflects the relationship between the sample mutually exclusive.
  • If the sample is under a plurality of samples, and each sample is independent of the classification, it may be used as a sigmoid activation function for each output; the mutually exclusive categories for classification should be employed as the last softmax activation function.

Why LSTM the activation function is tanh and sigmoid without Relu

In the LSTM, Sigmoid function as a function of the role of the door, in the range of (0, 1), can not be replaced

Relu purpose is to solve the problem of the disappearance of the gradient, while in LSTM, because the residual mechanism on the timing, the gradient disappears problem has been greatly reduced.

On the other hand, tanh is possible to output map model in the range (-1, 1), easier to optimize

softmax backpropagation

For multi-classification problems, the output layer activation function softmax single layer neural network classifier considers only the weight parameter \ (W is \) , using the optimization method of SGD, input samples \ (X \) , labeled \ (Y \) , wherein the sample dimension \ (m \) , the category number is \ (n-\) , which forward propagation and reverse propagation equation:

  • Forward propagation:

\[\begin{aligned} &z = Wx \\ &p_i = softmax(z) = \frac{exp(z_i)}{\sum_{j=1}^{n} exp(z_j)} \\ &L(\hat{y}, y) = -\sum_{i=1}^ny_i\ log\ p_i \end{aligned}\]

  • Back-propagation:

\[\frac{\partial L}{\partial p_i} = -\sum_{i=1}^n\frac{y_i}{p_i} \]

\[\begin{cases} \frac{\partial p_i}{\partial z_j} = \frac{exp(z_j)\sum_{k=1}^{n} exp(z_k) - exp(z_j)^2}{(\sum_{k=1}^{n} exp(z_k))^2} = p_j(1-p_j) & , i = j\\ \frac{\partial p_i}{\partial z_j} = -\frac{exp(z_j)exp(z_i)}{(\sum_{k=1}^{n} exp(z_k))^2} = -p_ip_j & , i \ne j \end{cases} \]

then

\[\begin{aligned} &\frac{\partial L}{\partial z_i} = \frac{\partial p_i}{\partial z_i} \frac{\partial p_i}{\partial z_i}\\ &= - \frac{y_i}{p_i}p_i(1-p_i) - \sum_{j\ne i}\frac{y_j}{p_j}(-p_ip_j) \\ &= -y_i + p_iy_i + p_i\sum_{j\ne i}y_j \\ &= -y_i + p_i \sum_{j=1}^ny_j \\ &= p_i - y_i \end{aligned}\]

Represented as a matrix as: \ (\ FRAC {\ partial L} {\ partial} = P Z - Y \)

Guess you like

Origin www.cnblogs.com/sandwichnlp/p/12631015.html