CHANG Course Notes

Machine learning models

@ (Lab) [Notes]

Learning materials

https://datawhalechina.github.io/leeml-notes
CHANG Machine Learning class notes
https://www.bilibili.com/video/av59538266/?p=6
video CHANG Course
question types: regression (prediction), classification (classification), structured (such as machine translation, speech recognition, synthesis, etc.)
supervised, semi-supervised, unsupervised, reinforcement learning (but not give the correct results give a score / feedback)
return there a learning rate defined . Method parameters are: gradient descent.
Where the regularization of benefits: While adding more features, but some features will be too high weights fit.

p5error come from?

variance and bias (bias and variance).
Relatively simple model is not easily influenced by the training data, but limited skills, can not be combined with more features. (Less fit), relatively far from the real function.

Note that gradient descent

Adjust the learning rate. If too small, loss decreased slowly. If too large, it may explode loss or shock can not fall back and forth. If many parameters can visualize change and loss parameter changes.
Automatically adjusting learning rate. With the learning process of learning to be getting smaller and smaller. Preferably the different parameters for different learning rate, a method is: adagrad :

w is the parameter, g is the partial differential value, σ is the root mean square value over all partial differential values.
RMS: square of N by N and the number of the square root.
The last reduction formula:
Faster gradient descent - Stochastic : Loss is calculated only one example, a training example to update a parameter.
feature scaling: If a range of different eigenvalue distribution very different, the value can be entered such that rescale the same feature distribution. Thus, to make the loss of pattern comparison parameters like "circle" since the parameters change along the contour lines, so faster than oval circle. method:
Principia Mathematica gradient descent! Taylor expansion. Taylor theorem. Ensure the correct premise that the pace of moving to small!

Classification:

Binary classification: the resulting equation be divided by a "/" 0. This equation can be a loss in the number of training set misclassification. Find the best parameter method: perceptron / SVM.
Or probabilistic classification models Gaussian distribution has two parameters μ and Σ, if you adjust the parameters, you can make a feature class distribution in the Gaussian distribution center. Better parameters, features a Gaussian distribution will be at a higher probability.

To prevent overfitting, it is possible to have all of the Gaussian distribution with a [Sigma (but different μ), [Sigma extreme value is a weighted sum of all [Sigma.
sigmoid function: The above probability model after simplification, it becomes sigmoid function fz, z simplified and can be linear WX + B (after it is used with the same [Sigma, probability model is linear). → logistic regression.
The loss logistic regression equation! It is the cross-entropy

According to a probabilistic model is to find w, b, so that the equation of maximum probability trainingdata. Dichotomous, when the equation represents the probability when C1, C1,0 represents 1 represented by C2, all the same can be written, and then added. :

This equation i.e. Cross entropy between two Bernoulli distribution

obtained after LOSS This equation can also be used to optimize the parameters of the gradient descent. First partial differentiation of w: (w is a vector of dimension i, x because each i-th input feature)

seek out gradient descent equation, linear regression, and is exactly the same!
Multiple categories of classification: Method similar, softmax:

Role is, output out of the z value is limited between [0, 1]. The gap between the value of large value and a small pull further apart, strengthening. After softmax, it is the probability of belonging to a certain category (scores).
The derivation of this method, there are two angles, one assumes that the data is Gaussian, a Gaussian distribution and then to derive the formula softmax; maximum entropy other by an angle.
His loss is a function of the cross-entropy:
If the distribution of such features can not be used to separate logistic regression classes by a straight line, you can be transformed feature.
Let the machine do its own feature transformation: a cascade of logistic regression. And, a logistic regression cell, called a neuron

Depth study

The method of connecting each neuron:

fully connected feedforward network, almost the same as before, the neuronal connections in different ways, have different network structures. Where the function of neurons, has not used the sigmoid. Finally, it is still a layer of SoftMax , to obtain the output of the network.
loss as before and also, the classification problem is considered yhead and cross-entropy of y, and for obtaining the sum of the entire training set of examples.
The method may also find use gradient descent parameters, can be considered a differential back propagation backward propagation method. Backward propagation:

As shown, is the use of a chain rule for differentiating, gradually transferred from the last layer back until the input layer, differentiated. (Where C is the distance y and yhead (generally cross-entropy)). z, the result is a linear function of a neuron. If you have the output layer before a hidden layer, it is direct

otherwise recursive calculations. (The actual operation is certainly one forward from the last count)

What is the full link layer?

Fully connected layers are not necessary, replaced frequently sake convolution.
fully connected layer -fc act as a "classifier" throughout the convolutional neural network. If convolutional layer, and the cell layer is activated function layer other operations to data mapping the original feature space, then the hidden layer, the tie layer serves to full-learned "indicates distributed nature" maps to action labeled sample space . In actual use, a full convolution operation implemented by the connection layer: the whole layer of the front layer is connected to the whole connector can be converted to a 1x1 convolution kernel convolution; and the front layer is a layer fully connected layers convolution can be converted to global hxw convolution kernel convolution, h and w are the result of the convolution is high and wide front layer.
Full redundancy connection layer parameters, but ensures full connectivity layer model representation of migration. (Https://www.zhihu.com/question/41037974)
https://blog.csdn.net/zfjBIT/article/details/88075569

Receptive field?

It may represent a pixel of the input image information of the original number of pixels. Convolution layer foregoing small field experience, can capture local details information, for each pixel (Activation activation value) that is the output image of the input image just feel a small range of values calculated result.

The upper and lower sampling

Pooling is downsampling, upsampling operation is reversed. Pooling effect layer:

Dimensionality reduction, reduce model size, speed up calculations
Reduce the probability of over-fitting, feature extraction to enhance the robustness
Insensitive to translation and rotation

There are pooled and the pool mean max pooled.

Albino?

Purpose is to remove redundant information whitened input data. Image training data is assumed, because of a strong correlation between adjacent pixels in the image, the input is redundant when used for training; whitening purpose is to reduce the redundancy of the input.
Forming a final image is affected by many factors ambient lighting intensity reflected at the object imaging camera or the like. In order to obtain constant information from external influences of those contained in the image, the image we need whitening. (Image whitening (Whitening) may be used to over-exposure or low exposure images are processed, the processing mode is to change the image of the average pixel value is 0, the image changing unit variance is the variance of 1.) In order to remove these general factors, we see it converted into a pixel value of zero mean and unit variance. Therefore, we first calculate the difference between the average value and the original pixel gray scale image. Alt text

----------------
Disclaimer: This article is the original article CSDN bloggers "Dean0Winchester", and follow CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement. .
Original link: https: //blog.csdn.net/qq_38906523/article/details/79853984