Summary of logit, entropy, clustering and other knowledge

1. Linear evaluation freezes the backbone of the model to train the last fully connected layer.
2. End-to-end learning actually means no other additional processing. From the original data input to the task result output, the entire training and prediction process is Done in the model.
In fact, the definition of end-to-end in the industry is relatively vague. The following types can all call themselves end-to-end:
① The input is original data (no preprocessing of the original data is required), which is called end-to-end;
② The input is original data , the output is the final required result, which
is called end-to-end; ③ Global optimization is performed, emphasizing that a neural network model includes all steps in the model: a model that can achieve comprehensive and global optimization of each sub-step can also be called End to end.
Excerpted from
3. Kmeans clustering algorithm Detailed explanation of Kmeans clustering algorithm
As a representative of unsupervised clustering algorithms - Kmeans clustering (Kmeans) algorithm, the main function of this algorithm is to automatically classify similar samples into a category. The so-called supervised algorithm means that the input sample does not have a corresponding output or label. Clustering attempts to divide the samples in the data set into several usually disjoint subsets, each subset is called a "cluster".
Kmeans algorithm is the most commonly used clustering algorithm. The main idea is: given the K value and K initial cluster center points, classify each point (that is, the data record) into the nearest cluster center. In the cluster represented by the point, after all points are allocated, the center point of the cluster is recalculated based on all points in the cluster (average value is taken), and then iteratively allocates points and updates the center point of the cluster. Steps until the cluster center point changes very little or the specified number of iterations is reached
4. The EM algorithm is an iterative optimization strategy. Since each iteration in its calculation method is divided into two steps, one of which is the expectation step (E step) and the other is the maximum step (M step), so the algorithm is called It is the EM algorithm (Expectation-Maximization Algorithm), which was originally designed to solve the problem of parameter estimation in the case of missing data (including hidden variables).
Excerpted from Detailed Explanation of EM Algorithm
5. For a detailed explanation of Logit transformation, see What is Logit? ——Discrete Choice Model 3
Linear regression means that the value of the dependent variable ranges from negative infinity to positive infinity, and the research indicator for binary variables is usually rate. Its value is 0-1, and there is no way to use linear regression for analysis. Through the logit function transformation, 0-1 can be changed from negative infinity to positive infinity, and all methods of linear regression can be used. This is the origin of logistic regression.
A very important feature of Logit is that there is no upper or lower limit - this brings great convenience to modeling.
(That is to say, after logit transformation, the probability p of [0,1] is transformed from negative infinity to positive infinity)
Insert image description here
Insert image description here
6.KL divergence is a metric used to measure the similarity of two probability distributions. For details, see Machine Learning: Detailed Explanation of KL Divergence.
Any observation in the real world can be regarded as expressed as information and data. Generally speaking, we cannot obtain the overall data. We can only obtain partial samples of the data. According to the part of the data Sample, we will make an approximate estimate of the entire data, and the entire data itself has a true distribution (we may never know).
Then the similarity, or degree of difference, between the approximately estimated probability distribution and the actual probability distribution of the data as a whole can be expressed by KL divergence.
The amount of information carried by xi. Information
Insert image description here
entropy: the average amount of information.
Using P-based coding to encode samples from P, the average number of bits required for the optimal encoding.
Insert image description here
average information content of continuous information
Insert image description here

Cross entropy:
The number of bits required to encode samples from Q using P-based coding.
Insert image description here
Relative entropy [KL divergence]:
Know how many bits P needs at least to express Q
Insert image description here
Insert image description here
Insert image description here
7. Markov chain
Insert image description here
Markov chain believes that all information in the past has been saved in the current state. For example, in this sequence of numbers 1 - 2 - 3 - 4 - 5 - 6, from the perspective of the Markov chain, the state of 6 is only related to 5 and has nothing to do with other previous processes. Markov chains believe that all information in the past is preserved in its current state.
Examples of non-Markov chain processes:
Only processes that satisfy the characteristics of a Markov chain are Markov chain processes. For example, regarding the problem of taking a ball from a bag without putting it back:
Obviously, the current probability of taking the ball is not only related to the color of the ball I took last time, but also to the color of each ball I took before, so this process is not a Marco Husband chain process.
If it is a ball-retrieval problem with a return bag, this establishes a Markov stochastic process.

Guess you like

Origin blog.csdn.net/weixin_44040169/article/details/127256618