Machine Learning Fundamentals (3)

0. Foreword

        This blog post mainly explains supervised learning and unsupervised learning, classification problems and regression problems, generative models and discriminant models, cross-validation, overfitting and underfitting, bias and variance decomposition, regularization, and dimensionality disasters.

1. Supervised learning and unsupervised learning

        According to whether the sample data has a label value, machine learning algorithms can be divided into supervised learning and unsupervised learning:

        The sample data of supervised learning has label values, it learns a model from the training samples, and then uses this model to predict and infer new samples. Typical representatives of supervised learning are classification problems and regression problems.

        Unsupervised learning analyzes unlabeled samples to discover the structure or distribution of the sample set. Typical representatives of unsupervised learning are clustering, representation learning, and data dimensionality reduction, which deal with samples without label values.

2. Classification and regression problems

        In supervised learning, if the labels of the samples are integers, the prediction function is a vector-to-integer mapping, which is called a classification problem. If the label values ​​are continuous real numbers, it is called a regression problem, in which case the prediction function is a vector-to-real mapping.

3. Generative model and discriminative model

        Classification algorithms can be divided into discriminative models and generative models. Given a feature vector x and a label value y, the generative model models the joint probability p(x,y), and the discriminative model models the conditional probability p(y|x). In addition, a classifier that does not use a probabilistic model is also classified as a discriminative model, which directly obtains the prediction function without caring about the probability distribution of the sample:

        The discriminant model directly obtains the prediction function f(x), or directly calculates the probability value p(y|x), such as SVM and logistic regression, softmax regression, and the discriminant model only cares about the decision surface, regardless of the density of the probability distribution of the sample.

       The generative model calculates p(x, y) or p(x|y). In general, the generative model assumes that the samples of each class obey a certain probability distribution and models this probability distribution.

        Common generative models in machine learning include Bayesian classifiers, Gaussian mixture models, hidden Markov models, restricted Boltzmann machines, generative adversarial networks, etc. Typical discriminant models include decision tree, kNN algorithm, artificial neural network, support vector machine, logistic regression, AdaBoost algorithm, etc.

4. Cross Validation

        Cross validation is a technique for statistical accuracy. The k-fold cross-validation divides the sample into k parts randomly and evenly, and uses k-1 of them to train the model in turn, and uses one part to test the accuracy of the model. The average of the k accuracy rates is used as the final accuracy rate.

5. Overfitting and underfitting

        Underfitting is also called under-learning. The intuitive performance is that the trained model performs poorly on the training set and does not learn the law of the data. The reason for underfitting is that the model itself is too simple, for example, the data itself is nonlinear but a linear model is used; the number of features is too small to correctly establish the mapping relationship.

        Over-fitting is also called over-learning. The intuitive performance is that it performs well on the training set, but does not perform well on the test set, and the generalization performance is poor. The root cause of overfitting is that the training data contains sampling errors, and the model also fits the sampling errors during training. The so-called sampling error refers to the deviation between the sample set obtained by sampling and the overall data set.

        Possible reasons for overfitting are:

        ① The model itself is too complex and fits the noise in the training sample set. At this point, you need to choose a simpler model, or cut the model.

        ②The training samples are too few or lack of representativeness. At this time, it is necessary to increase the number of samples, or increase the diversity of samples.

        ③The interference of training sample noise causes the model to fit these noises. At this time, it is necessary to eliminate the noise data or use a model that is not sensitive to noise.

6. Bias and variance decomposition

        The generalization error of the model can be decomposed into bias and variance. The deviation is the error caused by the model itself, that is, the error caused by the wrong model assumption, which is the gap between the mathematical expectation of the predicted value of the model and the real value.

        Variance is the error due to sensitivity to small fluctuations in the training sample set. It can be understood as the variation range of the model's predicted value, that is, the degree of fluctuation of the model's predicted value.

        The overall error of the model can be decomposed into the square of the deviation and the sum of the variance:

        If the model is too simple, there will generally be large deviation and small variance; on the contrary, if the model is complex, there will be large variance but small deviation.

7. Regularization

        In order to prevent overfitting, you can add a penalty term to the loss function to punish complex models, forcing the model parameter values ​​to be as small as possible to make the model simpler. After adding the penalty term, the loss function is:

        Regularization is widely used in various machine learning algorithms, such as ridge regression, LASSO regression, logistic regression, neural network, etc. In addition to directly adding regularization items, there are other methods to force the model to be simpler, such as the pruning algorithm of the decision tree, the dropout technology in neural network training, and the early termination technology.

8. The curse of dimensionality

        To increase the accuracy of the algorithm, more and more features are used. When the dimension of the feature vector is not high, adding features can indeed improve the accuracy; but when the dimension of the feature vector increases to a certain value, continuing to increase the feature will lead to a decrease in accuracy. This problem is called dimensionality. disaster.

Guess you like

Origin blog.csdn.net/m0_61363749/article/details/126264741