People may have noticed such a phenomenon in life: We can easily distinguish Japanese, Koreans and Thais by their appearance, but when facing British, Russians and Germans, it is difficult for us to recognize their facial expressions. face. The reason for this phenomenon is, on the one hand, that Japan, South Korea, and Thailand are neighboring countries of our country, and we have relatively more opportunities to observe ordinary people in these countries, which makes it easier for us to observe the unique features of these countries. On the other hand, regardless of factors such as clothing and makeup, similarities between the same race make facial features easier to compare and identify, while differences between different races are more difficult to capture.

Based on a large number of observations, we can summarize the facial features of people from different countries, such as the Chinese with a moderate jaw, the Japanese with a long face and long nose, the Koreans with small eyes and high cheekbones, and the Thais with dark complexion. These characteristics are used to determine whether passerby A is from Japan or passerby B is from Korea, etc.

This is an example of a simplified version of the human learning mechanism, which is to extract recurring regularities and patterns from a large number of phenomena. In artificial intelligence, this process is called machine learning. Formally defined, machine learning refers to the use of certain experiences by algorithms to improve their performance in specific tasks. From a methodological point of view, machine learning is a discipline in which computers build probabilistic and statistical models based on data, and use the models to predict and analyze data.

Machine learning can come from data and go to data. The existing data has certain statistical characteristics, and different data can be regarded as samples satisfying independent and identical distribution. Machine learning derives a model describing all data based on existing training data, and uses the model to make optimal predictions for unknown test data.

In machine learning, data is not a quantitative value in the usual sense, but a description of some properties of an object. The described properties are called attributes, and the values of attributes are called attribute values. The vectors obtained by orderly arrangement of different attribute values are data, also called instances.

In the example at the beginning of the article, the typical attributes of the facial features of the yellow race include skin color, eye size, nose length, and cheekbones height. The standard Chinese instance A is a combination of attribute values {shallow, large, short, low}, and the standard Korean instance B is a combination of attribute values {shallow, small, long, high}.

According to the knowledge of linear algebra, different attributes of data can be regarded as independent of each other, each attribute represents a different dimension, and these dimensions together constitute a feature space. Each instance can be viewed as a point in the feature space, i.e., a feature vector. The eigenvector here is different from the vector corresponding to the eigenvalue, but refers to the vector in the feature space. By classifying the input data, the output can be obtained by classifying according to the feature vector.

In a real machine learning task, the form of the output may be more complicated. According to the different types of input and output, prediction problems can be divided into the following three categories:

- Classification problem: the output variable is a finite number of discrete variables, and when the number is 2, it is the simplest binary classification problem;
- Regression problem: both input and output variables are continuous variables;
- Labeling problem: Both input variables and output variables are sequences of variables.

However, in fact, people in each country are unique, and their looks will naturally vary widely. So a Korean with thick eyebrows and big eyes could be mistaken for Chinese, and a Japanese with darker skin could be mistaken for Thai.

In machine learning, the same problem exists with errors. An algorithm can neither perfectly fit all training data nor make accurate predictions on all test data. Therefore, error performance is one of the important metrics for machine learning.

**In machine learning, error is defined as the difference between the actual predicted output of the learner and the true output of the sample** . In classification problems, the commonly used error function is **the error rate** , that is, the proportion of misclassified samples to all samples.

**The error can be further divided into two categories: training error and testing error** . The training error refers to the error of the learner on the training data set, also known as the experience error; the test error refers to the error of the learner on the new sample, also known as the generalization error.

The training error describes the correlation between the input attribute and the output classification, which can determine whether a given problem is an easy learning problem. **The test error reflects the predictive ability of the learner on the unknown test data set, which is an important concept in machine learning. **Practical learners are all learners with low test error, that is, learners that perform better on new samples.

The learner uses known data to fit the real situation, so as to obtain a model that is as close as possible to the real model. In order to ensure that this model can be applied to all unknown data, we need to extract as much as possible universal laws applicable to all data in the training data set.

However, if too much attention is paid to the training error, that is, the pursuit of the exact match between the prediction law and the training data, it will cause the learner to misunderstand the non-universal characteristics of the training samples and mistake it for the universal nature of all data. As a result, the generalization ability of the learner will decrease.

In the previous example, if there are fewer foreigners and have never seen Koreans with double eyelids, then the wrong stereotype that "all single eyelids are Koreans" may be formed. This is a typical overfitting phenomenon. The features of the training data are mistaken for the features of the whole.

Underfitting means that the learner cannot fit the training data well, and the reason is that the learning ability is so weak that the basic properties of the training data are not learned. If the learner is not capable enough, it may even mistake images of chimpanzees for people, which is the consequence of underfitting.

Underfitting can be overcome by improving the learner algorithm, while overfitting cannot be avoided, and its impact can only be minimized. Since the number of training samples is limited, a model with a limited number of parameters is sufficient to incorporate all training samples. However, the more parameters the model has, the less data that can accurately match the model. When such a model is applied to infinite unknown data, overfitting is inevitable.

In addition, the training samples themselves may contain some noise, which will also introduce additional errors to the accuracy of the model. Overall, there is a parabolic relationship between test error and model complexity. When the model complexity is low, the test error is high; as the model complexity increases, the test error will gradually decrease and reach the minimum value; after that, when the model complexity continues to rise, the test error will increase accordingly, corresponding to Overfitting occurs.

To estimate test error more accurately, a widely used method is cross-validation. The idea of cross-validation is to reuse the limited training samples, divide the data into several subsets, let the different subsets form the training set and the test set respectively, and repeatedly perform training, testing and model selection on this basis to achieve optimal effect.

If the training dataset is divided into 10 subsets D1-10 for cross-validation, each model needs to be tested using 10 epochs of training. In the first round, use the subset D2-D10 as the training set and test on the subset D1; in the second round, use the subsets D1 and D3-D10 as the training set and test on the subset D2 . By analogy, when the model is tested on all 10 subsets, its performance is the average of the 10 test results. Among the different models, the model with the smallest average test error is the optimal model.

An important engineering problem in machine learning is parameter tuning, also known as tuning. This is because the parameter values of the algorithm have a significant impact on the model performance. In neural networks and deep learning, parameter tuning becomes more important and complex due to the large number of parameters. Suppose a neural network contains 1000 parameters, and each parameter has 10 possible values. For each set of training/testing sets, there are 100010 models to be examined. Therefore, the relationship between performance and efficiency needs to be weighed in the process of parameter tuning.

According to whether the training data has label information, machine learning tasks can be divided into three categories: supervised learning, unsupervised learning and semi-supervised learning.

- Supervised learning: learning based on training data of known categories;
- Unsupervised learning: learning based on training data of unknown categories;
- Semi-supervised learning: learning using both known and unknown classes of training data.