Summary of basic concepts of machine learning

Machine learning mainly extracts the corresponding model from the data by means of calculation, that is to say, the input of machine learning is data, and through the learning algorithm, the corresponding model is output, and we can solve practical problems based on this model in the future.


In machine learning, the input data is called a dataset, and the dataset is divided into a training set and a test set. The training set is used to train the model through the learning algorithm, while the test set is used to evaluate our model performance. The dataset contains a bar sample, which is composed of attributes or features. For example, data describing people, including characteristic height (cm), weight (kg), face shape (1-round face, 2-square face, 3-melon face)

[

    [180, 80, 1],

    [160, 60, 2],

    [170, 70, 3],

    ......

]


According to whether the data set is labeled (the above is the unlabeled data set), machine learning can be divided into two categories, the labeled one is called supervised learning, and the other is called unsupervised learning. The following is a marked version of the above data set, allowing users to evaluate according to their personal preferences (0-favorite, 1-average, 2-no favor), you can see that the label is the description of a certain sample:

[

    [180, 80, 1,0],

    [160, 60, 2,1],

    [170, 70, 3,2],

    ......

]


The purpose of machine learning is to make the trained model applicable to non-training sets through a certain learning algorithm based on the training set. That is to say, the model should have a certain generalization ability, and the test set is used to evaluate The generalization ability of the model, therefore, the general test set should be different sample data from the training set.


So, does more training samples lead to better generalization ability? The answer is no, too many training samples will introduce the problem of over-fitting. To put it bluntly, the machine learns the training samples so well that its generalization ability becomes weaker, just like students memorize mathematics by rote. question, but there are other questions that cannot be answered. However, if there are too few training samples, there will be a problem of under-fitting, just like the students do not do enough questions and do not cover all the knowledge points. Therefore, the amount of data in the training set needs to be carefully considered.


In reality, the problems most often handled by machine learning are mainly classification and regression problems. We are given a sample, the classification can analyze the data of this sample, and then give us the class that the sample may belong to. Regression is mainly used to predict trends. Classification and regression are supervised learning. Clustering is an unsupervised learning method. Clustering mainly generates density estimates of statistical values ​​of data according to unlabeled training sets.


With some knowledge, we can't sum up the laws, but the information hidden in the data can't deceive us. What we need to do is to use various algorithms to extract these tacit knowledge. In general, machine learning is a means by which we use data to discover and generalize knowledge that we can then use to solve practical problems.



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325580429&siteId=291194637