Day1 Machine Learning (ML) Basics

An introduction to machine learning

definition

　　The definition of machine learning given by Tom Mitchell: For a certain type of task T and performance measure P, if the performance of a computer program measured by P on T improves itself with experience E, then the computer program is said to learn from experience E.

　　The definition of machine learning given by Baidu Encyclopedia: Machine learning is a multi-disciplinary interdisciplinary, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance.

Classification

　　Supervised learning (supervised learning): The data set is labeled, that is to say, the answer is known to the given sample. Most of the models we have learned belong to this category, including K-nearest neighbor algorithm, decision tree , Naive Bayes, logistic regression, support vector machines, etc.;

　　Unsupervised learning: Contrary to supervised learning, the data set is completely unlabeled. The main basis is that similar samples are generally close in distance in the data space, so that the samples can be classified by distance calculation. Including clustering, EM algorithm, etc.;

　　Semi-supervised learning (semi-supervised learning): Semi-supervised learning is generally aimed at the problem that the amount of data is super large but there are few labeled data or the acquisition of labeled data is difficult and expensive, and part of the training is labeled. and part of it is not;

　　Reinforcement learning: The way to always motivate learning, through the incentive function to let the model continuously adjust according to the situation encountered;

Related concepts

　　Training set (training set/data)/training examples (training examples): used for training, that is, the data set that generates the model or algorithm;

　　Test set (testing set/data)/testing examples (testing examples): Data sets used to test the learned models or algorithms;

　　Feature vector (features/feature vector): a collection of attributes, usually represented by a vector, attached to an instance;

　　label: the label of the instance category;

　　positive example;

　　反例(negative example)；

Deep Learning

　　It is a new field based on machine learning. It is originated from the neural network algorithm inspired by the structure of the human brain and the depth of the model structure. algorithm. Deep learning, as an extension of machine learning, is used in image processing and computer vision, natural language processing, and speech recognition.

machine learning steps

　　First split the data into a training set and a test set, then use the feature vectors of the training set and the training set to train the algorithm, and then use the learned algorithm on the test set to evaluate the algorithm (may be designed to adjust the parameters (parameter tuning) , with a validation set.

2. Model evaluation and selection

Error-related concepts, overfitting

　　Error rate: The ratio of the number of misclassified samples to the total number of samples is usually taken.

　　Accuracy: 1 - Error rate.

　　Error: The difference between the actual predicted output of the learner and the actual output of the sample;

　　Training error: Also known as empirical error, the error of the learner on the training set;

　　Generalization error: The error of the learner on new samples;

　　Overfitting: When the learner learns the training samples "too good", it is likely that some characteristics of the training samples themselves have been regarded as general properties that all potential samples will have, which will lead to generalization. degradation performance. This phenomenon is called overfitting. The learning ability is so powerful that the less general characteristics contained in the training samples are learned. Overfitting is unavoidable, and all that can be done is to alleviate or reduce its risk.

　　Underfitting: Refers to the general nature of the training samples that the learner has not learned well. Underfitting is easier to overcome, such as expanding branches in decision tree learning and increasing the number of training rounds in neural network learning.

evaluation method

假设有数据集D={（x1，y1），（x2，y2），...，（xm，ym）}，对D进行适当处理，从中产生训练集S和测试集T。下面介绍几种常见的方法。

　　留出法（holdout）：

　　它直接将数据集D划分为两个互斥的集合，其中一个作为训练集S，另一个作为测试集T。在S上训练出模型后，用T来评估其测试误差，作为对泛化误差的估计。训练集和测试集的划分要尽可能的保持数据分布的一致性。

　　交叉验证法（cross validation）：

　　它先将数据集D划分为K个大小相似的互斥子集，每个子集Di都尽可能保持数据分布的一致性。然后，每次用K-1个子集的并集作为训练集，余下那个子集作为测试集；这样就可以获得K组训练集和测试集，从而可进行K次训练和测试，最终返回的是这K次测试结果的均值。K通常取值为10，称为10折交叉验证。

　　自助法（bootstrapping）：

　　对于给定的包含m个样本的数据集D，对它进行m次有放回的取出一个样本，得到新的数据集D1。初始数据集D中约有36.8%的样本未出现在D1中，可将D1作为训练集，D-D1作为测试集。

性能度量

回归任务最常用的性能度量是均方误差（mean squared error）；

在分类问题中，使用的性能度量主要有（混淆矩阵）准确率、召回率、 F1度量。

Day1 Machine Learning (ML) Basics

Guess you like