Semi-supervised learning machine learning []

Neural networks, various AutoEncoder should be the most effective in unsupervised learning method, the following summary is biased semi-supervised learning machine learning, machine learning [of origin] Zhou Zhihua.

Semi-supervised learning machine learning []

We've been all around the front of supervised learning and unsupervised learning, supervised learning refers to the learning task training sample contains tag information, such as: common classification and regression algorithm; unsupervised learning is learning training samples do not contain tag information tasks, such as: clustering algorithm. In real life, it is often the case there will be part of the sample have more marked and unmarked samples, such as: the need to allow users to do web pages recommended mark of interest, but few users are willing to take the time to provide the mark. If direct discarded unmarked sample set, using traditional supervised learning methods, often due to insufficient training samples, making its abilities to describe the overall distribution weakened, thus affecting the learning generalization performance. How to use sample data that is not labeled yet?

A simple approach is to carry out the marking of these unlabeled samples through expert knowledge, but the attendant is a huge human cost. If we start with the marked sample data set to train a learner, again based on the learner unlabeled samples to predict, pick out the high uncertainty or low confidence sample classification to consult experts and marking training set after the last expansion of the use of re-training learner, so marked will be able to significantly reduce costs, which is active learning (active learning), whose goal is to use as little as possible of the valuable advice / for better performance .

Clearly, the need for active learning to interact with the outside world / query / marking, still belongs to a supervised learning its essence. In fact, although not labeled samples contain no tag information, but they are marked with the same sample obtained from sampling are independent and identically distributed population, so they contain information about the distribution of data for training and learning's of great benefit. How to make the learning process does not rely on outside advice interaction, the use of automatic distribution of the information contained in unlabeled samples is semi-supervised learning method (semi-supervised learning), ie the training set contains both marked and unmarked sample data sample data.

Write pictures described here

In addition, semi-supervised learning may be further divided into pure semi-supervised learning and transductive, the difference is: the former is assumed that the training data set of unlabeled data is not data to be predicted, which is assumed unmarked data in the learning process is data to be predicted. Active learning, semi-supervised learning, and concepts of pure transductive three as shown below:

Write pictures described here

Method formula 1

Method formula (generative methods) are based on the method of expression model, i.e., the first joint distribution P (x, c) modeling, thus further solving P (c | x), such a method assumes that the sample data subject to potential distribution and therefore we need to be fully reliable prior knowledge. For example: in contact already with the Bayesian classifier Gaussian mixture clustering model formula belong. It is assumed that the overall distribution is a Gaussian mixture, i.e., formed from a combination of a plurality of Gaussian distribution, a Gaussian distribution so that a sub-cluster represents a category (categories). Gaussian mixture probability density function as follows:

Write pictures described here

Without loss of generality, assuming the true class clusters in the order of one-class, i.e., the i-th cluster corresponding to the i-th category Gaussian mixture components. Clustering with Gaussian mixture Similarly, the main task here is to estimate the parameters and the mixing coefficients of the Gaussian mixture components, the difference is: For labeled samples, may no longer belong to each class cluster, but can only belong to real Classmark particular cluster corresponding class.

Write pictures described here

Intuitively, Gaussian mixture model based on semi-supervised organic integration of the core idea of ​​Bayesian classifier with Gaussian mixture clustering, effective use of unlabeled data implied sample distribution information, so that more accurate estimation of the parameters . Similarly, where the call should EM Gong before solving the first parameters and the mixing coefficients of the respective components were mixed Gaussian random initialization, each calculated PM (i.e. γji, the i-th sample belongs to the class j, the sample is marked direct belonging to a particular category), then maximizing the likelihood function (i.e. LL (D) respectively for α, u and Σ partial derivative), the parameters iteratively updated.

Write pictures described here

When the updated convergence of the iteration parameters for the sample to be predicted, x, as can be calculated as Bayesian classifier sample belongs to the posterior probability of each class cluster, then you can find the most probable:

Write pictures described here

It can be seen: generative model-based method is very dependent on the assumptions underlying data distribution, that distribution can be assumed to be true and consistent distribution, or use unlabeled sample data will actually go gradually on the wrong road further away , thereby reducing the generalization performance learner. Therefore, such approach requires domain knowledge and the ability to pinch the sky and strong.

2 semi-supervised SVM

监督学习中的SVM试图找到一个划分超平面,使得两侧支持向量之间的间隔最大,即“最大划分间隔”思想。对于半监督学习,S3VM则考虑超平面需穿过数据低密度的区域。TSVM是半监督支持向量机中的最著名代表,其核心思想是:尝试为未标记样本找到合适的标记指派,使得超平面划分后的间隔最大化。TSVM采用局部搜索的策略来进行迭代求解,即首先使用有标记样本集训练出一个初始SVM,接着使用该学习器对未标记样本进行打标,这样所有样本都有了标记,并基于这些有标记的样本重新训练SVM,之后再寻找易出错样本不断调整。整个算法流程如下所示:

Write pictures described here 
Write pictures described here

3 基于分歧的方法

基于分歧的方法通过多个学习器之间的分歧(disagreement)/多样性(diversity)来利用未标记样本数据,协同训练就是其中的一种经典方法。协同训练最初是针对于多视图(multi-view)数据而设计的,多视图数据指的是样本对象具有多个属性集,每个属性集则对应一个试图。例如:电影数据中就包含画面类属性和声音类属性,这样画面类属性的集合就对应着一个视图。首先引入两个关于视图的重要性质:

相容性:即使用单个视图数据训练出的学习器的输出空间是一致的。例如都是{好,坏}、{+1,-1}等。 
互补性:即不同视图所提供的信息是互补/相辅相成的,实质上这里体现的就是集成学习的思想。

协同训练正是很好地利用了多视图数据的“相容互补性”,其基本的思想是:首先基于有标记样本数据在每个视图上都训练一个初始分类器,然后让每个分类器去挑选分类置信度最高的样本并赋予标记,并将带有伪标记的样本数据传给另一个分类器去学习,从而你依我侬/共同进步。

Write pictures described here 
Write pictures described here

4 半监督聚类

前面提到的几种方法都是借助无标记样本数据来辅助监督学习的训练过程,从而使得学习更加充分/泛化性能得到提升;半监督聚类则是借助已有的监督信息来辅助聚类的过程。一般而言,监督信息大致有两种类型:

Even with the constraint will do even: even necessarily means that two samples must be in the same class cluster, not even is the cluster will not in the same class. 
Tag information: a small amount of sample with a true mark.

The following mainly describes two K-Means clustering algorithm based on semi-supervised: a first data set must contain some do not even connected relation, another is labeled with a sample containing a small amount. The basic idea of ​​the two algorithms are very simple: For the k- means algorithm with the constraint relationships, when planning a cluster classification for each sample in an iterative process, it is necessary to detect the current division meets the constraints, if not it will meet the the sample is divided into classes corresponding to the next smallest distance cluster, and then continuously detects whether the constraints satisfied, until all samples divided. Algorithm procedure as shown below:

Write pictures described here

For k- means algorithm with a small amount of labeled samples, you can use these marked samples designated cluster center, while at the time of the sample divided without changing the clusters marked samples affiliation, to be divided directly category corresponds to a cluster. Algorithm procedure is as follows:

Write pictures described here

Here, the semi-supervised learning on the finished presentation - very interesting: the front of the semi-supervised learning a lot of knowledge modules linked together, enough to reflect the intentions of the choreography. Combined Benpian new knowledge come to recall some of his past research before the trip or find some muddy water, maybe more that silly old self, the more it is a good sign ~

 

Reprinted from: https://blog.csdn.net/weixin_41923961/article/details/82467633

Guess you like

Origin www.cnblogs.com/hugh2006/p/11839922.html