Semi-supervised classification framework SACCOS paper reading notes for new class detection and concept drift adaptation

introduce

This article mainly wants to solve the problem of strong assumptions often made in clustering-based concept drift detection methods, that is, it is assumed that similar categories are relatively close and heterogeneous categories are relatively far away, and it is also assumed that new categories usually appear in large numbers continuously when they appear.

In response to these problems, this paper proposes a semi-supervised adaptive classification framework SACCOS based on data flow, which can perform label prediction under the condition of concept drift and concept evolution.

The main contributions are:

  • proposed a semi-supervised framework that uses graph-based clustering techniques to address concept drift and concept evolution problems
  • Online normalization of data instances in a data stream to unify their size and use clustering ensembles to detect new categories in a short time
  • conducted an experiment

background

This part mainly explains the concept evolution, which is actually the emergence of new categories. However, the emergence of new categories can be divided into two situations:

  • One is that this class has never appeared before, that is, at time t > 0 t>0 t>An instance corresponding to a certain category appears when 0, when 0 < t ′ < t 0<t^{\ prime}<t 0<t<It has not appeared within t, and the model has not been trained or updated based on instances of this class. This is called the emergence of a new category
  • One is that the category reappears, that is, at t > 0 t>0 t>A certain category appears when 0, and the category appears at 0 < t ′ < t 0<t^{\prime} <t 0<t<It has appeared within t, and if the training data and update data of the classifier do not contain instances of this re-appearing category, then it can also be considered as a new category appearing< /span>

Therefore, these two are unified formally, that is, the definition t 0 , t 1 t_0, t_1 t0t1The concept between evolves into P t 0 ( y ) ≠ P t 1 ( y ) P_{t_0}(y)\neq P_{t_1}(y) < /span>Pt0(y)=Pt1(y)

specific method

Problem Description

Given the initial training data is D = { ( x i , y i ) } i = 1 m D=\{(x_i,y_i)\}^m_{i=1 } D={(xi,andi)}i=1m,其中 y i ∈ Y = { 1 , 2 , 3 , . . . , c } y_i\in Y=\{1,2,3,...,c\} andiAND={ 1,2,3,...,c}. The non-stationary flow data is S = { ( x t ′ , y t ′ ) } t = 1 ∞ S=\{(x_t^{\prime}, y_t^{\prime}) \}^{\infty}_{t=1} S={(xt,andt)}t=1,其中 y t ′ ∈ Y ′ = { 1 , 2 , . . . , c , c + 1 , . . . , c ′ } y_t^{\prime}\in Y^{\prime}=\{1,2,...,c,c+1,...,c^{\prime}\} andtAND={ 1,2,...,c,c+1,...,c} means that new categories may appear in the streaming data. Then the problem is to make the classifier at time t t t t ′ t^{\prime} tThe streaming data between ′ is used for training, and the label set is Y t t ′ ∈ Y ′ Y^{t^{\ prime}}_t\in Y^{\prime} ANDttAND,な么当 t ′ ′ t^{\prime \prime} t′′Time appearance example x ′ ′ x^{\prime \prime} x′′其标签为 y ′ ′ y^{\prime \prime} and′′, determine whether it belongs to Y t ′ Y^{t^{\prime}}_t ANDtt. The main challenges of this problem are the occurrence of concept drift, the emergence of multiple emerging categories in a short period of time, and the limited availability of category labels.

Overview

The specific algorithm text description is:

  1. First, SACCOS creates an initial set of clusters from instances with known class labels and trains a classifier to detect existing classes
  2. Second, for streaming data, perform online normalization along each new instance to unify the scale of its features
  3. The normalized instance is then temporarily stored in a dynamic buffer W and a cluster ensemble is used to identify whether the instance is an outlier (whether it falls outside the boundaries of the ensemble)
  4. If it is not, use the classifier to predict its label. If it is, it is temporarily stored in the outlier buffer B.
  5. Once the number in B exceeds the critical value, the emerging class detection module is used to cluster these outliers through intergraph clustering.
  6. After clustering is complete, each instance of each ensemble of these outliers is predicted using the existing classifier and its confidence is calculated, and the confidence of a cluster is the lowest value of the confidence of all instances in it.
  7. When the confidence of the cluster is higher than the threshold, then the instances in the cluster will be classified according to the pseudo labels previously predicted by the existing classifier for them; if the confidence of the cluster is less than the threshold, then the instances will be classified from the cluster. Randomly sample an instance and request its true label, and then propagate the label to predict the labels of other instances in the cluster.
  8. Finally, SACCOS uses an online change point detection (CPD) mechanism to detect changes in classifier confidence in predicting labels for existing class instances, and when a change occurs (meaning that concept drift has occurred), the most recent observations along the stream are used Replace this classifier with another classifier trained on the instance

I would like to add here that label propagation is a graph-based semi-supervised learning method for clustering problems. The basic principle is to use samples with known labels to propagate labels to predict labels for unknown samples. Specifically, samples with known labels are regarded as labeled nodes, and unknown samples are regarded as unlabeled nodes. They can be formed into a graph, in which nodes represent samples and edges represent similarity relationships between samples. This similarity relationship can be calculated using some similarity measurement methods, such as Euclidean distance, cosine similarity, etc. The basic idea of ​​the label propagation method is: first propagate the label of the labeled node to its neighbor nodes, then update the label of the unlabeled node, and repeat this process until convergence. During the propagation process, the label of a node will be affected by the labels of adjacent nodes, so the label of the node is constantly revised and eventually converges to a stable state. It is expressed as:
f i = ∑ j ∈ N i w i j f j ∑ j ∈ N i w i j f_i=\frac{\sum_{j\in N_i}w_{ij}f_j}{\sum_{j \in N_i}w_{ij}} fi=jNiInijjNiInijfj
That is, the label of node i is the weighted sum of its neighbor labels.

specific framework
initialization

In this stage, the instances are first normalized, and then the labeled instances are used to train the classifier C. Then the normalized initial data is used to generate a cluster set and added to the cluster ensemble M, thus forming the initial classifier and cluster ensemble. Then M has a size limit, which is T M T_M TM

After the warm-up period, for each new data instance that occurs, it is saved in a temporary data buffer S with size T S T_S TS. Whenever S is full, SACCOS updates the normalization parameters and plans those instances in range S, then sends these instances to other modules for further processing, and then cleans up S.

Online normalization (standardization)

Clustering mechanisms usually need to rely on a good feature space to measure the distance between group instances. If the features belong to an unbounded range, it will be a failed algorithm. Therefore, it is usually necessary to standardize the instance data set to overcome the above problems. However, streaming data usually has an unknown distribution, so it is difficult to normalize features, so online normalization is adopted.

Assume that N instances have been observed before, then new ones are currently sent from S T S T_S TSExample, then the updates to the mean and variance are:

Insert image description here
Insert image description here

And because new samples have arrived, the centroids and radii of each cluster in the current cluster should also change accordingly, that is:

Insert image description here

clustering

The algorithm used is multigraph clustering (MGC), which has proven to be a particularly effective method of identifying important clusters.

First define the following:

  • KNN-set:关于实例 x i x_i xiThe nearest k neighbor instances of
  • Mutual graph G(k): any two adjacent nodes in the graph x 1, x 2 x_1, x_2 x1x2They must all be instances in each other's KNN-set
  • Growth of mutual graph: Assumption x 1 ∈ G 1 ( k ) , x 2 ∈ G 2 ( k ) x_1\in G_1(k), x_2\in G_2( k) x1G1(k)x2G2(k), result x 1 , x 2 x_1, x_2x1x2 are instances in each other's KNN-set, then they can be connected, that is, G 1 , G 2 G_1,G_2 G1,G2Connect to a larger graph

Then the clustering corresponding to each graph can be expressed as:

Insert image description here

G ( k , D ) G(k,D) G(k,D) represents the set of all intergraphs of the data set D, and V ( G l ( k ) ) V(G_l(k)) V(Gl(k)) represents the set of vertices of the graph, and the vertices represent instances, so they are composed of instances Clustered clusters.

In order to improve the efficiency of parallel processing, the feature space of the data instance is divided into subspaces based on the feature variance through KD-Tree. Here, each subspace contains at most N m N_m Nm instances. In each subspace, the q-neighborhood noise coefficient (q-NNC) of each data instance is first calculated to perform noise reduction, that is, define D ~ q , n e a r e s t \tilde{D}_ {q,nearest} D~q,nearest D ~ q , f a r t h e s t \tilde{D}_{q,farthest} D~q,far thest is the average distance of the q nearest neighbors and the average distance of the q farthest neighbors for instance x, and then calculate its q-NNC coefficient as:
q ​​− N N C ( x ) = D ~ q , n e a r e s t ( x ) D ~ q , f a r t h e s t ( x ) q-NNC(x)=\frac{\tilde{D}_{q,nearest}(x)}{\tilde{D}_ {q,farthest}(x)} qNNC(x)=D~q,far thest(x)D~q,nearest(x)
The closer is to 1, the more likely it is a noise instance. Therefore, the instances of this subspace are sorted in descending order according to the q-NNC coefficient, and then the threshold is determined. Instances with coefficients greater than the threshold are considered noise and discarded. Then the threshold is determined by the difference of the minimum standard deviation, that is, assuming that the threshold is T n o i s e T_{noise} Tnoise, divide the instance into two parts, then the absolute value of the difference between the variances of the two parts is minimized, that is:
T n o i s e = a r g m i n ∣ σ ( L r ) − σ ( L l ) ∣ T_{noise}=argmin\vert \sigma(L^r)-\sigma(L^l)\vert Tnoise=argminσ(Lr)σ(Ll)
Next, use MGC to establish multiple clusters. The cluster forms a spherical hyperplane, its centroid is the average value of each feature of the instances, and its radius is the average distance. The specific algorithm is:

Insert image description here

The specific algorithm flow chart is:

Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/StarandTiAmo/article/details/129264739