DocSCAN: Unsupervised Text Classification via Learning from Neighbors

Abstract

We introduce DocSCAN, a completely unsupervised text classification approach built on the Semantic Clustering by Adopting NearestNeighbors algorithm. For each document, we obtain semantically informative vectors from a large pre-trained language model. We find that similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels. Our learnable clustering approach then uses pairs of neighboring datapoints as a weak learning signal to automatically learn topic assignments. On three different text classification benchmarks, we improve on various unsupervised baselines by a large margin.

我们介绍了DocSCAN，一种完全无监督的文本分类方法，建立在采用最近邻居算法的语义聚类基础上。对于每个文档，我们从一个预先训练好的大型语言模型中获得语义信息向量。我们发现相似的文档有近似向量，所以表示空间中的邻居倾向于共享主题标签。然后，我们的可学习聚类方法使用相邻数据点对作为弱学习信号来自动学习主题分配。在三个不同的文本分类基准上，我们在各种无监督基准上有了很大的改进。

1 Introduction

”What is this about?” is the starting question in human and machine reading of text documents. While this question would invite a variety of answers for documents in general, there is a large set of corpora for which each document can be labeled as belonging to a singular category or topic. Text classification is the task of automatically mapping texts into these categories. In the standard supervised setting (V apnik, 2000), machine learning algorithms learn such a mapping from annotated examples. Annotating data is costly, however, and the resulting annotations are usually domain-specific. Unsupervised methods promise to reduce the number of labeled examples needed or to dispense with them altogether.

This paper builds on recent developments in the domain of unsupervised neighbor-based clustering of images, the SCAN algorithm: Semantic Clustering by Adopting Nearest neighbors (V an Gansbeke et al, 2020). We adapt the algorithm to text classification and report strong experimental results on three text classification benchmarks. The intuition behind SCAN is that images often share the same label, if their embeddings in some representation space are close to each other. Thus, we can leverage this regularity as a weakly supervised signal for training models. We encode a datapoint and its neighbors through a network where the output of the network is determined by a classification layer. The model learns that it should assign similar output probabilities to a datapoint and each of its neighbors. In the ideal case, model output is consistent and one-hot, i.e. the model confidently assigns the same label to two neighboring datapoints.

Deep Transformer networks have led to rapid improvements in text classification and other natural language processing (NLP) tasks (see e.g. Y ang et al, 2019). We draw from such models to obtain task-agnostic contextualized language representations. We use SBERT embeddings (Reimers and Gurevych, 2019), which have proven performance in a variety of downstream tasks, such as retrieving semantically similar documents and text clustering.

We show that in this semantic space, indeed neighboring documents tend to often share the same class label and we can use this proximity to build a dataset on which we apply our neighbor-based clustering objective. We find that training a model exploiting this regularity works well for text classification and outperforms a standard unsupervised baseline by a large margin. All code for DocSCAN can be found publicly available online.

“What is this about?”是人类和机器阅读文本文档的起始问题。虽然这个问题通常会为文档带来各种各样的答案，但是有一个很大的语料库，每个文档都可以被标记为属于一个单一的类别或主题。文本分类是自动将文本映射到这些类别的任务。在标准监督设置(V apnik, 2000)中，机器学习算法从带注释的示例中学习这样的映射。然而，注释数据的开销很大，并且生成的注释通常是特定于领域的。无监督方法承诺减少需要的标记示例的数量，或者完全不使用它们。

本文基于无监督基于邻居的图像聚类领域的最新发展，SCAN算法:采用最近邻的语义聚类(V an Gansbeke et al, 2020)。我们将该算法应用于文本分类，并在三个文本分类基准上报告了强大的实验结果。SCAN背后的直觉是，如果图像在某个表示空间中的嵌入彼此接近，那么它们通常共享相同的标签。因此，我们可以利用这种规律性作为训练模型的弱监督信号。我们通过网络对数据点及其邻居进行编码，其中网络的输出由分类层决定。该模型了解到它应该为一个数据点和它的每个邻居分配类似的输出概率。在理想情况下，模型输出是一致的，即模型自信地将相同的标签分配给两个相邻的数据点。

深度变压器网络已经导致了文本分类和其他自然语言处理(NLP)任务的快速改进(参见例如Y ang等人，2019)。我们利用这样的模型来获得与任务无关的上下文化语言表示。我们使用SBERT嵌入(Reimers和Gurevych, 2019)，它在各种下游任务中已经证明了性能，例如检索语义相似的文档和文本聚类。

我们表明，在这个语义空间中，确实相邻的文档往往共享相同的类标签，我们可以使用这种接近性来构建一个数据集，在该数据集上应用基于邻居的聚类目标。我们发现，训练一个利用这种规律性的模型对文本分类效果很好，并且在很大程度上优于标准的无监督基线。DocSCAN的所有代码都可以在网上公开获取。

2 Related Work

Unsupervised learning methods are ubiquitous in natural language processing and text classification.
For a more general overview, we refer to surveys discussing the topic in extensive details (see e.g.

Feldman and Sanger, 2006; Grimmer and Stewart, 2013; Aggarwal and Zhai, 2012; Thangaraj and Sivakami, 2018; Li et al, 2021). One common approach for text classification is to represent documents as vectors and then apply any clustering algorithm on the vectors (Aggarwal and Zhai, 2012; Allahyari et al, 2017). The resulting clusters can be interpreted as the text classification results. A popular choice is to use the k-means algorithm which learns cluster centroids that minimize the withincluster sum of squared distances-to-centroids (see e.g. Jing et al, 2005; Guan et al, 2009; Balabantaray et al, 2015; Slamet et al, 2016; Song et al, 2016; Kwale, 2017). This methodology has also applications in social science research, where for example Demszky et al (2019) classify tweets using this method. K-means can also be applied in an iterative manner (Rakib et al, 2020).

There exist more sophisticated methods for generating weak labels for unsupervised learning for text classification. However, most of these methods take into account some sort of domain knowledge or heuristically generated labels. For example, Ratner et al (2017) generate a correlation-based aggregate of different labeling functions to generate proxy labels. Y u et al (2020) create weak labels via heuristics, and Meng et al (2020) use seed words (most importantly the label name) and infer the text category assignment from a masked language modeling task and seed word overlap for each category. DocSCAN is not subject to any of these dependencies. Similarly to k-means, we only need the number of topics present in a dataset. Hence, we think it is well suited to be compared against k-means

无监督学习方法在自然语言处理和文本分类中普遍存在。

为了获得更全面的概述，我们参考了在广泛细节中讨论该主题的调查(参见示例)。

费尔德曼和桑格，2006;格里默和斯图尔特，2013年;阿加瓦尔和翟，2012;Thangaraj和Sivakami, 2018;Li et al, 2021)。一种常见的文本分类方法是将文档表示为向量，然后在向量上应用任何聚类算法(Aggarwal和Zhai, 2012;Allahyari等人，2017)。生成的聚类可以解释为文本分类结果。一个流行的选择是使用k-means算法，该算法学习聚类质心，最小化到质心的平方距离的内包含和(参见Jing等人，2005;关等，2009;Balabantaray等人，2015;Slamet et al, 2016;Song等，2016;Kwale, 2017)。这种方法在社会科学研究中也有应用，例如Demszky等人(2019)使用这种方法对推文进行分类。K-means也可以以迭代的方式应用(Rakib et al, 2020)。

对于文本分类的无监督学习，存在更复杂的弱标签生成方法。然而，这些方法中的大多数都考虑到某种领域知识或启发式生成的标签。例如，Ratner等人(2017)生成了基于相关性的不同标签函数的聚合来生成代理标签。Y u等人(2020)通过启发式创建弱标签，孟等人(2020)使用种子词(最重要的标签名称)，并从掩码语言建模任务和每个类别的种子词重叠推断文本类别分配。DocSCAN不受任何这些依赖项的约束。类似于k-means，我们只需要数据集中出现的主题的数量。因此，我们认为它非常适合与k-means进行比较

3 Method

In this work, we build on the SCAN algorithm (V an Gansbeke et al, 2020). It is based on the intuition that a datapoint and its nearest neighbors in (some reasonable) representation space often share the same class label. The algorithm consists of three stages: (1) learn representations via a self-learning task, (2) mine nearest neighbors and fine-tune a network on the weak signal that two neighbors share the same label, and (3) confidencebased self-labeling of the training data (which is ommitted in this work2) Our adaptation DocSCAN to text classification works as follows. In Step 1, we need a document embedding method that serves as an analogue to SCAN’s self-learning task for images. Textual Entailment (Dagan et al, 2005) is an interesting pretraining task yielding transferable knowledge and generic language representations, as already shown in (Conneau et al, 2018). Combining this pretraining task and large Transformer models, e.g., (Devlin et al, 2019) has led to SBERT (Reimers and Gurevych, 2019): A network of BERT models fine-tuned on the Stanford Natural Language Inference corpus (Bowman et al, 2015). SBERT yields embeddings for short documents with proven performance across domains and for a variety of tasks, such as semantic search and clustering. For a given corpus, we apply SBERT and get a 768dimensional dense vector for each document.3We directly use the pre-trained SBERT model finetuned on top of the MPNet model4 (Song et al, 2020), which yields the best5 (on average) performing embeddings for 14 sentence embedding tasks and 6 semantic search tasks.

Step 2 is the mining of neighbors in the embedding space. We apply Faiss (Johnson et al, 2017) to get Euclidean distances between all embedded document vectors. The retrieved neighbors are the documents having the smallest Euclidean distance to a reference datapoint.

在这项工作中，我们建立在SCAN算法的基础上(V an Gansbeke et al, 2020)。它基于这样一种直觉:一个数据点和它在(某种合理的)表示空间中最近的邻居通常共享相同的类标签。该算法由三个阶段组成:
(1)通过自学习任务学习表示，
(2)挖掘最近的邻居并在两个邻居共享相同标签的微弱信号上微调网络，
(3)基于置信度的训练数据自标记

我们对文本分类的DocSCAN自适应工作如下。
在第一步中，我们需要一个文档嵌入方法，它可以作为SCAN图像的自我学习任务的模拟。Textual Entailment(Dagan等人，2005年)是一个有趣的预训练任务，可以产生可转移的知识和通用语言表示，如(Conneau等人，2018年)所示。将这个预训练任务和大型Transformer模型相结合，例如，(Devlin等人，2019)导致了SBERT (Reimers和Gurevych, 2019):在斯坦福自然语言推理语料库上进行微调的BERT模型网络(Bowman等人，2015)。SBERT为跨域的短文档和各种任务(如语义搜索和集群)提供了经过验证的性能嵌入。对于给定的语料库，我们应用SBERT，为每个文档获得768维的密集向量。我们直接使用在MPNet模型(Song et al, 2020)上经过微调的预训练的SBERT模型，该模型对14个句子嵌入任务和6个语义搜索任务产生了最佳(平均)执行嵌入。

步骤2是挖掘嵌入空间中的邻居。我们应用Faiss (Johnson et al, 2017)来获得所有嵌入文档向量之间的欧氏距离。检索到的邻居是与参考数据点的欧氏距离最小的文档。

在这里插入图片描述

SCAN worked because images with proximate embeddings tended to share class labels. Is that the case with text? Figure 1 shows that the answer is yes: across three text classification benchmarks, neighboring document pairs do indeed often share the same label. The fraction of pairs sharing the same label at k = 1 is above 85% for all datasets examined. For k = 5, the resulting fraction of correct pairs (from all mined pairs) is still higher than 75% in all cases. Furthermore, these frequencies of correct pairs for k = 5 are often higher than the frequency of correct pairs reported for images in (V an Gansbeke et al, 2020).

Next, we describe the SCAN loss, − 1 |D| X x∈D X k∈Nx log(f(x) · f(k)) + λ X i∈C pilog(pi) (1) which can be broken down as follows. The first part of Eq. (1) is the consistency loss. Our model f (parametrized by a neural net) computes a label for a datapoint x from the dataset D and for each datapoint k in the set of the mined neighbors from x in Nx. We then simply compute the dot product (denoted as ·) between the output distribution (normalized by a softmax function) for our datapoint x and its neighbor k. This dot product is maximized if both model outputs are one-hot with all probability mass on the same entry in the respective vectors. It is consistent because we want to assign the same label for a datapoint and all its neighbors.

The second term is an auxiliary loss to obtain regularization via entropy (scaled by a weight λ), such that the model is encouraged to spread probability mass across all clusters C where pi denotes the assigned probability of cluster i in C by the model.

Without this entropy term, there exists a shortcut by collapsing all examples into one single cluster. The entropy term ensures that the distribution of class labels resulting from applying DocSCAN tends to be roughly uniform. Thus, it works best for text classification tasks where the number of examples per class is balanced as well.

To summarize: We use SBERT and embed every datapoint in a given text classification dataset.

We then mine the five nearest neighbors for every datapoint. This yields our weakly supervised training set. We fine-tune networks on neighboring datapoints using the SCAN loss. At test time, we compute f(x) for every datapoint x in the test set.

We set the number of outcome classes equal to the numbers of classes in our considered datasets and use the hungarian matching algorithm (Kuhn and Yaw, 1955) to obtain the optimal cluster-to-label assignment.

SCAN可以工作，因为具有近似嵌入的图像倾向于共享类标签。文本也是这样吗?图1显示答案是肯定的:在三个文本分类基准测试中，相邻的文档对确实经常共享相同的标签。对于所有被检测的数据集，在k = 1处共享相同标签的对的比例高于85%。对于k = 5，在所有情况下，正确对的结果比例(来自所有挖掘的对)仍然高于75%。此外，k = 5的这些正确对的频率通常高于文献中报道的图像的正确对的频率(V an Gansbeke et al, 2020)。

接下来，我们描述SCAN损失，−1 |D| X X∈dx k∈Nx log(f(X)·f(k)) + λ xi∈C pilog(pi)(1)可以分解如下。式(1)的第一部分是一致性损失。我们的模型f(由神经网络参数化)为数据集D中的数据点x计算一个标签，并为Nx中x的挖掘邻居集中的每个数据点k计算一个标签。然后，我们简单地计算数据点x和它的邻居k的输出分布(由softmax函数归一化)之间的点积(记为·)。如果两个模型输出在各自向量的同一项上都是单热且全概率质量，则该点积将最大化。它是一致的，因为我们想为一个数据点和它的所有邻居分配相同的标签。

第二项是通过熵(按权重λ缩放)获得正则化的辅助损失，这样模型就被鼓励将概率质量分布到所有簇C上，其中pi表示模型在C中簇i的分配概率。

如果没有这个熵项，就有一个捷径可以把所有的例子压缩成一个集群。熵项确保应用DocSCAN产生的类标签的分布趋向于大致均匀。因此，它最适合于文本分类任务，其中每个类的示例数量也是平衡的。

总结一下:我们使用SBERT并将每个数据点嵌入到给定的文本分类数据集中。

然后，我们挖掘五个最近的邻居的每个数据点。这就得到了弱监督训练集。我们使用SCAN损失对相邻数据点上的网络进行微调。在测试时，我们为测试集中的每个数据点x计算f(x)。

我们设置结果类的数量等于我们考虑的数据集中的类的数量，并使用匈牙利匹配算法(Kuhn和Yaw, 1955)来获得最佳的聚类到标签的分配。

4 Experiments

We apply DocSCAN on three widely used but diverse text classification benchmarks: The 20NewsGroup data (Lang, 1995), the AG’s news corpus (Zhang et al, 2015), and lastly the DBPedia ontology dataset (Lehmann et al, 2015). We provide further dataset descriptions in Appendix Section A.

The main results are reported in Table 1. For all experiments, we report the mean accuracy over 10 runs on the test set (with different seeds and the 95% confidence interval). The columns correspond to the benchmark corpora. The rows correspond to the models, starting with a random baseline [1], two k-means baselines [2, 3] and the results obtained by DocSCAN in [4]. We also report a supervised learning baseline [5] and results taken from related literature in [6].

Row [1] provides a sensible lower-bound, row [5] analogously a supervised upper-bound for text classification performance. In the random draw [1], accuracy by construction converges to the average of the class proportions. The supervised model [5] is an SVM classifier applied to the same SBERT embeddings6 which serve as inputs to the k-means baseline and to DocSCAN. Predictably, the supervised baseline obtains strong accuracy on these benchmark classification tasks.

The industry workhorse for clustering is kmeans, an algorithm for learning cluster centroids that minimize the within-cluster sum of squared distances-to-centroids. When applied to TF-IDFweighted bag-of-n-grams features [2], k-means improves over the results obtained in [1]. When applied to SBERT vectors [3], we see large improvements over all previous experiments. These results suggest that k-means applied to reasonable document embeddings already yields satisfactory results for text classification. Second, they corroborate what we already saw in Figure 1, that neighbors in SBERT representation space contain information about text topic classes.

So what does DocSCAN add? We fine-tune a classification layer using the SBERT embeddings with the SCAN objective (Eq. 1) and k = 5 neighbors. We observe unambiguous and significant improvements over the already strong k-means base- line in all three datasets (as we can judge from the 95% confidence intervals). The smallest improvements (over 5% points) are made on the 20 News dataset, containing 20 classes. The largest improvement gains are observed for AG news with 4 classes, suggesting that DocSCAN above all works best for text classification tasks with a lower number of classes. Surprisingly, we do not find that the improvements correlate with the accuracy of neighboring pairs sharing the same label (see Figure 1), but rather with the numbers of classes in the dataset (see Table 2). In the case of the AG news data with only a few different classes, we find that DocSCAN approaches the performance of a supervised baseline using the same input features.

Finally, in [6] we show results from related literature on unsupervised text classification. We find that DocSCAN performs comparable to other completely unsupervised methods. We find that DocSCAN obtains the best results for the 20 News dataset, comparable results in the case of AG news data and slightly worse results than the related literature on the DBPedia data. However, we note that DocSCAN is a simple method consisting of only hidden dim∗num classes parameters, that is exactly one classification layer which is fine-tuned in a completely unsupervised manner using the SCAN loss. Whereas the results for DBPedia from (Meng et al, 2020) are obtained by fine-tuning whole language models using domain knowledge (seed words).

We show and discuss ablation experiments for DocSCAN in Appendix B. Specifically, we conduct experiments regarding the various hyperparameters of the algorithm and find that it is robust to such choices. Furthermore, we find that DocSCAN outperforms a k-means baseline over different input features in all settings. Given the findings derived from these experiments, we recommend default hyperparameters for applying DocSCAN.

我们将DocSCAN应用于三个广泛使用但不同的文本分类基准:20NewsGroup数据(Lang, 1995)， AG的新闻语料库(Zhang et al, 2015)，最后是DBPedia本体数据集(Lehmann et al, 2015)。我们在附录A部分提供了进一步的数据集描述。

主要结果如表1所示。对于所有实验，我们报告了在测试集上运行10次以上的平均准确度(使用不同的种子和95%置信区间)。列对应于基准语料库。这些行对应于模型，从一个随机基线[1]开始，两个k-means基线[2,3]和[4]中DocSCAN获得的结果。我们还报告了监督学习基线[5]和[6]相关文献的结果。

行[1]提供了一个合理的下界，行[5]类似地提供了文本分类性能的有监督的上界。在随机图[1]中，构造精度收敛于类比例的平均值。监督模型[5]是一个SVM分类器，应用于相同的SBERT嵌入6，作为k-means基线和DocSCAN的输入。可预见的是，监督基线在这些基准分类任务中具有较强的准确性。

聚类的行业主力军是kmeans，这是一种学习聚类质心的算法，可以最小化聚类内到质心距离的平方和。当应用于tf - idf加权bag-of-n-grams特征[2]时，k-means比[1]得到的结果有所改善。当应用于SBERT向量[3]时，我们看到了比之前所有实验都有很大的改进。这些结果表明，k-means应用于合理的文档嵌入已经产生了令人满意的文本分类结果。其次，它们证实了我们在图1中已经看到的情况，即SBERT表示空间中的邻居包含关于文本主题类的信息。

那么DocSCAN增加了什么呢?我们使用带有SCAN目标(公式1)和k = 5邻居的SBERT嵌入对分类层进行微调。在所有三个数据集中，我们观察到在已经很强的k-均值基线上有明确而显著的改进(从95%置信区间可以判断)。最小的改进(超过5%)是在包含20个类的20 News数据集上进行的。在包含4个类的AG新闻中观察到最大的改进收益，这表明DocSCAN最适合于包含较少类的文本分类任务。令人惊讶的是，我们没有发现这种改进与共享相同标签的相邻对的准确性相关(见图1)，而是与数据集中的类数量相关(见表2)。在只有几个不同类的AG新闻数据的情况下，我们发现DocSCAN使用相同的输入特征接近监督基线的性能。

最后，在[6]中，我们展示了有关无监督文本分类的相关文献的结果。我们发现DocSCAN的性能与其他完全无监督的方法相当。我们发现DocSCAN在20 News数据集上获得了最好的结果，在AG新闻数据的情况下获得了类似的结果，而在DBPedia数据上的相关文献的结果略差。然而，我们注意到DocSCAN是一个简单的方法，只包含隐藏的dim * num类参数，这正是一个分类层，它是在完全无监督的方式下使用SCAN损失进行微调。而DBPedia(孟等人，2020)的结果是通过使用领域知识(种子词)微调整个语言模型获得的。

我们在附录b中展示和讨论了DocSCAN的消融实验，具体来说，我们进行了关于算法的各种超参数的实验，并发现它对这些选择是稳健的。此外，我们发现在所有设置中，DocSCAN在不同输入特征上优于k-means基线。鉴于从这些实验中得到的结果，我们推荐用于应用DocSCAN的默认超参数。

5 Conclusion

In this work, we introduced DocSCAN for unsupervised text classification. Analogous to the recognizable object content of images, we find that a document and its close neighbors in embedding space often share the same class in terms of the topical content. We show that this consistency can be used as a weak signal for fine-tuning text classifier models in an unsupervised fashion. We start with SBERT embeddings and fine-tune DocSCAN on three text classification benchmarks. We outperform a random baseline and two k-means baselines by a large margin. We discuss the influence of hyper-parameters and input features for DocSCAN and recommend default parameters which we have observed to work well across our main results.

As with images, unsupervised learning with SCAN can be used for text classification. However, the method may not work as generically, and should for example be limited to text classification in cases of balanced datasets (given that we use an entropy loss as an auxiliary objective). Still, this work points to the promise of further exploration of unsupervised methods using embedding geometry.

在这项工作中，我们引入了DocSCAN进行无监督文本分类。类似于图像的可识别对象内容，我们发现一个文档与其嵌入空间中的近邻在主题内容方面通常具有相同的类别。我们表明，这种一致性可以作为一个弱信号，以无监督的方式微调文本分类器模型。我们从SBERT嵌入开始，并在三个文本分类基准上对DocSCAN进行微调。我们比随机基线和两个k-均值基线的表现要好很多。我们讨论了超参数和输入特征对DocSCAN的影响，并推荐了我们观察到的在主要结果中工作良好的默认参数。

与图像一样，SCAN的无监督学习可以用于文本分类。然而，该方法可能不像一般情况下那样工作，例如应该仅限于平衡数据集的文本分类(假设我们使用熵损失作为辅助目标)。尽管如此，这项工作指出了进一步探索使用嵌入几何的无监督方法的希望。