Machine learning-traditional clustering, LDA, deep learning clustering methods


The application scenarios of clustering are not widely classified, and due to the unsupervised algorithm effect, it has not been applied to the production environment, but it is still an important part of machine learning. Common application scenarios for text clustering are document label generation, hot news discovery, etc. In addition, when processing text features, clusters can also be used to form low-dimensional representations of features.
This article from the feature-based , latent semantic analysis , depth study two types of methods described in this text clustering task.

1. Method based on text features

Feature Representation The feature representation of
text in the task of classification and clustering is the same. Refer to the original natural language processing-introduction to text classification

1. K-Means algorithm

The K-means algorithm is one of the most commonly used clustering algorithms. Its calculation is simple and fast. But requirements:

  1. The similarity between data can be measured by Euclidean distance (the closer the Euclidean distance, the higher the similarity of the two data)
  2. Specify the number of clusters in advance
其算法概括为:
1. 初始化k个对象作为初始聚类中心
2. 把各个对象分配给距离它最新的聚类中心点
3. 重新计算聚类中心
终止条件: 没有对象被重新分配 或者 没有聚类中心变化 或者误差平方和局部最小

Problems with K-means algorithm

  1. K needs to be manually determined
  2. How to confirm the initial cluster center point.
  3. Outliers have a significant impact

For problem 1,
you can simply analyze the data, or you can use structure-based algorithms (such as using average contour coefficients, the closer to 1 the better the clustering effect) or change-based algorithms (that is, define a function, as K Change, thinking that an extreme value will be generated at the correct K). For the specific method, please refer to the summary.
For problem 2, the
initial clustering center point can be obtained by multiple random selection, or you can select k points with the largest distance, or use hierarchical clustering to select K cluster points and use these centers as Initial point

2. Mean shift algorithm

Mean shift algorithm is often applied in scenes such as target tracking and data clustering in image recognition. The algorithm does not need to specify the number of categories, but is determined by bandwidth (bandwidth can be estimated with a specified algorithm).
This algorithm, for a given number of samples, first randomly select a center point, and then calculate the average value of the distance vector from all points to the center point within a certain range of the center point, calculate the average value to obtain an offset mean, and then Move the center point to the offset mean position. Through this repeated movement, the center point can be gradually approached to the best position.

1 在数据点中随机选择一个点作为初始中心点。
2 找出离该中心点距离在带宽之内的所有点,记做集合M,认为这些点属于簇C.
3 计算从中心点开始到集合M中每个元素的向量,将这些向量相加,得到偏移向量。
4 将该中心点沿着偏移的方向移动,移动距离就是该偏移向量的模。
5 重复上述步骤2,3,4,直到偏移向量的大小满足设定的阈值要求,记住此时的中心点。
终止条件: 所有的点都被归类
7 分类:根据每个类,对每个点的访问频率,取访问频率最大的那个类,作为当前点集的所属类。

The advantage of the mean shift clustering algorithm is that it does not need to specify the number of clusters

3. Hierarchical clustering

Hierarchical clustering divides the data set at different levels to form a tree-shaped cluster structure. There are usually two ways to classify "top-down" split and "bottom-up" aggregation.
Hierarchical clustering
The hierarchical clustering merging algorithm calculates the similarity between two types of data points, combines the two most similar data points among all data points, and iterates this process repeatedly. Simply put, the merging algorithm of hierarchical clustering determines the similarity between data points of each category and all the data points by calculating the distance between them. The smaller the distance, the higher the similarity. Combine the closest two data points or categories to generate a cluster tree. There are three methods for calculating the distance between two combined data points, namely Single Linkage, Complete Linkage and Average Linkage.

How many clusters a data set should be clustered into is usually related to the granularity of the data set we care about. One of the advantages of the hierarchical clustering algorithm over the partitioned clustering algorithm is that it can display the clustering situation of the data set on different scales (levels). But it also has shortcomings: (1) The conditions for the termination of the algorithm are very vague, and it is difficult to accurately express and control the stopping of the algorithm. (2) Once the clustering result is formed, the hierarchy structure is generally not rebuilt to improve the performance of the clustering.

4. Spectral clustering algorithm

Spectral clustering treats all data as points in space, and these points can be connected by edges. The edge weight value between two points farther apart is lower, and the edge weight value between two points closer together is higher. By cutting the graph composed of all data points, the difference after cutting the graph The sum of the edge weights between the subgraphs is as low as possible, and the weight sum of the edges within the subgraphs is as high as possible, so as to achieve the purpose of clustering. This algorithm involves undirected graphs in graph theory, linear algebra and matrix analysis.
For details, see Spectral Clustering

5. DBSCAN density clustering algorithm

Density clustering does not directly evaluate the similarity between data, and treats clusters as high-density regions separated by low-density regions.

We define the core sample to mean that there are min_samples other samples within the eps distance range of a sample in the data set, and these samples are designated as neighbors of the core sample. At this time, the core sample is in a dense area of ​​the vector space. A cluster is a collection of core samples, which can be constructed through recursion. A core sample is selected, the core sample is found among all its neighbor samples, and then the core sample among the neighbor samples of the newly acquired core sample is found. This process is recursive. The cluster also has a set of non-core samples, which are samples of neighbors of the core samples in the cluster, but they are not core samples themselves.

The clustering definition of DBSCAN is very simple: the set of samples connected by the maximum density derived from the density reachability relationship is a category or cluster of our final clustering.

sklearn code

sklearn shows the clustering effect of various clustering methods under different data. For specific use and calling, see the document. The
Clusters
specific code is written in the official website document in detail, so I won't post it here.

2. Latent semantic analysis

The latent semantic analysis here refers to three types of models: LSA, PLSA, and LDA. Simply understand that each document contains multiple topics, and each topic contains different words. See the blog for details .
We first build a document-word matrix A, the matrix value is the words that appear in the document.
Matrix factorization
Now A'can be obtained by decomposition, and there is not much difference between A'and A, which can be regarded as A'=A. Ut is the topic distribution of each document, and Vt is the distribution of words under each topic.
However, this decomposition is actually not very explanatory. PLSA models probabilistically. Each document contains multiple topics, and each topic contains multiple words. Then, Bayesian is added to PLSA to get LDA, which is the topic model. Just use LDA directly in the application scenario.
LDA code:

Lda = genism.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

Through LDA, the probability that each document contains each topic is obtained. This can be regarded as a low-dimensional representation of the document for clustering, or the largest topic can be directly used as the clustering result. LDA is also relatively simple to use, but it is very slow.

Three, deep learning clustering

According to my intuitive understanding, direct clustering based on deep learning is actually not feasible, at least not easy. (How to set the loss function?) The
conventional method is to reduce the dimensionality first and then cluster. Specifically, neural networks can be used for dimensionality reduction, and then traditional methods can be used to cluster in low-latitude space.

Common networks used for dimensionality reduction

  • Autoencoder
    An autoencoder is a neural network with three layers: input layer, hidden layer (encoding layer) and decoding layer. The purpose of the network is to reconstruct its input so that its hidden layer learns a good representation of the input. If the trained model makes the output consistent with the input, then a small number of neurons in the middle can represent the input data, which is equivalent to obtaining the same representation of the input data. Based on this foundation, there are noise reduction automatic encoders, sparse automatic encoders, variational automatic encoders and so on. We can obtain a suitable data format based on the autoencoder, and then perform clustering.
  • Restricted Boltzmann machine The
    restricted Boltzmann machine and the autoencoder look similar, but they have changed from three layers to two layers (in fact, there should be huge differences). In terms of feature extraction, It also takes its hidden layer as a new representation of the original data.

But my understanding is wrong. In fact, deep learning can be used for clustering.

The deep learning clustering network can be divided into two parts, part A: the neural network transforms the input into a new representation, part B: then clusters based on the new representation.
Question : How is the loss function of deep learning clustering determined and how is it trained? (If there is a trainable loss function, then at least it can be trained)
Look at the paper Towards K-means-friendly Spaces: Simultaneous Deep
Learning and Clustering and
now give a rough description here.
If you use the K-means clustering idea, then the loss function of the network should be similar to K-means.
Insert picture description here
The first part is the loss of stage A in the paper. We mainly focus on the second part, that is,
Insert picture description here
f(x) represents the input The low latitude of, M represents the coordinates of K cluster centers, and Si represents which category each point belongs to. The key here is that Si is a discrete variable. So here are separate training:
Insert picture description here
specifically:

  1. Fix M and S to optimize the network and get a new representation of each point,
  2. Divide the new point into various categories (according to the distance to the center point)
  3. Update the center point of each class

Compared with the original K-means, it only has one more step 1, optimizing the process of obtaining a new representation for the input.

In fact, clustering based on deep learning in my imagination is still different. It can be said that the clustering step is no different from traditional thinking.

I also elaborate on my opinion : the algorithm is mainly divided into two parts, the A part learns low-dimensional representations (more suitable for clustering?), and the B part clusters low-dimensional data. In fact, the two tasks are still separated, but through the optimization of the objective function, the learned low-dimensional representation can be more suitable for clustering . But the iterative process of clustering is still consistent with the traditional method.

Of course, because it is unsupervised here, we cannot simply learn a low-dimensional representation suitable for clustering, but also need the low-dimensional representation to represent the original input (generalization ability?), so the loss of stage A is added to the above formula ( Such as the reconstruction error of the autoencoder). This is equivalent to joint learning of two tasks.

For details about clustering in deep learning, please see this article Clustering with Deep Learning: Taxonomy and New Methods

The following is the official account, welcome to scan the QR code, thank you for your attention, thank you for your support!

Official account name: Python into the pit NLP
No public
This official account is mainly dedicated to natural language processing, machine learning, coding algorithms and some knowledge sharing of Python. I am just a side dish. I hope everyone can make progress together while recording the process of my study and work. Welcome to exchange and share.

Guess you like

Origin blog.csdn.net/lovoslbdy/article/details/104927365