Natural language processing NLP - GSDMM for short text clustering

Table of contents

Series Article Directory

1. Introduction to Papers and Algorithms

1. Background and introduction of the paper

2. Movie grouping process simulates GSDMM clustering

3. Algorithm model and process

4. Paper results and analysis

2. GSDMM model reproduction (MGP process)

1. Core idea

2. Implementation process

3. Code testing and result analysis

3.1 Test code

3.2 Clustering

3. Experimental reproduction of the paper

1. Project import

1.1 Use jupyter directly

1.2 File conversion using Pycharm (method used in this experiment)

2. Code and Analysis

2.1 Definition of MovieGroupProcess class

2.2 Definition of classifier

2.3 Data import and setting

2.4 Clustering

2.6 Statistical analysis

3. Code testing

3.1 GSDMM

3.2 K-means

3.3 Relevant data display

4. Analysis of statistical results

4.1 Advantages of GSDMM

4.2 Influence of parameters on clustering effect

4. Questions and Supplements

1. Problem

1.1 Problem description

1.2 Solutions

2. Dataset introduction

3. GSDMM and LDA

3.1 Introduction to LDA

3.2 Performance comparison

3.3 Conclusion

4. Model Supplement

5. References

Summarize


Series Article Directory

This series of blogs focuses on the concept, principle and code practice of natural language processing NLP, and does not include tedious mathematical derivation (if you have any questions, please discuss and point out in the comment area, or contact me directly by private message).

Chapter 1 Natural Language Processing NLP - GSDMM for Short Text Clustering


synopsis

    This blog is mainly for the paper introduction of GSDMM for short text clustering, the introduction of papers and algorithms, the reproduction of GSDMM model, and the analysis of statistical results. (enclosed data set and python code)


Paper selection: A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering (GSDMM for short text clustering)

Experimental environment: Pycharm2021+scikit-learn 0.19.2 (to avoid the impact of liner_assignment abolition)

1. Introduction to Papers and Algorithms

1. Background and introduction of the paper

    With the popularity of social media such as Twitter, essay clustering has become an increasingly important task. This is a challenging problem due to its sparse, high-dimensional and large-volume characteristics. In the paper, the authors propose a Dirichlet multinomial mixture model (abbreviated as GSDMM).

    We found that GSDMM can automatically infer the number of clusters with a good balance between completeness and stability . And the convergence speed is fast . It can also deal with the sparse and high-dimensional problems of short texts , and can obtain the representative words of each cluster. Our extensive experimental studies show that GSDMM significantly outperforms the other three clustering models (K-means, HAC, DMAFP) on the Google News dataset and the Tweet dataset.

2. Movie grouping process simulates GSDMM clustering

    GSDMM adopts an analogous approach—simulating the clustering process through the Movie Grouping Process (MGP). The problem of short text clustering can be seen as a problem of grouping students through the list of movies each student has watched. Naturally, the movies (lists) watched by students in each group are similar, while the movie lists of students in different groups are different. is huge. The content of MGP’s analog short text clustering is shown in Table 1, and the movie grouping process ( MGP ) is shown in Table 2 :

Table 1 Analogy correspondence between MGP and short text clustering

MGP

short text clustering

all students

data set, corpus

Every student, every movie list

per document

Movies students have seen, movies on movie lists

word in document

Table 2 Movie grouping process

1. Predefine K groups and randomly assign students to these K groups

2. For each student, reassign groups according to the following guidelines:

   a. Choose a group with more students

   b. Select groups with more similar movie lists

3. Repeat step 2 until the remaining group becomes stable

    Analysis : The completeness and consistency of GSDMM are reflected in criterion a and criterion b respectively . The criterion a makes the family cluster more complete, that is, let the same group contain as many students belonging to the group as possible, while the criterion b Make clusters more cohesive, i.e. keep students with the same movie list in the same group as much as possible.

    GSDMM reassigns each student's group through the conditional probability in Figure 1:

Figure 1 Conditional probability formula for reallocation

Tips: In the conditional probability formula above, the part in the orange dotted box (the left dotted box) corresponds to criterion a , and the part in the blue dotted box (right dotted box) corresponds to criterion b .

3. Algorithm model and process

The DMM algorithm model and its comparison with other models (LDA, BTM) probability diagrams are shown in Figure 2:

Figure 2 Comparison of probability maps of different algorithm models

    Analysis : The DMM model can be seen to be one layer less than LDA . LDA is the topic distribution of the first generated document , and then generates the topic corresponding to each word, and then extracts the word according to the topic word distribution; while DMM first generates a global Topic distribution, and then generate the topic of the document, each document corresponds to a topic, and then extract all the words corresponding to the document according to the topic of the document; while BTM constructs word pairs based on the words of the document, and then extracts the corresponding topic of the word pair .

    The algorithm flow of GSDMM is shown in Table 3:

Table 3 GSDMM algorithm flow

The actual process of     GSDMM short text clustering is as follows:

1. Assume that in the initial stage, we specify the number of document classifications as K. According to the literature, in experiments, this K value is usually larger than the actual number of classes.

2. For each document, denoted by d, the probability of classifying d follows a multinomial distribution, such as classifying d into a family labeled z, and updating the statistics of the number of documents, the number of words and the number of occurrences of each word in this family As a result, on the original basis, the number of documents + 1, the number of words plus the number of words in document d, the statistical results of each word in this category plus the statistical information of the word corresponding to d.

3. After the classification is complete, iterate the following operations:

For each document, it is also denoted by d, record its classified label z1, remove document d from this class z1, update the relevant parameters of z1, then it is time to re-designate a class for d, and the classification probability obeys Label z1 knocks out d and d is the conditional probability distribution of the prior condition. This is actually a Gibbs sampling process, reassigning the class label z2 and updating related parameters.

4. Paper results and analysis

In the paper, GDSMM, K-means, HAC, and DMAFP were used to conduct experiments on the Google News dataset and the Tweet dataset respectively, and five clustering standards of NMI, H, C, ARI, and AMI were used to compare the different The clustering effect of the clustering model is shown in Figure 3 (taking the TweetSet dataset as an example ):

Figure 3 Comparison of the effects of different clustering models under different clustering indicators

    Tips : Based on the consideration of data visualization, the performance is normalized based on different evaluation methods.

   Analysis : It can be seen from Figure 3 that the performance and performance (Performance) of GSDMM under different clustering indicators are better than the other three clustering models, which verifies the superiority of its algorithm effect in short text clustering.

    In addition, the paper also explores the comparison of clustering effects under different parameters. Among them, the impact of different K and iteration times on the results of different data sets is shown in Figure 4, and the impact of different α and β on the results of different data sets is shown in Figure 5:

Figure 4. The influence of different K and iteration times on the results of different data sets

Figure 5 Effect of different α and β on the results of different data sets

    Analysis : It can be seen from Figure 4 and Figure 5 that these four parameters have a great influence on the clustering effect . For the data set TweetSet, it can be seen from the figure that when the initial cluster size K value tends to be around 300 , when the parameter Alpha is equal to 0.1 , when Beta is equal to 0.08 , and when the number of iterations is 20 , the clustering result of GSDMM tends to be stable and the effect is basically the same as actually match.

    Supplement : The above four parameters of GSDMM are empirical parameters, and the optimal parameter values ​​will be different for different data sets (each data set is quite different) . In practical applications, when given better empirical parameters, GSDMM has better clustering effect, which makes it have higher application value.

2. GSDMM model reproduction ( MGP process)

1. Core idea

    This part mainly reproduces the movie grouping process simulation GSDMM process clustering in part 1 ( 2 ) through handwritten codes . The problems and conditions are as follows:

    At the beginning, all students are randomly assigned to k groups, and write down their favorite movie list, and then the professor will read each person's list in turn, after each person is read, the group number must be updated to meet the following one or Two conditions:

1. The new group has more students

2. The new group of students is more similar to their own list

Repeat the above operations until the group numbers of all students do not change, then we will get the grouping results we need.

    Implementation core : Write the MovieGroupProcess class, pass in the corresponding parameters to the construction method, and the required parameters are shown in Table 4:

Table 4 Core parameters required for the MGP process

parameter

role or definition

K

Maximum number of clusters

a

It is used to control the probability that students choose the currently empty group. When it is 0 , it means that no one will join the empty group

b

Used to control how similar the student is to other students with similar interests, when low it means that the student is more eager to be in a group with students with similar interests rather than a more popular group

n_iters

iterations

2. Implementation process

    Source address: GitHub - rwalk/gsdmm: GSDMM: Short text clustering

    Write code to implement the MGP process according to the algorithm flow and model in No. 1 and the core idea in No. 2 (1). The code is as follows:

from numpy.random import multinomial
from numpy import log, exp
from numpy import argmax
import json

class MovieGroupProcess:
    def __init__(self, K=8, alpha=0.1, beta=0.1, n_iters=30):

        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.n_iters = n_iters

        # slots for computed variables
        self.number_docs = None
        self.vocab_size = None
        self.cluster_doc_count = [0 for _ in range(K)]
        self.cluster_word_count = [0 for _ in range(K)]
        self.cluster_word_distribution = [{} for i in range(K)]

    @staticmethod
    def from_data(K, alpha, beta, D, vocab_size, cluster_doc_count, cluster_word_count, cluster_word_distribution):
        '''
        Reconstitute a MovieGroupProcess from previously fit data
        :param K:
        :param alpha:
        :param beta:
        :param D:
        :param vocab_size:
        :param cluster_doc_count:
        :param cluster_word_count:
        :param cluster_word_distribution:
        :return:
        '''
        mgp = MovieGroupProcess(K, alpha, beta, n_iters=30)
        mgp.number_docs = D
        mgp.vocab_size = vocab_size
        mgp.cluster_doc_count = cluster_doc_count
        mgp.cluster_word_count = cluster_word_count
        mgp.cluster_word_distribution = cluster_word_distribution
        return mgp

    @staticmethod
    def _sample(p):
        '''
        Sample with probability vector p from a multinomial distribution
        :param p: list
            List of probabilities representing probability vector for the multinomial distribution
        :return: int
            index of randomly selected output
        '''
        return [i for i, entry in enumerate(multinomial(1, p)) if entry != 0][0]

    def fit(self, docs, vocab_size):
        '''
        Cluster the input documents
        :param docs: list of list
            list of lists containing the unique token set of each document
        :param V: total vocabulary size for each document
        :return: list of length len(doc)
            cluster label for each document
        '''
        alpha, beta, K, n_iters, V = self.alpha, self.beta, self.K, self.n_iters, vocab_size

        D = len(docs)
        self.number_docs = D
        self.vocab_size = vocab_size

        # unpack to easy var names
        m_z, n_z, n_z_w = self.cluster_doc_count, self.cluster_word_count, self.cluster_word_distribution
        cluster_count = K
        d_z = [None for i in range(len(docs))]

        # initialize the clusters
        for i, doc in enumerate(docs):

            # choose a random  initial cluster for the doc
            z = self._sample([1.0 / K for _ in range(K)])
            d_z[i] = z
            m_z[z] += 1
            n_z[z] += len(doc)

            for word in doc:
                if word not in n_z_w[z]:
                    n_z_w[z][word] = 0
                n_z_w[z][word] += 1

        for _iter in range(n_iters):
            total_transfers = 0

            for i, doc in enumerate(docs):

                # remove the doc from it's current cluster
                z_old = d_z[i]

                m_z[z_old] -= 1
                n_z[z_old] -= len(doc)

                for word in doc:
                    n_z_w[z_old][word] -= 1

                    # compact dictionary to save space
                    if n_z_w[z_old][word] == 0:
                        del n_z_w[z_old][word]

                # draw sample from distribution to find new cluster
                p = self.score(doc)
                z_new = self._sample(p)

                # transfer doc to the new cluster
                if z_new != z_old:
                    total_transfers += 1

                d_z[i] = z_new
                m_z[z_new] += 1
                n_z[z_new] += len(doc)

                for word in doc:
                    if word not in n_z_w[z_new]:
                        n_z_w[z_new][word] = 0
                    n_z_w[z_new][word] += 1

            cluster_count_new = sum([1 for v in m_z if v > 0])
            print("In stage %d: transferred %d clusters with %d clusters populated" % (
            _iter, total_transfers, cluster_count_new))
            if total_transfers == 0 and cluster_count_new == cluster_count and _iter>25:
                print("Converged.  Breaking out.")
                break
            cluster_count = cluster_count_new
        self.cluster_word_distribution = n_z_w
        return d_z

    def score(self, doc):
        '''
        Score a document
        Implements formula (3) of Yin and Wang 2014.
        http://dbgroup.cs.tsinghua.edu.cn/wangjy/papers/KDD14-GSDMM.pdf
        :param doc: list[str]: The doc token stream
        :return: list[float]: A length K probability vector where each component represents
                              the probability of the document appearing in a particular cluster
        '''
        alpha, beta, K, V, D = self.alpha, self.beta, self.K, self.vocab_size, self.number_docs
        m_z, n_z, n_z_w = self.cluster_doc_count, self.cluster_word_count, self.cluster_word_distribution

        p = [0 for _ in range(K)]

        #  We break the formula into the following pieces
        #  p = N1*N2/(D1*D2) = exp(lN1 - lD1 + lN2 - lD2)
        #  lN1 = log(m_z[z] + alpha)
        #  lN2 = log(D - 1 + K*alpha)
        #  lN2 = log(product(n_z_w[w] + beta)) = sum(log(n_z_w[w] + beta))
        #  lD2 = log(product(n_z[d] + V*beta + i -1)) = sum(log(n_z[d] + V*beta + i -1))

        lD1 = log(D - 1 + K * alpha)
        doc_size = len(doc)
        for label in range(K):
            lN1 = log(m_z[label] + alpha)
            lN2 = 0
            lD2 = 0
            for word in doc:
                lN2 += log(n_z_w[label].get(word, 0) + beta)
            for j in range(1, doc_size +1):
                lD2 += log(n_z[label] + V * beta + j - 1)
            p[label] = exp(lN1 - lD1 + lN2 - lD2)

        # normalize the probability vector
        pnorm = sum(p)
        pnorm = pnorm if pnorm>0 else 1
        return [pp/pnorm for pp in p]

    def choose_best_label(self, doc):
        '''
        Choose the highest probability label for the input document
        :param doc: list[str]: The doc token stream
        :return:
        '''
        p = self.score(doc)
        return argmax(p),max(p)

    Tips: The most important function in the MovieGroupProcess class (MGP process) is the fit function , which is used to implement the clustering of the input files.

3. Code testing and result analysis

3.1 Test code

    After defining the MovieGroupProcess class, you need to write test code to test and verify the correctness and clustering effect of GSDMM. The test code is as follows (including character, word, short text clustering):

from unittest import TestCase
from gsdmm.mgp import MovieGroupProcess
import numpy


class TestGSDMM(TestCase):
    '''This class tests the Panel data structures needed to support the RSK model'''

    def setUp(self):
        numpy.random.seed(47)

    def tearDown(self):
        numpy.random.seed(None)

    def compute_V(self, texts):
        V = set()
        for text in texts:
            for word in text:
                V.add(word)
        return len(V)

    def test_grades(self):

        grades = list(map(list, [
            "A",
            "A",
            "A",
            "B",
            "B",
            "B",
            "B",
            "C",
            "C",
            "C",
            "C",
            "C",
            "C",
            "C",
            "C",
            "C",
            "C",
            "D",
            "D",
            "F",
            "F",
            "P",
            "W"
        ]))

        grades = grades + grades + grades + grades + grades
        mgp = MovieGroupProcess(K=100, n_iters=100, alpha=0.001, beta=0.01)
        y = mgp.fit(grades, self.compute_V(grades))
        self.assertEqual(len(set(y)), 7)
        for words in mgp.cluster_word_distribution:
            self.assertTrue(len(words) in {0, 1}, "More than one grade ended up in a cluster!")

    def test_simple_example(self):
        # example from @spattanayak1

        docs = [['house',
                 'burning',
                 'need',
                 'fire',
                 'truck',
                 'ml',
                 'hindu',
                 'response',
                 'christian',
                 'conversion',
                 'alm']]

        mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)

        vocab = set(x for doc in docs for x in doc)
        n_terms = len(vocab)
        n_docs = len(docs)

        y = mgp.fit(docs, n_terms)

    def test_short_text(self):
        # there is no perfect segmentation of this text data:
        texts = [
            "where the red dog lives",
            "red dog lives in the house",
            "blue cat eats mice",
            "monkeys hate cat but love trees",
            "green cat eats mice",
            "orange elephant never forgets",
            "orange elephant must forget",
            "monkeys eat banana",
            "monkeys live in trees",
            "elephant",
            "cat",
            "dog",
            "monkeys"
        ]

        texts = [text.split() for text in texts]
        V = self.compute_V(texts)
        mgp = MovieGroupProcess(K=30, n_iters=100, alpha=0.2, beta=0.01)
        y = mgp.fit(texts, V)
        self.assertTrue(len(set(y)) < 10)
        self.assertTrue(len(set(y)) > 3)

3.2 Clustering

    Run the test code, the result is shown in Figure 6:

Figure 6 MGP process clustering results

    Analysis : As can be seen from Figure 6, the test code successfully completed the MGP process within 3.109s, realized clustering , and verified the correctness of GSDMM reproduction and the relative superiority of performance. Class models are compared to verify its advantages more scientifically.

3. Experimental reproduction of the paper

    This part is mainly to reproduce the experiment of the paper, and compare the clustering effect of GSDMM and other clustering models under the Tweet dataset through clustering indicators such as NMI and ACC to familiarize yourself with the actual short article clustering process and code of GSDMM, and verify GSDMM The superiority of the algorithm, the code and data set addresses are as follows:

GitHub - junyachen/GSDMM: A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering

1. Project import

    After getting the code on github, the original code is Jupyter Notebook code (.ipynb format), there are two ways to import and run the code:

1.1 Use jupyter directly

    Open cmd and use the pip install notebook command to install jupyter. The process is shown in Figure 7. Then enter jupyter notebook to establish a connection, upload the corresponding file and click Tweet_gsdmm.ipynb to enter the running interface, as shown in Figure 8:

Figure 7 jupyter installation

Figure 8 Jupyter Notebook running interface

1.2 File conversion using Pycharm (method used in this experiment)

    Open cmd, use the cd command to enter the code project folder, use the jupyter nbconvert --to script *.ipynb command to convert the ipynb file into a py file, and then import the project into Pycharm.

2. Code and Analysis

    Open the project in Pycharm and analyze the code. The core part is analyzed as follows:

2.1 Definition of MovieGroupProcess class

    Similar to the class definition of MGP in the second, it defines the relevant parameters and the way of data reading, and the code is the same as above.

2.2 Definition of classifier

    The second part of the code is mainly the definition of the classifier, including the initialization of the vector, the definition of training, evaluation and prediction functions, and the definition of word segmentation training and evaluation. The code is as follows:

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
import numpy as np
import numpy


class Classifier(object):

    def __init__(self, vectors, clf):
        self.embeddings = vectors
        self.clf = TopKRanker(clf)
        self.binarizer = MultiLabelBinarizer(sparse_output=True)

    def train(self, X, Y, Y_all):
        self.binarizer.fit(Y_all)
        X_train = [self.embeddings[int(x)] for x in X]
        Y = self.binarizer.transform(Y)
        self.clf.fit(X_train, Y)

    def evaluate(self, X, Y):
        top_k_list = [len(l) for l in Y]
        Y_ = self.predict(X, top_k_list)
        Y = self.binarizer.transform(Y)
        averages = ["micro", "macro", "samples", "weighted"]
        results = {}
        for average in averages:
            results[average] = f1_score(Y, Y_, average=average)
        results['accuracy'] = accuracy_score(Y, Y_)
        return results

    def predict(self, X, top_k_list):
        X_ = numpy.asarray([self.embeddings[int(x)] for x in X])
        Y = self.clf.predict(X_, top_k_list=top_k_list)
        return Y

    def split_train_evaluate(self, X, Y, train_precent, seed=0):
        state = numpy.random.get_state()
        training_size = int(train_precent * len(X))
        numpy.random.seed(seed)
        shuffle_indices = numpy.random.permutation(numpy.arange(len(X)))
        X_train = [X[shuffle_indices[i]] for i in range(training_size)]
        Y_train = [Y[shuffle_indices[i]] for i in range(training_size)]
        X_test = [X[shuffle_indices[i]] for i in range(training_size, len(X))]
        Y_test = [Y[shuffle_indices[i]] for i in range(training_size, len(X))]
        self.train(X_train, Y_train, Y)
        numpy.random.set_state(state)
        return self.evaluate(X_test, Y_test)


from sklearn.multiclass import OneVsRestClassifier


class TopKRanker(OneVsRestClassifier):
    def predict(self, X, top_k_list):
        probs = numpy.asarray(super(TopKRanker, self).predict_proba(X))
        all_labels = []
        for i, k in enumerate(top_k_list):
            probs_ = probs[i, :]
            labels = self.classes_[probs_.argsort()[-k:]].tolist()
            probs_[:] = 0
            probs_[labels] = 1
            all_labels.append(probs_)
        return numpy.asarray(all_labels)

2.3 Data import and setting

    After setting the data_path of the data set, you can use with open to import files, and use split to perform word segmentation and related settings. The code is as follows:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from time import time
import re
import numpy as np
from sklearn.metrics.cluster import normalized_mutual_info_score

filename = "Tweet"
# filename = "T"
# filename = "S"
# filename = "TS"
data_path = "./data/" + filename
data = []
docs = []
total_doc_num = 2472
doc_num = 0
doc_label = np.zeros((total_doc_num,), dtype=int)
with open(data_path, 'r')as file:
    punc = '",:'
    lines = file.readlines()
    for line in lines:
        line = re.sub(r'[{}]+'.format(punc), '', line)

        data.append(" ".join(line.split(' ')[1:-2]))
        doc_label[doc_num] = int(line.split(' ')[-1].strip().replace("}", ""))
        docs.append(line.split(' ')[1:-2])

        doc_num += 1
file.close()

# In[4]:


uni_set = set()
for count in data:
    for j in count.split(" "):
        uni_set.add(j)

2.4 Clustering

    After completing the definition of the MovieGroupProcess class and classifier and data import, cluster analysis can be performed. In this experiment, ACC and NMI are used as clustering indicators. ACC is defined in the code, and NMI is calculated using from sklearn.metrics.cluster import normalized_mutual_info_score . Clustering parameter setting (take K=n_cluster, alpha=0.1, beta=beta, n_iters=10 as an example) and clustering process (fit function with classifier) ​​code is as follows:

from sklearn.utils.linear_assignment_ import linear_assignment


# from scipy.optimize import linear_sum_assignment as linear_assignment  # 添加as语句不用修改代码中的函数名


def acc(y_true, y_pred):
    y_true = y_true.astype(np.int64)
    assert y_pred.size == y_true.size
    D = max(y_pred.max(), y_true.max()) + 1
    w = np.zeros((D, D), dtype=np.int64)
    for i in range(y_pred.size):
        w[y_pred[i], y_true[i]] += 1
    ind = linear_assignment(w.max() - w)
    return sum([w[i, j] for i, j in ind]) * 1.0 / y_pred.size


# In[6]:


n_cluster = 89
beta = 0.1
mgp = MovieGroupProcess(K=n_cluster, alpha=0.1, beta=beta, n_iters=10)
y = mgp.fit(docs, len(uni_set))

# In[7]:


mgp_doc_label = np.zeros((total_doc_num,), dtype=int)
for i in range(len(docs)):
    mgp_doc_label[i] = mgp.choose_best_label(docs[i])[0]

# In[8]:


print("NMI_GSDMM=", normalized_mutual_info_score(np.array(doc_label), np.array(mgp_doc_label)))
print("ACC_GSDMM=", acc(np.array(doc_label), np.array(mgp_doc_label)))

2.5 K-means clustering

    Re-import and set the data according to the above method, and use different clustering models to cluster the Tweet dataset. Taking K-means as an example, through the from sklearn.cluster import KMeans import method, the clustering and statistical analysis codes are as follows:

filename = "./data/Tweet_dictionary.txt"
word_id = {}
with open(filename, 'r')as datafile:
    sents = datafile.readlines()
    for data in sents:
        word_id[data.split(" ")[0]] = int(data.split(" ")[1].strip())
datafile.close()

# In[10]:


beta = 0.001
topic_word_dis = np.zeros((n_cluster, len(uni_set)))
for i in range(len(mgp.cluster_word_distribution)):
    for j, keyword in enumerate(word_id):
        if keyword not in mgp.cluster_word_distribution[i]:
            topic_word_dis[i][j] = beta
        else:
            topic_word_dis[i][j] = mgp.cluster_word_distribution[i][keyword] + beta

# In[11]:


norm_topic_word_dis = topic_word_dis / np.sum(topic_word_dis, axis=1)[:, np.newaxis]

# In[12]:


import numpy as np

total_doc_num = 2472
doc_emb = np.zeros((total_doc_num, n_cluster))
doc_label = np.zeros((total_doc_num,), dtype=int)
import re

filename = "./data/Tweet"
doc_num = 0
with open(filename, 'r')as datafile:
    sents = datafile.readlines()
    punc = '",:'
    for data in sents:

        data = re.sub(r'[{}]+'.format(punc), '', data)
        raw_text = data.split(' ')[1:-2]

        doc_label[doc_num] = int(data.split(' ')[-1].strip().replace("}", ""))
        for data_i in raw_text:
            doc_emb[doc_num] += norm_topic_word_dis[:, word_id[data_i]]
        doc_emb[doc_num] /= len(raw_text)
        # doc_emb[doc_num]
        doc_num += 1
datafile.close()

# In[19]:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=89, max_iter=100).fit(doc_emb)

# In[14]:

from sklearn.metrics.cluster import normalized_mutual_info_score

# In[15]:

print("NMI_K_means=", normalized_mutual_info_score(np.array(doc_label), np.array(kmeans.labels_)))
print("ACC_K_means=", acc(np.array(doc_label), np.array(kmeans.labels_)))

# In[21]:

import warnings

warnings.filterwarnings('ignore')

avg_nmi = []
avg_acc = []
for i in range(20):
    kmeans = KMeans(n_clusters=89, max_iter=100).fit(doc_emb)
    avg_nmi.append(normalized_mutual_info_score(np.array(doc_label), np.array(kmeans.labels_)))
    avg_acc.append(acc(np.array(doc_label), np.array(kmeans.labels_)))

print("avg_nmi: ", np.mean(avg_nmi))
print("avg_acc: ", np.mean(avg_acc))

2.6 Statistical analysis

    In order to perform statistical analysis on the clustering results (avg_value, etc.), use the sys library three times, and write the relevant results into Tweet_gsdmm_emb.txt , the code is as follows:

Y = [str(i) for i in list(doc_label)]
X = [str(i) for i in list(np.arange(total_doc_num))]

# In[14]:


import sys

if not sys.warnoptions:
    import warnings

    warnings.simplefilter("ignore")
avg_value = []
for i in np.arange(0.1, 1.0, 0.1):
    # for i in np.arange(0.01, 0.11, 0.01):
    clf_ratio = np.round(i, 2)
    clf = Classifier(vectors=doc_emb, clf=LogisticRegression())
    avg_value.append(clf.split_train_evaluate(X, Y, clf_ratio)['micro'])
    print(clf_ratio, " ", clf.split_train_evaluate(X, Y, clf_ratio)['micro'])
print(np.mean(avg_value))

# In[15]:


import sys

if not sys.warnoptions:
    import warnings

    warnings.simplefilter("ignore")
avg_value = []
for i in np.arange(0.1, 1.0, 0.1):
    # for i in np.arange(0.01, 0.11, 0.01):
    clf_ratio = np.round(i, 2)
    clf = Classifier(vectors=doc_emb, clf=LogisticRegression())
    avg_value.append(clf.split_train_evaluate(X, Y, clf_ratio)['macro'])
    print(clf_ratio, " ", clf.split_train_evaluate(X, Y, clf_ratio)['macro'])
print(np.mean(avg_value))

# In[16]:


import sys

if not sys.warnoptions:
    import warnings

    warnings.simplefilter("ignore")
avg_value = []
for i in np.arange(0.1, 1.0, 0.1):
    # for i in np.arange(0.01, 0.11, 0.01):
    clf_ratio = np.round(i, 2)
    clf = Classifier(vectors=doc_emb, clf=LogisticRegression())
    avg_value.append(clf.split_train_evaluate(X, Y, clf_ratio)['accuracy'])
    print(clf_ratio, " ", clf.split_train_evaluate(X, Y, clf_ratio)['accuracy'])
print(np.mean(avg_value))

# In[17]:


storefile = 'Tweet_gsdmm_emb.txt'
sf = open(storefile, 'w')
for i in range(doc_emb.shape[0]):
    sf.write(str(i) + " " + " ".join([str(ele) for ele in doc_emb[i]]) + '\n')
sf.close()

3. Code testing

    After importing the project into Pycharm, modify the Tweet dataset address, and run Tweet_gsdmm.py, the results are as follows:

3.1 GSDMM

    The clustering process of GSDMM and the test results of NMI and ACC are shown in Figure 9:

 Fig.9 Clustering process and test results of GSDMM

    Analysis : As can be seen from Figure 9, GSDMM successfully clustered the Tweet data set. During the test, the NMI was 0.8699 and the ACC was 0.7997 .

3.2 K-means

    The clustering and statistical analysis results of K-means are shown in Figure 10:

Figure 10 K-means clustering and statistical analysis results

    Analysis : As can be seen from Figure 10, K-means successfully clustered the Tweet dataset. During the test, the NMI was 0.7951 and the ACC was 0.4951 . The average NMI is 0.7921 , and the average ACC is 0.4920 , both of which are lower than GSDMM, which verifies the superiority of GSDMM.

3.3 Relevant data display

    After both GSDMM and K-means clustering are finished, as shown in the code analysis, use sys for three data processing, clf_ratio and its corresponding word segmentation training evaluation and the average value of avg_value are shown in Figure 11, and finally write it into Tweet_gsdmm_emb .txt file, the file part is shown in Figure 12:

 Figure 11 Relevant data display

Figure 12 Display of some files in Tweet_gsdmm_emb.txt

4. Analysis of statistical results

    In order to intuitively illustrate the superiority of GSDMM in short text clustering compared with other models in the paper and the influence of parameters on the clustering effect, this part will use statistical analysis of experimental data and perform drawing visualization.

4.1 Advantages of GSDMM

    Use GSDMM, K-means, HAC, and DMAFP four clustering models to cluster the Tweet data set, and count its NMI and ACC. The experiment was carried out 20 times to take the average value. The data summary is shown in Table 5. The effect comparison is as follows As shown in Figure 13:

Table 5 Summary of clustering index data of different models

Figure 13 Comparison of the effects of different clustering models

    Analysis : As can be seen from Table 5 and Figure 13, in the experiments using NMI and ACC as clustering indicators, the clustering effect of GSDMM is better than the other three clustering models , which can verify the superiority of its algorithm in short text clustering .

4.2 Influence of parameters on clustering effect

    In order to explore the influence of different parameters on the clustering effect of GSDMM, the parameters were changed, experiments were carried out, data were recorded and summarized and analyzed.

    This experiment takes the influence of β on the clustering effect as an example, determines K=8, alpha=0.1, n_iters=30, changes the value of beta, and observes the change of NMI. The data summary is shown in Table 6, and the impact trend diagram is shown in Figure 14:

Table 6 Data summary of the impact of β on clustering effect

Figure 14 The influence of parameter β on the clustering effect

     Analysis : It can be seen from the figure that different values ​​of β have a great influence on the clustering effect. As β increases, the value of NMI rises rapidly and then slowly decreases to a stable value. The clustering effect is best when β is around 0.1 , which is consistent with the conclusion of the paper.

4. Questions and Supplements

1. Problem

1.1 Problem description

    During the course of this experiment, a problem occurred during the project import process, as shown in Figure 15:

Figure 15 Project import problem

    According to the official website, linear_assignment is deprecated, and the official scipy.optimize.linear_sum_assignment replaced sklearn.utils.linear_assignment_, that is, use the from scipy.optimize import linear_sum_assignment as linear_assignment command to replace the original import, but the problem in Figure 16 appears:

Figure 16 Replace import new question

    Analysis : After replacing the import, the implementation of some functions with the same name has changed, resulting in a change in the number of received return values.

1.2 Solutions

    Therefore, give up changing the import and switch to the method of lowering the scikit-learn version to solve the problem ( officially deprecated in version 0.23 , just download the lower version ), you can use pip install -i https://pypi.douban.com/simple scikit -learn== version command, or specify the version in the settings.

    After downgrading the version, the tests ran successfully and the results are detailed above.

2. Dataset introduction

    The main data set used in this experiment is the Tweet data set, which contains 2472 short text data . Some data sets are shown in Figure 17:

Figure 17 Partial display of the Tweet dataset

3. GSDMM and LDA

    Whether it is the MGP process or the clustering of the Tweet dataset (2472 pieces of data) in this experiment, it is a short text and small data set. To explore the situation of GSDMM in long text clustering or large data set clustering , this part will GSDMM and LDA Compare.

    Reference Experiment: Short Text Topic Model: LDA and GSDMM - Zhihu (zhihu.com)

3.1 Introduction to LDA

    LDA is currently one of the most popular topic model algorithms, which assumes that each document (tweet in our example) contains multiple topics, and calculates the contribution of each topic to the document, and needs to set the number of topics in advance. The algorithm model is shown in Figure 18:

Figure 18 LDA algorithm model

3.2 Performance comparison

    In order to compare the performance of the two topic model methods of GSDMM and LDA, we can choose to focus on 3 key parameters:

Ⅰ. Running time

    In the reference experiment, LDA can process a 6m data set in about 1 hour, while GSDMM can only process about 300K , which means that in terms of runtime, the performance of LDA is much better than that of GSDMM . Not only that, but for all practical purposes we can say that GSDMM cannot handle very large datasets , at least not in one go. Also, as of 2022.11, GSDMM is not yet parallelizable .

Ⅱ. Coherence and consistency of modeling themes

    In reference experiments, GSDMM does better than LDA for modeling topic coherence and consistency . The LDA output varies greatly from iteration to iteration and is generally more chaotic.

Ⅲ. Topic coherence score

    Rationale Supplement : Topic coherence score is an objective measure rooted in the distributional assumption of linguistics : words with similar meanings tend to appear in similar contexts. Topics are considered coherent if all or most of the words are closely related.

    In reference experiments, GSDMM far outperforms LDA in the coherence and consistency of the topics it outputs in terms of topic coherence scores .

3.3 Conclusion

    The advantages and disadvantages of LDA and GSDMM are shown in Figure 19:

Figure 19 Advantages and disadvantages of LDA and GSDMM

    Analysis : Based on the analysis of the performance comparison results in ② and Figure 19, the advantages and disadvantages of LDA and GSDMM depend on the demands of the experimenter and the size and type of the database.

    LDA is relatively simple and efficient , especially for long text and large data sets. GSDMM is relatively complex and inefficient to implement, but it is more excellent in the degree of task completion (ie, the coherence and consistency of the modeling theme and the theme coherence score) , especially in short text classification tasks.

4. Model Supplement

    In addition to the GSDMM model, many new models (traditional machine learning, deep learning) have been applied to short text classification in recent years, such as FastText, TextCNN, TextRNN, TextRNN + Attention, TextRCNN, etc., and have achieved good results. 

5. References

1.A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering

2.GitHub - rwalk/gsdmm: GSDMM: Short text clustering

3. Short text topic model: LDA and GSDMM - Zhihu (zhihu.com)


Summarize

(1) GSDMM has the following advantages:

1. Can maintain a balance between completeness and consistency;

2. It can handle sparse and high-latitude short texts well;

3. Compared with other clustering algorithms, its performance is more outstanding.

(2) Different parameters have different effects on the clustering effect of GSDMM, and the optimal parameter values ​​will be different for different data sets (each data set is quite different). In practical applications, when given better empirical parameters, GSDMM has better clustering effect, which makes it have higher application value.

(3) Compared with commonly used text classification models such as LDA, GSDMM is relatively complex and inefficient to implement, but it is better in the degree of task completion (ie, the coherence and consistency of the modeling theme and the topic coherence score) , especially in short texts. in this classification task.

(4) Different clustering models should be used for different data sets and different needs.

Guess you like

Origin blog.csdn.net/weixin_51426083/article/details/127746785