Machine learning practice 11-text clustering analysis based on K-means algorithm, generate text clustered files

Hello everyone, I am Wei Xue AI. Today I will introduce Machine Learning Practice 11-Text Clustering Analysis Based on K-means Algorithm, and generate text clustered files. Text clustering analysis is a core task in the field of NLP. By grouping similar text samples, it can help us discover patterns and structures hidden in text data.

In this project, I will use the K-means algorithm to implement text clustering analysis. The K-means algorithm is a commonly used clustering algorithm, which iteratively assigns samples to K clusters and determines the best cluster division by minimizing the sum of the square errors of the samples in each cluster. By transforming text data into vector representations and using the K-means algorithm to cluster the vectors, we can achieve automatic text classification and grouping.

Contents
1. Introduction
2. Basic knowledge of text clustering analysis
3. Design and implementation of text clustering analysis projects
4. Text clustering analysis implementation code case
5. Advantages, disadvantages and challenges of text clustering analysis
6. Text clustering analysis
7. Conclusion

I. Introduction

Text clustering analysis is a technique for classifying and organizing text data. It classifies similar texts into one category by discovering similarities and associations between texts. Text clustering is of great significance in practical applications and can help us understand the structure and content of large-scale text data, so as to discover the information and patterns hidden in it.

2. Basic knowledge of text clustering analysis

Text clustering refers to dividing a text dataset into several disjoint categories, so that the text similarity within the same category is high, and the similarity between different categories is low. Commonly used text clustering algorithms include K-means algorithm and hierarchical clustering algorithm. The K-means algorithm divides the text data into K clusters through iterative optimization, and each cluster has similarity; the hierarchical clustering algorithm calculates the similarity between different texts, and gradually merges the most similar texts until a complete cluster is formed. clustering tree.

In text clustering, text representation is a key issue. Commonly used text representation methods include word bag model and TF-IDF. The word bag model represents the text as a vector, where each dimension represents the number of occurrences of a specific word in the text; TF-IDF considers the frequency of words and their importance in the entire text set.

The mathematical principle of the K-means algorithm can be expressed by the following formula:

Given a dataset X = { x 1 , x 2 , . . . , xn } containing n samples X=\{x_1, x_2, ..., x_n\}X={ x1,x2,...,xn} , where each samplexi x_ixiis a d-dimensional vector ( xi 1 , xi 2 , . . . , xid ) (x_{i1}, x_{i2}, ..., x_{id})(xi 1,xi2,...,xid) . The K-means algorithm aims to divide these samples into K clusters, where each sample belongs to one and only one cluster.

First, we need to select K initial cluster centers μ = { μ 1 , μ 2 , . . . , μ K } \mu=\{\mu_1, \mu_2, ..., \mu_K\}m={ m1,m2,...,mK} , where each cluster center is a d-dimensional vector.

Then, the iterative process of the algorithm is as follows:

  1. For each sample xi x_ixi, calculate its distance from each cluster center (usually using Euclidean distance or other distance measurement methods), and classify it into the cluster corresponding to the nearest cluster center.
  2. For each cluster, calculate the mean of all its samples as the new cluster center.
  3. Repeat steps 1 and 2 until a stopping condition is met (e.g., the maximum number of iterations is reached or the cluster centers no longer change significantly).

The optimization goal of the K-means algorithm is to minimize the sum of the distances between all samples and their cluster centers, that is, to minimize the following objective function:
J = ∑ i = 1 n ∑ j = 1 K rij ∣ ∣ xi − μ j ∣ ∣ 2 J = \sum_{i=1}^{n} \sum_{j=1}^{K} r_{ij} ||x_i - \mu_j||^2J=i=1nj=1Krijximj2
where,rij r_{ij}rijIndicates sample xi x_ixibelongs to cluster jjThe indicator variable of j , ifxi x_ixibelongs to cluster jjj r i j = 1 r_{ij}=1 rij=1 , otherwiserij = 0 r_{ij}=0rij=0

Through the iterative optimization process, the K-means algorithm will continuously update the cluster centers until a set of objective functions JJ is foundThe final clustering result of J- minimization.

It should be noted that the K-means algorithm may converge to different local optimal solutions for different initial cluster center selections. To overcome this, multiple runs or other heuristics can be used to improve the clustering results.

3. Design and implementation of text clustering analysis project

When carrying out a text clustering analysis project, data collection and preprocessing are first required. Data can come from various sources such as news reports, social media, etc., but needs to be cleaned and denoised. The next step is text feature extraction and representation. You can use the word bag model or TF-IDF method to convert the text into a vector representation. Then it is necessary to select a suitable clustering algorithm and perform parameter tuning. Finally, the clustering results are evaluated and visualized to better understand and explain the clustering results.

4. Text clustering analysis implementation code case

Here we can give a specific text clustering analysis implementation code case, for example, use Python language and scikit-learn library to implement K-means clustering algorithm, and cluster news text data sets.

#coding utf-8
import csv
import jieba
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import os
import re
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# 对中文文本进行分词
def tokenize_text(text):
    return " ".join(jieba.cut(text))

# 去除标点符号
def remove_punctuation(text):
    punctuation = '!"#,。、$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    text = re.sub(r'[{}]+'.format(punctuation), '', text)
    return text

# 将分词后的文本转化为tf-idf矩阵
def text_to_tfidf_matrix(texts):
    tokenized_texts = [tokenize_text(remove_punctuation(text)) for text in texts]
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(tokenized_texts)
    return tfidf_matrix

# 聚类函数
def cluster_texts(tfidf_matrix, n_clusters):
    kmeans = KMeans(n_clusters=n_clusters)
    kmeans.fit(tfidf_matrix)
    return kmeans.labels_

# 保存聚类结果到新的CSV文件
def save_clusters_to_csv(filename, texts, labels):
    base_filename, ext = os.path.splitext(filename)
    output_filename = f"{base_filename}_clusters{ext}"
    with open(output_filename, "w", encoding="utf-8", newline="") as csvfile:
        csvwriter = csv.writer(csvfile)
        for text, label in zip(texts, labels):
            csvwriter.writerow([text, label])
    return output_filename

# 输出聚类结果
def print_cluster_result(texts, labels):
    clusters = {
    
    }
    for i, label in enumerate(labels):
        if label not in clusters:
            clusters[label] = []
        clusters[label].append(texts[i])

    for label, text_list in clusters.items():
        print(f"Cluster {label}:")
        for text in text_list:
            print(f"  {text}")

def text_KMeans(filename,n_clusters):
    df = pd.read_csv(filename, encoding='utf-8')  # 读取csv文件
    texts = df['text'].tolist()  # 提取文本数据为列表格

    print(df.iloc[:, [0, -1]])
    # 将文本转化为tf-idf矩阵
    tfidf_matrix = text_to_tfidf_matrix(texts)
    # 进行聚类
    labels = cluster_texts(tfidf_matrix, n_clusters)
    clusters = []
    for i, label in enumerate(labels):
        clusters.append(label)

    df['cluster'] = clusters

    output = 'data_clustered.csv'
    df.to_csv('data_clustered.csv', index=False, encoding='utf-8')

    return output,labels,tfidf_matrix

def pca_picture(labels,tfidf_matrix):
    # 进行降维操作并将结果保存到DataFrame中
    pca = PCA(n_components=3)
    result = pca.fit_transform(tfidf_matrix.toarray())
    result_df = pd.DataFrame(result, columns=['Component1', 'Component2', 'Component3'])

    # 将聚类结果添加到DataFrame中
    result_df['cluster'] = labels

    # 绘制聚类图形
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    colors = ['red', 'blue', 'green']
    for i in range(3):
        subset = result_df[result_df['cluster'] == i]
        ax.scatter(subset['Component1'], subset['Component2'], subset['Component3'], color=colors[i], s=50)
    ax.set_xlabel("Component 1")
    ax.set_ylabel("Component 2")
    ax.set_zlabel("Component 3")
    plt.show()

if __name__ == "__main__":
    # 加载中文文本
    filename = "data.csv"
    n_clusters =3
    output,labels,tfidf_matrix = text_KMeans(filename, n_clusters)
    pca_picture(labels, tfidf_matrix)

Run the PCA algorithm to generate a 3D image:
insert image description here

5. Advantages, disadvantages and challenges of text clustering analysis

Text clustering analysis has the following advantages: it can provide insights to help us understand the structure and content of text data; it can realize automatic clustering and reduce manual intervention; it can efficiently process large-scale data and speed up analysis.

However, text clustering also has some disadvantages: since clustering is based on similarity, inaccurate classification may occur for highly subjective text data; clustering algorithms usually require labeled data for training and tuning , which may be difficult to obtain in some scenarios; dealing with noise and redundant information is also a challenge.

In addition, text clustering also faces some challenges: high-dimensional problems, that is, when the dimensionality of text features is high, the clustering results may be inaccurate or difficult to interpret; semantic similarity problems, due to the complexity of natural language, semantic differences between texts The similarity is difficult to capture; the problem of category imbalance, that is, the large difference in the number of text samples of different categories, may affect the effect of clustering.

6. The future development trend of text clustering analysis

In the future, text clustering analysis may develop in the following directions: (some ideas can be put forward, such as improving text feature representation by combining deep learning methods, expanding application fields, etc.)

7. Conclusion

Text clustering analysis is an important technique that can help us understand and organize large-scale text data. By choosing appropriate algorithms and feature representation methods, and overcoming related challenges, we can obtain accurate and interpretable clustering results. With the continuous advancement of technology, text clustering analysis has broad application prospects in various fields.

Guess you like

Origin blog.csdn.net/weixin_42878111/article/details/131858535