NLP-news text analysis and mining based on bertopic tool

NLP-news text analysis and mining based on bertopic tool

I. Introduction

Recently, I have been briefly exposed to some NLP content, and practiced how to combine ChatGPT for learning.

Second, the specific process

(1) Preprocess the text and record the processing process.

Before using Bertopic for topic modeling, the text needs to be preprocessed. The following is the specific process of how to use Bertopic to preprocess text
1. Install the Bertopic library:
Install the Bertopic library in the Python environment. You can use the pip command to install the Bertopic library:

pip install bertopic

2. Load the dataset:
Data download address:
Link: https://pan.baidu.com/s/1e7u_7M3k19NMO8qwUlaTxA?pwd=eqqs
Extraction code: eqqs

Download the data, the storage location is as follows
insert image description here

Load the training set into a list using the following code:

dirPath=r'E:\AIStudy\WordSystem\new2016zh'
validPath=os.path.join(dirPath,'news2016zh_valid.json')
datas=[]
with open(trainPath, 'r',encoding='utf-8') as f:
    lines=f.readlines()
    for line in lines:
        data = json.loads(line)
        datas.append(data)

3. Preprocessing data
Before using Bertopic for topic modeling, the data needs to be preprocessed. Bertopic uses the spacy library for preprocessing, so you need to install the spacy library and download the corresponding model.
The spacy and en_core_web_sm models can be installed with the following commands:

pip install spacy
python -m spacy download en_core_web_sm

The specific data preprocessing procedure is as follows:

print('#2.预处理数据')
nlp = spacy.load('en_core_web_sm')
texts = [doc['title'] for doc in datas]
processed_texts = []
qtar=tqdm(total=len(texts))
for text in texts:
    qtar.update(1)
    doc = nlp(text)
    processed_texts.append(' '.join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct]))
qtar.close()
print(len(processed_texts))

Preprocessing is running:
insert image description here

(2) Use the text clustering tool to cluster the news collection, and record the processing process and results.

I use the text clustering tool Bertopic to cluster the news collection. The following is the processing process
4. Record the processing process
When performing text preprocessing, you can record the processing process for later viewing

import logging

logging.basicConfig(filename='preprocessing.log', level=logging.INFO)
for doc in data:
    text = doc['content']
    doc = nlp(text)
    processed_text = ' '.join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct])
    processed_texts.append(processed_text)
    logging.info(f'Processed document {
      
      doc["id"]}: {
      
      text} -> {
      
      processed_text}')

5. Use Bertopic to build a topic model and train it

print('3.训练模型')

model = BERTopic(language='english', calculate_probabilities=True)
topics, probabilities = model.fit_transform(processed_texts)

insert image description here

6. Evaluate the model
You can use the evaluation tools provided by sklearn.metrics to evaluate the performance of the model.

from sklearn.metrics import silhouette_score
silhouette_avg = silhouette_score(probabilities, topics)
print("Silhouette Score:", silhouette_avg)

7. Classify news headlines
Finally, use the trained model to classify news headlines. Assuming you have a new news headline, you can use the following code to categorize it into a topic:

new_title = '如何选择儿童摄影机构给宝宝拍照?'
new_processed_text = ' '.join([token.lemma_ for token in nlp(new_title) if not token.is_stop and not token.is_punct])
new_topic, new_prob = model.transform([new_processed_text])

The processing results are as follows:
insert image description here

(3) Manually observe the clustering results for simple tuning, and record the tuning process and results.

Basic guidelines for simple tuning of clustering results based on the following
1. Check clustering quality : First, the quality of clustering results needs to be checked. Silhouette Score, Calinski-Harabasz Index and other indicators can be used to evaluate the quality of clustering results. If the clustering quality is not good, you can try to adjust the clustering parameters or increase the amount of data to improve the clustering effect.
2. Classify according to the clustering results : Classify the clustering results according to themes, and you can find articles under similar topics, which is helpful for further understanding and analysis of the clustering results.
3. Adjust the clustering granularity : According to the clustering result, the clustering granularity can be adjusted. If the clustering results are too rough, you can try to increase the number of topics or adjust the clustering parameters; if the clustering results are too detailed, you can try to reduce the number of topics or adjust the clustering parameters.
4. Check the clustering labels : Check whether the clustering labels can accurately describe the clustering results. If the cluster labels are not accurate enough, you can consider manually modifying the cluster labels or using automated label generation techniques to generate more accurate cluster labels.
5. Further analysis of the clustering results : Further analysis of the clustering results can find the relevance and distinction between themes, which helps to further understand the internal structure of the text data. Visualization technology can be used to visualize the clustering results, or text mining technology can be used for keyword extraction and text relationship analysis.
Let's analyze the above results.
1. Increase the amount of data . For the convenience of viewing the effect, I only used 5000 entries, and the data is small.
2. Adjust the clustering parameters : calculate_probabilities=True, top_n_words=5, nr_topics=3. Language changed to Simplified Chinese
insert image description here

You can see the clustering effect, which has increased by more than 6 times.
3. Classify according to the clustering results and labels
calculate_similarity: used to specify whether to calculate the similarity between topics, the default is False.
similarity_threshold: Used to specify the similarity threshold between topics. When the similarity of two topics is higher than this threshold, the two topics will be merged into one topic. The default is 0.75.

3. Experimental analysis and summary

1Learn the basic concepts and terminology of text processing, including text information extraction, text clustering and text summarization, etc.
(2) Mastering the commonly used indicators in text processing, such as accuracy rate, recall rate, F1 value, etc., can help you measure algorithm performance.
(3) Familiar with commonly used text processing tools and algorithms, such as TextRank algorithm, Summarization algorithm, BERTopic algorithm, etc.
(4) Master the basic process and methods of text clustering, including text preprocessing, clustering model training, clustering result evaluation and clustering result visualization, etc.
(5) Understand how to extract keywords and summary sentences from text and use them to generate summaries of news topics.
It is a good learning material for beginners, which can quickly get started and master the basic techniques and tools of text processing. At the same time, it involves some commonly used text processing algorithms and tools, which are also very helpful for further in-depth study and application of text processing technology.
At the same time, I also learned some common tuning methods, which allowed me to better understand and optimize the clustering results

4. Thinking questions

(1) For the definition of parent-child events, whether the clustering granularity of the method adopted in the above process is too coarse or too fine, consider the optimization method.
1. The clustering granularity of the method used in the above process is too coarse.
For the definition of parent-child events, when the text clustering tool Bertopic is used to cluster the news pool, the clustering granularity may be affected. If the clustering is too coarse, different sub-events may be aggregated into the same parent event, making it impossible to distinguish different sub-events; if the clustering is too fine, the same parent event may be divided into multiple sub-events, resulting in aggregation Class results are too granular.

In order to solve this problem, after retrieving some information, we found that it is possible to adopt the following optimization methods:

Adjust the number of topics : The clustering granularity of Bertopic depends on the number of topics, so we can control the granularity of clustering by adjusting the number of topics. If the clustering is too thick, you can try to increase the number of topics; if the clustering is too fine, you can try to reduce the number of topics. It should be noted that the number of topics should not be set too small or too large, otherwise it may lead to poor clustering effect.

Adjust clustering parameters : Bertopic provides some parameters that can be used to adjust the granularity of clustering, such as word frequency threshold, topic similarity threshold, etc. By adjusting these parameters, the granularity of clustering can be controlled. It should be noted that different parameter values ​​may have different effects on the clustering results, so experiments are needed to determine the best parameter values.

Use hierarchical clustering : Bertopic uses a density-based clustering method, which may result in too rough a clustering result. You can consider using the hierarchical clustering method to stratify the clustering results, so as to obtain more detailed clustering results. Hierarchical clustering methods can be implemented using the AgglomerativeClustering class in the scikit-learn library.

Combined with manual labeling : If the clustering results are too rough or too detailed, you can consider combining manual labeling for optimization. A part of representative text can be selected for manual labeling, and then the manual labeling results can be fed back to the clustering model to optimize the clustering results. Manual labeling can be done manually, semi-automatically or crowdsourced.

Five, integrate the code

All codes are integrated as follows

import json
import spacy
import os
from tqdm import tqdm
from bertopic import BERTopic
from sklearn.metrics import silhouette_score

# 1.加载数据集
print('# 1.加载数据集')
dirPath=r'E:\AIStudy\WordSystem\new2016zh'
validPath=os.path.join(dirPath,'news2016zh_valid.json')
datas=[]
with open(validPath, 'r',encoding='utf-8') as f:
    lines=f.readlines()
    for i in range(len(lines)):
        line=lines[i]
        data = json.loads(line)
        datas.append(data)
        if i>2000:
            break

#2.预处理数据
print('#2.预处理数据')
nlp = spacy.load('en_core_web_sm')
texts = [doc['title'] for doc in datas]
processed_texts = []
qtar=tqdm(total=len(texts))
for text in texts:
    qtar.update(1)
    doc = nlp(text)
    processed_texts.append(' '.join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct]))
qtar.close()
print(len(processed_texts))



#3训练模型
print('3.训练模型')

model = BERTopic(language='chinese (simplified)', calculate_probabilities=True, top_n_words=5, nr_topics=3)
topics, probabilities = model.fit_transform(processed_texts)

#4评估模型
silhouette_avg = silhouette_score(probabilities, topics)
print("Silhouette Score:", silhouette_avg)

#实际分类
new_title = '如何选择儿童摄影机构给宝宝拍照?'
new_processed_text = ' '.join([token.lemma_ for token in nlp(new_title) if not token.is_stop and not token.is_punct])
new_topic, new_prob = model.transform([new_processed_text])

print(new_processed_text,new_topic, new_prob)

Guess you like

Origin blog.csdn.net/qq_51116518/article/details/131199393