Based on TF-IDF+TensorFlow+word cloud+LDA news automatic abstract recommendation system - deep learning algorithm application (including ipynb source code) + training data set


insert image description here

foreword

This project uses TF-IDF keyword extraction technology, combined with word cloud data visualization, LDA (Latent Dirichlet Allocation) model training and speech conversion system, to implement a text summarization program based on TensorFlow.

First, we use TF-IDF (Term Frequency-Inverse Document Frequency) technology to extract keywords in the text. This helps to find out the most representative words in the text, and provides important information for subsequent summarization.

Secondly, we use word cloud data visualization technology to display keywords in a visual way. This helps users intuitively understand the point and focus of the text.

Next, we train using the LDA model, a technique used for topic modeling. Through the LDA model, we are able to discover the hidden topic structure in the text, so as to better understand the distribution and association of text content.

Finally, we combined these techniques to create a TensorFlow-based text summarization program. This program can automatically extract the key information of the text, thematic structure, and generate concise text summaries.

In addition, we have also integrated the speech conversion system, so that the generated text summaries can be presented by speech, which improves the user experience and convenience of use.

Through this project, we were able to integrate multiple technologies to implement a powerful text summarization program, providing users with a more convenient and intuitive text understanding and acquisition experience.

overall design

This part includes the overall structure diagram of the system and the system flow chart.

System overall structure diagram

The overall structure of the system is shown in the figure.

insert image description here

System flow chart

The system flow is shown in the figure.

insert image description here

operating environment

This section includes the Python environment and the TensorFlow environment.

Python environment

Python 3.6 and above configuration is required. In the Windows environment, it is recommended to download Anaconda to complete the configuration of the Python environment. The download address is https://www.anaconda.com/ . You can also download a virtual machine to run the code in a Linux environment.

TensorFlow environment

The installation method is as follows:

method one

Open Anaconda Prompt and enter the Tsinghua warehouse image.

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config -set show_channel_urls yes

Create a Python 3.6 environment named TensorFlow. At this time, there is a problem with the Python version matching the later TensorFlow version. In this step, choose Python 3.x.

conda create -n tensorflow python=3.6

Where there is a need to confirm, enter y. Activate the TensorFlow environment in Anaconda Prompt:

conda activate tensorflow

Install the CPU version of TensorFlow:

pip install -upgrade --ignore -installed tensorflow

The test code is as follows:

import tensorflow as tf
hello = tf.constant( 'Hello, TensorFlow! ')
sess = tf.Session()
print sess.run(hello)
# 输出 b'Hello! TensorFlow'

Installed.

Method Two

Open Anaconda Navigator, enter Environments, click Create, enter TensorFlow in the pop-up dialog box, select the appropriate Python version, create the TensorFlow environment, and then enter the TensorFlow environment, click Not installed to find the required package in the search box. For example, for TensorFlow, select apply at the bottom right to test whether the installation is successful. Enter the following code in the Jupyter Notebook editor:

import tensorflow as tf
hello = tf.constant( 'Hello, TensorFlow! ')
sess = tf.Session()
print sess.run(hello)
# 输出 b'Hello! TensorFlow'

Can output hello TensorFlow, indicating that the installation is successful.

module implementation

This project includes 6 modules: data preprocessing, word cloud construction, keyword extraction, voice broadcast, LDA topic model, and model construction. The function introduction and related codes of each module are given below.

1. Data preprocessing

Download it from the Chinese text dataset THUCNews launched by the NLP Laboratory of Tsinghua University. The download address is https://github.com/gaussic/text-classification-cnn-rnn . It contains a total of 5,000 news texts, which are integrated and divided into 10 candidate categories: finance, real estate, home furnishing, education, technology, fashion, current affairs, sports, games, and entertainment.

1) Import data

It is realized by jupyter notebook, the relevant code is as follows:

#导入相应数据包
import pandas as pd
import numpy as np
#数据的读入及读出
df_news=pd.read_table("./cnews.val.txt",names=["category","content"])
df_news.head()

Read the corresponding data from the folder, which respectively represent the category and content of the news data, as shown in the figure.

#数据的类别及总量
df_news.category.unique()
df_news.content.shape
#为方便后续对数据的处理,将原始表格型据结构转换成列表格式
content_list=df_news.content.values.tolist()

insert image description here
The data category processing code compiles successfully, as shown in the figure.

insert image description here

2) Data cleaning

News text data includes not only Chinese characters, but also numbers, English characters, punctuation, etc. Word segmentation is an important part of Chinese text analysis, and correct word segmentation can better build models. Words in the Chinese corpus are closely connected, which is different from English or other language corpora. Therefore, instead of using space segmentation like English, use the segmentation method in the jieba library.

#jieba分词
content_fenci = [] #建立一个空的
for line in content_list:
	text = jieba.lcut(line) #给每一条都分词
	if len(text) > 1 and text != '\r': #换行
		content_fenci.append(text)  #将分词后的结果放入
#content_fenci[0]   #分词后的一个样本     
df_content=pd.DataFrame({
    
    'content':content_fenci})
df_content.head()

The result after word segmentation is shown in the figure.

insert image description here

#导入停用词
def drop_stopwords(contents,stopwords):
	content_clean = [] #放清理后的分词
	all_words = []
	for line in contents:
		line_clean=[]
	for word in line:
		if word in stopwords:
			continue
		line_clean.append(word)
		all_words.append(str(word))
		content_clean.append(line_clean)
	return content_clean,all_words
content_clean,all_words = drop_stopwords(content_fenci,stopwords_list,)
df_clean= pd.DataFrame({
    
    'contents_clean':content_clean})
df_clean.head()

The result after cleaning is shown in the figure.

insert image description here

3) Statistical word frequency

Count the number of occurrences of each word in the text, and sort the statistics by word frequency. as the picture shows.

insert image description here

The relevant code is as follows:

tf= Counter(all_words)

2. Word cloud construction

The word cloud is a visual display of the keywords that appear frequently in the text. The word cloud filters out a large amount of low-frequency and low-quality text information, so that the viewer can understand the gist of the text by quickly reading the text.

#导入背景图片后的词云
mask = imread('4.png')#读入图片
wc=wordcloud.WordCloud(font_path=font,mask=mask,background_color='white',scale=2)
#scale:按照比例进行放大画布,如设置为2,则长和宽都是原来画布的2倍
wc.generate_from_frequencies(tf)
plt.imshow(wc)  #显示词云
plt.axis('off') #关闭坐标轴
plt.show()
wc.to_file('ciyun.jpg') #保存词云

3. Keyword extraction

TF-IDF is a statistical method. The importance of a word increases proportionally to the number of times it appears in the file, but at the same time it also decreases inversely proportional to the frequency of occurrence in the corpus. Next, through the use of the TF-IDF algorithm Implement keyword extraction.

import jieba.analyse
index = 2
#print(df_clean['contents_clean'][index])
#词之间相连
content_S_str = "".join(content_clean[index])
print(content_list[index])
print('关键词:')
print(" ".join(jieba.analyse.extract_tags(content_S_str, topK=10, withWeight=False)))

4. Voice broadcast

The keywords successfully extracted above pyttsx3are broadcasted by converting them into speech.

import pyttsx3
voice=pyttsx3.init()
voice.say(" ".join(jieba.analyse.extract_tags(content_S_str, topK=10, withWeight=False)))
print("准备语音播报.....")
voice.runAndWait()

5. LDA topic model

LDA is a document topic generation model, also known as a three-layer Bayesian probability model. The model contains three layers of words (W), topics (Z) and documents (theta). Documents to topics, topics to words are subject to multinomial distribution, and the keywords of each topic are obtained. In actual operation, because the number of words is large, and the number of words in a document is limited, if a dense matrix is ​​used to represent it, it will cause memory waste, so gensim uses a sparse matrix to represent it internally. First, the document after word segmentation and cleaning is used to dictionary = corpora.Dictionary (texts)generate a dictionary; secondly, the generated dictionary is converted into a sparse vector.

def create_LDA(content_clean):
	#基于文本集建立(词典),并获得特征数
    dictionary = corpora.Dictionary(content_clean)
    #基于词典,将分词列表集转换成稀疏向量集,称作语料库
    dic = len(dictionary.token2id)
    print('词典特征数:%d' % dic)
	corpus = [dictionary.doc2bow(sentence) for sentence in content_clean]
	#模型训练
	lda = gensim.models.LdaModel(corpus=corpus, id2word = dictionary,num_topics = 10,passes=10)
    #passes 训练几轮
    print(lda.print_topic(1,topn=5))
    print('-----------')
    for topic in lda.print_topics(num_topics=10, num_words = 5):
		print(topic[1])
create_LDA(content_clean)

6. Model Construction

The principle of the Bayesian classifier is to use the prior probability of an object to calculate the posterior probability by using the Bayesian formula, that is, the probability that the object belongs to a certain class, and select the class with the largest posterior probability as the class to which it belongs. A mapping object maps hashable values ​​to arbitrary objects, and maps are mutable objects. At present, there is only one standard mapping type in Python—a dictionary, which is represented by curly braces, but each element in the curly braces is a key-value pair (key: value), and the key-value pairs in the dictionary are also unordered.

df_train=pd.DataFrame({
    
    "content":content_clean,"label":df_news['category']})
#为了方便计算,把对应的标签字符类型转换为数字
#映射类型(mapping)
#非空字典
label_mapping = {
    
    "体育": 0, "娱乐": 1, "家居": 2, "房产": 3, "教育":4, "时尚": 5,"时政": 6,"游戏": 7,"科技": 8,"财经": 9}
df_train['label'] = df_train['label'].map(label_mapping)
#df_train.head()
#将每个新闻信息转换成字符串形式,CountVectorizer和TfidfVectorizer的输入为字符串
def create_words(data):
    words = []
    for index in range(len(data)):
        try:
            words.append( ' '.join(data[index]))
        except Exception:
            print(index)
    return words
#把数据分成测试集和训练集
x_train,x_test,y_train,y_test =train_test_split(df_train['content'].values,df_train['label'].values,random_state=0)   
train_words = create_words(x_train)
test_words = create_words(x_test)
#模型训练
#第一种
#CountVectorizer属于常见的特征数值计算类,是一个文本特征提取方法
#对于每个训练文本,只考虑每种词汇在该训练文本中出现的频率
vec = CountVectorizer(analyzer = 'word',max_features=4000,lowercase=False)
vec.fit(train_words)
classifier = MultinomialNB()
classifier.fit(vec.transform(train_words),y_train)
print("模型准确率:",classifier.score(vec.transform(test_words), y_test))
#第二种,TfidfVectorizer除了考量某一词汇在当前训练文本中出现的频率之外
#关注包含这个词汇的其它训练文本数目的倒数,训练文本的数量越多特征化的方法就越有优势
vectorizer = TfidfVectorizer(analyzer='word',max_features = 40000,
lowercase=False)
vectorizer.fit(train_words)
classifier.fit(vectorizer.transform(train_words),y_train)
print("模型准确率为:",classifier.score(vectorizer.transform(test_words),
    y_test))

System test

The word cloud is shown in Figure 1, the keyword extraction is shown in Figure 2, the LDA test results are shown in Figure 3, and the Bayesian results are shown in Figure 4.

insert image description here

Figure 1 word cloud

insert image description here

Figure 2 Keyword extraction

insert image description here

Figure 3 LDA results

insert image description here

Figure 4 Bayesian results

Project source code download

See my blog resource download page for details


Other information download

If you want to continue to learn about artificial intelligence-related learning routes and knowledge systems, welcome to read my other blog " Heavy | Complete artificial intelligence AI learning-basic knowledge learning route, all materials can be downloaded directly from the network disk without paying attention to routines "
This blog refers to Github's well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc. There are about 100G related materials, and I hope to help all friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/132180260