Based on the TF-IDF algorithm, to create your own dictionary library (text preprocessing combined with keyword library)

TF-IDF custom dictionary library design and IDF statistics

What is TF-IDF?

Oops, seeing this personal log, I guess you already understand tf-idf. The following is a short and long way to make up the word count.
tf: refers to the frequency of the word in the current text. The more it appears in the text, the more important it is.
idf: refers to how many texts in n texts have appeared in this word, the more special the more important, that is, the less important the text appears in the text.
tf*idf constitutes the weight of the word, as far as I understand it is more scientific hahaha.

Idea flow chart

Hey, I haven't done a good job, so I write down my thoughts for everyone to criticize whether it is reasonable. Below is a simple text processing flowchart

Created with Raphaël 2.2.0 文本数据 去空格 分词 去停用词以及标点 统计全部出现的单词以及出现的文本数量 计算出来单词-IDF 结束

Basic text preprocessing in detail

This part mainly preprocesses Chinese, including removing spaces, word segmentation, removing stop words, keeping custom keywords, and removing punctuation. Of course, the big guys should also consider spelling errors, synonyms and the like.

Go to spaces

Remove the spaces in the text
input: contents is the list text data
output: the text list after removing the spaces

def remove_blank_space(contents):
    contents_new = map(lambda s: s.replace(' ', ''), contents)
    return list(contents_new)
Participle

Jieba segmentation of the text, where the input is the text list output in the previous step

def cut_words(contents):
    cut_contents = map(lambda s: list(jieba.lcut(s)), contents)
    return list(cut_contents)
Remove stop words

Here I used Baidu's stop words, and there are some stop words that are personally needed. At the same time, the concept of keywords is added. I am afraid that some words I need will not be filtered out.

def drop_stopwords(contents):
 # 初始化获取停用词表
 stop = open('data/stop_word_cn.txt', 'r+', encoding='utf-8')
 stop_me = open('./data/stop_me.txt', 'r+', encoding='utf-8')
 key_words = open('./data/key_words.txt', 'r+', encoding='utf-8')
 #分割停用词/自定义停用词/关键词
 stop_words = stop.read().split("\n")
 stop_me_words = stop_me.read().split("\n")
 key_words = key_words.read().split("\n")
   #定义返回后的结果
 contents_new = []
 #遍历处理数据
 for line in contents:
     line_clean = []
     for word in line:
         if (word in stop_words or word in stop_me_words) and word not in key_words:
             continue
         if is_chinese_words(word):
             line_clean.append(word)
     contents_new.append(line_clean)
 return contents_new

The above is_chineses_words is to judge whether the word segmentation is Chinese. If it is, it will be recorded. If it is not, it will not be recorded. It is mainly to filter punctuation.

Dictionary generation and IDF calculation

Dictionary generation ideas

I tried some documents for the dictionary before, but none of them felt suitable for me. So I use the following method.

  1. Count all the word segmentation results
  2. Then calculate their IDF export, and manually filter it again according to IDF.
Calculate IDF and thesaurus

Count the words in the dictionary and the number of files in which the words appear. The input is the text list processed above, and the output is the number of texts in the dictionary and words.

def deal_contents(contents):
 # 定义记录idf的数据
 word_counts = {
    
    }
 # 定义词典
 dict = []
 for content in contents:
     idf_flag = []
     for word in content:
     # 第一次出现词
         if word not in dict:
             dict.append(word)
             idf_flag.append(word)
             word_counts[word] = 1
         # 在短句中第一次出现
         elif word not in idf_flag:
             word_counts[word] = word_counts[word] + 1
 return dict, word_counts

After the above method is processed, only need to calculate again according to the formula to get the idf value. Input size is the total number of texts, dict dictionary list, word_counts dictionary structure word files. When outputting pandas DataFrame, this is convenient for my later data processing, and also for exporting csv.

def calc_idf(size, dict, word_counts):
 idf = []
 for word in dict:
     in_list = [word, size / word_counts[word]]
     idf.append(in_list)
 return pd.DataFrame(idf, columns=['word', 'idf'])

After getting the processing result, I plan to run it again according to the value of idf, which is time-consuming, but I think it is more effective.

Welcome to correct me, communicate, and hope that the big guys can bring Xiaobai.

Guess you like

Origin blog.csdn.net/m0_47220500/article/details/105639827