Extracting article keywords based on TF-IDF algorithm

0. Write in front

The purpose of this article is to use the TF-IDF algorithm to extract keywords in an article. Regarding TF-IDF, here is a good popular science article by Mr. Ruan Yifeng.

The application of TF-IDF and cosine similarity (1): automatic extraction of keywords - Ruan Yifeng's network log

TF-IDF is a statistical method to assess the importance of a word to a document set or a document in a corpus. (Baidu Encyclopedia)

TF ( Term Frequency) word frequency , the number or frequency of a word in the article, if a word in an article appears many times, then this word may be a more important word, of course, stop words are not included in the here.

IDF (inverse document frequency) is an inverse document frequency , which is a measure of the "weight" of a word. On the basis of word frequency, if a word has a low word frequency in multiple documents, it means that it is a relatively rare word, but It appears many times in a certain article, the larger the IDF value of the word, the greater the "weight" in this article. So when a word is more common, the IDF is lower.

After calculating the values ​​of TF and IDF , multiply the two to get TF-IDF. The higher the TF-IDF of the word, the more important it is in this article, the more likely it is the article. Key words.

Python's scikit-learn package has an API for calculating TF-IDF , and we use this to simply extract article keywords.

The text data materials used here are seasons 1-5 of "A Song of Ice and Fire" (Bing Song fans hahaha)

1. Data Collection

Text data source "A Song of Ice and Fire" novel online reading website content crawling, there are many websites, which one will not be posted here

 

The difficulty of crawling is not big, after crawling down, write to the local file

 

2. Document word segmentation

After crawling all the documents, in order to calculate the TF and IDF values, we must first extract all the words in the document, and use python's jieba library to perform Chinese word segmentation.

The following traverses all documents in all documents to segment words

import jieba  

wordslist = []
titlelist = []
# 遍历文件夹
for file in os.listdir('.'):
    if '.' not in file:
        # 遍历文档
        for f in os.listdir(file):
            # 标题
            # windows下编码问题添加:.decode('gbk', 'ignore').encode('utf-8'))
            titlelist.append(file+'--'+f.split('.')[0])
            # 读取文档
            with open(file + '//' + f, 'r') as f:
                content = f.read().strip().replace('\n', '').replace(' ', '').replace('\t', '').replace('\r', '')
            # 分词
            seg_list = jieba.cut(content, cut_all=True)
            result = ' '.join(seg_list)
            wordslist.append(result)

After document segmentation, stop words need to be removed to improve the extraction accuracy. Here, a stop word dictionary is prepared first.

stop_word = [unicode(line.rstrip()) for line in open('chinese_stopword.txt')]

...
seg_list = jieba.cut(content, cut_all=True)
seg_list_after = []
# 去停用词
for seg in seg_list:
    if seg.word not in stop_word:
        seg_list_after.append(seg)
result = ' '.join(seg_list_after)
wordslist.append(result)

At the same time, we can also add a dictionary of our own choice to improve the error correction ability of the program, such as

jieba.add_word(u'丹妮莉丝')

3. TF-IDF implementation of scikit-learn

(After installing anaconda, scikit-learn has been completed)

 

The TF-IDF weight calculation method in scikit-learn mainly uses the CountVectorizer() class and the TfidfTransformer() class.

The CountVectorizer class converts the words in the text into a word frequency matrix. word[ i ][ j ] in the matrix , it represents the word frequency of j word under i type text.

fit_transform(raw_documents[, y])Learn the vocabulary dictionary and return term-document matrix.get_feature_names()Array mapping from feature integer indices to feature name

fit_transform(), learns a word dictionary and returns a document matrix, the elements in the matrix are the number of occurrences of the word.

get_feature_names(), get an array of feature integer indices to feature name mappings, that is, an array of all keywords in the document.

vectorizer = CountVectorizer()
word_frequence = vectorizer.fit_transform(wordslist)
words = vectorizer.get_feature_names()

The TfidfTransformer class is used to count the TF-IDF value of each word.

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(word_frequence)
weight = tfidf.toarray()

Finally, output the top n words in order of weight.

def titlelist():
    for file in os.listdir('.'):
        if '.' not in file:
            for f in os.listdir(file):
                yield (file+'--'+f.split('.')[0]) # windows下编码问题添加:.decode('gbk', 'ignore').encode('utf-8'))

def wordslist():
    jieba.add_word(u'丹妮莉丝')   
    stop_word = [unicode(line.rstrip()) for line in open('chinese_stopword.txt')]
    print len(stop_word)
    for file in os.listdir('.'):
        if '.' not in file:
            for f in os.listdir(file):
                with open(file + '//' + f) as t:
                    content = t.read().strip().replace('\n', '').replace(' ', '').replace('\t', '').replace('\r', '')
                    seg_list = pseg.cut(content)
                    seg_list_after = []
                    # 去停用词
                    for seg in seg_list:
                        if seg.word not in stop_word:
                            seg_list_after.append(seg.word)
                    result = ' '.join(seg_list_after)
                    # wordslist.append(result)
                    yield result
    

if __name__ == "__main__":

    wordslist = list(wordslist())
    titlelist = list(titlelist())
    
    vectorizer = CountVectorizer()
    transformer = TfidfTransformer()
    tfidf = transformer.fit_transform(vectorizer.fit_transform(wordslist))
    
    words = vectorizer.get_feature_names()  #所有文本的关键字
    weight = tfidf.toarray()
    
    print 'ssss'
    n = 5 # 前五位
    for (title, w) in zip(titlelist, weight):
        print u'{}:'.format(title)
        # 排序
        loc = np.argsort(-w)
        for i in range(n):
            print u'-{}: {} {}'.format(str(i + 1), words[loc[i]], w[loc[i]])
        print '\n'

operation result

 

Get the keywords of each document.

 

4. Finally

References:

[1]. The application of TF-IDF and cosine similarity (1): automatic keyword extraction - Ruan Yifeng's network log

[2]. Python Package Index

[3]. sklearn.feature_extraction.text.CountVectorizer - scikit-learn 0.18.1 documentation

Code GitHub: wzyonggege/tf-idf

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325299032&siteId=291194637