Chinese word segmentation and tfidf feature application

introduction

Chinese word segmentation and TF-IDF (Term Frequency-Inverse Document Frequency) are two basic technologies in natural language processing (NLP).

Chinese Word Segmentation
Because Chinese text has no obvious word separators, word segmentation is required. Common word segmentation algorithms are:

  1. Dictionary-based word segmentation: longest match algorithm, forward maximum match, reverse maximum match, etc.
  2. Statistical word segmentation: hidden Markov model (HMM), conditional random field (CRF), etc.

There are ready-made word segmentation libraries in Python, such as jieba.

import jieba

sentence = "我爱自然语言处理"
words = jieba.cut(sentence)
print(list(words))

TF-IDF
TF-IDF is used to measure the importance of a word in a document set.

  • TF (Term Frequency): Term frequency, the number of times a word appears in a document.
  • IDF (Inverse Document Frequency): Inverse Document Frequency, which measures how rare a word is.

TF-IDF value = TF value × IDF value
It is often used in text mining, information retrieval, etc.

Application Example
Suppose you have a set of Chinese documents and you want to find the keywords in each document.

  1. First, Chinese word segmentation is performed on each document.
  2. Then use the TF-IDF algorithm to find words with high TF-IDF values ​​in each document.

The Scikit-learn library in Python has a ready-made TF-IDF implementation.

from sklearn.feature_extraction.text import TfidfVectorizer

# 假设已经分词并用空格连接
docs = ["我 爱 自然 语言 处理", "自然 语言 处理 很 有趣", "我 喜欢 学习"]
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(docs)

# 打印特征名和TF-IDF矩阵
print(vectorizer.get_feature_names_out())
print(tfidf.toarray())

Through Chinese word segmentation and TF-IDF, you can effectively convert text data into machine-understandable numerical data, and then use it in various NLP tasks, such as text classification, clustering, etc.

1. The basic task of NLP - word segmentation

Word segmentation task
Why talk about word segmentation?

  1. Word segmentation is a task that has been studied for a long time. By understanding the development of word segmentation algorithms, you can see the research history of NLP
  2. Word segmentation is a representative of a class of problems in NLP
  3. Word segmentation is very common, and many NLP tasks are based on word segmentation

2. Chinese word segmentation

2.1 Chinese Word Segmentation - Difficulties

Chinese Word Segmentation has some unique difficulties compared to spaces-delimited languages ​​like English:

  1. No clear delimiters : Chinese text does not have white space to clearly separate words like English, which makes the definition of words ambiguous.
  2. Lexical ambiguity : A word may appear in multiple words with different meanings. For example, the word "ginseng" in "sea cucumber" and "participation".
  3. New word recognition : Current events, the Internet, etc. often generate new words, which may not be in the dictionary, and it is difficult to accurately segment words through a dictionary-based method.
  4. Compound words and phrases : There are a large number of compound words and idioms in Chinese, such as "heart has lingering fear". If the words in these phrases are separated separately, their meaning will change.
  5. Multi-granularity word segmentation : Chinese word segmentation can be performed at different granularities. For example, "Peking University" can be divided into "Peking University" or "Peking University". Different application scenarios require word segmentation with different granularities.
  6. Dialects and regional terms : Different regions and groups may have specific terms and expressions, which increases the complexity of word segmentation.
  7. Part-of-speech diversity : In Chinese, the same word may appear as different parts of speech, and different parts of the sentence may require different word segmentation methods.
  8. Noise and errors : There may be spelling mistakes, typos, or non-standard terms in the text, which may affect the accuracy of word segmentation.

Due to the above factors, Chinese word segmentation usually uses complex algorithms based on machine learning or statistics, such as Conditional Random Field (CRF), Hidden Markov Model (HMM), etc., to improve the accuracy of word segmentation.

2.2 Chinese word segmentation - forward maximum matching

Forward Maximum Matching (FMM for short) is a dictionary-based Chinese word segmentation algorithm. This method starts from the first character of the text, tries to find the longest word, and then tokenizes it as a word. The specific process is as follows:

  1. Preparing the dictionary : You first need a pre-prepared dictionary containing all possible occurrences of words.
  2. Set the window size : The window size is usually set to the length of the longest word in the dictionary.
  3. Scan text :
  • Brings the window to the extreme left of the text.
  • Take characters from the window and match them against words in the dictionary, trying to find the longest matching word.
  • If a match is found, treat the word as a tokenization result and move the window to the right by the corresponding number of characters.
  • If no match is found, take the first character in the window as a tokenization result and just move the window one character to the right.
  1. Repeat step 3 until the entire text has been scanned.

The advantage of this method is that it is simple to implement and fast in operation. But the shortcomings are also obvious: it cannot handle new words outside the dictionary well, and due to the lack of context information, it may lead to wrong segmentation of some ambiguous words.

2.2.1 Implementation Method 1

  1. Find the maximum word length in the vocabulary
  2. Select a window with the maximum word length from the beginning of the string, and check whether the word in the window is in the vocabulary
  3. If in the vocabulary, segment at the word boundary, then move to the word boundary, repeat step 2
  4. If it is not in the vocabulary, the right edge of the window is backed up by one character, and then checks whether the window word is in the vocabulary

Cutting process:
Peking University
To report to Peking University during his lifetimeReport before deathto
Peking Universitycome aliveRegister at
Peking Universityduring lifeBefore you come to report to
Peking Universityto report
Pre-student at Peking Universityto reportIn
front of Peking University studentsComeRegistration
Beijing college students comecheck in

#分词方法:最大正向切分的第一种实现方式

import re
import time

#加载词典
def load_word_dict(path):
    max_word_length = 0
    word_dict = {
    
    }  #用set也是可以的。用list会很慢
    with open(path, encoding="utf8") as f:
        for line in f:
            word = line.split()[0]
            word_dict[word] = 0
            max_word_length = max(max_word_length, len(word))
    return word_dict, max_word_length

#先确定最大词长度
#从长向短查找是否有匹配的词
#找到后移动窗口
def cut_method1(string, word_dict, max_len):
    words = []
    while string != '':
        lens = min(max_len, len(string))
        word = string[:lens]
        while word not in word_dict:
            if len(word) == 1:
                break
            word = word[:len(word) - 1]
        words.append(word)
        string = string[len(word):]
    return words

#cut_method是切割函数
#output_path是输出路径
def main(cut_method, input_path, output_path):
    word_dict, max_word_length = load_word_dict("dict.txt")
    writer = open(output_path, "w", encoding="utf8")
    start_time = time.time()
    with open(input_path, encoding="utf8") as f:
        for line in f:
            words = cut_method(line.strip(), word_dict, max_word_length)
            writer.write(" / ".join(words) + "\n")
    writer.close()
    print("耗时:", time.time() - start_time)
    return


string = "测试字符串"
word_dict, max_len = load_word_dict("dict.txt")
# print(cut_method1(string, word_dict, max_len))

main(cut_method1, "corpus.txt", "cut_method1_output.txt")

2.2.2 Implementation Method 2 Using Prefix Dictionary

  1. Search from front to back
  2. If the word in the window is a word prefix, continue to expand the window
  3. If the word in the window is not a word prefix, record the found word and move the window to a word boundary
    insert image description here

Cutting process:
insert image description here

#分词方法最大正向切分的第二种实现方式

import re
import time
import json

#加载词前缀词典
#用0和1来区分是前缀还是真词
#需要注意有的词的前缀也是真词,在记录时不要互相覆盖
def load_prefix_word_dict(path):
    prefix_dict = {
    
    }
    with open(path, encoding="utf8") as f:
        for line in f:
            word = line.split()[0]
            for i in range(1, len(word)):
                if word[:i] not in prefix_dict: #不能用前缀覆盖词
                    prefix_dict[word[:i]] = 0  #前缀
            prefix_dict[word] = 1  #词
    return prefix_dict


#输入字符串和字典,返回词的列表
def cut_method2(string, prefix_dict):
    if string == "":
        return []
    words = []  # 准备用于放入切好的词
    start_index, end_index = 0, 1  #记录窗口的起始位置
    window = string[start_index:end_index] #从第一个字开始
    find_word = window  # 将第一个字先当做默认词
    while start_index < len(string):
        #窗口没有在词典里出现
        if window not in prefix_dict or end_index > len(string):
            words.append(find_word)  #记录找到的词
            start_index += len(find_word)  #更新起点的位置
            end_index = start_index + 1
            window = string[start_index:end_index]  #从新的位置开始一个字一个字向后找
            find_word = window
        #窗口是一个词
        elif prefix_dict[window] == 1:
            find_word = window  #查找到了一个词,还要在看有没有比他更长的词
            end_index += 1
            window = string[start_index:end_index]
        #窗口是一个前缀
        elif prefix_dict[window] == 0:
            end_index += 1
            window = string[start_index:end_index]
    #最后找到的window如果不在词典里,把单独的字加入切词结果
    if prefix_dict.get(window) != 1:
        words += list(window)
    else:
        words.append(window)
    return words


#cut_method是切割函数
#output_path是输出路径
def main(cut_method, input_path, output_path):
    word_dict = load_prefix_word_dict("dict.txt")
    writer = open(output_path, "w", encoding="utf8")
    start_time = time.time()
    with open(input_path, encoding="utf8") as f:
        for line in f:
            words = cut_method(line.strip(), word_dict)
            writer.write(" / ".join(words) + "\n")
    writer.close()
    print("耗时:", time.time() - start_time)
    return


string = "王羲之草书《平安帖》共有九行"
# string = "你到很多有钱人家里去看"
# string = "金鹏期货北京海鹰路营业部总经理陈旭指出"
# string = "伴随着优雅的西洋乐"
# string = "非常的幸运"
prefix_dict = load_prefix_word_dict("dict.txt")
# print(cut_method2(string, prefix_dict))
# print(json.dumps(prefix_dict, ensure_ascii=False, indent=2))
main(cut_method2, "corpus.txt", "cut_method2_output.txt")

2.3 Chinese word segmentation - reverse maximum matching

Reverse Maximum Matching (RMM) is a variant of Forward Maximum Matching (FMM), the main difference being the search direction. In RMM, the word segmentation process starts from the last character of the text and scans forward. The specific process is as follows:

  1. Prepare a dictionary : You need a pre-prepared dictionary that includes all the words you might use.
  2. Set the window size : The window size is usually set to the length of the longest word in the dictionary.
  3. Scan text :
  • Brings the window to the far right of the text.
  • Take characters from the window and match them against words in the dictionary, trying to find the longest matching word.
  • If a match is found, treat the word as a tokenization result and move the window to the left by the corresponding number of characters.
  • If no match is found, take the last character in the window as a tokenization result and move the window one character to the left.
  1. Repeat step 3 until the entire text has been scanned.

Similar to forward max matching, the advantages of reverse max matching include simplicity to implement and fast running speed. But there are also disadvantages, such as not being able to handle new words out of the dictionary well, and may cause some ambiguous words to be incorrectly segmented due to lack of context information. Sometimes, people will compare the results of forward maximum matching and reverse maximum matching to further improve the accuracy of word segmentation.

insert image description here

2.4 Chinese word segmentation - two-way maximum matching

Bidirectional Maximum Matching (BMM) is a combination of forward maximum matching (FMM) and reverse maximum matching (RMM). This method uses two strategies for word segmentation respectively, and compares the results of the two. Specific steps are as follows:

  1. Prepare a dictionary: First you need a pre-prepared dictionary, including all the words that may be used.
  2. Word segmentation using forward maximum matching: starting from the first character of the text, apply the forward maximum matching algorithm for word segmentation.
  3. Word segmentation using reverse maximum matching: starting from the last character of the text, apply the reverse maximum matching algorithm for word segmentation.
  4. Compare the results of the two methods:
  • If the word segmentation results obtained by the two methods are the same, then this result is likely to be correct.
  • If the word segmentation results obtained by the two methods are different, the result with fewer words is usually selected, because in practical applications, it is generally believed that the fewer word segmentation results, the higher the accuracy.

The advantage of two-way maximum matching is that it combines two word segmentation methods, which can theoretically obtain more accurate word segmentation results. However, the computational cost of this method is relatively high, because two word segmentation algorithms need to be run and the results compared.

In addition, this method also cannot handle new words outside the dictionary or fixed collocations composed of multiple words well, nor can it solve the word segmentation ambiguity problem caused by the context. But in general, the bidirectional maximum matching algorithm is usually better than the single direction maximum matching algorithm in terms of accuracy.

insert image description here

2.5 Chinese word segmentation-jieba word segmentation

jieba is a popular Chinese word segmentation library, mainly used in Python language. It uses the maximum probability path segmentation algorithm based on the prefix dictionary and the dynamic programming method based on the directed acyclic graph (DAG). jieba supports multiple word segmentation modes, including exact mode, full mode and search engine mode. In addition, it supports part-of-speech tagging and keyword extraction.

2.5.1 Basic usage

  1. Installation Install
    via pip:
pip install jieba
  1. Import library and basic word segmentation
import jieba

sentence = "我来到北京清华大学"
seg_list = jieba.cut(sentence)
print(list(seg_list))

2.5.2 Word segmentation mode

  • Exact Mode : Split sentences into exact short words as much as possible.
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print(list(seg_list))

  • Full mode : Scans all possible words in the sentence.
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print(list(seg_list))

  • Search engine mode : On the basis of the precise mode, the long words are segmented again.
seg_list = jieba.cut_for_search("我来到北京清华大学")
print(list(seg_list))

2.5.3 Other functions

  • Add a custom dictionary :
jieba.load_userdict("userdict.txt")

  • Part of speech tagging :
import jieba.posseg as pseg
words = pseg.cut("我来到北京清华大学")
for word, flag in words:
    print(f"{
      
      word} {
      
      flag}")

  • Keyword extraction :
from jieba import analyse
tags = analyse.extract_tags(sentence, topK=5)
print(tags)

example
insert image description here

2.6 Disadvantages of the three methods

Common disadvantages of positive maximum split, negative maximum split, and bidirectional maximum split:

  1. Extremely dependent on the vocabulary, if there is no vocabulary, it will not work; if the required words are missing in the vocabulary, the result will not be correct
  2. During the segmentation process, the meaning expressed by the entire sentence will not be paid attention to, and only the sentence will be regarded as fragments
  3. If there are certain typos in the text, it will cause a series of effects
  4. Unable to enumerate entity words such as personal names cannot be effectively processed

2.7 Chinese word segmentation - based on machine learning

Rethink, if we want to segment a sentence, what do we need to know?

  • Shanghai tap water comes from the sea
  • For each word, we want to know if it is a word boundary

insert image description here

The problem is transformed into: For each word in the sentence, perform a binary classification judgment. The positive class indicates that it is a word boundary in this sentence, and the negative class indicates that it is not a word boundary.

Label the data and train the model so that the model can complete the above judgments, then this model can be called a word segmentation model

Code

#coding:utf8

import torch
import torch.nn as nn
import jieba
import numpy as np
import random
import json
from torch.utils.data import DataLoader

"""
基于pytorch的网络编写一个分词模型
我们使用jieba分词的结果作为训练数据
看看是否可以得到一个效果接近的神经网络模型
"""

class TorchModel(nn.Module):
    def __init__(self, input_dim, hidden_size, num_rnn_layers, vocab):
        super(TorchModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab) + 1, input_dim) #shape=(vocab_size, dim)
        self.rnn_layer = nn.RNN(input_size=input_dim,
                            hidden_size=hidden_size,
                            batch_first=True,
                            num_layers=num_rnn_layers,
                            )
        self.classify = nn.Linear(hidden_size, 2)
        self.loss_func = nn.CrossEntropyLoss(ignore_index=-100)

    #当输入真实标签,返回loss值;无真实标签,返回预测值
    def forward(self, x, y=None):
        x = self.embedding(x)  #input shape: (batch_size, sen_len), output shape:(batch_size, sen_len, input_dim)
        x, _ = self.rnn_layer(x)  #output shape:(batch_size, sen_len, hidden_size)
        y_pred = self.classify(x)   #output shape:(batch_size, sen_len, 2)
        if y is not None:
            # view(-1,2): (batch_size, sen_len, 2) ->  (batch_size * sen_len, 2)
            return self.loss_func(y_pred.view(-1, 2), y.view(-1))
        else:
            return y_pred

class Dataset:
    def __init__(self, corpus_path, vocab, max_length):
        self.vocab = vocab
        self.corpus_path = corpus_path
        self.max_length = max_length
        self.load()

    def load(self):
        self.data = []
        with open(self.corpus_path, encoding="utf8") as f:
            for line in f:
                sequence = sentence_to_sequence(line, self.vocab)
                label = sequence_to_label(line)
                sequence, label = self.padding(sequence, label)
                sequence = torch.LongTensor(sequence)
                label = torch.LongTensor(label)
                self.data.append([sequence, label])
                #使用部分数据做展示,使用全部数据训练时间会相应变长
                if len(self.data) > 10000:
                    break

    #将文本截断或补齐到固定长度
    def padding(self, sequence, label):
        sequence = sequence[:self.max_length]
        sequence += [0] * (self.max_length - len(sequence))
        label = label[:self.max_length]
        label += [-100] * (self.max_length - len(label))
        return sequence, label

    def __len__(self):
        return len(self.data)

    def __getitem__(self, item):
        return self.data[item]

#文本转化为数字序列,为embedding做准备
def sentence_to_sequence(sentence, vocab):
    sequence = [vocab.get(char, vocab['unk']) for char in sentence]
    return sequence

#基于结巴生成分级结果的标注
def sequence_to_label(sentence):
    words = jieba.lcut(sentence)
    label = [0] * len(sentence)
    pointer = 0
    for word in words:
        pointer += len(word)
        label[pointer - 1] = 1
    return label

#加载字表
def build_vocab(vocab_path):
    vocab = {
    
    }
    with open(vocab_path, "r", encoding="utf8") as f:
        for index, line in enumerate(f):
            char = line.strip()
            vocab[char] = index + 1   #每个字对应一个序号
    vocab['unk'] = len(vocab) + 1
    return vocab

#建立数据集
def build_dataset(corpus_path, vocab, max_length, batch_size):
    dataset = Dataset(corpus_path, vocab, max_length) #diy __len__ __getitem__
    data_loader = DataLoader(dataset, shuffle=True, batch_size=batch_size) #torch
    return data_loader


def main():
    epoch_num = 10        #训练轮数
    batch_size = 20       #每次训练样本个数
    char_dim = 50         #每个字的维度
    hidden_size = 100     #隐含层维度
    num_rnn_layers = 3    #rnn层数
    max_length = 20       #样本最大长度
    learning_rate = 1e-3  #学习率
    vocab_path = "chars.txt"  #字表文件路径
    corpus_path = "../corpus.txt"  #语料文件路径
    vocab = build_vocab(vocab_path)       #建立字表
    data_loader = build_dataset(corpus_path, vocab, max_length, batch_size)  #建立数据集
    model = TorchModel(char_dim, hidden_size, num_rnn_layers, vocab)   #建立模型
    optim = torch.optim.Adam(model.parameters(), lr=learning_rate)     #建立优化器
    #训练开始
    for epoch in range(epoch_num):
        model.train()
        watch_loss = []
        for x, y in data_loader:
            optim.zero_grad()    #梯度归零
            loss = model(x, y)   #计算loss
            loss.backward()      #计算梯度
            optim.step()         #更新权重
            watch_loss.append(loss.item())
        print("=========\n第%d轮平均loss:%f" % (epoch + 1, np.mean(watch_loss)))
    #保存模型
    torch.save(model.state_dict(), "model.pth")
    return

#最终预测
def predict(model_path, vocab_path, input_strings):
    #配置保持和训练时一致
    char_dim = 50  # 每个字的维度
    hidden_size = 100  # 隐含层维度
    num_rnn_layers = 3  # rnn层数
    vocab = build_vocab(vocab_path)       #建立字表
    model = TorchModel(char_dim, hidden_size, num_rnn_layers, vocab)   #建立模型
    model.load_state_dict(torch.load(model_path))   #加载训练好的模型权重
    model.eval()
    for input_string in input_strings:
        #逐条预测
        x = sentence_to_sequence(input_string, vocab)
        with torch.no_grad():
            result = model.forward(torch.LongTensor([x]))[0]
            result = torch.argmax(result, dim=-1)  #预测出的01序列
            #在预测为1的地方切分,将切分后文本打印出来
            for index, p in enumerate(result):
                if p == 1:
                    print(input_string[index], end=" ")
                else:
                    print(input_string[index], end="")
            print()



if __name__ == "__main__":
    # main()
    input_strings = ["同时国内有望出台新汽车刺激方案",
                     "沪胶后市有望延续强势",
                     "经过两个交易日的强势调整后",
                     "昨日上海天然橡胶期货价格再度大幅上扬"]
    predict("model.pth", "chars.txt", input_strings)


3. About word segmentation

At present, the research on Chinese word segmentation is gradually decreasing for the following reasons:

  1. In most cases, the current word segmentation has an ideal effect, and there is not much room for optimization.
  2. Even if a word segmentation error occurs, the downstream task does not necessarily have an error, so it is not worth spending a lot of effort to optimize the word segmentation
  3. With the rise of neural networks and pre-training models, Chinese tasks gradually no longer require word segmentation, or even no word segmentation, and the effect is better
  4. The problem that can't be solved is really hard to solve.

4. Summarize experience

  1. There are different algorithms for the same task
  2. Different implementations may have the same result, but with different efficiencies
  3. Different algorithms may have different results, but each has advantages and disadvantages
  4. Changing space for time is a common way to improve performance
  5. A combination of multiple algorithms may yield better results

5. New word discovery

New word discovery is an important task in natural language processing (NLP), especially for dynamically changing corpora or domain texts. Traditional word segmentation tools usually rely on pre-built dictionaries, which have limitations when dealing with unregistered words (words that are not in the dictionary). New word spotting algorithms are designed to automatically recognize such new words from large amounts of text.

  • Assuming there is no vocabulary, how to discover new words from text?
  • Over time, new words will appear and established vocabularies will become obsolete
  • Supplementary vocabulary facilitates downstream tasks

insert image description here

code example

import math
from collections import defaultdict

class NewWordDetect:
    def __init__(self, corpus_path):
        self.max_word_length = 5
        self.word_count = defaultdict(int)
        self.left_neighbor = defaultdict(dict)
        self.right_neighbor = defaultdict(dict)
        self.load_corpus(corpus_path)
        self.calc_pmi()
        self.calc_entropy()
        self.calc_word_values()


    #加载语料数据,并进行统计
    def load_corpus(self, path):
        with open(path, encoding="utf8") as f:
            for line in f:
                sentence = line.strip()
                for word_length in range(1, self.max_word_length):
                    self.ngram_count(sentence, word_length)
        return

    #按照窗口长度取词,并记录左邻右邻
    def ngram_count(self, sentence, word_length):
        for i in range(len(sentence) - word_length + 1):
            word = sentence[i:i + word_length]
            self.word_count[word] += 1
            if i - 1 >= 0:
                char = sentence[i - 1]
                self.left_neighbor[word][char] = self.left_neighbor[word].get(char, 0) + 1
            if i + word_length < len(sentence):
                char = sentence[i +word_length]
                self.right_neighbor[word][char] = self.right_neighbor[word].get(char, 0) + 1
        return

    #计算熵
    def calc_entropy_by_word_count_dict(self, word_count_dict):
        total = sum(word_count_dict.values())
        entropy = sum([-(c / total) * math.log((c / total), 10) for c in word_count_dict.values()])
        return entropy

    #计算左右熵
    def calc_entropy(self):
        self.word_left_entropy = {
    
    }
        self.word_right_entropy = {
    
    }
        for word, count_dict in self.left_neighbor.items():
            self.word_left_entropy[word] = self.calc_entropy_by_word_count_dict(count_dict)
        for word, count_dict in self.right_neighbor.items():
            self.word_right_entropy[word] = self.calc_entropy_by_word_count_dict(count_dict)


    #统计每种词长下的词总数
    def calc_total_count_by_length(self):
        self.word_count_by_length = defaultdict(int)
        for word, count in self.word_count.items():
            self.word_count_by_length[len(word)] += count
        return

    #计算互信息(pointwise mutual information)
    def calc_pmi(self):
        self.calc_total_count_by_length()
        self.pmi = {
    
    }
        for word, count in self.word_count.items():
            p_word = count / self.word_count_by_length[len(word)]
            p_chars = 1
            for char in word:
                p_chars *= self.word_count[char] / self.word_count_by_length[1]
            self.pmi[word] = math.log(p_word / p_chars, 10) / len(word)
        return

    def calc_word_values(self):
        self.word_values = {
    
    }
        for word in self.pmi:
            if len(word) < 2 or "," in word:
                continue
            pmi = self.pmi.get(word, 1e-3)
            le = self.word_left_entropy.get(word, 1e-3)
            re = self.word_right_entropy.get(word, 1e-3)
            self.word_values[word] = pmi ** 2 * min(le, re)

if __name__ == "__main__":
    nwd = NewWordDetect("sample_corpus.txt")
    # print(nwd.word_count)
    # print(nwd.left_neighbor)
    # print(nwd.right_neighbor)
    # print(nwd.pmi)
    # print(nwd.word_left_entropy)
    # print(nwd.word_right_entropy)
    value_sort = sorted([(word, count) for word, count in nwd.word_values.items()], key=lambda x:x[1], reverse=True)
    print([x for x, c in value_sort if len(x) == 2][:10])
    print([x for x, c in value_sort if len(x) == 3][:10])
    print([x for x, c in value_sort if len(x) == 4][:10])


from words to understanding

  • After having the word segmentation ability, you need to use words to complete the understanding of the text
  • The first thing you can think of is to select important words from the article

What is the important word

  • If a word appears many times in a certain type of text (assumed to be A type), but rarely appears in other types of text (non-A type), then this word is an important word (high weight word) of A type text.
    insert image description here
  • Conversely, if a word appears in many domains, its importance for any category is poor.
    insert image description here

How to describe mathematically
insert image description here

6. TF-IDF

6.1 TF-IDF calculation

  • TF·IDF = TF * IDF
  • Suppose there are four documents, and the words in the documents are replaced by letters
  • A:a b c d a b c d
  • B: b c b c b c
  • C: b d b d
  • D: d d d d d d d

insert image description here

  • Each word will get a TF·IDF value for each category
  • TF · IDF high -> the word is highly important to the field
  • low is the opposite

insert image description here

6.2 Other versions of TFIDF

insert image description here
insert image description here

6.3 Algorithm Features

  1. The calculation of tf-idf is very dependent on the word segmentation results. If the word segmentation is wrong, the significance of the statistical value will be greatly reduced
  2. Each word, for each document, has a different tf-idf value, so tfidf cannot be discussed without data
  3. If there is only one text, tf-idf cannot be calculated
  4. Category data balance is important
  5. Easily affected by various special symbols, it is best to do some preprocessing

6.4 TFIDF application - search engine

  1. For all existing web pages (text), calculate the TFIDF value of words in each web page
  2. Word segmentation for an input query
  3. For document D, calculate the sum of the TFIDF values ​​of the words in the query in document D as the correlation score between the query and the document

code example

import jieba
import math
import os
import json
from collections import defaultdict
from calculate_tfidf import calculate_tfidf, tf_idf_topk
"""
基于tfidf实现简单搜索引擎
"""

jieba.initialize()

#加载文档数据(可以想象成网页数据),计算每个网页的tfidf字典
def load_data(file_path):
    corpus = []
    with open(file_path, encoding="utf8") as f:
        documents = json.loads(f.read())
        for document in documents:
            corpus.append(document["title"] + "\n" + document["content"])
        tf_idf_dict = calculate_tfidf(corpus)
    return tf_idf_dict, corpus

def search_engine(query, tf_idf_dict, corpus, top=3):
    query_words = jieba.lcut(query)
    res = []
    for doc_id, tf_idf in tf_idf_dict.items():
        score = 0
        for word in query_words:
            score += tf_idf.get(word, 0)
        res.append([doc_id, score])
    res = sorted(res, reverse=True, key=lambda x:x[1])
    for i in range(top):
        doc_id = res[i][0]
        print(corpus[doc_id])
        print("--------------")
    return res

if __name__ == "__main__":
    path = "news.json"
    tf_idf_dict, corpus = load_data(path)
    while True:
        query = input("请输入您要搜索的内容:")
        search_engine(query, tf_idf_dict, corpus)

6.5 TFIDF Application - Text Summarization

  1. The keywords of each text are obtained by calculating the TFIDF value.
  2. Sentences containing many keywords are considered as key sentences.
  3. Select several key sentences as a summary of the text.

code example

import jieba
import math
import os
import random
import re
import json
from collections import defaultdict
from calculate_tfidf import calculate_tfidf, tf_idf_topk
"""
基于tfidf实现简单文本摘要
"""

jieba.initialize()

#加载文档数据(可以想象成网页数据),计算每个网页的tfidf字典
def load_data(file_path):
    corpus = []
    with open(file_path, encoding="utf8") as f:
        documents = json.loads(f.read())
        for document in documents:
            assert "\n" not in document["title"]
            assert "\n" not in document["content"]
            corpus.append(document["title"] + "\n" + document["content"])
        tf_idf_dict = calculate_tfidf(corpus)
    return tf_idf_dict, corpus

#计算每一篇文章的摘要
#输入该文章的tf_idf词典,和文章内容
#top为人为定义的选取的句子数量
#过滤掉一些正文太短的文章,因为正文太短在做摘要意义不大
def generate_document_abstract(document_tf_idf, document, top=3):
    sentences = re.split("?|!|。", document)
    #过滤掉正文在五句以内的文章
    if len(sentences) <= 5:
        return None
    result = []
    for index, sentence in enumerate(sentences):
        sentence_score = 0
        words = jieba.lcut(sentence)
        for word in words:
            sentence_score += document_tf_idf.get(word, 0)
        sentence_score /= (len(words) + 1)
        result.append([sentence_score, index])
    result = sorted(result, key=lambda x:x[0], reverse=True)
    #权重最高的可能依次是第10,第6,第3句,将他们调整为出现顺序比较合理,即3,6,10
    important_sentence_indexs = sorted([x[1] for x in result[:top]])
    return "。".join([sentences[index] for index in important_sentence_indexs])

#生成所有文章的摘要
def generate_abstract(tf_idf_dict, corpus):
    res = []
    for index, document_tf_idf in tf_idf_dict.items():
        title, content = corpus[index].split("\n")
        abstract = generate_document_abstract(document_tf_idf, content)
        if abstract is None:
            continue
        corpus[index] += "\n" + abstract
        res.append({
    
    "标题":title, "正文":content, "摘要":abstract})
    return res


if __name__ == "__main__":
    path = "news.json"
    tf_idf_dict, corpus = load_data(path)
    res = generate_abstract(tf_idf_dict, corpus)
    writer = open("abstract.json", "w", encoding="utf8")
    writer.write(json.dumps(res, ensure_ascii=False, indent=2))
    writer.close()

6.6 TFIDF application - text similarity calculation

  • After calculating tfidf for all texts, select the first n words with higher tfidf from each text to obtain a set S of words.
  • For each text D, calculate the word frequency of each word in S, and use it as a vector of the text.
  • By calculating the cosine value of the vector angle, the vector similarity is obtained as the similarity of the text
  • Calculation of the cosine value of the angle between vectors:

code example

#coding:utf8
import jieba
import math
import os
import json
from collections import defaultdict
from calculate_tfidf import calculate_tfidf, tf_idf_topk

"""
基于tfidf实现文本相似度计算
"""

jieba.initialize()

#加载文档数据(可以想象成网页数据),计算每个网页的tfidf字典
#之后统计每篇文档重要在前10的词,统计出重要词词表
#重要词词表用于后续文本向量化
def load_data(file_path):
    corpus = []
    with open(file_path, encoding="utf8") as f:
        documents = json.loads(f.read())
        for document in documents:
            corpus.append(document["title"] + "\n" + document["content"])
    tf_idf_dict = calculate_tfidf(corpus)
    topk_words = tf_idf_topk(tf_idf_dict, top=5, print_word=False)
    vocab = set()
    for words in topk_words.values():
        for word, score in words:
            vocab.add(word)
    print("词表大小:", len(vocab))
    return tf_idf_dict, list(vocab), corpus


#passage是文本字符串
#vocab是词列表
#向量化的方式:计算每个重要词在文档中的出现频率
def doc_to_vec(passage, vocab):
    vector = [0] * len(vocab)
    passage_words = jieba.lcut(passage)
    for index, word in enumerate(vocab):
        vector[index] = passage_words.count(word) / len(passage_words)
    return vector

#先计算所有文档的向量
def calculate_corpus_vectors(corpus, vocab):
    corpus_vectors = [doc_to_vec(c, vocab) for c in corpus]
    return corpus_vectors

#计算向量余弦相似度
def cosine_similarity(vector1, vector2):
    x_dot_y = sum([x*y for x, y in zip(vector1, vector2)])
    sqrt_x = math.sqrt(sum([x ** 2 for x in vector1]))
    sqrt_y = math.sqrt(sum([x ** 2 for x in vector2]))
    if sqrt_y == 0 or sqrt_y == 0:
        return 0
    return x_dot_y / (sqrt_x * sqrt_y + 1e-7)


#输入一篇文本,寻找最相似文本
def search_most_similar_document(passage, corpus_vectors, vocab):
    input_vec = doc_to_vec(passage, vocab)
    result = []
    for index, vector in enumerate(corpus_vectors):
        score = cosine_similarity(input_vec, vector)
        result.append([index, score])
    result = sorted(result, reverse=True, key=lambda x:x[1])
    return result[:4]


if __name__ == "__main__":
    path = "news.json"
    tf_idf_dict, vocab, corpus = load_data(path)
    corpus_vectors = calculate_corpus_vectors(corpus, vocab)
    passage = "魔兽争霸"
    for corpus_index, score in search_most_similar_document(passage, corpus_vectors, vocab):
        print("相似文章:\n", corpus[corpus_index].strip())
        print("得分:", score)
        print("--------------")

6.7 Advantages of TFIDF

  1. Good interpretability, you can clearly see the key words, even if the prediction results are wrong, it is easy to find the reason
  2. The calculation speed is fast, the word segmentation itself takes the most time, and the rest are simple statistical calculations
  3. Little dependence on labeled data, part of the work can be done using unlabeled corpus
  4. It can be used in combination with many algorithms and can be regarded as word weight

6.8 Disadvantages of TFIDF

  1. Affected by the participle effect
  2. There is no semantic similarity between words
  3. No word order information (bag of words model)
  4. Limited range of capabilities, unable to complete complex tasks such as machine translation and entity mining
  5. Sample imbalance can have a big impact on the results
  6. The distribution between samples within a class is not considered

7. Calculation and use of TFIDF

code example

import jieba
import math
import os
import json
from collections import defaultdict

"""
tfidf的计算和使用
"""

#统计tf和idf值
def build_tf_idf_dict(corpus):
    tf_dict = defaultdict(dict)  #key:文档序号,value:dict,文档中每个词出现的频率
    idf_dict = defaultdict(set)  #key:词, value:set,文档序号,最终用于计算每个词在多少篇文档中出现过
    for text_index, text_words in enumerate(corpus):
        for word in text_words:
            if word not in tf_dict[text_index]:
                tf_dict[text_index][word] = 0
            tf_dict[text_index][word] += 1
            idf_dict[word].add(text_index)
    idf_dict = dict([(key, len(value)) for key, value in idf_dict.items()])
    return tf_dict, idf_dict

#根据tf值和idf值计算tfidf
def calculate_tf_idf(tf_dict, idf_dict):
    tf_idf_dict = defaultdict(dict)
    for text_index, word_tf_count_dict in tf_dict.items():
        for word, tf_count in word_tf_count_dict.items():
            tf = tf_count / sum(word_tf_count_dict.values())
            #tf-idf = tf * log(D/(idf + 1))
            tf_idf_dict[text_index][word] = tf * math.log(len(tf_dict)/(idf_dict[word]+1))
    return tf_idf_dict

#输入语料 list of string
#["xxxxxxxxx", "xxxxxxxxxxxxxxxx", "xxxxxxxx"]
def calculate_tfidf(corpus):
    #先进行分词
    corpus = [jieba.lcut(text) for text in corpus]
    tf_dict, idf_dict = build_tf_idf_dict(corpus)
    tf_idf_dict = calculate_tf_idf(tf_dict, idf_dict)
    return tf_idf_dict

#根据tfidf字典,显示每个领域topK的关键词
def tf_idf_topk(tfidf_dict, paths=[], top=10, print_word=True):
    topk_dict = {
    
    }
    for text_index, text_tfidf_dict in tfidf_dict.items():
        word_list = sorted(text_tfidf_dict.items(), key=lambda x:x[1], reverse=True)
        topk_dict[text_index] = word_list[:top]
        if print_word:
            print(text_index, paths[text_index])
            for i in range(top):
                print(word_list[i])
            print("----------")
    return topk_dict

def main():
    dir_path = r"category_corpus/"
    corpus = []
    paths = []
    for path in os.listdir(dir_path):
        path = os.path.join(dir_path, path)
        if path.endswith("txt"):
            corpus.append(open(path, encoding="utf8").read())
            paths.append(os.path.basename(path))
    tf_idf_dict = calculate_tfidf(corpus)
    tf_idf_topk(tf_idf_dict, paths)

if __name__ == "__main__":
    main()

Guess you like

Origin blog.csdn.net/m0_63260018/article/details/132484674