n-gram language model - text generation source code

Please add image description

  In the field of natural language processing, n-gram language model is a basic and powerful tool. It is effectively used in text generation tasks by considering the sequence of words to predict text content. This blog will discuss how to use the n-gram model, especially when processing Chinese text, using jieba for word segmentation and the nltk library for model construction.

  I explained the principle of n-gram in my last blog. Reference: n-gram language model - calculation and smoothing of sentence probability distribution.

Basic principles of n-gram model

  The n-gram model is based on a simple assumption: the occurrence of a word is only related to a limited number of words preceding it. Such models can be divided into different types, such as bigrams and trigrams, depending on how many previous words we consider.

  Taking bigram as an example, it can be approximated that the occurrence probability of a word only depends on the word before it:

  为了让 p ( w i ∣ w i − 1 ) p(w_i | w_{i-1}) p(wiwi1) makes sense when i is 1, we usually add a start tag (BOS) at the beginning of the sentence and an end tag (EOS) at the end of the sentence to This is included in the probability calculation. For example, to calculate the probability of Mark wrote a book, we would calculate it like this:

p ( Mark wrote a book ) = p ( Mark ∣ BOS ) ⋅ p ( wrote ∣ Mark ) ⋅ p ( a ∣ wrote ) ⋅ p ( book ∣ a ) ⋅ p ( EOS ∣ book ) p(\text{Mark wrote a book}) = p(\text{Mark} | \text{BOS}) \cdot p(\text{wrote} | \text{Mark}) \cdot p(\text{a} | \text{wrote}) \cdot p(\text{book} | \text{a}) \cdot p(\text{EOS} | \text{book}) p(Mark wrote a book)=p(MarkBOS)p(wroteMark)p(awrote)p(booka)p(EOSbook)

  为了估计 p ( w i ∣ w i − 1 ) p(w_i | w_{i-1}) p(wiwi1), you can simply calculate the frequency of word w in a certain text and then normalize it. If c is used to represent the number of occurrences in a given text, we can use the following formula:

p ( w i ∣ w i − 1 ) = c ( w i − 1 , w i ) ∑ w c ( w i − 1 , w ) ​ p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{\sum_{w} c(w_{i-1}, w)}​ p(wiwi1)=wc(wi1,w)c(wi1,Ini)

  The above formula is Maximum Likelihood Estimation (MLE). This formula also applies to higher-order n-gram models.

Text generation steps

1. Preparation and word segmentation

  Use jieba to segment Chinese text. This is the first step in processing the Chinese n-gram model. The results after word segmentation are used to build the n-gram model.

2. Build n-gram model

  Use nltk's ngrams function to create a sequence of bigrams from the word segmentation results. These bigrams are then used to construct a conditional frequency distribution object for subsequent text generation.

3. Application of smoothing technology

  In the n-gram model, in order to deal with word combinations that do not appear in the training data, smoothing technology needs to be used. Lidstone smoothing and Laplace smoothing are two common methods. These methods avoid the zero probability problem and make the model more robust by adding a small non-zero value to the count of word combinations.

Lidstone smoothing and Laplace smoothing refer to previous blogs n-gram language model - sentence probability distribution calculation and smoothing

4. Generate text

  The process of text generation is to start from an initial word and continuously generate the next word according to the conditional frequency distribution. This process is repeated until the required number of words is reached or a stopping condition is encountered.

Source code

  The following is an example of using Laplace smoothed n-gram model to generate text:

import nltk
from nltk.probability import LidstoneProbDist, LaplaceProbDist
from nltk.corpus import brown
from nltk import FreqDist, ConditionalFreqDist
import random
import jieba

# 示例文本 读取ylk.txt
text = open("ylk.txt", encoding="utf-8").read()
# jieba分词
tokens = jieba.cut(text)
# 生成bigrams
bi_grams = list(nltk.ngrams(tokens, 2))
# 创建条件频率分布对象
cfd = ConditionalFreqDist(bi_grams)

# # 使用Lidstone平滑
# # gamma值小于1的Lidstone平滑
lidstone_cfd = {
    
    condition: LidstoneProbDist(cfd[condition], gamma=0.1) for condition in cfd.conditions()}

# 使用Laplace平滑
# Laplace平滑是gamma=1的特殊情况
laplace_cfd = {
    
    condition: LaplaceProbDist(cfd[condition]) for condition in cfd.conditions()}

def generate_text(initial_word, cfd, num_words=50):
    current_word = initial_word
    generated_text = [current_word]

    for _ in range(num_words - 1):
        if current_word not in cfd:
            break
        next_word = random.choices(
            population=list(cfd[current_word].samples()),
            weights=[cfd[current_word].prob(w) for w in cfd[current_word].samples()]
        )[0]
        generated_text.append(next_word)
        current_word = next_word

    return ''.join(generated_text)

# 示例:从"哈哈"开始生成文本
print(generate_text("方锐", laplace_cfd, 100))
print(generate_text("方锐", lidstone_cfd, 100))

  In the code, I downloaded more than 4M online novels as a corpus. Then start generating based on the protagonist's name, the results are as follows:

Insert image description here

  I feel lidstone is more effective.

  By using the n-gram model combined with smoothing technology, text that conforms to language rules can be effectively generated. Although this method is simple, it is still very effective in many application scenarios, especially when resources are limited.

  With the development of deep learning technology, more complex language models have emerged, and the n-gram model feels that it is no longer suitable in the field of text generation~

Guess you like

Origin blog.csdn.net/qq_43592352/article/details/134351200