language model

introduction

A language model is a model used to predict the probability distribution of the next word or character in a sequence of text. It can capture certain aspects of language structure, such as grammar, sentence structure and contextual information. Traditional language models usually use N-gram methods or hidden Markov models, but these models often cannot capture long-distance dependencies and complex semantic information.

1. What is a language model

In layman's terms,
the language model evaluates whether a sentence is "reasonable" or "human."
Mathematically,
P (today's weather is good) > P (today's bad weather)
language model is used to calculate the textSentence probability

2. The main purpose of the language model

2.1 Language Model - Speech Recognition

  • Speech Recognition: Sound -> Text
  • Sound is a wave
    insert image description here
  • Divide the wave into many frames according to the time period, such as 25ms
  • Then perform acoustic feature extraction and convert each frame into a vector
  • The vector after the acoustic feature extraction is used as input, and the acoustic model is used to predict the phoneme
    insert image description here
  • Phonemes are similar to Pinyin, but tones should be considered
  • The phoneme sequence corresponds to multiple text sequences, and the language model selects the sequence with the highest probability of forming a sentence
  • Use beam search or Viterbi to decode
  • Speech Recognition Diagram
    insert image description here

2.2 Language Model - Handwriting Recognition

insert image description here

  • The recognition model converts the text in the picture into candidate Chinese characters (generally divided into two steps of positioning and recognition), and then the language model selects the sequence with the highest probability of forming a sentence

2.3 Language Model - Input Method

  • The input is a pinyin sequence, each pinyin naturally has multiple candidate Chinese characters, and a high-probability sequence is selected according to the language model
  • The input method is a task with a lot of details. On the basic algorithm of language model, it is necessary to consider common typing errors, common mispronunciation, pinyin abbreviation, mixed Chinese and English, output symbols, user habits, etc.
  • Handwriting input method, voice input method is the same

3. Classification of language models

  1. Based on Statistical Language Model
  • Statistics of word frequency, word order, and word co-occurrence for a corpus
  • Calculate the relevant probability to get the language model
  • Representative: N-gram language model
  1. Language Models Based on Neural Networks
  • Use corpus for model training according to the set network structure
  • Representative: LSTM language model, BERT, etc.
  1. Auto regressive language model
  • Predict the context from the context (or vice versa) at training time
  • One-way models, using only one-sided sequence information
  • Representatives: N-gram, ELMO
  1. Auto encoding language model
  • Predict characters anywhere in a sequence at training time
  • Bidirectional model, absorbing contextual information
  • Representative: BERT

4. N-gram language model

The N-gram language model is a basic language model used to predict the probability of the next word or character, based on the first N-1 words or characters. The model views text as an ordered sequence of words or characters and assumes that the nth word is only related to the first N-1 words.

For example, in a bigram (2-gram) model, the occurrence of each word depends only on the word before it. For example, the probability that "I eat" is followed by "apple" can be expressed as P(apple | I eat).

advantage:

  1. Simple calculation: only need to count word frequency and conditional word frequency.
  2. Easy to implement: No complicated algorithms are required.

shortcoming:

  1. Sparsity problem: As N increases, the storage space required by the model increases sharply, and most N-gram combinations may not exist in actual data.
  2. Context limitation: Only the context information of N-1 words can be captured.

Despite these limitations, N-gram models are still widely used in many application scenarios due to their simplicity and efficiency, such as spell checking, speech recognition, and machine translation.

How to calculate the sentence probability?

  • Use S for sentences and w for individual words or phrases
  • S = w1w2w3w4w5…wn
  • P(S) = P(w1,w2,w3,w4,w5…wn)
  • Sentence probability -> probability of words W1~Wn appearing in sequence
  • P(w1,w2,w3,…,wn) = P(w1)P(w2|w1)P(w3|w1,w2)…P(wn|w1,…,wn-1)

in words

  • P(today's weather is good) = P(today)*P(day|today) *P(day|today) *P(gas|today's day) *P(not|today's weather) *P(wrong|today's weather is not)

in words

  • P(today's weather is good) = P(today)*P(weather|today) *P(nice|today's weather)

How to calculate P(today)?

  • P(today) = Count(today) / Count_total the total number of words in the corpus
  • P(weather|today) = Count(today's weather) / Count(today)
  • P(good|today's weather) = Count(today's weather is good) / Count(today's weather)
  • Two-tuple: today's weather 2 gram
  • Triplet: The weather is fine today 3 gram

Difficulty: Too many sentences!

  • For any language, the number of N-grams is too large to be exhaustive and needs to be simplified

Markov hypothesis

  • P(wn|w1,…,wn-1) ≈ P(wn|wn-3,wn-2,wn-1)
  • Assuming that the probability of the nth word appearing is only affected by the limited words in front of it
  • P(today's weather is good) = P(today)*P(day|today) *P(day|today) *P(gas|every day) *P(no|weather) *P(wrong|no)

Flaws of the Markov assumption:

  1. Factors affecting the nth word may appear far ahead
    long distance dependency

Example: I read a book about Markov's life
I watched a movie about Markov's life
I heard a story about Markov's life

  1. Factors affecting the nth word may appear after it
  2. The factors affecting the nth word may not be in the text

However, a very efficient model can still be obtained under the Markov assumption

Corpus:
The weather is good today P(today) = 3 / 12 = 1/4
The weather is good tomorrow P(weather | today) = 2 / 3
The weather is not good today P (nice | today’s weather) = 1 / 2
It is sunny today P (good |weather) = 2/3
3 gram model
P(nice weather today) = P(today)*P(weather|today)*P(nice|today's weather) = 1/12 2 gram model
P
(nice weather today) = P(today)*P(weather|today)*P(nice|weather) = 1 / 9

Corpus:
The weather is good today P(today) = 3 / 12 = 1/4
The weather is good tomorrow P(weather | today) = 2 / 3
The weather is not good today P (nice | today’s weather) = 1 / 2
It is sunny today P (good |weather) = 2/3

Question: How to give the probability of words or ngrams that have not appeared in the corpus?
P(bad weather today) = P(today)*P(weather|today)*P(bad|weather)

  • Smoothing problem (smoothing)
  • Theoretically, the probability of a sentence formed by any combination of words should not be zero
  • How to assign probabilities to unseen words or ngrams is a smoothing problem
  • Also known as the discounting problem (discounting)

4.1 N-gram language model - smoothing method

  1. back off

When the triplet abc does not exist, look for the probability of the bc binary group
P(c | ab) = P(c | b) * Bow(ab)
Bow(ab) is called the fallback probability of the binary group ab
There are many ways to calculate the fallback probability, and it can even be set as a constant
. The fallback can be performed iteratively, such as the sequence abcd
P(d | abc) = P(d | bc) * Bow(abc)
P(d | bc) = P( d | c) * Bow(bc)
P(d | c ) = P(d) * Bow©

How to deal with P(word) does not exist
Add 1 smooth add-one smooth
For 1gram probability P(word) = Count(word)+1/Count(total_word)+V
V is the vocabulary size It
is also possible for high-order probabilities
insert image description here

Replace the low-frequency words
with unseen words encountered in the prediction, and also replace them with
one word into prophecy -> one word into
P (|one word into)
This is a common method for nlp to deal with unregistered words (OOV)

Interpolation
Inspired by regression smoothing, when calculating the high-order ngram probability, the low-order ngram probability value is considered at the same time, and the final result is given by interpolation
insert image description here

Practice has proved that the effect of this method can be improved.
λ can be determined by adjusting parameters on the verification set

4.2 ngram code

import math
from collections import defaultdict


class NgramLanguageModel:
    def __init__(self, corpus=None, n=3):
        self.n = n
        self.sep = "_"     # 用来分割两个词,没有实际含义,只要是字典里不存在的符号都可以
        self.sos = "<sos>"    #start of sentence,句子开始的标识符
        self.eos = "<eos>"    #end of sentence,句子结束的标识符
        self.unk_prob = 1e-5  #给unk分配一个比较小的概率值,避免集外词概率为0
        self.fix_backoff_prob = 0.4  #使用固定的回退概率
        self.ngram_count_dict = dict((x + 1, defaultdict(int)) for x in range(n))
        self.ngram_count_prob_dict = dict((x + 1, defaultdict(int)) for x in range(n))
        self.ngram_count(corpus)
        self.calc_ngram_prob()

    #将文本切分成词或字或token
    def sentence_segment(self, sentence):
        return sentence.split()
        #return jieba.lcut(sentence)

    #统计ngram的数量
    def ngram_count(self, corpus):
        for sentence in corpus:
            word_lists = self.sentence_segment(sentence)
            word_lists = [self.sos] + word_lists + [self.eos]  #前后补充开始符和结尾符
            for window_size in range(1, self.n + 1):           #按不同窗长扫描文本
                for index, word in enumerate(word_lists):
                    #取到末尾时窗口长度会小于指定的gram,跳过那几个
                    if len(word_lists[index:index + window_size]) != window_size:
                        continue
                    #用分隔符连接word形成一个ngram用于存储
                    ngram = self.sep.join(word_lists[index:index + window_size])
                    self.ngram_count_dict[window_size][ngram] += 1
        #计算总词数,后续用于计算一阶ngram概率
        self.ngram_count_dict[0] = sum(self.ngram_count_dict[1].values())
        return

    #计算ngram概率
    def calc_ngram_prob(self):
        for window_size in range(1, self.n + 1):
            for ngram, count in self.ngram_count_dict[window_size].items():
                if window_size > 1:
                    ngram_splits = ngram.split(self.sep)              #ngram        :a b c
                    ngram_prefix = self.sep.join(ngram_splits[:-1])   #ngram_prefix :a b
                    ngram_prefix_count = self.ngram_count_dict[window_size - 1][ngram_prefix] #Count(a,b)
                else:
                    ngram_prefix_count = self.ngram_count_dict[0]     #count(total word)
                # word = ngram_splits[-1]
                # self.ngram_count_prob_dict[word + "|" + ngram_prefix] = count / ngram_prefix_count
                self.ngram_count_prob_dict[window_size][ngram] = count / ngram_prefix_count
        return

    #获取ngram概率,其中用到了回退平滑,回退概率采取固定值
    def get_ngram_prob(self, ngram):
        n = len(ngram.split(self.sep))
        if ngram in self.ngram_count_prob_dict[n]:
            #尝试直接取出概率
            return self.ngram_count_prob_dict[n][ngram]
        elif n == 1:
            #一阶gram查找不到,说明是集外词,不做回退
            return self.unk_prob
        else:
            #高于一阶的可以回退
            ngram = self.sep.join(ngram.split(self.sep)[1:])
            return self.fix_backoff_prob * self.get_ngram_prob(ngram)


    #回退法预测句子概率
    def calc_sentence_ppl(self, sentence):
        word_list = self.sentence_segment(sentence)
        word_list = [self.sos] + word_list + [self.eos]
        sentence_prob = 0
        for index, word in enumerate(word_list):
            ngram = self.sep.join(word_list[max(0, index - self.n + 1):index + 1])
            prob = self.get_ngram_prob(ngram)
            # print(ngram, prob)
            sentence_prob += math.log(prob)
        return 2 ** (sentence_prob * (-1 / len(word_list)))



if __name__ == "__main__":
    corpus = open("sample.txt", encoding="utf8").readlines()
    lm = NgramLanguageModel(corpus, 3)
    print("词总数:", lm.ngram_count_dict[0])
    print(lm.ngram_count_prob_dict)
    print(lm.calc_sentence_ppl("e f g b d"))

4.3 Evaluation index of language model

  • perplexity
    insert image description here
  • PPL value is inversely proportional to sentence probability

Generally, a reasonable target text is used to calculate the PPL. If the PPL value is low, it means that the probability of forming a sentence is high, which means that the sentence is judged by the language model and the rationality of the sentence is high. This is a good language model.

  • Another PPL, with logarithmic summation instead of fractional product
    insert image description here
  • The essence is the same, and it is inversely proportional to the probability of forming a sentence
  • Thinking: The smaller the PPL, the better the effect of the language model. Is this conclusion correct?
  • Sentence probability is a relative value!

4.4 Comparison of two types of language models

insert image description here

5. Neural Network Language Model

  • Bengio et al. 2003
  • Similar to the ngram model, use the first n words to predict the next word
  • output the probability distribution over the word list
  • Get the by-product of the word vector

insert image description here

  • With the development of related research, the complexity of the hidden layer model structure continues to increase
  • DNN -> CNN/RNN -> LSTM/GRU -> transformer

insert image description here

  • Devlin et al. 2018 The Birth of BERT
  • Main features: Instead of using the method of predicting the next word to train the language model, instead predict a word that is randomly covered in the text
  • This method is called MLM (masked language model)
  • In fact, this method was proposed very early, and it was not original by bert

the code

#coding:utf8

import torch
import torch.nn as nn
import numpy as np
import math
import random
import os
import re
import matplotlib.pyplot as plt

"""
基于pytorch的rnn语言模型
"""

class LanguageModel(nn.Module):
    def __init__(self, input_dim, vocab):
        super(LanguageModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab) + 1, input_dim)
        self.layer = nn.RNN(input_dim, input_dim, num_layers=2, batch_first=True)
        self.classify = nn.Linear(input_dim, len(vocab) + 1)
        self.dropout = nn.Dropout(0.1)
        self.loss = nn.functional.cross_entropy

    #当输入真实标签,返回loss值;无真实标签,返回预测值
    def forward(self, x, y=None):
        x = self.embedding(x)  #output shape:(batch_size, sen_len, input_dim)
        x, _ = self.layer(x)      #output shape:(batch_size, sen_len, input_dim)
        x = x[:, -1, :]        #output shape:(batch_size, input_dim)
        x = self.dropout(x)
        y_pred = self.classify(x)   #output shape:(batch_size, input_dim)
        if y is not None:
            return self.loss(y_pred, y)
        else:
            return torch.softmax(y_pred, dim=-1)

#读取语料获得字符集
#输出一份
def build_vocab_from_corpus(path):
    vocab = set()
    with open(path, encoding="utf8") as f:
        for index, char in enumerate(f.read()):
            vocab.add(char)
    vocab.add("<UNK>") #增加一个unk token用来处理未登录词
    writer = open("vocab.txt", "w", encoding="utf8")
    for char in sorted(vocab):
        writer.write(char + "\n")
    return vocab

#加载字表
def build_vocab(vocab_path):
    vocab = {
    
    }
    with open(vocab_path, encoding="utf8") as f:
        for index, line in enumerate(f):
            char = line[:-1]        #去掉结尾换行符
            vocab[char] = index + 1 #留出0位给pad token
        vocab["\n"] = 1
    return vocab

#加载语料
def load_corpus(path):
    return open(path, encoding="utf8").read()

#随机生成一个样本
#从文本中截取随机窗口,前n个字作为输入,最后一个字作为输出
def build_sample(vocab, window_size, corpus):
    start = random.randint(0, len(corpus) - 1 - window_size)
    end = start + window_size
    window = corpus[start:end]
    target = corpus[end]
    # print(window, target)
    x = [vocab.get(word, vocab["<UNK>"]) for word in window]   #将字转换成序号
    y = vocab[target]
    return x, y

#建立数据集
#sample_length 输入需要的样本数量。需要多少生成多少
#vocab 词表
#window_size 样本长度
#corpus 语料字符串
def build_dataset(sample_length, vocab, window_size, corpus):
    dataset_x = []
    dataset_y = []
    for i in range(sample_length):
        x, y = build_sample(vocab, window_size, corpus)
        dataset_x.append(x)
        dataset_y.append(y)
    return torch.LongTensor(dataset_x), torch.LongTensor(dataset_y)

#建立模型
def build_model(vocab, char_dim):
    model = LanguageModel(char_dim, vocab)
    return model


#计算文本ppl
def calc_perplexity(sentence, model, vocab, window_size):
    prob = 0
    model.eval()
    with torch.no_grad():
        for i in range(1, len(sentence)):
            start = max(0, i - window_size)
            window = sentence[start:i]
            x = [vocab.get(char, vocab["<UNK>"]) for char in window]
            x = torch.LongTensor([x])
            target = sentence[i]
            target_index = vocab.get(target, vocab["<UNK>"])
            if torch.cuda.is_available():
                x = x.cuda()
            pred_prob_distribute = model(x)[0]
            target_prob = pred_prob_distribute[target_index]
            prob += math.log(target_prob, 10)
    return 2 ** (prob * ( -1 / len(sentence)))


def train(corpus_path, save_weight=True):
    epoch_num = 10        #训练轮数
    batch_size = 128       #每次训练样本个数
    train_sample = 10000   #每轮训练总共训练的样本总数
    char_dim = 128        #每个字的维度
    window_size = 6       #样本文本长度
    vocab = build_vocab("vocab.txt")       #建立字表
    corpus = load_corpus(corpus_path)     #加载语料
    model = build_model(vocab, char_dim)    #建立模型
    if torch.cuda.is_available():
        model = model.cuda()
    optim = torch.optim.Adam(model.parameters(), lr=0.001)   #建立优化器
    for epoch in range(epoch_num):
        model.train()
        watch_loss = []
        for batch in range(int(train_sample / batch_size)):
            x, y = build_dataset(batch_size, vocab, window_size, corpus) #构建一组训练样本
            if torch.cuda.is_available():
                x, y = x.cuda(), y.cuda()
            optim.zero_grad()    #梯度归零
            loss = model(x, y)   #计算loss
            watch_loss.append(loss.item())
            loss.backward()      #计算梯度
            optim.step()         #更新权重
        print("=========\n第%d轮平均loss:%f" % (epoch + 1, np.mean(watch_loss)))
    if not save_weight:
        return
    else:
        base_name = os.path.basename(corpus_path).replace("txt", "pth")
        model_path = os.path.join("model", base_name)
        torch.save(model.state_dict(), model_path)
        return

#训练corpus文件夹下的所有语料,根据文件名将训练后的模型放到莫得了文件夹
def train_all():
    for path in os.listdir("corpus"):
        corpus_path = os.path.join("corpus", path)
        train(corpus_path)


if __name__ == "__main__":
    # build_vocab_from_corpus("corpus/all.txt")
    # train("corpus.txt", True)
    train_all()

6. Application of language model

6.1 Application of Language Model - Speaker Separation

  • Judging the speaker by the content of the speech
  • Commonly used in language recognition systems to judge characters in recorded conversations
  • Such as customer service dialogue recording, judge the agent or customer
  • Judging speakers by their accents
  • Translator: Can you imagine that there are cockroaches in this unlucky house? This is really scary!
  • Hong Kong and Taiwan accent: How can you be like this?
  • Northeast Flavor: It’s not uncommon for me to tell you about those things and it’s overwhelmed
  • Essentially a text classification task
  1. For each category, train the language model using the category corpus
  2. For a new input text, use all language models to calculate the sentence probability
  3. Select the category with the highest probability as the predicted category

insert image description here

  • Compared with general text classification models, such as Bayesian, rf, neural network, etc.
  • Advantage:
  • Each category model is independent of each other, sample imbalance or sample error has no effect on other models
  • New categories can be added at any time without affecting the effect of old categories
  • In terms of effect: generally there will be no significant advantage
  • In terms of efficiency: generally lower than the unified classification model

6.2 Application of language model - text error correction

  • correct errors in the text
  • like:
  • i went todaydarkThe door looks at the people's heroesremembermonument
  • I went to Tiananmen today to see the Monument to the People's Heroes
  • Mistakes may be homophones or similar characters, etc.
  1. Create a set of confused words for each word
  2. Calculate the probability of the entire sentence being a sentence
  3. Replace the words in the original sentence with the words in the confused word set, and recalculate the probability
  4. Select a candidate sentence with the highest score, if the score of this sentence exceeds a certain threshold compared with the original sentence
  5. Repeat steps 3-4 for the next word until the end of the sentence

insert image description here

This approach has some drawbacks:

  1. Unable to solve the problem of many words and few words
  2. The setting of the threshold is very difficult to grasp. If the setting is too large, the error correction effect will not be achieved; if the setting is too small, a large number of replacements will be caused, which may change the original meaning of the sentence
  3. The list of confusing words is difficult to complete
  4. The domain of the language model will affect the modification results
  5. Continuous typos will greatly increase the difficulty of error correction
  • General Industry Practices:
  • Limit a modification white list, only to determine whether specific words need to be modified
  • For example, only for all fragments pronounced shang wu, calculate whether to change it to "business", and ignore the rest
  • For deep learning models, typos can be tolerated, so the importance of error correction itself is declining, generally only for display tasks

6.3 Application of language model - digital normalization

  • Converts the numeric portion of a text into a reader-friendly style
  • It is often used when displaying text after the language recognition system
  • like:
  • The coal inventory at Qinhuangdao Port suddenly surged in early November, from 4.549 million tons to 7.734 million tons, breaking the record since 1999
  • Coal inventory at Qinhuangdao Port suddenly surged in early November, from 4.549 million tons to 7.734 million tons, breaking the record since 1999
  1. Find the text that conforms to the specification in digital form as the original corpus
  2. Find the numeric part (arbitrary form) with regex
  3. Replace the number part with tokens such as <Arabic numerals> <Chinese character numbers> <Chinese character continuous reading> according to its format
  4. Use text with tokens to train language models
  5. For the newly entered text, also use regular expressions to find the digital part, and then bring in each token respectively, and use the language model to calculate the probability
  6. Select the token with the highest probability as the final digital format, convert it according to the rules and fill in the original text

insert image description here

6.4 Application of language model - text marking

insert image description here
insert image description here

  • Essentially a sequence labeling task
  • Word segmentation, text punctuation, text paragraph segmentation and other tasks can be processed in a similar manner
  • Only one token is required for word segmentation or paragraph segmentation; when marking points, multiple separation tokens can be used to represent different punctuation points

7. Summary

  1. The core ability of the language model is to calculate the probability of forming a sentence. Relying on this ability, a large number of different types of NLP tasks can be completed.
  2. Statistics-based language models and neural network-based language models have their own usage scenarios. Generally speaking, the advantage of statistical-based models lies in decoding speed, while neural network models usually perform better.
  3. Evaluating the language model purely through PPL is limited, and it is better to conduct an overall evaluation through the downstream task effect.
  4. A deep understanding of an algorithm helps to discover more application methods.
  5. Seemingly simple (even wrong) assumptions can lead to meaningful results, and in fact, this is a common way to simplify problems.

Guess you like

Origin blog.csdn.net/m0_63260018/article/details/132506240