DataWhale team punch learning camp task02-2

Language model and data collection

The language model
for a natural language text can be seen as a discrete time series, given the length of a
sequence of words
, the target language model is to assess the reasonableness of the sequence, ie the sequence probability calculation:
Here Insert Picture Description
In this section we introduce based on statistics language model, mainly
gram (
-gram). In subsequent sections, we will introduce a language model based on neural network.

The language model
assumptions sequence w1, w2, ..., wT each word in turn generated, we have
Here Insert Picture Description
for example, the probability for a sequence of text containing four words
Here Insert Picture Description
probability parameter is the word of the language model, and given the first few words conditions in the case of probability. Training data set is provided as a large text corpus, such as the probability of all the Wikipedia entry, words may be calculated by the term in the training data set relative word frequency, e.g.,
the probability can be calculated as:
Here Insert Picture Description
where n (w1) of the corpus number w1 to as the first word of the text, n is the total number of this corpus of Chinese.
Similarly, given the case where w1, w2 is the conditional probability can be calculated as:
Here Insert Picture Description
where n-(w1, w2) as the corpus to the first word w1, w2 as a second number of words of text.

n-gram
sequence length increases, the complexity of calculating and storing a plurality of probability of co-occurrence of words will increase exponentially. n-gram model is simplified by assuming Markov Markov assumption means that a word appears only n words associated with the front, i.e., n-th order Markov chain (Markov chain of order n), if n = 1, then there P (w3 | w1, w2) = P (w3 | w2). Based on n-1 order Markov chain, we can rewrite the language model:
Here Insert Picture Description
the above is also called n-gram (n-grams), which is the probability language model n-1 order Markov Chain. For example, when n = 2, the probability of the text sequence contains four words can be rewritten as:
Here Insert Picture Description
where n 1, 2, and 3, respectively, will be referred to respectively one yuan syntax (the unigram), bigram (of bigrams in ) and ternary syntax (trigram). For example, the length of the sequence w1 4, w2, w3, w4 a gram, and bigram probabilities are ternary syntax:
Here Insert Picture Description

When n is small, n-gram is often not accurate. For example, a gram, a three-word sentence is "you go first" and probability "you go first" is the same. However, when n is large, n-gram is necessary to calculate and store a large number of multi-word word frequency and adjacent frequencies.

Thoughts:
gram What are the possible defects?

1. The parameter space is too large
2. sparse data

Language model data set

Reading the data set

with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f:
    corpus_chars = f.read()
print(len(corpus_chars))
print(corpus_chars[: 40])
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[: 10000]

63,282
want to have a helicopter
and you want to fly to the universe
wants you melt together and
melt in the universe,
I am every day every day

Build character index

idx_to_char = list(set(corpus_chars)) # 去重,得到索引到字符的映射
char_to_idx = {char: i for i, char in enumerate(idx_to_char)} # 字符到索引的映射
vocab_size = len(char_to_idx)
print(vocab_size)

corpus_indices = [char_to_idx[char] for char in corpus_chars]  # 将每个字符转化为索引,得到一个索引的序列
sample = corpus_indices[: 20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)

1027
chars: you want to have a helicopter and fly to the universe you want to want to go and
indices: [1022, 648, 1025 , 366, 208, 792, 199, 1022, 648, 641, 607, 625, 26, 155, 130 , 5, 199, 1022, 648, 641]

Defined functions load_data_jay_lyrics, called directly in the following sections.

def load_data_jay_lyrics():
    with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f:
        corpus_chars = f.read()
    corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
    corpus_chars = corpus_chars[0:10000]
    idx_to_char = list(set(corpus_chars))
    char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
    vocab_size = len(char_to_idx)
    corpus_indices = [char_to_idx[char] for char in corpus_chars]
    return corpus_indices, char_to_idx, idx_to_char, vocab_size

Data sampling timing
in training every time we need small quantities of sample and random read labels. Experimental data with previous section except that the timing of a data sample typically comprises consecutive characters. Assuming that the number of time steps is 5, the sample sequence is five characters, that is, "think", "should", "have", "straight", "l." Tag sequence to the sample in the training set are the characters of the next character, i.e., "to", "have", "straight", "up", "machine", i.e., X = "want to have helicopter", Y = "to have helicopter. "

We now consider the sequence "want to have helicopters, and you want to fly to the universe", if the number of time steps is 5, the following may sample and labels:
Here Insert Picture Description
you can see, if the length of the sequence number T, is the time step n, then a total of Tn legitimate samples, but these samples have a lot of overlap, we usually use more efficient sampling method. We have two ways of time-series data sampling, are random sampling and adjacent sampling.

Randomly sampled
following code each time a random sampling of the data in small quantities. Wherein the batch size is the number of samples per batch_size small quantities, num_steps each sample time included in the number of steps. In random sampling, each sample is taken on the original sequence of a sequence of any two adjacent small quantities random location on the original sequence are not necessarily adjacent.

import torch
import random
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # 减1是因为对于长度为n的序列,X最多只有包含其中的前n - 1个字符
    num_examples = (len(corpus_indices) - 1) // num_steps  # 下取整,得到不重叠情况下的样本个数
    example_indices = [i * num_steps for i in range(num_examples)]  # 每个样本的第一个字符在corpus_indices中的下标
    random.shuffle(example_indices)

    def _data(i):
        # 返回从i开始的长为num_steps的序列
        return corpus_indices[i: i + num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    for i in range(0, num_examples, batch_size):
        # 每次选出batch_size个随机样本
        batch_indices = example_indices[i: i + batch_size]  # 当前batch的各个样本的首字符的下标
        X = [_data(j) for j in batch_indices]
        Y = [_data(j + 1) for j in batch_indices]
        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)

Test this function, we enter the consecutive integers from 0 to 29 as an artificial sequence, provided batch size and number of time steps 2 and 6, respectively, the input X and label printing randomly sampled read each time small quantities of sample Y.

my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

X: tensor([[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17]])
Y: tensor([[ 7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18]])

X: tensor([[ 0, 1, 2, 3, 4, 5],
[18, 19, 20, 21, 22, 23]])
Y: tensor([[ 1, 2, 3, 4, 5, 6],
[19, 20, 21, 22, 23, 24]])

Adjacent samples
in adjacent sampling, the adjacent two random positions on small quantities adjacent to the original sequence.

def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    corpus_len = len(corpus_indices) // batch_size * batch_size  # 保留下来的序列的长度
    corpus_indices = corpus_indices[: corpus_len]  # 仅保留前corpus_len个字符
    indices = torch.tensor(corpus_indices, device=device)
    indices = indices.view(batch_size, -1)  # resize成(batch_size, )
    batch_num = (indices.shape[1] - 1) // num_steps
    for i in range(batch_num):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

Under the same settings, print samples of the input X and the adjacent label sample of small quantities of each read Y. Two adjacent positions on the random small quantities adjacent to the original sequence.

for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

X: tensor([[ 0, 1, 2, 3, 4, 5],
[15, 16, 17, 18, 19, 20]])
Y: tensor([[ 1, 2, 3, 4, 5, 6],
[16, 17, 18, 19, 20, 21]])

X: tensor([[ 6, 7, 8, 9, 10, 11],
[21, 22, 23, 24, 25, 26]])
Y: tensor([[ 7, 8, 9, 10, 11, 12],
[22, 23, 24, 25, 26, 27]])

Published 31 original articles · won praise 0 · Views 809

Guess you like

Origin blog.csdn.net/qq_44750620/article/details/104315517