AI: deep learning for text processing

This article, together with the release of another article, mentioned BlueDot company, the company is committed to using artificial intelligence to protect people from infectious diseases worldwide infringement, at the time of this epidemic has not aroused strong concern, issued a warning in advance of the week week, how precious!

Their AI early warning system, it uses a deep learning process of the text, the system crawl hundreds of thousands of information on the network to get a lot of news, public statements and other natural language processing, today we talk about depth learn how a simple treatment of the text.

Text, String or Text, is the sequence of words or characters, is the most common word processing (we do not consider the Chinese, the Chinese understand and deal with the English compared to the much more complex). The computer is cured mathematical treatment of the text, in essence, it is cured of statistics, so the use of the statistical model can solve many simple problems. Here we go.

Processing text data

Consistent with previous, if the original data is not to train vector, we want to vectorization, vectorization of text, there are several ways:

According to word split
According to character segmentation
Extract word n-gram

I like to eat fire ......, I am going to guess you would say what? 1-gram went on to say anything can, and earlier this word does not matter; 2-gram next may say "the, firewood, flame", etc., make up the word "torch, matches, fire"; 3-gram might say next. " pot "consisting of" hot pot ", the larger probability. Briefly so understanding, n-gram is related to the previous n-1 words.

We come today to fill a pit dug down before, then later said it would introduce one-hot, now is the time.

one-hot encoding

def one_hot():
    samples = ['The cat sat on the mat', 'The dog ate my homework']
    token_index = {}
    # 分割成单词
    for sample in samples:
        for word in sample.split():
            if word not in token_index:
                token_index[word] = len(token_index) + 1
    # {'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10}
    print(token_index)
 
    max_length = 8
    results = np.zeros(shape=(len(samples), max_length, max(token_index.values()) + 1))
    for i, sample in enumerate(samples):
        for j, word in list(enumerate(sample.split()))[:max_length]:
            index = token_index.get(word)
            results[i, j, index] = 1.

    print(results)

result

We see that the data is bad, mat and back were all homework with an English phrase '' to write the kind of virtuoso advanced regular expressions to match this inexplicable sign it? Of course not, that's right, Keras has a built-in method.

def keras_one_hot():
    samples = ['The cat sat on the mat.', 'The dog ate my homework.']
    tokenizer = Tokenizer(num_words=1000)
    tokenizer.fit_on_texts(samples)
    sequences = tokenizer.texts_to_sequences(samples)
    print(sequences)
    one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
    print(one_hot_results)
    word_index = tokenizer.word_index
    print(word_index)
    print('Found %s unique tokens.' % len(word_index))

result

Here num_words and above max_length are used to indicate how many of the most commonly used words, good control of this, can greatly reduce the amount of computation time training, even a little time to better improve accuracy, we hope to attract some attention. We can also see the resulting vector encoding, a large part is 0, not compact, which causes a large amount of memory footprint, not bad, there are other ways to do what? The answer is yes.

Embed the word

Also called term vectors. The word is usually embedded in a dense, low-dimensional (256,512,1024). That in the end what is the word embed it?

In this paper we deal with the theme of the text information, text information is semantic, and for no semantic text we can not do anything, but our previous approach, in fact, statistically the probability ,, is a simple calculation, no understand the meaning of (or very little), but taking into account the real situation, "very good" and "very good" mean similar, and they are "very poor" meaning is the opposite, so we want to convert into a vector when the first two vectors small distance, and after a large distance vector. So see below a picture, it is not very easy to understand it:

You may go directly to achieve this function a bit difficult, but fortunately Keras simplifies this problem, Embedding is a built-in network layer, can accomplish this mapping relationship. Now after understanding this concept, let us look at IMDB problems (movie reviews emotion prediction), the code is simple, no difference can be up to 75 percent accuracy rate:

def imdb_run():
    max_features = 10000
    maxlen = 20
    (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
    x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
    model = Sequential()
    model.add(Embedding(10000, 8, input_length=maxlen))
    model.add(Flatten())
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
    model.summary()
    history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

Our little less amount of data, how to do it? On one when we deal with images, the method used is the use of pre-trained network, where we use a similar method, using pre-trained word embedded. The two most popular word is embedded GloVe and Word2Vec, we are still behind the introduction of these two words is embedded at the right time. Today we use the method GloVe specific approach I wrote in the comments of the code. We still look at the result, the code was placed at the end:

Soon passed fit, you may feel that verify the accuracy of nearly 60%, taking into account the training sample only 200, this result is really quite good, of course, you may not believe it, so I'll give two comparison chart , a group of words is not embedded:

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture uploaded directly down (IMG-Uem4R2hO-1,583,414,769,394) ( https://upload-images.jianshu.io/upload_images/2023569-23b0d32d9d3db11d?imageMogr2 / Auto-Orient / Strip 7CimageView2% / 2 / W / 1240 is )]

Verify the accuracy is significantly lower, and then gives the data 2000 of the training set:

The precision on a lot higher, the pursuit of this level is not our purpose, our purpose is to illustrate the embedded word is valid, we achieved this goal, well, then we look at the code it:

#!/usr/bin/env python3

import os
import time

import matplotlib.pyplot as plt
import numpy as np
from keras.layers import Embedding, Flatten, Dense
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer


def deal():
    # http://mng.bz/0tIo
    imdb_dir = '/Users/renyuzhuo/Documents/PycharmProjects/Data/aclImdb'
    train_dir = os.path.join(imdb_dir, 'train')
    labels = []
    texts = []
    # 读出所有数据
    for label_type in ['neg', 'pos']:
        dir_name = os.path.join(train_dir, label_type)
        for fname in os.listdir(dir_name):
            if fname[-4:] == '.txt':
                f = open(os.path.join(dir_name, fname))
                texts.append(f.read())
                f.close()
                if label_type == 'neg':
                    labels.append(0)
                else:
                    labels.append(1)

    # 对所有数据进行分词
    # 每个评论最多 100 个单词
    maxlen = 100
    # 训练样本数量
    training_samples = 200
    # 验证样本数量
    validation_samples = 10000
    # 只取最常见 10000 个单词
    max_words = 10000
    # 分词，前文已经介绍过了
    tokenizer = Tokenizer(num_words=max_words)
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)
    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))
    # 将整数列表转换成张量
    data = pad_sequences(sequences, maxlen=maxlen)
    labels = np.asarray(labels)
    print('Shape of data tensor:', data.shape)
    print('Shape of label tensor:', labels.shape)
    # 打乱数据
    indices = np.arange(data.shape[0])
    np.random.shuffle(indices)
    data = data[indices]
    labels = labels[indices]
    # 切割成训练集和验证集
    x_train = data[:training_samples]
    y_train = labels[:training_samples]
    x_val = data[training_samples: training_samples + validation_samples]
    y_val = labels[training_samples: training_samples + validation_samples]

    # 下载词嵌入数据，下载地址：https: // nlp.stanford.edu / projects / glove
    glove_dir = '/Users/renyuzhuo/Documents/PycharmProjects/Data/glove.6B'
    embeddings_index = {}
    f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
    # 构建单词与其x向量表示的索引
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    print('Found %s word vectors.' % len(embeddings_index))

    # 构建嵌入矩阵
    embedding_dim = 100
    embedding_matrix = np.zeros((max_words, embedding_dim))
    for word, i in word_index.items():
        if i < max_words:
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[i] = embedding_vector

    # 构建模型
    model = Sequential()
    model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
    model.add(Flatten())
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.summary()

    # 将 GloVe 加载到 Embedding 层，且将其设置为不可训练
    model.layers[0].set_weights([embedding_matrix])
    model.layers[0].trainable = False

    # 训练模型
    model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy',
                  metrics=['acc'])
    history = model.fit(x_train, y_train,
                        epochs=10,
                        batch_size=32,
                        validation_data=(x_val, y_val))
    model.save_weights('pre_trained_glove_model.h5')

    # 画图
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(acc) + 1)
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.show()

    plt.figure()
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
    plt.show()


if __name__ == "__main__":
    time_start = time.time()
    deal()
    time_end = time.time()
    print('Time Used: ', time_end - time_start)

This article first number from the public: RAIS