Recurrent neural network solves text classification problem

1. Concept

1.1, recurrent neural network

Recurrent Neural Network (RNN) is a type of recurrent neural network that takes sequence data as input, performs recursion in the evolution direction of the sequence, and all nodes (recurrent units) are connected in a chain.

The input of the convolutional network is only the input data X, and in addition to the input data X, the output of each step of the recurrent neural network will be used as the input of the next step. This cycle, and the same activation function and parameters are used every time. In each cycle, x0 is multiplied by the coefficient U to obtain s0, and then input to the next time through the coefficient W, which forms the forward propagation of the recurrent neural network.

          

In the back propagation, the derivative of the loss function E to the parameter W is required, and the formula at the lower right can be obtained by the chain derivative rule

  

Recurrent neural network is compared with convolutional neural network. Convolutional neural network is an output that produces an output through the network. The recurrent neural network can achieve one input multiple outputs (generate image description), multiple inputs one output (text classification), multiple inputs multiple outputs (machine translation, video commentary).

RNN uses the tan activation function, the output is between -1 and 1, and the gradient is easy to disappear. Steps farther from the output contribute little to the gradient.

The output of the bottom layer is used as the input of the upper layer to form a multi-layer RNN network, and the upper layer can also be passed between, and the residual connection can be used to prevent overfitting.

1.2 Long- and short-term memory network

There is only one parameter W between each propagation of RNN. It is difficult to describe a large number of complex information requirements with this parameter. In order to solve this problem , Long Short Term Memory (LSTM) is introduced . This network can carry out selective mechanisms, selectively input and output information that needs to be used, and selectively forget information that is not needed. The realization of the selective mechanism is realized through the Sigmoid gate, the output of the sigmoid function is between 0 and 1, 0 represents forgetting, 1 represents memory, 0.5 represents memory 50%

The LSTM network structure is shown below,

   

As shown in the right figure above, this is the implicit state of the current round of operations . The current state is obtained by the dot product of the previous state and the forget gate result, plus the result of the incoming ones.

The following figure shows the structure of the forget gate . The output ht-1 and data xt of the previous round pass through the forget gate to select whether to forget, and the forget result is generated ft

The following figure shows the structure of the incoming gate. The result of ht-1 and xt after the forget gate is it and the result Ct of tanh is subjected to a dot product operation to obtain the input of this operation

The following figure shows the structure of the output gate . The result of ht-1 and xt passing through the forget gate is ot and the current state is dot product to produce this output.

   

To implement the LSTM network as follows, first define the _generate_params function to generate the parameters required for each gate, and call this function to define the parameters of the input gate, output gate, forget gate, and intermediate state tanh. The parameters of each gate are three, input the weight and offset value of x and h.

Then start each round of LSTM loop calculation. The input gate calculation is to multiply the input embedded_input matrix by the input gate parameter x_in, plus the result of multiplying h and the corresponding parameter, and finally add the offset value b_in to get the input through sigmoid Door result.

Similarly, matrix multiplication and offset operations are performed to obtain the results of the forget gate and output gate. The intermediate state tanh is similar to the operation of the three gates, except that it finally passes the tanh function.

Multiply the last hidden state by the forget gate plus the input gate by the intermediate state to get the current hidden state

Pass the current state through the tanh function and add the output gate to get the current output h

What is obtained after multiple rounds of input cycles is the final output of the LSTM network.

# 实现LSTM网络
    # 生成Cell网格所需参数
    def _generate_paramas(x_size, h_size, b_size):
        x_w = tf.get_variable('x_weight', x_size)
        h_w = tf.get_variable('h_weight', h_size)
        bias = tf.get_variable('bias', b_size, initializer=tf.constant_initializer(0.0))
        return x_w, h_w, bias

    scale = 1.0 / math.sqrt(embedding_size + lstm_nodes[-1]) / 3.0
    lstm_init = tf.random_uniform_initializer(-scale, scale)
    with tf.variable_scope('lstm_nn', initializer=lstm_init):
        # 输入门参数
        with tf.variable_scope('input'):
            x_in, h_in, b_in = _generate_paramas(
                x_size=[embedding_size, lstm_nodes[0]],
                h_size=[lstm_nodes[0], lstm_nodes[0]],
                b_size=[1, lstm_nodes[0]]
            )
        # 输出门参数
        with tf.variable_scope('output'):
            x_out, h_out, b_out = _generate_paramas(
                x_size=[embedding_size, lstm_nodes[0]],
                h_size=[lstm_nodes[0], lstm_nodes[0]],
                b_size=[1, lstm_nodes[0]]
            )
        # 遗忘门参数
        with tf.variable_scope('forget'):
            x_f, h_f, b_f = _generate_paramas(
                x_size=[embedding_size, lstm_nodes[0]],
                h_size=[lstm_nodes[0], lstm_nodes[0]],
                b_size=[1, lstm_nodes[0]]
            )
        # 中间状态参数
        with tf.variable_scope('mid_state'):
            x_m, h_m, b_m = _generate_paramas(
                x_size=[embedding_size, lstm_nodes[0]],
                h_size=[lstm_nodes[0], lstm_nodes[0]],
                b_size=[1, lstm_nodes[0]]
            )

        # 两个初始化状态,隐含状态state和初始输入h
        state = tf.Variable(tf.zeros([batch_size, lstm_nodes[0]]), trainable=False)
        h = tf.Variable(tf.zeros([batch_size, lstm_nodes[0]]), trainable=False)
        # 遍历LSTM每轮循环,即每个词的输入过程
        for i in range(max_words):
            # 取出每轮输入,三维数组embedd_inputs的第二维代表训练的轮数
            embedded_input = embedded_inputs[:, i, :]
            # 将取出的结果reshape为二维
            embedded_input = tf.reshape(embedded_input, [batch_size, embedding_size])
            # 遗忘门计算
            forget_gate = tf.sigmoid(tf.matmul(embedded_input, x_f) + tf.matmul(h, h_f) + b_f)
            # 输入门计算
            input_gate = tf.sigmoid(tf.matmul(embedded_input, x_in) + tf.matmul(h, h_in) + b_in)
            # 输出门
            output_gate = tf.sigmoid(tf.matmul(embedded_input, x_out) + tf.matmul(h, h_out) + b_out)
            # 中间状态
            mid_state = tf.tanh(tf.matmul(embedded_input, x_m) + tf.matmul(h, h_m) + b_m)
            # 计算隐含状态state和输入h
            state = state * forget_gate + input_gate * mid_state
            h = output_gate + tf.tanh(state)
        # 最后遍历的结果就是LSTM的输出
        last_output = h

1.3, text classification

The text classification problem is to analyze and judge the input text string, and then output the result. The string cannot be directly input to the RNN network, so you need to split the text into a single phrase before input, encode the phrase into a vector, and enter a phrase every round. When the last phrase is input, the output result is also a vector. embedding corresponds a word to a vector, and each dimension of the vector corresponds to a floating point value. These floating point values ​​are dynamically adjusted so that the embedding code is related to the meaning of the word. In this way, the input and output of the network are all vectors, and then the final full connection operation can correspond to different classifications.

The problem that the RNN network inevitably brings is that the final output is affected by the most recent input, and the input that is farther away may not affect the result. This is the information bottleneck problem. In order to solve this problem, a bidirectional LSTM is introduced. Bidirectional LSTM not only increases reverse information propagation, but each round will have an output, which will be combined and then transmitted to the fully connected layer.

Another text classification model is HAN (Hierarchy Attention Network). First, the text is divided into sentences and word levels, the input words are encoded and then added to get the sentence code, and then the sentence codes are added to get the final text code. Attention refers to adding a weighted value before each level of code is accumulated, and accumulating the code according to different weights.

        

Because the length of the input text is not uniform, the neural network cannot be used for learning directly. In order to solve this problem, the length of the input text can be unified to a maximum value, and the convolutional neural network is barely used for learning, that is, TextCNN . The convolution process of the text convolution network uses multi-channel one-dimensional convolution. Compared with two-dimensional convolution, one-dimensional convolution means that the convolution kernel only moves in one direction. For example, as shown in the figure on the left, 1 × 1 + 5 × 2 + 2 × 2 + 4 × 3 + 3 × 3 + 3 × 4 = 48, and then the convolution kernel moves down one grid to get 45, and so on. As shown in the right figure below, enter multiple words of varying lengths. First fill all of them into a six-channel embedding array, and then use a six-channel one-dimensional convolution kernel to convolve from top to bottom to obtain a one-dimensional array, and then output after passing through the pooling layer and the fully connected layer.

    

It can be seen that the CNN network can't perfectly handle the serial problem of different input lengths, but it can process multiple phrases in parallel, which is more efficient, and the RNN can handle the serial input better, combining the advantages of the two. It constitutes the R-CNN model . First, feature extraction is performed on the input through a two-way RNN network, and then CNN is used to further extract, then the features of each step are fused together through the pooling layer, and finally classified through the fully connected layer.

No matter what model you need to use embedding to convert the input into a vector. When the input is too large, the converted embedding layer parameters will be too large, which is not only not conducive to storage, but also causes overfitting, so the embedding layer needs to be compressed. The original embedding code is a parameter corresponding to an input, for example, wait corresponds to the parameter x1, for corresponds to x2, and the corresponds to x3. If there are too many inputs, the encoding parameters will be very large. You can use two parameter pairs to encode the input. For example, wait corresponds to (x1, x2), for corresponds to (x1, x3) ..., so that you can maximize The number of saving parameters is shared compression .

2. Text classification via Text RNN

2.1, data preprocessing

The text classification data set files downloaded on the Internet are as follows, divided into test set and training set data, there are four folders under each training set, each folder is a classification, each classification has 1000 txt files, each There is a text of the classification in the file

     

Iterate through all the training set files through os.walk, and split the categorized text into single phrases through the jieba library, separated by spaces. Then add the classification text to the beginning, separated by tabs, and finally output the result to train_segment.txt,

# 将文件中的句子通过jieba库拆分为单个词
def segment_word(input_file, output_file):
    # 循环遍历训练数据集的每一个文件
    for root, folders, files in os.walk(input_file):
        print('root:', root)
        for folder in folders:
            print('dir:', folder)
        for file in files:
            file_dir = os.path.join(root, file)
            with open(file_dir, 'rb') as in_file:
                # 读取文件中的文本
                sentence = in_file.read()
                # 通过jieba函数库将句子拆分为单个词组
                words = jieba.cut(sentence)
                # 文件夹路径最后两个字即为分类名
                content = root[-2:] + '\t'
                # 去除词组中的空格,排除为空的词组
                for word in words:
                    word = word.strip(' ')
                    if word != '':
                        content += word + ' '
            # 换行并将文本写入输出文件
            content += '\n'
            with open(output_file, 'a') as outfile:
                outfile.write(content.strip(' '))

The results are as follows:

Since some phrases have few occurrences and are not statistically significant, they need to be excluded, and the frequency of occurrence of each phrase is counted by the get_list () method. Using the dictionary data type that comes with Python can easily achieve phrase data statistics, the format is {"keyword": frequency}, frequency records the number of occurrences of keyword. If a phrase is newly appeared, it will be added to the dictionary as a new entry, otherwise the frequency value will be +1.

# 统计每个词出现的频率
def get_list(segment_file, out_file):
    # 通过词典保存每个词组出现的频率
    word_dict = {}
    with open(segment_file, 'r') as seg_file:
        lines = seg_file.readlines()
        # 遍历文件的每一行
        for line in lines:
            line = line.strip('\r\n')
            # 将一行按空格拆分为每个词,统计词典
            for word in line.split(' '):
                # 如果这个词组没有在word_dict词典中出现过,则新建词典项并设为0
                word_dict.setdefault(word, 0)
                # 将词典word_dict中词组word对应的项计数加一
                word_dict[word] += 1
        # 将词典中的列表排序,关键字为列表下标为1的项,且逆序
        sorted_list = sorted(word_dict.items(), key=lambda d: d[1], reverse=True)
        with open(out_file, 'w') as outfile:
            # 将排序后的每条词典项写入文件
            for item in sorted_list:
                outfile.write('%s\t%d\n' % (item[0], item[1]))

The statistical results are as follows:

2.2, data reading

It is not possible to use the phrase directly for coding learning. You need to convert the phrase to embedding coding. According to the train_list list just generated, number each phrase in the order from the front to the back. If the frequency of the phrase is less than the threshold, it is excluded. Use the Word_list class to construct the phrase objects of the training data and test data, and implement the phrase encoding in the class constructor __init __ (). And define class method sentence2id to convert the split sentence phrase into the corresponding id array, if the word is not in the phrase list, then set the value to -1.

Before defining the class, first specify some hyperparameters for subsequent use:

# 定义超参数
embedding_size = 32  # 每个词组向量的长度
max_words = 10  # 一个句子最大词组长度
lstm_layers = 2  # lstm网络层数
lstm_nodes = [64, 64]  # lstm每层结点数
fc_nodes = 64  # 全连接层结点数
batch_size = 100  # 每个批次样本数据
lstm_grads = 1.0  # lstm网络梯度
learning_rate = 0.001  # 学习率
word_threshold = 10  # 词表频率门限,低于该值的词语不统计
num_classes = 4  # 最后的分类结果有4类
class Word_list:
    def __init__(self, filename):
        # 用词典类型来保存需要统计的词组及其频率
        self._word_dic = {}
        with open(filename, 'r',encoding='GB2312',errors='ignore') as f:
            lines = f.readlines()
        for line in lines:
            word, freq = line.strip('\r\n').split('\t')
            freq = int(freq)
            # 如果词组的频率小于阈值,跳过不统计
            if freq < word_threshold:
                continue
            # 词组列表中每个词组都是不重复的,按序添加到word_dic中即可,下一个词组id就是当前word_dic的长度
            word_id = len(self._word_dic)
            self._word_dic[word] = word_id

    def sentence2id(self, sentence):
        # 将以空格分割的句子返回word_dic中对应词组的id,若不存在返回-1
        sentence_id = [self._word_dic.get(word, -1)
                       for word in sentence.split()]
        return sentence_id


train_list = Word_list(train_list_dir)

Define the TextData class to complete the reading and management of the data, read the train_segment.txt file just processed in the __init __ () function, divide the category tags and sentence phrases according to the tab characters, and convert the category and sentence into numeric id . If the phrase of the sentence exceeds the maximum threshold, then the excess is truncated, if not enough, it is filled with -1. Define the class function _shuffle_data () for cleaning data, next_batch () for returning data and labels by batch, and get_size () for returning the total number of phrases.

class TextData:
    def __init__(self, segment_file, word_list):
        self.inputs = []
        self.labels = []
        # 通过词典管理文本类别
        self.label_dic = {'体育': 0, '校园': 1, '女性': 2, '出版': 3}
        self.index = 0

        with open(segment_file, 'r') as f:
            lines = f.readlines()
            for line in lines:
                # 文本按制表符分割,前面为类别,后面为句子
                label, content = line.strip('\r\n').split('\t')[0:2]
                self.content_size = len(content)
                # 将类别转换为数字id
                label_id = self.label_dic.get(label)
                # 将句子转化为embedding数组
                content_id = word_list.sentence2id(content)
                # 如果句子的词组长超过最大值,截取max_words长度以内的id值
                content_id = content_id[0:max_words]
                # 如果不够则填充-1,直到max_words长度
                padding_num = max_words - len(content_id)
                content_id = content_id + [-1 for i in range(padding_num)]
                self.inputs.append(content_id)
                self.labels.append(label_id)
        self.inputs = np.asarray(self.inputs, dtype=np.int32)
        self.labels = np.asarray(self.labels, dtype=np.int32)
        self._shuffle_data()

    # 对数据按照(input,label)对来打乱顺序
    def _shuffle_data(self):
        r_index = np.random.permutation(len(self.inputs))
        self.inputs = self.inputs[r_index]
        self.labels = self.labels[r_index]

    # 返回一个批次的数据
    def next_batch(self, batch_size):
        # 当前索引+批次大小得到批次的结尾索引
        end_index = self.index + batch_size
        # 如果结尾索引大于样本总数,则打乱所有样本从头开始
        if end_index > len(self.inputs):
            self._shuffle_data()
            self.index = 0
            end_index = batch_size
        # 按索引返回一个批次的数据
        batch_inputs = self.inputs[self.index:end_index]
        batch_labels = self.labels[self.index:end_index]
        self.index = end_index
        return batch_inputs, batch_labels

    # 获取词表数目
    def get_size(self):
        return self.content_size

# 训练数据集对象
train_set = TextData(train_segment_dir, train_list)
# print(data_set.next_batch(10))
# 训练数据集词组条数
train_list_size = train_set.get_size()

2.3, build a calculation graph model

Define the function create_model to realize the construction of the calculation graph model. First define the placeholders of the model input, which are the ratio of input text input, output label output, and dropout keep_prob.

First build the embedding layer, extract the input codes and splice them into a matrix, for example, input [1,8,3], then extract embeding [1], embedding [8] and embedding [3] to splice into a matrix

Next, the LSTM network is constructed. Here, a two-layer network is constructed, and the number of nodes in each layer is defined in the previous parameter lstm_node [] array. The construction of each cell is realized by the function tf.contrib.rnn.BasicLSTMCell, and then undergoes a Dropout operation. Then merge the two cells into an LSTM network, and input the embedded_inputs to the LSTM network through the function tf.nn.dynamic_rnn to obtain the output rnn_output. This is a three-dimensional array, the second dimension represents the number of training steps, we only take the result of the last dimension, that is, the subscript value is -1.

Next, build a fully connected layer, define the fully connected layer through the tf.layers.dense function, and then map the output to the category after a dropout operation. The parameter num_classes of the category type obtains the estimated value logits

Then you can find the evaluation values ​​such as loss and accuracy. Calculate the cross-entropy loss value between the predicted value logits and the label value outputs, then calculate the predicted value through arg_max, and then find the accuracy rate

Next, define the training method and apply gradient clipping to the variables to prevent the gradient from disappearing.

Finally, input evaluation values ​​such as placeholders and losses, and other training parameters are returned to the outside of the calling function.

# 创建计算图模型
def create_model(list_size, num_classes):
    # 定义输入输出占位符
    inputs = tf.placeholder(tf.int32, (batch_size, max_words))
    outputs = tf.placeholder(tf.int32, (batch_size,))
    # 定义是否dropout的比率
    keep_prob = tf.placeholder(tf.float32, name='keep_rate')
    # 记录训练的总次数
    global_steps = tf.Variable(tf.zeros([], tf.float32), name='global_steps', trainable=False)

    # 将输入转化为embedding编码
    with tf.variable_scope('embedding',
                           initializer=tf.random_normal_initializer(-1.0, 1.0)):
        embeddings = tf.get_variable('embedding', [list_size, embedding_size], tf.float32)
        # 将指定行的embedding数值抽取出来
        embedded_inputs = tf.nn.embedding_lookup(embeddings, inputs)

    # 实现LSTM网络
    scale = 1.0 / math.sqrt(embedding_size + lstm_nodes[-1]) / 3.0
    lstm_init = tf.random_uniform_initializer(-scale, scale)
    with tf.variable_scope('lstm_nn', initializer=lstm_init):
        # 构建两层的lstm,每层结点数为lstm_nodes[i]
        cells = []
        for i in range(lstm_layers):
            cell = tf.contrib.rnn.BasicLSTMCell(lstm_nodes[i], state_is_tuple=True)
            # 实现Dropout操作
            cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)
            cells.append(cell)
        # 合并两个lstm的cell
        cell = tf.contrib.rnn.MultiRNNCell(cells)
        # 将embedded_inputs输入到RNN中进行训练
        initial_state = cell.zero_state(batch_size, tf.float32)
        # runn_output:[batch_size,num_timestep,lstm_outputs[-1]
        rnn_output, _ = tf.nn.dynamic_rnn(cell, embedded_inputs, initial_state=initial_state)
        last_output = rnn_output[:, -1, :]

    # 构建全连接层
    fc_init = tf.uniform_unit_scaling_initializer(factor=1.0)
    with tf.variable_scope('fc', initializer=fc_init):
        fc1 = tf.layers.dense(last_output, fc_nodes, activation=tf.nn.relu, name='fc1')
        fc1_drop = tf.contrib.layers.dropout(fc1, keep_prob)
        logits = tf.layers.dense(fc1_drop, num_classes, name='fc2')

    # 定义评估指标
    with tf.variable_scope('matrics'):
        # 计算损失值
        softmax_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=outputs)
        loss = tf.reduce_mean(softmax_loss)
        # 计算预测值,求第1维中最大值的下标,例如[1,1,5,3,2] argmax=> 2
        y_pred = tf.argmax(tf.nn.softmax(logits), 1, output_type=tf.int32)
        # 求准确率
        correct_prediction = tf.equal(outputs, y_pred)
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    # 定义训练方法
    with tf.variable_scope('train_op'):
        train_var = tf.trainable_variables()
        # for var in train_var:
        #     print(var)
        # 对梯度进行裁剪防止梯度消失或者梯度爆炸
        grads, _ = tf.clip_by_global_norm(tf.gradients(loss, train_var), clip_norm=lstm_grads)
        # 将梯度应用到变量上去
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        train_op = optimizer.apply_gradients(zip(grads, train_var), global_steps)

    # 以元组的方式将结果返回
    return ((inputs, outputs, keep_prob),
            (loss, accuracy),
            (train_op, global_steps))

# 调用构建函数,接收解析返回的参数
placeholders, matrics, others = create_model(train_list_size, num_classes)
inputs, outputs, keep_prob = placeholders
loss, accuracy = matrics
train_op, global_steps = others

2.4. Training

Run the calculation graph model through Session, get the training set data from the train_set in batches and fill in the placeholders, run sess.run, get the intermediate values ​​such as the loss value, accuracy rate and print

# 进行训练
init_op = tf.global_variables_initializer()
train_keep_prob = 0.8       # 训练集的dropout比率
train_steps = 10000

with tf.Session() as sess:
    sess.run(init_op)

    for i in range(train_steps):
        # 按批次获取训练集数据
        batch_inputs, batch_labels = train_set.next_batch(batch_size)
        # 运行计算图
        res = sess.run([loss, accuracy, train_op, global_steps],
                       feed_dict={inputs: batch_inputs, outputs: batch_labels,
                                  keep_prob: train_keep_prob})
        loss_val, acc_val, _, g_step_val = res
        if g_step_val % 20 == 0:
            print('第%d轮训练,损失:%3.3f,准确率:%3.5f' % (g_step_val, loss_val, acc_val))

After 10,000 rounds of training on my data set, the accuracy of the training set hovered around 90%

 

Source code and related data files: https://github.com/SuperTory/MachineLearning/tree/master/TextRNN

Published 124 original articles · Like 65 · Visit 130,000+

Guess you like

Origin blog.csdn.net/theVicTory/article/details/101017006