RNN 9 series model of nerve and its variants LSTM, GRU

Processing sequence data, we start with the language model N-gram model, and then to focus on RNN, and classification by the actual text RNN variants LSTM and GRU.

N-gram language model models

By previous lessons, we learned that the traditional method of processing a natural language sentence is treated as a bag of words model (Bag-of-Words, BoW), regardless of the order of each word, such as using Naive Bayes algorithm spam recognition or text classification. Sometimes there is no problem in Chinese in this way, because some sentences even though the scrambled word, or can understand what these words say, for example:

T: Research shows that the order of Chinese characters does not necessarily affect the reading, for example, when you read this sentence.

F: Research study table Ming, Chinese character sequence along not given a shadow can read loud reading, such as when you read this sentence.

But sometimes not, upset the order of words, the meaning of the sentence becomes incredible, for example:

T: I like to eat barbecue.

F: I like to eat barbecue.

Well, there is no model order between words and word sentence consider it? There, N-gram language model is a kind.

N-gram language model is a model (Language Model, LM), is based on a probability model determination, which is input (in the order of sequence word) word, the output probability of the sentence, i.e., a combination of these words probability (Joint probability).
The idea of using language model N-gram, generally need to know the current word and the previous word, because of the emergence of each word in a sentence is not independent. For example, if the first word is "air", the next word is "very" high probability that the next word will be "fresh." Similar to our people's association, the more N-gram model know the information, the more accurate the results obtained.

Explained in the previous lesson text classification, we have used the word sklearn Bag model, try adding the extraction  and  statistical features, the amount of amplification thesaurus, get more features. 2-gram 3-gram

Ngram_range controlled by parameters, as follows:

     from sklearn.feature_extraction.text import CountVectorizer
        vec = CountVectorizer(
            analyzer='word', # tokenise by character ngrams
            ngram_range=(1,4), # use ngrams of size 1 and 2 max_features=20000, # keep the most common 1000 ngrams ) 

Thus, N-gram model, natural language processing, such as mainly used in speech tagging, classification spam messages, word, a machine translation and voice recognition, voice recognition and other fields.

However, N-gram model is not perfect, it has the following advantages and disadvantages:

  • Pros: contains all the information before the N-1 word can provide, the probability of these words for the current word has a strong binding;

  • Disadvantages: very large-scale training text to determine the parameters of the model, when N is large, the model parameter space is too large. Therefore, the N value is generally common 1,2,3. Data smoothing problem because there are sparse data caused, the solution is mainly Laplace smoothing and interpolation and backtracking.

Therefore, according to the advantages and disadvantages N-gram, which is an evolution of NNLM (Neural Network based Language Model) was born.

NNLM proposed in 2003 by the Bengio, it is a very simple model, consists of four layers, an input layer, a buried layer, hidden layer and output layer, model structure below (from Baidu picture):

enter image description here

NNLM received input word sequence of length N, the output word is the next category. First, the input is the index sequence of word sequences, such as the word "I" in the dictionary (size | V |) in the index is 10, the word "yes" index is 23, "Xiao Ming," the index is 65, the sentence " I was Xiao Ming, "the index sequence is 10, 23,65. Buried layer (Embedding) is the size of a  matrix, the first row vector 10,23,65 makes up 3 × K matrix layer is removed from the output of the Embedding. Embedding layer hidden layer receiving as input the output of the stitched to tanh function is active, and finally fed with softmax output layer, the output probability. |V|×K

NNLM biggest drawback is that many parameters, slow training, ask for fixed-length N that is very flexible, and can not take advantage of the complete historical information.

因此,针对 NNLM 存在的问题,Mikolov 在2010年提出了 RNNLM,有兴趣可以阅读相关论文,其结构实际上是用 RNN 代替 NNLM 里的隐层,这样做的好处,包括减少模型参数、提高训练速度、接受任意长度输入、利用完整的历史信息。同时,RNN 的引入意味着可以使用 RNN 的其他变体,像 LSTM、BLSTM、GRU 等等,从而在序列建模上进行更多更丰富的优化。

以上,从词袋模型说起,引出语言模型 N-gram 以及其优化模型 NNLM 和 RNNLM,后续内容从 RNN 说起,来看看其变种 LSTM 和 GRU 模型如何处理类似序列数据。

RNN 以及变种 LSTM 和 GRU 原理

RNN 为序列数据而生

RNN 称为循环神经网路,因为这种网络有“记忆性”,主要应用在自然语言处理(NLP)和语音领域。RNN 具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐藏层之间的节点不再无连接而是有连接的,并且隐藏层的输入不仅包括输入层的输出还包括上一时刻隐藏层的输出。

理论上,RNN 能够对任何长度的序列数据进行处理,但由于该网络结构存在“梯度消失”问题,所以在实际应用中,解决梯度消失的方法有:梯度裁剪(Clipping Gradient)和 LSTM(Long Short-Term Memory)。

下图是一个简单的 RNN 经典结构:

enter image description here

RNN 包含输入单元(Input Units),输入集标记为 \{x_0,x_1,...,x_t,x_t...\}{x0,x1,...,xt,xt...};输出单元(Output Units)的输出集则被标记为 \{y_0,y_1,...,y_t,...\}{y0,y1,...,yt,...};RNN 还包含隐藏单元(Hidden Units),我们将其输出集标记为 \{h_0,h_1,...,h_t,...\}{h0,h1,...,ht,...},这些隐藏单元完成了最为主要的工作。

LSTM 结构

LSTM 在1997年由“Hochreiter & Schmidhuber”提出,目前已经成为 RNN 中的标准形式,用来解决上面提到的 RNN 模型存在“长期依赖”的问题。

enter image description here

LSTM 通过三个“门”结构来控制不同时刻的状态和输出。所谓的“门”结构就是使用了 Sigmoid 激活函数的全连接神经网络和一个按位做乘法的操作,Sigmoid 激活函数会输出一个0~1之间的数值,这个数值代表当前有多少信息能通过“门”,0表示任何信息都无法通过,1表示全部信息都可以通过。其中,“遗忘门”和“输入门”是 LSTM 单元结构的核心。下面我们来详细分析下三种“门”结构。

  • 遗忘门,用来让 LSTM“忘记”之前没有用的信息。它会根据当前时刻节点的输入 X_tXt、上一时刻节点的状态 Ct1Ct−1 和上一时刻节点的输出 h_{t-1}ht1 来决定哪些信息将被遗忘。

  • 输入门,LSTM 来决定当前输入数据中哪些信息将被留下来。在 LSTM 使用遗忘门“忘记”部分信息后需要从当前的输入留下最新的记忆。输入门会根据当前时刻节点的输入 X_tXt、上一时刻节点的状态 C_{t-1}Ct1 和上一时刻节点的输出 h_{t-1}ht1 来决定哪些信息将进入当前时刻节点的状态 C_tCt,模型需要记忆这个最新的信息。

  • 输出门,LSTM 在得到最新节点状态 C_tCt 后,结合上一时刻节点的输出 h_{t-1}ht1 和当前时刻节点的输入 X_tXt 来决定当前时刻节点的输出。

GRU 结构

GRU(Gated Recurrent Unit)是2014年提出来的新的 RNN 架构,它是简化版的 LSTM。下面是 LSTM 和 GRU 的结构比较图(来自于网络):

enter image description here

在超参数均调优的前提下,据说效果和 LSTM 差不多,但是参数少了1/3,不容易过拟合。如果发现 LSTM 训练出来的模型过拟合比较严重,可以试试 GRU。

实战基于 Keras 的 LSTM 和 GRU 文本分类

上面讲了那么多,但是 RNN 的知识还有很多,比如双向 RNN 等,这些需要自己去学习,下面,我们来实战一下基于 LSTM 和 GRU 的文本分类。

本次开发使用 Keras 来快速构建和训练模型,使用的数据集还是第06课使用的司法数据。

整个过程包括:

  1. 语料加载
  2. 分词和去停用词
  3. 数据预处理
  4. 使用 LSTM 分类
  5. 使用 GRU 分类

第一步,引入数据处理库,停用词和语料加载:

    #引入包
    import random
    import jieba
    import pandas as pd #加载停用词 stopwords=pd.read_csv('stopwords.txt',index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8') stopwords=stopwords['stopword'].values #加载语料 laogong_df = pd.read_csv('beilaogongda.csv', encoding='utf-8', sep=',') laopo_df = pd.read_csv('beilaogongda.csv', encoding='utf-8', sep=',') erzi_df = pd.read_csv('beierzida.csv', encoding='utf-8', sep=',') nver_df = pd.read_csv('beinverda.csv', encoding='utf-8', sep=',') #删除语料的nan行 laogong_df.dropna(inplace=True) laopo_df.dropna(inplace=True) erzi_df.dropna(inplace=True) nver_df.dropna(inplace=True) #转换 laogong = laogong_df.segment.values.tolist() laopo = laopo_df.segment.values.tolist() erzi = erzi_df.segment.values.tolist() nver = nver_df.segment.values.tolist() 

第二步,分词和去停用词:

    #定义分词和打标签函数preprocess_text
    #参数content_lines即为上面转换的list
    #参数sentences是定义的空list,用来储存打标签之后的数据
    #参数category 是类型标签
    def preprocess_text(content_lines, sentences, category): for line in content_lines: try: segs=jieba.lcut(line) segs = [v for v in segs if not str(v).isdigit()]#去数字 segs = list(filter(lambda x:x.strip(), segs)) #去左右空格 segs = list(filter(lambda x:len(x)>1, segs))#长度为1的字符 segs = list(filter(lambda x:x not in stopwords, segs)) #去掉停用词 sentences.append((" ".join(segs), category))# 打标签 except Exception: print(line) continue #调用函数、生成训练数据 sentences = [] preprocess_text(laogong, sentences,0) preprocess_text(laopo, sentences, 1) preprocess_text(erzi, sentences, 2) preprocess_text(nver, sentences, 3) 

第三步,先打散数据,使数据分布均匀,然后获取特征和标签列表:

    #打散数据,生成更可靠的训练集
    random.shuffle(sentences)
    
    #控制台输出前10条数据,观察一下
    for sentence in sentences[:10]:
        print(sentence[0], sentence[1]) #所有特征和对应标签 all_texts = [ sentence[0] for sentence in sentences] all_labels = [ sentence[1] for sentence in sentences] 

第四步,使用 LSTM 对数据进行分类:

    #引入需要的模块
    from keras.preprocessing.text import Tokenizer
    from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.layers import Dense, Input, Flatten, Dropout from keras.layers import LSTM, Embedding,GRU from keras.models import Sequential #预定义变量 MAX_SEQUENCE_LENGTH = 100 #最大序列长度 EMBEDDING_DIM = 200 #embdding 维度 VALIDATION_SPLIT = 0.16 #验证集比例 TEST_SPLIT = 0.2 #测试集比例 #keras的sequence模块文本序列填充 tokenizer = Tokenizer() tokenizer.fit_on_texts(all_texts) sequences = tokenizer.texts_to_sequences(all_texts) word_index = tokenizer.word_index print('Found %s unique tokens.' % len(word_index)) data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) labels = to_categorical(np.asarray(all_labels)) print('Shape of data tensor:', data.shape) print('Shape of label tensor:', labels.shape) #数据切分 p1 = int(len(data)*(1-VALIDATION_SPLIT-TEST_SPLIT)) p2 = int(len(data)*(1-TEST_SPLIT)) x_train = data[:p1] y_train = labels[:p1] x_val = data[p1:p2] y_val = labels[p1:p2] x_test = data[p2:] y_test = labels[p2:] #LSTM训练模型 model = Sequential() model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH)) model.add(LSTM(200, dropout=0.2, recurrent_dropout=0.2)) model.add(Dropout(0.2)) model.add(Dense(64, activation='relu')) model.add(Dense(labels.shape[1], activation='softmax')) model.summary() #模型编译 model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc']) print(model.metrics_names) model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, batch_size=128) model.save('lstm.h5') #模型评估 print(model.evaluate(x_test, y_test)) 

训练过程结果为:

enter image description here

第五步,使用 GRU 进行文本分类,上面就是完整的使用 LSTM 进行 文本分类,如果使用 GRU 只需要改变模型训练部分:

    model = Sequential()
    model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
    model.add(GRU(200, dropout=0.2, recurrent_dropout=0.2)) model.add(Dropout(0.2)) model.add(Dense(64, activation='relu')) model.add(Dense(labels.shape[1], activation='softmax')) model.summary() model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc']) print(model.metrics_names) model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, batch_size=128) model.save('lstm.h5') print(model.evaluate(x_test, y_test)) 

训练过程结果:

enter image description here

总结

Starting from the word model bags herein, is intended to lead N-gram language model and the optimization model and NNLM RNNLM, and variants thereof as well as through RNN LSTM GRU model and understand how its principle similar to the processing sequence data, and based on actual and LSTM GRU Chinese text classification.

Guess you like

Origin www.cnblogs.com/chen8023miss/p/11977245.html