Depth hands-on science learning - RNN sentiment analysis of

        This classification is then automatically files language ⼀ permanent ⻅ The language processing task, which is the one Hash ⻓ variable Text is converted into a sequence of this class files. Its submenus ⼀ a problem: Using files this sentiment classification to analyze files of this mood. This problem is also called sentiment analysis, and has a full wide-ranging apply it. For example, we can analyze the user comments on the product Using statistics and satisfaction of Use households, or analysis Using household sentiment of the market situation and Use ⾏ to predict the next ⾏ situation.

       Here the term vectors to apply it and the pre-trained neural containing a plurality of two-way circular hidden layer Open networks to determine the file sequence present in one Hash ⻓ variable contains the front is facing the screen for or negative emotions.

1, into packages and modules

 1 import collections
 2 import os
 3 import random
 4 import tarfile
 5 import torch
 6 from torch import nn
 7 import torchtext.vocab as Vocab
 8 import torch.utils.data as Data
 9 
10 import sys
11 sys.path.append("..") 
12 import d2lzh_pytorch as d2l
13 
14 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
15 
16 DATA_ROOT = "./Datasets"
View Code

2, data is read 

Use IMDb Stanford data set (Stanford's Large Movie Review Dataset) as the file of the present data set sentiment classification. This data set is divided into training and test Using two data sets, each containing 25,000 reviews of movies downloaded from IMDb. In each data set, labeled "front-on" and the number of comments "negative the screen" are equal. Download the data extract to Datasets in. Read training data set and test data set. Each sample is ⼀ reviews and their corresponding labels: 1 for "front-on", 0 "negative ⾯."

 1 from tqdm import tqdm
 2 def read_imdb(folder='train', data_root="./Datasets/aclImdb"):  
 3     data = []
 4     for label in ['pos', 'neg']:
 5         folder_name = os.path.join(data_root, folder, label)
 6         for file in tqdm(os.listdir(folder_name)):
 7             with open(os.path.join(folder_name, file), 'rb') as f:
 8                 review = f.read().decode('utf-8').replace('\n', '').lower()
 9                 data.append([review, 1 if label == 'pos' else 0])
10     random.shuffle(data)
11     return data
12 
13 train_data, test_data = read_imdb('train'), read_imdb('test')
View Code

3, the data preprocessing

get_tokenized_imdb defined function Use the simplest Remedies: Based on the space into ⾏ word.

1 def get_tokenized_imdb(data): 
2     """
3     data: list of [string, label]
4     """
5     def tokenizer(text):
6         return [tok.lower() for tok in text.split(' ')]
7     return [tokenizer(review) for review, _ in data]
View Code

You can create good word dictionary according to the points of the training data set. We filter out individual cases from the number of occurrences of less than 5 words.

1 def get_vocab_imdb(data):  
2     tokenized_data = get_tokenized_imdb(data)
3     counter = collections.Counter([tk for st in tokenized_data for tk in st])
4     return Vocab.Vocab(counter, min_freq=5)
5 
6 vocab = get_vocab_imdb(train_data)
7 '# words in vocab:', len(vocab)  # ('# words in vocab:', 46151)
View Code

Because each comment of not ⼀ ⻓ actuator can not be directly combined into ⼩ batch, we define the function of each comment into preprocess_imdb ⾏ word, and converted into words by a dictionary index, and then by cutting off or 0 to the complement of each comment ⻓ fixed 500.

. 1  DEF preprocess_imdb (Data, Vocab):  
 2      max_l = 500   # will complement each comment by cutting or 0, so that the length becomes 500 
. 3  
. 4      DEF PAD (X):
 . 5          return X [: max_l] IF len (X)> max_l the else X + [0] * (max_l - len (X))
 . 6  
. 7      tokenized_data = get_tokenized_imdb (Data)
 . 8      Features torch.tensor = ([PAD ([vocab.stoi [Word] for Word in words]) for words in tokenized_data])
 . 9      Labels torch.tensor = ([Score for _, Scorein data])
10     return features, labels
View Code

4, create a data iterator

We create a data iterator. Each iteration will return one hour for a batch of data.

1 batch_size = 64
2 train_set = Data.TensorDataset(*preprocess_imdb(train_data, vocab))
3 test_set = Data.TensorDataset(*preprocess_imdb(test_data, vocab))
4 train_iter = Data.DataLoader(train_set, batch_size, shuffle=True)
5 test_iter = Data.DataLoader(test_set, batch_size)
View Code

5, RNN model

In this model, each of the first word feature vectors obtained by embedding layer START. Then, we Using Neural Open networks way circular feature coding sequence into ⼀ obtain further sequence information. Finally, the encoded information into an output sequence by the full connection layer. Specifically, we can ⻓ bidirectional short-term memory in the initial and final hidden timestep timestep link layer is transmitted to the output as a sequence characterizing feature classification. BiRNN ⾯ implemented in the lower class, Embedding Examples i.e. START embedded layer, LSTM example is the sequence encoding the hidden layer, Linear instance i.e. ⽣ classification result to the output layer.

. 1  class BiRNN (nn.Module):
 2      DEF  the __init__ (Self, Vocab, embed_size, num_hiddens, num_layers):
 . 3          Super (BiRNN, Self). The __init__ ()
 . 4          self.embedding nn.Embedding = (len (Vocab), embed_size )   # embedded layers 
. 5          
. 6          # bidirectional set to True obtain bidirectional Recurrent neural networks 
. 7          self.encoder = nn.LSTM (input_size = embed_size, 
 . 8                                  hidden_size = num_hiddens, 
 . 9                                  num_layers = num_layers,
 10                                 True = Bidirectional)   # hidden layer 
. 11          self.decoder = nn.Linear (* num_hiddens. 4, 2) # initial time step and hidden final time step as a fully-connected input layer 
12 is  
13 is      DEF Forward (Self, Inputs):
 14          # the inputs shape (batch size, word count), because the required sequence length LSTM (seq_len) as a first dimension, the input permutation after 
15          # re-extracted word feature, the shape of the output (number of words, batch size, word vector dimension) 
16          embeddings = self.embedding (inputs.permute (. 1 , 0))
 . 17          # rnn.LSTM afferent input only embeddings, so only returns the last layer of hidden states of the hidden layer in each time step. 
18 is          # Outputs shape (number of words, the batch size, the number of hidden units * 2) 
. 19          Outputs, _ = self.encoder (embeddings) #Output, (H, C) 
20 is          # connecting the initial and final time step, time step of the hidden layer fully connected as input. Its shape is 
21          # (batch size, the number of hidden units 4 *). 
22 is          encoding torch.cat = ((Outputs [0], Outputs [-1]), -1 )
 23 is          outs = self.decoder (encoding)
 24          return outs
View Code

⼀ create a two-way with two hidden layers of Recurrent Neural Open networks.

1 embed_size, num_hiddens, num_layers = 100, 100, 2
2 net = BiRNN(vocab, embed_size, num_hiddens, num_layers)
View Code

6, pre-loaded training vector word

Because of sentiment classification of the training data set is not very zoomed, in response to over-fitting, we will direct the vector Using the word in even larger scale corpus pre-trained as a feature vector of each word. Here, we will load the 100-dimensional vector GloVe word for each word in the dictionary vocab. Then, we will vectors Using these words as a feature vector of each word in a comment. Note that pre-term training needs embed_size ⼀ vector dimension coincides with the embedded START zoomed layer output model created in ⼩. In addition, we are no longer in training to update these term vectors.

. 1 glove_vocab = Vocab.GloVe (name = ' 6B ' , Dim = 100, the os.path.join = Cache (DATA_ROOT, " Glove " ))
 2  DEF load_pretrained_embedding (words, pretrained_vocab):
 . 3      "" " from a pre-trained vocab words extracted word vectors corresponding to "" " 
. 4      the embed torch.zeros = (len (words), pretrained_vocab.vectors [0] .shape [0]) # is initialized to 0 
. 5      oov_count = 0 # oUT of Vocabulary 
. 6      for I, Word in the enumerate (words):
 . 7          the try :
 . 8              IDX =pretrained_vocab.stoi [Word]
 . 9              the embed [I,:] = pretrained_vocab.vectors [IDX]
 10          the except a KeyError:
 . 11              oov_count + = 0
 12 is      IF oov_count> 0:
 13 is          Print ( " . There are OOV words% D " )
 14      return the embed
 15  
16  net.embedding.weight.data.copy_ (load_pretrained_embedding (vocab.itos, glove_vocab))
 . 17 net.embedding.weight.requires_grad = False # loaded directly onto a pre-trained, it is not necessary to update
View Code

For pre-training Using the term vectors based on our 50-dimensional vector word GloVe pre-trained Wikipedia submenus set. Will automatically download the appropriate word to cache vector designated folder (the default is .vector_cache) when you create word vector pre-training examples First Time, requiring joint ⽹.

cache_dir = "./Datasets/glove"
glove = vocab.GloVe(name='6B', dim=50, cache=cache_dir)

# ./Datasets/glove/glove.6B.zip: 862MB [40:57, 351kB/s]                                
# 100%|█████████▉| 399022/400000 [00:30<00:00, 24860.36it/s]

7, training and evaluation model

1 lr, num_epochs = 0.01, 5
2 optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=lr)
3 loss = nn.CrossEntropyLoss()
4 d2l.train(train_iter, test_iter, net, loss, optimizer, device, num_epochs)
View Code

 8, the definition of a prediction function

1 def predict_sentiment(net, vocab, sentence):
2     """sentence是词语的列表"""
3     device = list(net.parameters())[0].device
4     sentence = torch.tensor([vocab.stoi[word] for word in sentence], device=device)
5     label = torch.argmax(net(sentence.view((1, -1))), dim=1)
6     return 'positive' if label.item() == 1 else 'negative'
View Code

Achieve forecast

predict_sentiment(net, vocab, ['this', 'movie', 'is', 'so', 'great'])  # positive

Guess you like

Origin www.cnblogs.com/harbin-ho/p/12003246.html