CNN sentiment classification

CNN text classification data and code please click to open the link

The data used in the experiment are Dangdang book reviews, for example, see the following:

Negtive: I don't like the person, let alone the text, the ancient rhyme is a bit too much, hypocritical!
Positive: This book is really good, it should be given five stars, but unfortunately the rating can't be modified. It has the author's personal style and profile, and you can learn a lot of health knowledge. Can the author publish another book? If such a good book comes out, I will definitely buy it. Thanks to Dangdang's enthusiastic customers for recommending, I can only buy these good books, and I also thank Dangdang. I wish good-hearted people a happy family and good health! ~

（1）Load data and labels

def load_data_and_labels():
  """
  Loads MR polarity data from files, splits the data into words and generates labels.
  Returns split sentences and labels.
  """
  # Load data from files
  positive_examples = list(codecs.open("./data/chinese/pos.txt", "r", "utf-8").readlines())
  positive_examples = [s.strip() for s in positive_examples]
  negative_examples = list(codecs.open("./data/chinese/neg.txt", "r", "utf-8").readlines())
  negative_examples = [s.strip() for s in negative_examples]
  # Split by words
  x_text = positive_examples + negative_examples
  # x_text = [clean_str(sent) for sent in x_text]
  x_text = [list(s) for s in x_text]

  # Generate labels
  positive_labels = [[0, 1] for _ in positive_examples]
  negative_labels = [[1, 0] for _ in negative_examples]
  y = np.concatenate([positive_labels, negative_labels], 0)
  return [x_text, y]

The role of this function is to load positive and negative data from the file, combine them, and segment each sentence, so x_text is a two-dimensional list that stores each word of each review; their corresponding The labels are also combined. Since the labels actually correspond to the two neurons in the output layer of the binary classification, they are encoded into 0/1 and 1/0 with one-hot, and then y is returned.
Among them, the return value of f.readlines() is a list, and each element is a line of text (str type, with "\n" at the end), so there is no need to convert it to list() in the outer layer and
use s.strip The () function strips newlines and whitespace at the end of each sentence.
After removing the newline, each sentence needs to do some operations (specifically in the clean_str() function) to separate punctuation and abbreviations due to the problem just mentioned. The most concise word segmentation method of English str is to split by spaces, so we only need to add spaces to each part that needs to be divided, and then call the split(" ") function on the entire str to complete the word segmentation.
The generation of labels is similar.

（2）Pad sentence

def pad_sentences(sentences, padding_word="<PAD/>"):
  """
  Pads all sentences to the same length. The length is defined by the longest sentence.
  Returns padded sentences.
  """
  sequence_length = max(len(x) for x in sentences)
  padded_sentences = []
  for i in range(len(sentences)):
    sentence = sentences[i]
    num_padding = sequence_length - len(sentence)
    new_sentence = sentence + [padding_word] * num_padding
    padded_sentences.append(new_sentence)
  return padded_sentences

Why padding the sentence?
Because the input_x in the TextCNN model corresponds to tf.placeholder, which is a tensor, and the shape has been fixed, such as [batch, sequence_len], it is impossible to have different lengths for each row of the tensor, so it is necessary to find the entire dataset. The length of the longest sentence, and then add padding words at the end of the sentence with insufficient length to ensure that the length of the input sentence is consistent.
Since in the load_data function, a two-dimensional list is obtained to store the data of each sentence, so after padding_sentences, it is still returned in this form. It's just that the padding word may be added to the end of each sentence list.

（3）Build vocabulary

def build_vocab(sentences):
  """
  Builds a vocabulary mapping from word to index based on the sentences.
  Returns vocabulary mapping and inverse vocabulary mapping.
  """
  # Build vocabulary
  word_counts = Counter(itertools.chain(*sentences))
  # Mapping from index to word
  vocabulary_inv = [x[0] for x in word_counts.most_common()]
  # Mapping from word to index
  vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
  return [vocabulary, vocabulary_inv]

Here,
we know that the Counter in the collections module can implement word frequency statistics, for example:

from collections import Counter
import collections
sentence = ["i", "love", "mom", "mom","mom","me","loves", "me"]
word_counts=collections.Counter(sentence)
print word_counts
print word_counts.most_common()
vocabulary_inv = [x[0] for x in word_counts.most_common()]
print vocabulary_inv
vocabulary_inv = list(sorted(vocabulary_inv))
print vocabulary_inv
vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
print vocabulary
print vocabulary_inv
print [vocabulary,vocabulary_inv]

输出结果：
[('mom', 3), ('me', 2), ('i', 1), ('love', 1), ('loves', 1)]
['mom', 'me', 'i', 'love', 'loves']
['i', 'love', 'loves', 'me', 'mom']
{'i': 0, 'me': 3, 'love': 1, 'mom': 4, 'loves': 2}

The parameter accepted by Counter is iterable, but now there are multiple sentence lists, how to generate all words in multiple sentence word lists by an efficient iterator?
This uses itertools.chain(*iterables)

which takes multiple iterators as arguments, but returns only a single iterator, which yields the contents of all argument iterators as if they came from a single sequence.

From this we get Word frequency statistics on the entire dataset, word_counts.
But to create a dictionary vocabulary, you need to extract the first element of each pair from word_counts, which is word (equivalent to Counter doing a deduplication job here), you don't need to build vocabulary according to word frequency, but according to word lexicographical order, so perform a sorted on vocabulary to get the word list in lexicographical order. Small initials come first. (In the example, it is based on word frequency)
and then create a dict to store the index corresponding to each word, which is the vocabulary variable.

（4）Build input data

def build_input_data(sentences, labels, vocabulary):
  """
  Maps sentencs and labels to vectors based on a vocabulary.
  """
  #x present index matrix vocabulary[word] to get index
  x = np.array([[vocabulary[word] for word in sentence] for sentence in sentences])
  y = np.array(labels)
  return [x, y]

From the above two functions, we get a two-dimensional list of all sentences after word segmentation, labels corresponding to sentences, and a vocabulary dictionary for querying the index corresponding to each word.
but! ! Think about it, the current sentences store word strings one by one, which takes up a lot of memory when the amount of data is large. Therefore, it is best to store the index corresponding to the word, and the index is int, which takes up less space.
Therefore, the newly generated vocabulary is used to query each word in the two-dimensional list of sentences to generate a two-dimensional list composed of word indexes. Finally, convert this two-dimensional list into a two-dimensional array in numpy.
Because the corresponding labels are already two-dimensional lists of 0,1, they can be directly converted into arrays.
After converting to an array, it can be used directly as the input and labels of the cnn.

（5）Load data

def load_data():
  """
  Loads and preprocessed data for the MR dataset.
  Returns input vectors, labels, vocabulary, and inverse vocabulary.
  """
  # Load and preprocess data
  sentences, labels = load_data_and_labels()
  sentences_padded = pad_sentences(sentences)
  vocabulary, vocabulary_inv = build_vocab(sentences_padded)
  x, y = build_input_data(sentences_padded, labels, vocabulary)
  return [x, y, vocabulary, vocabulary_inv]

Finally, integrate the above processing functions,

1. First load the original data from the text file, temporarily store it in the list in the form of sentences, then perform clean_str on each sentence, and segment the words to obtain a two-dimensional list with word as the basic unit sentences, labels correspond to [0,1] and [1,0]
2. Find the maximum length of the sentence, and padding the sentences with insufficient length
3. Create a vocabulary according to the data, return it in lexicographical order, and get the corresponding word for each word index.
4. Convert the two-dimensional list sentences of type str into sentences of type int, and return a two-dimensional numpy array as the input and labels of the model for subsequent use.

（6）Generate batch

def batch_iter(data, batch_size, num_epochs):
  """
  Generates a batch iterator for a dataset.
  """
  data = np.array(data)
  data_size = len(data)
  num_batches_per_epoch = int(len(data)/batch_size) + 1
  for epoch in range(num_epochs):
    # Shuffle the data at each epoch
    shuffle_indices = np.random.permutation(np.arange(data_size))
    shuffled_data = data[shuffle_indices]
    for batch_num in range(num_batches_per_epoch):
      start_index = batch_num * batch_size
      end_index = min((batch_num + 1) * batch_size, data_size)
      yield shuffled_data[start_index:end_index]

The function of this function is to define a batches = batch_iter(...) during the whole training process. During the whole training process, only the batches of the for loop can be operated on each batch data.

batches=batch_iter(...)

for batch in batches:

processing batch

Yield
Yield is a bit like return, except that it returns
a To grasp the essence of yield, you must understand its main point: when you call this function , the code you write in this function doesn't actually run. This function just returns a generator object. A bit too tricky :-)

Then your code will run every time the for uses the generator.

Now comes the hardest part of explaining:
when your for calls the function for the first time, it makes a generator, and the loop runs through your function until it generates the first value. Then each call will run the loop and return the next value until no value is returned. The generator is considered empty once the function runs but no longer yields. This happens because the loop has reached its end, or because you no longer satisfy the "if/else" condition.