[Natural Language NLP] TensorFlow uses LSTM to implement sentiment analysis tasks

Natural Language Sentiment Analysis

As we all know, human natural language contains rich emotional colors: expressing people's emotions (such as sadness, happiness), expressing people's moods (such as burnout, depression), expressing people's preferences (such as like, hate). Using machines to automatically analyze these emotional tendencies will not only help companies understand how consumers feel about their products and provide a basis for product improvement, but also help companies analyze the attitudes of business partners and make better business decisions.

We can define a sentiment analysis task as a classification problem, that is, given a text input, the machine automatically outputs a conclusion after analyzing, processing, generalizing, and reasoning on the text.

Common Sentiment Analysis Tasks

  • Positive: expresses positive emotions, such as joy, happiness, surprise, expectation, etc.
  • Negative: Indicates negative negative emotions, such as sadness, sadness, anger, panic, etc.
  • Other: Indicates other types of emotions

Deep neural network completes sentiment analysis task

A common approach is to first convert words into vectors, then vectorize each word in the sentence, and use this vector to represent the sentiment analysis task.

First, add and average the Embedding and inverse directions of all word vectors in a sentence, and use the resulting weighted embedding vector as the vector representation of the entire sentence, but this method will face some problems

  • Variable-length sentences: Sentences in natural language are often of different lengths, but most neural networks accept inputs of equal length tensors
  • Combined semantics: Using weighted embedding vectors will lose semantic information, such as the impact of the order of different words, such as 我喜欢你sum 你喜欢我, if weighted, the expression vectors of these two sentences are the same, but obviously the semantics of these two sentences are very different

Handling variable-length sentences
By setting , it is max_seq_lenused to control the maximum length of text that the neural network can handle. If our sentence length is not enough, we can use padding, and if the sentence is too long, we can use truncation

image-20220114182805600

  • For sentences longer than max_seq_len, we usually truncate the sentence so that it can be input into a tensor. The process of sentence truncation is tricky, sometimes it is better to truncate the first part of the sentence than the latter, and sometimes it is the opposite. Of course, there are other truncation methods. Interested readers can read the relevant information, which will not be repeated here.
  • For sentences whose sentence length is less than max_seq_len, we generally use a special word to pad the sentence, a process called Padding. Assuming a sentence "I, love, artificial, intelligence", max_seq_len=6, then there are two possible filling methods:
    • Forward Fill: "[pad], [pad], me, love, artificial, intelligence"
    • Backfill: "I, love, artificial, intelligence, [pad], [pad]"

Recurrent Neural Network RNN

The RNN network is a common user-oriented sequence model, which can model natural language sentences or other time-series signals. The RNN network structure is shown in the figure:

image-20220114182954136

It divides the entire sentence vector into different word vectors, and each word vector is represented by a vector. First, I in the picture will be input to the network for memory, and then the output will be transmitted to the next cell and the next time. The word vectors of the slice are modeled together until the last cell, and then the vector is output.

Long Short Term Memory Network (LSTM)

RNN can process time series data or natural language, but it has a problem that the network model is small and cannot retain more effective information. If our sentences are too long, LSTM can solve this problem, and this network model can learn Forgetting and memory, selective learning, memory and forgetting of previous sentences.

image-20220114183459760

  • Input Gate: Recurrent Neural Network RNN ​​and Long Short Term Memory Network LSTM - Figure 22, controls how much of the input signal will be fused.
  • Oblivion Gate: Recurrent Neural Network RNN ​​and Long Short Term Memory Network LSTM - Figure 23, controls how much past memory will be fused.
  • Output gate: Recurrent Neural Network RNN ​​and Long Short Term Memory Network LSTM - Figure 24, which controls how much memory is finally output.
  • Unit Status:Recurrent Neural Network RNN ​​and Long Short Term Memory Network LSTM - Figure 25

image-20220114183536181

LSTM implements sentiment analysis task

With the help of long short-term memory networks, we can perform sentiment analysis tasks very easily. As shown below. For each sentence, we first turn these sentences into fixed-length vectors by truncating and padding. Then, using a long-short-term memory network, each sentence is read from left to right. After completing the reading, we use the last output memory of the long-short-term memory network as the semantic information of the entire sentence, and directly use this vector as input and send it to a classification layer for classification, thus completing the neural network construction for sentiment analysis problems. mold.

image-20220114183645100

full code

1. Load corpus data

"""
 * Created with PyCharm
 * 作者: 阿光
 * 日期: 2022/1/13
 * 时间: 23:29
 * 描述:
"""
import random
import re
import tarfile

import numpy as np
import requests


def download():
    corpus_url = "https://dataset.bj.bcebos.com/imdb%2FaclImdb_v1.tar.gz"
    web_request = requests.get(corpus_url)
    corpus = web_request.content

    with open("./aclImdb_v1.tar.gz", "wb") as f:
        f.write(corpus)

    f.close()


# download()


def load_imdb(is_training):
    data_set = []
    for label in ["pos", "neg"]:
        with tarfile.open("./aclImdb_v1.tar.gz") as tarf:
            path_pattern = "aclImdb/train/" + label + "/.*\.txt$" if is_training \
                else "aclImdb/test/" + label + "/.*\.txt$"
            path_pattern = re.compile(path_pattern)
            tf = tarf.next()
            while tf != None:
                if bool(path_pattern.match(tf.name)):
                    sentence = tarf.extractfile(tf).read().decode()
                    sentence_label = 0 if label == 'neg' else 1
                    data_set.append((sentence, sentence_label))
                tf = tarf.next()
    return data_set


def data_preprocess(corpus):
    data_set = []
    for sentence, sentence_label in corpus:
        sentence = sentence.strip().lower()
        sentence = sentence.split(" ")
        data_set.append((sentence, sentence_label))
    return data_set


# 构造词典,统计每个词的频率,并根据频率将每个词转换为一个整数id
def build_dict(corpus):
    word_freq_dict = dict()
    for sentence, _ in corpus:
        for word in sentence:
            if word not in word_freq_dict:
                word_freq_dict[word] = 0
            word_freq_dict[word] += 1
    word_freq_dict = sorted(word_freq_dict.items(), key=lambda x: x[1], reverse=True)
    word2id_dict = dict()
    word2id_freq = dict()
    word2id_dict['[oov]'] = 0
    word2id_freq[0] = 1e10
    word2id_dict['[pad]'] = 1
    word2id_freq[1] = 1e10
    for word, freq in word_freq_dict:
        word2id_dict[word] = len(word2id_dict)
        word2id_freq[word2id_dict[word]] = freq
    return word2id_freq, word2id_dict


# 把语料转换为id序列
def convert_corpus_to_id(corpus, word2id_dict):
    data_set = []
    for sentence, sentence_label in corpus:
        sentence = [word2id_dict[word] if word in word2id_dict \
                        else word2id_dict['[oov]'] for word in sentence]
        data_set.append((sentence, sentence_label))
    return data_set


# 编写一个迭代器,每次调用这个迭代器都会返回一个新的batch,用于训练或者预测
def build_batch(word2id_dict, corpus, batch_size, epoch_num, max_seq_len, shuffle=True):
    sentence_batch = []
    sentence_label_batch = []
    for _ in range(epoch_num):
        if shuffle:
            random.shuffle(corpus)
        for sentence, sentence_label in corpus:
            sentence_sample = sentence[:min(max_seq_len, len(sentence))]
            if len(sentence_sample) < max_seq_len:
                for _ in range(max_seq_len - len(sentence_sample)):
                    sentence_sample.append(word2id_dict['[pad]'])
            sentence_batch.append(sentence_sample)
            sentence_label_batch.append([sentence_label])
            if len(sentence_batch) == batch_size:
                yield np.array(sentence_batch).astype("int64"), np.array(sentence_label_batch).astype("int64")
                sentence_batch = []
                sentence_label_batch = []
    if len(sentence_batch) == batch_size:
        yield np.array(sentence_batch).astype("int64"), np.array(sentence_label_batch).astype("int64")


def get_data():
    train_corpus = load_imdb(True)
    test_corpus = load_imdb(False)
    train_corpus = data_preprocess(train_corpus)
    test_corpus = data_preprocess(test_corpus)
    word2id_freq, word2id_dict = build_dict(train_corpus)
    vocab_size = len(word2id_freq)
    train_corpus = convert_corpus_to_id(train_corpus, word2id_dict)
    test_corpus = convert_corpus_to_id(test_corpus, word2id_dict)
    train_datasets = build_batch(word2id_dict,
                                 train_corpus[:1000], batch_size=64, epoch_num=64, max_seq_len=30)
    return train_datasets

2. Define the LSTM network model

"""
 * Created with PyCharm
 * 作者: 阿光
 * 日期: 2022/1/13
 * 时间: 23:45
 * 描述:
"""
import keras
from tensorflow import nn
from tensorflow.keras.layers import *


class Model(keras.Model):
    def __init__(self):
        super(Model, self).__init__()
        self.embedding = Embedding(input_dim=252173,
                                   output_dim=256)
        self.lstm = LSTM(128)
        self.fc = Dense(2, activation=nn.softmax)

    def call(self, inputs):
        x = self.embedding(inputs)
        x = self.lstm(x)
        x = self.fc(x)
        return x

3. Training data

"""
 * Created with PyCharm
 * 作者: 阿光
 * 日期: 2022/1/13
 * 时间: 23:57
 * 描述:
"""
import tensorflow as tf
from keras import Input

import lstm
from model import Model

models = Model()
models.build(input_shape=(1, 50))
models.call(Input(shape=50))
models.summary()

train_datasets = lstm.get_data()

models.compile(optimizer='adam',
               loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
               metrics=['accuracy'])

# 权重保存路径
checkpoint_path = "./weight/cp.ckpt"

# 回调函数,用户保存权重
save_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                   save_best_only=True,
                                                   save_weights_only=True,
                                                   monitor='loss',
                                                   verbose=1)

history = models.fit(train_datasets,
                     epochs=5,
                     callbacks=[save_callback])

Guess you like

Origin blog.csdn.net/m0_47256162/article/details/122500197