[Natural Language Processing (NLP) Practical Combat] LSTM network implements Chinese text sentiment analysis (hands-on and super detailed teaching)

Table of contents

introduction:

1. Display of all documents:

1. Chinese stop word data (hit_stopwords.txt) comes from:

2. The data data set is the file extracted from chinese_text_cnn-master.zip. Click the link to enter github, click Code, Download ZIP to download.

2. Install dependent libraries:

3. Data preprocessing (data_set.py):

train.txt - training set file after removing stop words:

test.txt - test set file after removing stop words:

4. Model training and saving (main.py)

1. LSTM model construction:

2.main.py cost display:

 3. Model saving

4. Training results 

5.LSTM model test (test.py) 

 1.Test results:

2.Test results:

6. Complete code display:

1.data_set.py

2.mian.py

3.test.py 


introduction:

In today's digital age, people generate massive amounts of text data in social media, comment platforms, and various types of online communications. These data contain rich emotional information, thus becoming a valuable resource for in-depth understanding of user attitudes, market trends, and even social sentiment. The development of natural language processing (NLP) has provided us with powerful tools that make it possible to analyze text sentiment. In this field, long short-term memory networks (LSTM) have become an important technology in sentiment analysis tasks due to their ability to capture long-distance dependencies in text sequences.

This blog will teach you step by step how to use LSTM network to implement Chinese text sentiment analysis. We will start with data preprocessing and gradually build an end-to-end sentiment analysis model. Through detailed steps and sample code, gain an in-depth understanding of how to process Chinese text data, build LSTM models, train and evaluate.

1. Display of all documents:

1. Chinese stop word data (hit_stopwords.txt) comes from:

Project directory preview - stopwords - GitCode

2. Wheredata data set ischinese_text_cnn-master.zipExtracted files. Click the link to enter github, click Code, Download ZIP to download.

2. Install dependent libraries:

pip install torch # 搭建LSTM模型
pip install gensim # 中文文本词向量转换
pip install numpy # 数据清洗、预处理
pip install pandas

3. Data preprocessing (data_set.py):

# -*- coding: utf-8 -*-
# @Time : 2023/11/15 10:52
# @Author :Muzi
# @File : data_set.py
# @Software: PyCharm
import pandas as pd
import jieba


# 数据读取
def load_tsv(file_path):
    data = pd.read_csv(file_path, sep='\t')
    data_x = data.iloc[:, -1]
    data_y = data.iloc[:, 1]
    return data_x, data_y

train_x, train_y = load_tsv("./data/train.tsv")
test_x, test_y = load_tsv("./data/test.tsv")
train_x=[list(jieba.cut(x)) for x in train_x]
test_x=[list(jieba.cut(x)) for x in test_x]

with open('./hit_stopwords.txt','r',encoding='UTF8') as f:
    stop_words=[word.strip() for word in f.readlines()]
    print('Successfully')
def drop_stopword(datas):
    for data in datas:
        for word in data:
            if word in stop_words:
                data.remove(word)
    return datas

def save_data(datax,path):
    with open(path, 'w', encoding="UTF8") as f:
        for lines in datax:
            for i, line in enumerate(lines):
                f.write(str(line))
                # 如果不是最后一行,就添加一个逗号
                if i != len(lines) - 1:
                    f.write(',')
            f.write('\n')

if __name__ == '__main':
    train_x=drop_stopword(train_x)
    test_x=drop_stopword(test_x)

    save_data(train_x,'./train.txt')
    save_data(test_x,'./test.txt')
    print('Successfully')

train.txt - training set file after removing stop words:

 

test.txt - test set file after removing stop words:

4. Model training and saving (main.py)

1. LSTM model construction:

Different data sets should have different classification standards. The data model I use here belongs to a two-classification problem.

# 定义LSTM模型
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        output = self.fc(lstm_out[:, -1, :])  # 取序列的最后一个输出
        return output

# 定义模型
input_size = word2vec_model.vector_size
hidden_size = 50  # 你可以根据需要调整隐藏层大小
output_size = 2  # 输出的大小,根据你的任务而定

model = LSTMModel(input_size, hidden_size, output_size)
# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()  # 交叉熵损失函数
optimizer = torch.optim.Adam(model.parameters(), lr=0.0002)

2.main.py cost display:

# -*- coding: utf-8 -*-
# @Time : 2023/11/13 20:31
# @Author :Muzi
# @File : mian.py.py
# @Software: PyCharm
import pandas as pd
import torch
from torch import nn
import jieba
from gensim.models import Word2Vec
import numpy as np
from data_set import load_tsv
from torch.utils.data import DataLoader, TensorDataset


# 数据读取
def load_txt(path):
    with open(path,'r',encoding='utf-8') as f:
        data=[[line.strip()] for line in f.readlines()]
        return data

train_x=load_txt('train.txt')
test_x=load_txt('test.txt')
train=train_x+test_x
X_all=[i for x in train for i in x]

_, train_y = load_tsv("./data/train.tsv")
_, test_y = load_tsv("./data/test.tsv")
# 训练Word2Vec模型
word2vec_model = Word2Vec(sentences=X_all, vector_size=100, window=5, min_count=1, workers=4)

# 将文本转换为Word2Vec向量表示
def text_to_vector(text):
    vector = [word2vec_model.wv[word] for word in text if word in word2vec_model.wv]
    return sum(vector) / len(vector) if vector else [0] * word2vec_model.vector_size

X_train_w2v = [[text_to_vector(text)] for line in train_x for text in line]
X_test_w2v = [[text_to_vector(text)] for line in test_x for text in line]

# 将词向量转换为PyTorch张量
X_train_array = np.array(X_train_w2v, dtype=np.float32)
X_train_tensor = torch.Tensor(X_train_array)
X_test_array = np.array(X_test_w2v, dtype=np.float32)
X_test_tensor = torch.Tensor(X_test_array)
#使用DataLoader打包文件
train_dataset = TensorDataset(X_train_tensor, torch.LongTensor(train_y))
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataset = TensorDataset(X_test_tensor,torch.LongTensor(test_y))
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=True)
# 定义LSTM模型
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        output = self.fc(lstm_out[:, -1, :])  # 取序列的最后一个输出
        return output

# 定义模型
input_size = word2vec_model.vector_size
hidden_size = 50  # 你可以根据需要调整隐藏层大小
output_size = 2  # 输出的大小,根据你的任务而定

model = LSTMModel(input_size, hidden_size, output_size)
# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()  # 交叉熵损失函数
optimizer = torch.optim.Adam(model.parameters(), lr=0.0002)

if __name__ == "__main__":
    # 训练模型
    num_epochs = 10
    log_interval = 100  # 每隔100个批次输出一次日志
    loss_min=100
    for epoch in range(num_epochs):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            outputs = model(data)
            loss = criterion(outputs, target)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if batch_idx % log_interval == 0:
                print('Epoch [{}/{}], Batch [{}/{}], Loss: {:.4f}'.format(
                    epoch + 1, num_epochs, batch_idx, len(train_loader), loss.item()))
            # 保存最佳模型
            if loss.item()<loss_min:
                loss_min=loss.item()
                torch.save(model, 'model.pth')

    # 模型评估
    with torch.no_grad():
        model.eval()
        correct = 0
        total = 0
        for data, target in test_loader:
            outputs = model(data)
            _, predicted = torch.max(outputs.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

        accuracy = correct / total
        print('Test Accuracy: {:.2%}'.format(accuracy))

 3. Model saving

 # 保存最佳模型
            if loss.item()<loss_min:
                loss_min=loss.item()
                torch.save(model, 'model.pth')

4. Training results 

5.LSTM model test (test.py) 

# -*- coding: utf-8 -*-
# @Time : 2023/11/15 15:53
# @Author :Muzi
# @File : test.py.py
# @Software: PyCharm
import torch
import jieba
from torch import nn
from gensim.models import Word2Vec
import numpy as np

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        output = self.fc(lstm_out[:, -1, :])  # 取序列的最后一个输出
        return output

# 数据读取
def load_txt(path):
    with open(path,'r',encoding='utf-8') as f:
        data=[[line.strip()] for line in f.readlines()]
        return data

#去停用词
def drop_stopword(datas):
    # 假设你有一个函数用于预处理文本数据
    with open('./hit_stopwords.txt', 'r', encoding='UTF8') as f:
        stop_words = [word.strip() for word in f.readlines()]
    datas=[x for x in datas if x not in stop_words]
    return datas

def preprocess_text(text):
    text=list(jieba.cut(text))
    text=drop_stopword(text)
    return text

# 将文本转换为Word2Vec向量表示
def text_to_vector(text):
    train_x = load_txt('train.txt')
    test_x = load_txt('test.txt')
    train = train_x + test_x
    X_all = [i for x in train for i in x]
    # 训练Word2Vec模型
    word2vec_model = Word2Vec(sentences=X_all, vector_size=100, window=5, min_count=1, workers=4)
    vector = [word2vec_model.wv[word] for word in text if word in word2vec_model.wv]
    return sum(vector) / len(vector) if vector else [0] * word2vec_model.vector_size

if __name__ == '__main__':
    # input_text = "这个车完全就是垃圾,又热又耗油"
    input_text = "这个车我开了好几年,还是不错的"
    label = {1: "正面情绪", 0: "负面情绪"}
    model = torch.load('model.pth')
    # 预处理输入数据
    input_data = preprocess_text(input_text)
    # 确保输入词向量与模型维度和数据类型相同
    input_data=[[text_to_vector(input_data)]]
    input_arry= np.array(input_data, dtype=np.float32)
    input_tensor = torch.Tensor(input_arry)
    # 将输入数据传入模型
    with torch.no_grad():
        output = model(input_tensor)
    predicted_class = label[torch.argmax(output).item()]
    print(f"predicted_text:{input_text}")
    print(f"模型预测的类别: {predicted_class}")

 1.Test results:

2.Test results:

6. Complete code display:

1.data_set.py

import pandas as pd
import jieba

# 数据读取
def load_tsv(file_path):
    data = pd.read_csv(file_path, sep='\t')
    data_x = data.iloc[:, -1]
    data_y = data.iloc[:, 1]
    return data_x, data_y

with open('./hit_stopwords.txt','r',encoding='UTF8') as f:
    stop_words=[word.strip() for word in f.readlines()]
    print('Successfully')
def drop_stopword(datas):
    for data in datas:
        for word in data:
            if word in stop_words:
                data.remove(word)
    return datas

def save_data(datax,path):
    with open(path, 'w', encoding="UTF8") as f:
        for lines in datax:
            for i, line in enumerate(lines):
                f.write(str(line))
                # 如果不是最后一行,就添加一个逗号
                if i != len(lines) - 1:
                    f.write(',')
            f.write('\n')

if __name__ == '__main':
    train_x, train_y = load_tsv("./data/train.tsv")
    test_x, test_y = load_tsv("./data/test.tsv")
    train_x = [list(jieba.cut(x)) for x in train_x]
    test_x = [list(jieba.cut(x)) for x in test_x]
    train_x=drop_stopword(train_x)
    test_x=drop_stopword(test_x)
    save_data(train_x,'./train.txt')
    save_data(test_x,'./test.txt')
    print('Successfully')

2.mian.py

import pandas as pd
import torch
from torch import nn
import jieba
from gensim.models import Word2Vec
import numpy as np
from data_set import load_tsv
from torch.utils.data import DataLoader, TensorDataset


# 数据读取
def load_txt(path):
    with open(path,'r',encoding='utf-8') as f:
        data=[[line.strip()] for line in f.readlines()]
        return data

train_x=load_txt('train.txt')
test_x=load_txt('test.txt')
train=train_x+test_x
X_all=[i for x in train for i in x]

_, train_y = load_tsv("./data/train.tsv")
_, test_y = load_tsv("./data/test.tsv")
# 训练Word2Vec模型
word2vec_model = Word2Vec(sentences=X_all, vector_size=100, window=5, min_count=1, workers=4)

# 将文本转换为Word2Vec向量表示
def text_to_vector(text):
    vector = [word2vec_model.wv[word] for word in text if word in word2vec_model.wv]
    return sum(vector) / len(vector) if vector else [0] * word2vec_model.vector_size

X_train_w2v = [[text_to_vector(text)] for line in train_x for text in line]
X_test_w2v = [[text_to_vector(text)] for line in test_x for text in line]

# 将词向量转换为PyTorch张量
X_train_array = np.array(X_train_w2v, dtype=np.float32)
X_train_tensor = torch.Tensor(X_train_array)
X_test_array = np.array(X_test_w2v, dtype=np.float32)
X_test_tensor = torch.Tensor(X_test_array)
#使用DataLoader打包文件
train_dataset = TensorDataset(X_train_tensor, torch.LongTensor(train_y))
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataset = TensorDataset(X_test_tensor,torch.LongTensor(test_y))
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=True)
# 定义LSTM模型
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        output = self.fc(lstm_out[:, -1, :])  # 取序列的最后一个输出
        return output

# 定义模型
input_size = word2vec_model.vector_size
hidden_size = 50  # 你可以根据需要调整隐藏层大小
output_size = 2  # 输出的大小,根据你的任务而定

model = LSTMModel(input_size, hidden_size, output_size)
# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()  # 交叉熵损失函数
optimizer = torch.optim.Adam(model.parameters(), lr=0.0002)

if __name__ == "__main__":
    # 训练模型
    num_epochs = 10
    log_interval = 100  # 每隔100个批次输出一次日志
    loss_min=100
    for epoch in range(num_epochs):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            outputs = model(data)
            loss = criterion(outputs, target)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if batch_idx % log_interval == 0:
                print('Epoch [{}/{}], Batch [{}/{}], Loss: {:.4f}'.format(
                    epoch + 1, num_epochs, batch_idx, len(train_loader), loss.item()))
            # 保存最佳模型
            if loss.item()<loss_min:
                loss_min=loss.item()
                torch.save(model, 'model.pth')

    # 模型评估
    with torch.no_grad():
        model.eval()
        correct = 0
        total = 0
        for data, target in test_loader:
            outputs = model(data)
            _, predicted = torch.max(outputs.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

        accuracy = correct / total
        print('Test Accuracy: {:.2%}'.format(accuracy))

3.test.py 

import torch
import jieba
from torch import nn
from gensim.models import Word2Vec
import numpy as np

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        output = self.fc(lstm_out[:, -1, :])  # 取序列的最后一个输出
        return output

# 数据读取
def load_txt(path):
    with open(path,'r',encoding='utf-8') as f:
        data=[[line.strip()] for line in f.readlines()]
        return data

#去停用词
def drop_stopword(datas):
    # 假设你有一个函数用于预处理文本数据
    with open('./hit_stopwords.txt', 'r', encoding='UTF8') as f:
        stop_words = [word.strip() for word in f.readlines()]
    datas=[x for x in datas if x not in stop_words]
    return datas

def preprocess_text(text):
    text=list(jieba.cut(text))
    text=drop_stopword(text)
    return text

# 将文本转换为Word2Vec向量表示
def text_to_vector(text):
    train_x = load_txt('train.txt')
    test_x = load_txt('test.txt')
    train = train_x + test_x
    X_all = [i for x in train for i in x]
    # 训练Word2Vec模型
    word2vec_model = Word2Vec(sentences=X_all, vector_size=100, window=5, min_count=1, workers=4)
    vector = [word2vec_model.wv[word] for word in text if word in word2vec_model.wv]
    return sum(vector) / len(vector) if vector else [0] * word2vec_model.vector_size

if __name__ == '__main__':
    input_text = "这个车完全就是垃圾,又热又耗油"
    # input_text = "这个车我开了好几年,还是不错的"
    label = {1: "正面情绪", 0: "负面情绪"}
    model = torch.load('model.pth')
    # 预处理输入数据
    input_data = preprocess_text(input_text)
    # 确保输入词向量与模型维度和数据类型相同
    input_data=[[text_to_vector(input_data)]]
    input_arry= np.array(input_data, dtype=np.float32)
    input_tensor = torch.Tensor(input_arry)
    # 将输入数据传入模型
    with torch.no_grad():
        output = model(input_tensor)
    # 这里只一个简单的示例
    predicted_class = label[torch.argmax(output).item()]
    print(f"predicted_text:{input_text}")
    print(f"模型预测的类别: {predicted_class}")

Guess you like

Origin blog.csdn.net/m0_74053536/article/details/134379831