自己编写深度学习框架ANNbox【高仿tensorflow】__03文本情感分析

文章列表
1.自己编写深度学习框架ANNbox【高仿tensorflow】__01实现全连接.
2.自己编写深度学习框架ANNbox【高仿tensorflow】__02实现不同的优化方法.
3.自己编写深度学习框架ANNbox【高仿tensorflow】__03文本情感分析.
4.自己编写深度学习框架ANNbox【高仿tensorflow】__04卷积神经网络编写:实现AlexNet.
5.自己编写深度学习框架ANNbox【高仿tensorflow】__05循环神经网络编写:实现RNN(01).

自己编写深度学习框架__03文本情感分析

通过整理文本数据训练出一个可以预测文本正面或负面情感的预测模型。

1 方法1:基于词汇计数的方法实现输入语句数据向量化

1.1 文本输入处理

首先统计所有句子中出现的词汇组成一个集合,从而得到“单词-序列号”这一映射关系。然后对于每个句子,统计该句子中每个单词出现的次数并放入对应的“序列号”处,没有出现的单词用0代替。如假设文本库为“She is happy and she is fine”、“She is sad”,那么该语库中所有句子中出现的词汇组合成的集合为{‘she’, ‘is’, ‘happy’, ‘sad’,’and’,’fine’},句子“She is happy and she is fine”的输入向量为[2,1,1,0,1,0]

1.2 标签处理

Positive:+1 , Negitive:-1

1.3 神经网络模型

隐藏层仅有1层,且激活函数为全等映射函数,输出层激活函数为sigmoid,损失函数为平方误差损失函数,以“She is happy and she is fine”为输入向量的模型为:

这里写图片描述

1.4 程序实现

# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import sys
import random
import miniFlow as mf
import pandas as pd
class data_input():
    def __init__(self, reviews, labels, hidden_nodes=10):
        """参数:reviews(dataFrame), 用于训练 | labels(dataFrame), 用于训练"""
        np.random.seed(1)
        self.pre_process_data(reviews, labels)

    def pre_process_data(self, reviews, labels):
        """预处理数据,统计reviews中出现的所有单词,并且生成word2index"""
        # 统计reviews中出现的所有单词,
        review_vocab = set()
        for review in reviews.values:
            word = review[0].split(' ')
            review_vocab.update(word)

        self.review_vocab = list(review_vocab)

        # 统计labels中所有出现的label(其实在这里,就+1和-1两种)
        label_vocab = set()
        for label in labels.values:
            label_vocab.add(label[0])
        self.label_vocab = list(label_vocab)

        # 构建word2idx,给每个单词安排一个"门牌号"
        self.word2idx = dict()
        for idx, word in enumerate(self.review_vocab):
            self.word2idx[word] = idx

    def update_input_layer(self, reviews, labels):
        """对review进行数字化处理,统计其中单词出现次数,并将结果存放到self.layer_0中,也就是输入层"""
        inputs = np.zeros((len(reviews), len(self.review_vocab)))
        for ind in range(len(reviews)):
            for word in reviews.iloc[ind,0].split(' '):
                if word.lower() in self.word2idx:
                    idx = self.word2idx[word.lower()]
                    # 统计单词出现的次数,作为输入
                    inputs[ind,idx] += 1
#                    inputs[ind,idx] = 1

        labels_ = np.zeros((len(labels), 1))
        for ind in range(len(labels)):
            if(labels.iloc[ind,0]=='positive'):
                labels_[ind] = 1
        return inputs,labels_
# loaddata
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
data = data_input(reviews,labels)
inputs_test, labels_test = data.update_input_layer(reviews[-5000:-1],labels[-5000:-1])
## create the model
in_units = len(data.review_vocab)
h1_units = 10
o_units = 1
random.seed (1)
W1=mf.variable(value=np.random.normal( 0.0, in_units**-0.5, (in_units, h1_units)), name='W1')
b1=mf.variable(value=np.zeros(h1_units), name='b1')
W2=mf.variable(value=np.random.normal( 0.0, h1_units**-0.5, (h1_units, o_units)), name='W2')
b2=mf.variable(value=np.zeros(o_units), name='b2')
X, y = mf.placeholder(name='X'), mf.placeholder(name='y')
hidden1=mf.Linear(X, W1, b1,name='linear1')
out=mf.Sigmoid(mf.Linear(hidden1, W2, b2,name='linear2'),name='out')
# Define loss and optimizer
cost = mf.MSE(y, out,name='cost')
epochs = 2
m = len(reviews) - 5000
batch_size = 250
#learning_rate=2e-4
learning_rate=2e-3
Momentum_rate=0.95
steps_per_epoch = m // batch_size
#train_step = mf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
train_step = mf.train.MomentumOptimizer(learning_rate,Momentum_rate).minimize(cost)
print("Total number of examples = {}".format(m))
mf.global_variables_initializer().run()
loss_list = []
acc_train_list = []
acc_test_list = []
# Train
for i in range(epochs):
    loss = 0
    for j in range(steps_per_epoch):
        inputs_train, labels_train = data.update_input_layer(reviews[j*batch_size:(j+1)*batch_size],labels[j*batch_size:(j+1)*batch_size])
        feed_dict = {X: inputs_train, y: labels_train}
        graph = train_step.run(feed_dict)
        loss = np.mean(graph[-1].value)
        loss_list.append(loss)
        acc_train = np.mean(((graph[-2].value>0.5).astype(int).reshape(-1,1) == labels_train).astype(int))
        acc_train_list.append(acc_train)
        acc_test = np.mean(((out.run({X: inputs_test, y: labels_test})>0.5).astype(int).reshape(-1,1) == labels_test).astype(int))
        acc_test_list.append(acc_test)
        sys.stdout.write("\rprocess: {}/{}, loss:{:.5f}, acc_train:{:.2f}, acc_test:{:.2f}".format(j, steps_per_epoch, loss, acc_train, acc_test))
    plt.figure()
#    plt.subplot(211)
    plt.plot(range(len(loss_list)),loss_list,label=u'loss')
#    plt.subplot(212)
    plt.plot(range(len(loss_list)),acc_train_list,label=u'acc_train')
    plt.plot(range(len(loss_list)),acc_test_list,label=u'acc_test')
    plt.ylim([0,1])
    plt.legend()
    plt.show()

1.5 分类结果

process: 79/80, loss:0.15171, acc_train:0.81, acc_test:0.80

这里写图片描述

准确率有80%,比随机猜测要好一些。

2 方法2:提升准确率-减少输入噪声

2.1 方法1的问题

方法1中的输入数据中包含了许多对于输出没有任何价值的单词,将其称为噪声,比如the, an, of, on等等。这些单词属于中性词,对于判断Positive或者Negative没有任何帮助,并且这些词出现频率非常高,在统计个数时,这些词肯定比带有情绪的词个数要多。当这些中性的词占据了大部分的输入时,即label=1(positive)的时候,这些词出现了很多次,label=0(negative)的时候,这些词也出现了很多次,那这些词到底是正向词还是负向词呢?神经网络在训练的时候就会感到疑惑。一下这些噪声的出现频率有多高。

import numpy as np
import sys
import time
import pandas as pd
# 读取数据
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
from collections import Counter
positive_counter = Counter()
negative_counter = Counter()
total_counter = Counter()
for review, label in zip( reviews.values, labels.values ):
    word = review[0].split(' ')
    if label == 'positive':
        positive_counter.update(word)
    elif label == 'negative':
        negative_counter.update(word)
    total_counter.update(word)
positive_counter.most_common()[:30]
negative_counter.most_common()[:30]

结果:
positive_counter.most_common()[:30]
Out5:
[(”, 550468),
(‘the’, 173324),
(‘.’, 159654),
(‘and’, 89722),
(‘a’, 83688),
(‘of’, 76855),
(‘to’, 66746),
(‘is’, 57245),
(‘in’, 50215),
(‘br’, 49235),
(‘it’, 48025),
(‘i’, 40743),
(‘that’, 35630),
(‘this’, 35080),
(’s’, 33815),
(‘as’, 26308),
(‘with’, 23247),
(‘for’, 22416),
(‘was’, 21917),
(‘film’, 20937),
(‘but’, 20822),
(‘movie’, 19074),
(‘his’, 17227),
(‘on’, 17008),
(‘you’, 16681),
(‘he’, 16282),
(‘are’, 14807),
(‘not’, 14272),
(‘t’, 13720),
(‘one’, 13655)]

negative_counter.most_common()[:30]
Out[6]:
[(”, 561462),
(‘.’, 167538),
(‘the’, 163389),
(‘a’, 79321),
(‘and’, 74385),
(‘of’, 69009),
(‘to’, 68974),
(‘br’, 52637),
(‘is’, 50083),
(‘it’, 48327),
(‘i’, 46880),
(‘in’, 43753),
(‘this’, 40920),
(‘that’, 37615),
(’s’, 31546),
(‘was’, 26291),
(‘movie’, 24965),
(‘for’, 21927),
(‘but’, 21781),
(‘with’, 20878),
(‘as’, 20625),
(‘t’, 20361),
(‘film’, 19218),
(‘you’, 17549),
(‘on’, 17192),
(‘not’, 16354),
(‘have’, 15144),
(‘are’, 14623),
(‘be’, 14541),
(‘he’, 13856)]

从positive_counter和negative_counter的统计结果来看,那些出现最多次的词都是噪声。真正有用词,例如happy,funny,horrible等出现次数并不多,但是却是非常重要的

2.2 语句输入预处理

由于中性词数量太多,可以尝试先降低中性词的权重,一个简单的方法是:
不在以单词个数作为输入,而是以单词是否出现作为输入(只输入0或者1,0表示这个单词在review中没有出现,1表示出现)。那么修改就非常简单了,将update_input_layer()中+=修改为=

2.3 程序实现

# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import sys
import random
import miniFlow as mf
import pandas as pd
class data_input():
    def __init__(self, reviews, labels, hidden_nodes=10):
        """参数:reviews(dataFrame), 用于训练 | labels(dataFrame), 用于训练"""
        np.random.seed(1)
        self.pre_process_data(reviews, labels)

    def pre_process_data(self, reviews, labels):
        """预处理数据,统计reviews中出现的所有单词,并且生成word2index"""
        # 统计reviews中出现的所有单词,
        review_vocab = set()
        for review in reviews.values:
            word = review[0].split(' ')
            review_vocab.update(word)

        self.review_vocab = list(review_vocab)

        # 统计labels中所有出现的label(其实在这里,就+1和-1两种)
        label_vocab = set()
        for label in labels.values:
            label_vocab.add(label[0])
        self.label_vocab = list(label_vocab)

        # 构建word2idx,给每个单词安排一个"门牌号"
        self.word2idx = dict()
        for idx, word in enumerate(self.review_vocab):
            self.word2idx[word] = idx

    def update_input_layer(self, reviews, labels):
        """对review进行数字化处理,统计其中单词出现次数,并将结果存放到self.layer_0中,也就是输入层"""
        inputs = np.zeros((len(reviews), len(self.review_vocab)))
        for ind in range(len(reviews)):
            for word in reviews.iloc[ind,0].split(' '):
                if word.lower() in self.word2idx:
                    idx = self.word2idx[word.lower()]
                    # 统计单词出现的次数,作为输入
#                    inputs[ind,idx] += 1
                    inputs[ind,idx] = 1

        labels_ = np.zeros((len(labels), 1))
        for ind in range(len(labels)):
            if(labels.iloc[ind,0]=='positive'):
                labels_[ind] = 1
        return inputs,labels_
# loaddata
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
data = data_input(reviews,labels)
inputs_test, labels_test = data.update_input_layer(reviews[-5000:-1],labels[-5000:-1])
## create the model
in_units = len(data.review_vocab)
h1_units = 10
o_units = 1
random.seed (1)
W1=mf.variable(value=np.random.normal( 0.0, in_units**-0.5, (in_units, h1_units)), name='W1')
b1=mf.variable(value=np.zeros(h1_units), name='b1')
W2=mf.variable(value=np.random.normal( 0.0, h1_units**-0.5, (h1_units, o_units)), name='W2')
b2=mf.variable(value=np.zeros(o_units), name='b2')
X, y = mf.placeholder(name='X'), mf.placeholder(name='y')
hidden1=mf.Linear(X, W1, b1,name='linear1')
out=mf.Sigmoid(mf.Linear(hidden1, W2, b2,name='linear2'),name='out')
# Define loss and optimizer
cost = mf.MSE(y, out,name='cost')
epochs = 2
m = len(reviews) - 5000
batch_size = 250
learning_rate=2e-2
Momentum_rate=0.95
steps_per_epoch = m // batch_size
#train_step = mf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
train_step = mf.train.MomentumOptimizer(learning_rate,Momentum_rate).minimize(cost)
print("Total number of examples = {}".format(m))
mf.global_variables_initializer().run()
loss_list = []
acc_train_list = []
acc_test_list = []
# Train
for i in range(epochs):
    loss = 0
    for j in range(steps_per_epoch):
        inputs_train, labels_train = data.update_input_layer(reviews[j*batch_size:(j+1)*batch_size],labels[j*batch_size:(j+1)*batch_size])
        feed_dict = {X: inputs_train, y: labels_train}
        graph = train_step.run(feed_dict)
        loss = np.mean(graph[-1].value)
        loss_list.append(loss)
        acc_train = np.mean(((graph[-2].value>0.5).astype(int).reshape(-1,1) == labels_train).astype(int))
        acc_train_list.append(acc_train)
        acc_test = np.mean(((out.run({X: inputs_test, y: labels_test})>0.5).astype(int).reshape(-1,1) == labels_test).astype(int))
        acc_test_list.append(acc_test)
        sys.stdout.write("\rprocess: {}/{}, loss:{:.5f}, acc_train:{:.2f}, acc_test:{:.2f}".format(j, steps_per_epoch, loss, acc_train, acc_test))
    plt.figure()
#    plt.subplot(211)
    plt.plot(range(len(loss_list)),loss_list,label=u'loss')
#    plt.subplot(212)
    plt.plot(range(len(loss_list)),acc_train_list,label=u'acc_train')
    plt.plot(range(len(loss_list)),acc_test_list,label=u'acc_test')
    plt.ylim([0,1])
    plt.legend()
    plt.show()

2.4 分类结果

process: 159/160, loss:0.08832, acc_train:0.88, acc_test:0.86
这里写图片描述

准确率超过80%了

3 方法3:提速并提升准确率—减少输入向量中的中性词及出现次数较少的单词

3.1 方法思路

这里写图片描述

3.2 程序实现

# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import sys
import random
import miniFlow as mf
import pandas as pd
from collections import Counter
class data_input():
    def __init__(self, reviews, labels, hidden_nodes=10):
        """参数:reviews(dataFrame), 用于训练 | labels(dataFrame), 用于训练"""
        np.random.seed(1)
        self.pre_process_data(reviews, labels)

    def pre_process_data(self, reviews, labels):
        """预处理数据,统计reviews中出现的所有单词,并且生成word2index"""
        # 统计reviews中出现的所有单词,
#        review_vocab = set()
#        for review in reviews.values:
#            word = review[0].split(' ')
#            review_vocab.update(word)
#        self.review_vocab = list(review_vocab)
        # 统计reviews中出现的所有单词,减去中性词的影响
        positive_counts = Counter()
        negative_counts = Counter()
        total_counts = Counter()
        for review, label in zip( reviews.values, labels.values ):
            word = review[0].split(' ')
            if label == 'positive':
                positive_counts.update(word)
            elif label == 'negative':
                negative_counts.update(word)
            total_counts.update(word)
        positive_counts.most_common()[:30]
        negative_counts.most_common()[:30]

        #正负情感中各个词汇出现的比例
        pos_neg_ratios = Counter()
        for term,cnt in list(total_counts.most_common()):
            if(cnt > 100):
                pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
                pos_neg_ratios[term] = pos_neg_ratio

        for word,ratio in pos_neg_ratios.most_common():
            if(ratio > 1):
                pos_neg_ratios[word] = np.log(ratio)
            else:
                pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))

        min_count = 50
        polarity_cutoff = 0.1
        review_vocab = set()
        for review in reviews.values:
            for word in review[0].split(" "):
                ## New for Project 6: only add words that occur at least min_count times
                #                     and for words with pos/neg ratios, only add words
                #                     that meet the polarity_cutoff
                if(total_counts[word] > min_count):
                    if(word in pos_neg_ratios.keys()):
                        if((pos_neg_ratios[word] >= polarity_cutoff) or (pos_neg_ratios[word] <= -polarity_cutoff)):
                            review_vocab.add(word)
                    else:
                        review_vocab.add(word)
        self.review_vocab = list(review_vocab)

        # 统计labels中所有出现的label(其实在这里,就+1和-1两种)
        label_vocab = set()
        for label in labels.values:
            label_vocab.add(label[0])
        self.label_vocab = list(label_vocab)

        # 构建word2idx,给每个单词安排一个"门牌号"
        self.word2idx = dict()
        for idx, word in enumerate(self.review_vocab):
            self.word2idx[word] = idx

    def update_input_layer(self, reviews, labels):
        """对review进行数字化处理,统计其中单词出现次数,并将结果存放到self.layer_0中,也就是输入层"""
        inputs = np.zeros((len(reviews), len(self.review_vocab)))
        for ind in range(len(reviews)):
            for word in reviews.iloc[ind,0].split(' '):
                if word.lower() in self.word2idx:
                    idx = self.word2idx[word.lower()]
                    # 统计单词出现的次数,作为输入
#                    inputs[ind,idx] += 1
                    inputs[ind,idx] = 1

        labels_ = np.zeros((len(labels), 1))
        for ind in range(len(labels)):
            if(labels.iloc[ind,0]=='positive'):
                labels_[ind] = 1
        return inputs,labels_
# loaddata
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
data = data_input(reviews,labels)
inputs_test, labels_test = data.update_input_layer(reviews[-5000:-1],labels[-5000:-1])
## create the model
in_units = len(data.review_vocab)
h1_units = 10
o_units = 1
random.seed (1)
W1=mf.variable(value=np.random.normal( 0.0, in_units**-0.5, (in_units, h1_units)), name='W1')
b1=mf.variable(value=np.zeros(h1_units), name='b1')
W2=mf.variable(value=np.random.normal( 0.0, h1_units**-0.5, (h1_units, o_units)), name='W2')
b2=mf.variable(value=np.zeros(o_units), name='b2')
X, y = mf.placeholder(name='X'), mf.placeholder(name='y')
hidden1=mf.Linear(X, W1, b1,name='linear1')
out=mf.Sigmoid(mf.Linear(hidden1, W2, b2,name='linear2'),name='out')
# Define loss and optimizer
cost = mf.MSE(y, out,name='cost')
epochs = 2
m = len(reviews) - 5000
batch_size = 250
learning_rate=2e-2
Momentum_rate=0.95
steps_per_epoch = m // batch_size
#train_step = mf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
train_step = mf.train.MomentumOptimizer(learning_rate,Momentum_rate).minimize(cost)
print("Total number of examples = {}".format(m))
mf.global_variables_initializer().run()
loss_list = []
acc_train_list = []
acc_test_list = []
# Train
for i in range(epochs):
    loss = 0
    for j in range(steps_per_epoch):
        inputs_train, labels_train = data.update_input_layer(reviews[j*batch_size:(j+1)*batch_size],labels[j*batch_size:(j+1)*batch_size])
        feed_dict = {X: inputs_train, y: labels_train}
        graph = train_step.run(feed_dict)
        loss = np.mean(graph[-1].value)
        loss_list.append(loss)
        acc_train = np.mean(((graph[-2].value>0.5).astype(int).reshape(-1,1) == labels_train).astype(int))
        acc_train_list.append(acc_train)
        acc_test = np.mean(((out.run({X: inputs_test, y: labels_test})>0.5).astype(int).reshape(-1,1) == labels_test).astype(int))
        acc_test_list.append(acc_test)
        sys.stdout.write("\rprocess: {}/{}, loss:{:.5f}, acc_train:{:.2f}, acc_test:{:.2f}".format(j, steps_per_epoch, loss, acc_train, acc_test))
    plt.figure()
#    plt.subplot(211)
    plt.plot(range(len(loss_list)),loss_list,label=u'loss')
#    plt.subplot(212)
    plt.plot(range(len(loss_list)),acc_train_list,label=u'acc_train')
    plt.plot(range(len(loss_list)),acc_test_list,label=u'acc_test')
    plt.ylim([0,1])
    plt.legend()
plt.show()

3.3 分类结果

好像准确率提升的并不多,但是计算速度快了许多
process: 79/80, loss:0.10845, acc_train:0.85, acc_test:0.86

这里写图片描述

4 寻找意思最相似的词

其实就是第一个隐藏层中,每个单词所对应的权重向量投影最大的那些词。
from collections import Counter
from collections import Counter
def get_most_similar_words(focus):
”’获取与focus意思相近的词”’
most_similar = Counter()
for word in data.word2idx.keys():
most_similar[word] = np.dot(W1.value[data.word2idx[word]],W1.value[data.word2idx[focus]])
return most_similar.most_common()
print(get_most_similar_words(“excellent”)[:10])

Out[39]:
[(‘great’, 0.1152234330418324), (‘excellent’, 0.080198714194045187), (‘best’, 0.073144738228254028), (‘love’, 0.066772701417992214), (‘wonderful’, 0.060141522399788996), (‘well’, 0.059674210384868261), (‘perfect’, 0.056961287292946916), (‘loved’, 0.054479527232679389), (‘amazing’, 0.049861357145510987), (‘still’, 0.046883315279415624)]

代码文件:https://pan.baidu.com/s/1QIE0a6NdJOEKO7_pKtNTOQ

猜你喜欢

转载自blog.csdn.net/drilistbox/article/details/80816950