05基于深度学习的文本分类2
思路3:WordVec + 深度学习分类器
- WordVec是进阶款的词向量,并通过构建深度学习分类完成分类。
- 深度学习分类的网络结构可以选择TextCNN、TextRNN或者BiLSTM。
文本表示方法 Part3
Word2Vec
什么是Word2Vec
word2vec由Tomas Mikolov等提出( Efficient Estimation of Word Representations in Vector Space, ICLR, 2013),作用是将所有词语投影到K维的向量空间,每个词语都可以用一个K维向量表示。2013年Google开放这一工具包。
简洁、高效,广泛应用于NLP任务中,用于训练相应的词向量。
其中包括:
- 两个训练模型(Skip-gram,CBOW)
CBOW是通过一个或多个单词的上下文来进行这个词语的预测
Skip Gram模型是通过一个或多个单词来进行上下文的预测 - 两种加速的方法(Hierarchical Softmax,Negative Sampling)。
为什么需要Word2vec
1、传统的词表示——one-hot representation
不足:词向量之间互相独立,不能表示出在语义上的相似性;高维稀疏表示可能会引发维度灾难。
2、Distributed representation — word embedding
通过训练将词表示为限定维度K的实数向量
word2vec采用简化的模型,提高了训练速度,使得word embedding这项技术变得实用。
Word2vec的2个模型
CBOW(Continuous Bag-of-Words Model)和Skip-gram (Continuous Skip-gram Model)
Word2vec的2种训练方式
Negative Sampling
Hierarchical Softmax
import logging
import random
import numpy as np
import torch
logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')
# set seed
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)
# split data to 10 fold
fold_num = 10
data_file = '../data/train_set.csv'
import pandas as pd
def all_data2fold(fold_num, num=10000):
fold_data = []
f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
texts = f['text'].tolist()[:num]
labels = f['label'].tolist()[:num]
total = len(labels)
index = list(range(total))
np.random.shuffle(index)
all_texts = []
all_labels = []
for i in index:
all_texts.append(texts[i])
all_labels.append(labels[i])
label2id = {}
for i in range(total):
label = str(all_labels[i])
if label not in label2id:
label2id[label] = [i]
else:
label2id[label].append(i)
all_index = [[] for _ in range(fold_num)]
for label, data in label2id.items():
# print(label, len(data))
batch_size = int(len(data) / fold_num)
other = len(data) - batch_size * fold_num
for i in range(fold_num):
cur_batch_size = batch_size + 1 if i < other else batch_size
# print(cur_batch_size)
batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
all_index[i].extend(batch_data)
batch_size = int(total / fold_num)
other_texts = []
other_labels = []
other_num = 0
start = 0
for fold in range(fold_num):
num = len(all_index[fold])
texts = [all_texts[i] for i in all_index[fold]]
labels = [all_labels[i] for i in all_index[fold]]
if num > batch_size:
fold_texts = texts[:batch_size]
other_texts.extend(texts[batch_size:])
fold_labels = labels[:batch_size]
other_labels.extend(labels[batch_size:])
other_num += num - batch_size
elif num < batch_size:
end = start + batch_size - num
fold_texts = texts + other_texts[start: end]
fold_labels = labels + other_labels[start: end]
start = end
else:
fold_texts = texts
fold_labels = labels
assert batch_size == len(fold_labels)
# shuffle
index = list(range(batch_size))
np.random.shuffle(index)
shuffle_fold_texts = []
shuffle_fold_labels = []
for i in index:
shuffle_fold_texts.append(fold_texts[i])
shuffle_fold_labels.append(fold_labels[i])
data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
fold_data.append(data)
logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))
return fold_data
fold_data = all_data2fold(10)
# build train data for word2vec
fold_id = 9
train_texts = []
for i in range(0, fold_id):
data = fold_data[i]
train_texts.extend(data['text'])
logging.info('Total %d docs.' % len(train_texts))
logging.info('Start training...')
from gensim.models.word2vec import Word2Vec
num_features = 100 # Word vector dimensionality
num_workers = 8 # Number of threads to run in parallel
train_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)
# save model
model.save("./word2vec.bin")
# load model
model = Word2Vec.load("./word2vec.bin")
# convert format
model.wv.save_word2vec_format('./word2vec.txt', binary=False)
自己电脑跑不动,DSW老掉线。。。
TextCNN
TextCNN利用CNN(卷积神经网络)进行文本特征抽取,不同大小的卷积核分别抽取n-gram特征,卷积计算出的特征图经过MaxPooling保留最大的特征值,然后将拼接成一个向量作为文本的表示。
TextRNN
TextRNN利用RNN(循环神经网络)进行文本特征抽取,由于文本本身是一种序列,而LSTM天然适合建模序列数据。TextRNN将句子中每个词的词向量依次输入到双向双层LSTM,分别将两个方向最后一个有效位置的隐藏层拼接成一个向量作为文本的表示。
参考资料:
比赛地址
Datawhale零基础入门NLP赛事 - Task5 基于深度学习的文本分类2-1Word2Vec
word2vec理论与实践
白话word2vec
DeepNLP的表示学习·词嵌入来龙去脉·深度学习(Deep Learning)·自然语言处理(NLP)·表示(Representation)