Natural language processing text preprocessing (below) (tensor representation, text data analysis, text feature processing, etc.)

Article directory

1. Text tensor representation method

1. What is a text tensor representation

A piece of text is represented by tensors, in which vocabulary is generally expressed as a vector, called a word vector, and then each word vector is formed into a matrix in order to form a text representation.

Take a chestnut:

["人生", "该", "如何", "起头"]

==>

# 每个词对应矩阵中的一个向量
[[1.32, 4,32, 0,32, 5.2],
 [3.1, 5.43, 0.34, 3.2],
 [3.21, 5.32, 2, 4.32],
 [2.54, 7.32, 5.12, 9.54]]

2. The role of text tensor representation:

Representing the text in the form of tensor (matrix) can make the language text can be used as the input of the computer processing program for the next series of parsing work.

3. The method of text tensor representation:

  • one-hot encoding
  • Word2vec
  • Word Embedding

4. One-hot word vector

4.1 What is one-hot word vector representation

Also known as one-hot encoding, each word is represented as a vector with n elements. Only one element in this word vector is 1, and the other elements are all 0. Different vocabulary elements have different positions of 0, and the size of n is the entire The total number of distinct words in the corpus.

Take a chestnut:

["改变", "要", "如何", "起手"]`
==>

[[1, 0, 0, 0],
 [0, 1, 0, 0],
 [0, 0, 1, 0],
 [0, 0, 0, 1]]

4.2 One-hot encoding implementation

  • Perform onehot encoding:
# 导入用于对象保存与加载的joblib
from sklearn.externals import joblib
# 导入keras中的词汇映射器Tokenizer
from keras.preprocessing.text import Tokenizer
# 假定vocab为语料集所有不同词汇集合
vocab = {
    
    "周杰伦", "陈奕迅", "王力宏", "李宗盛", "吴亦凡", "鹿晗"}
# 实例化一个词汇映射器对象
t = Tokenizer(num_words=None, char_level=False)
# 使用映射器拟合现有文本数据
t.fit_on_texts(vocab)

for token in vocab:
    zero_list = [0]*len(vocab)
    # 使用映射器转化现有文本数据, 每个词汇对应从1开始的自然数
    # 返回样式如: [[2]], 取出其中的数字需要使用[0][0]
    token_index = t.texts_to_sequences([token])[0][0] - 1
    zero_list[token_index] = 1
    print(token, "的one-hot编码为:", zero_list)

# 使用joblib工具保存映射器, 以便之后使用
tokenizer_path = "./Tokenizer"
joblib.dump(t, tokenizer_path)

Output effect:

鹿晗 的one-hot编码为: [1, 0, 0, 0, 0, 0]
王力宏 的one-hot编码为: [0, 1, 0, 0, 0, 0]
李宗盛 的one-hot编码为: [0, 0, 1, 0, 0, 0]
陈奕迅 的one-hot编码为: [0, 0, 0, 1, 0, 0]
周杰伦 的one-hot编码为: [0, 0, 0, 0, 1, 0]
吴亦凡 的one-hot编码为: [0, 0, 0, 0, 0, 1]

# 同时在当前目录生成Tokenizer文件, 以便之后使用
  • The use of onehot encoder:
# 导入用于对象保存与加载的joblib
# from sklearn.externals import joblib
# 加载之前保存的Tokenizer, 实例化一个t对象
t = joblib.load(tokenizer_path)

# 编码token为"李宗盛"
token = "李宗盛"
# 使用t获得token_index
token_index = t.texts_to_sequences([token])[0][0] - 1
# 初始化一个zero_list
zero_list = [0]*len(vocab)
# 令zero_List的对应索引为1
zero_list[token_index] = 1
print(token, "的one-hot编码为:", zero_list) 

Output effect:

李宗盛 的one-hot编码为: [1, 0, 0, 0, 0, 0]

4.3 Advantages and disadvantages of one-hot encoding

  • Advantages: simple operation, easy to understand.
  • Disadvantages: The connection between words is completely separated, and in a large corpus, the length of each vector is too large, occupying a large amount of memory.

Explanation :
Because of the obvious disadvantages of one-hot encoding, this encoding method is applied less and less, and it is replaced by word2vec and word embedding, the dense vector representation methods we will learn next.

5. Word2Vec

Detailed theory: https://blog.csdn.net/v_JULY_v/article/details/102708459

5.1 What is word2vec

It is a popular unsupervised training method that expresses words as vectors. This process will build a neural network model and use network parameters as a vector representation of words. It includes two training modes: CBOW and skipgram.

(1) CBOW (Continuous bag of words) mode

Given a text corpus for training, a certain length (window) is selected as the research object, and the context vocabulary is used to predict the target vocabulary.

insert image description here

Analysis:
The window size in the figure is 9, and the target vocabulary is predicted using 4 words before and after.

Word2vec process description in CBOW mode :

Assume that our given training corpus has only one sentence: Hope can set you free (May you grow freely), the window size is 3, so the first training sample of the model comes from Hope can set, because it is CBOW mode, so Hope will be used and set as input and can as output. During model training, vocabulary such as Hope, can, and set all use their one-hot encoding. As shown in the figure: each one-hot encoded word and its respective transformation matrix (ie Parameter matrix 3x5, where 3 refers to the final word vector dimension) is multiplied and then added to obtain the context representation matrix (3x1).
insert image description here

Next, multiply the context representation matrix with the transformation matrix (parameter matrix 5x3, all transformation matrices share parameters) to obtain a 5x1 result matrix, which will be compared with our real target matrix, which is the one-hot encoding matrix (5x1) of can Calculate the loss, and then update the network parameters to complete a model iteration.

insert image description here

Finally, the window moves backwards in order, and the parameters are updated again until all the corpus is traversed to obtain the final transformation matrix (3x5). This transformation matrix is ​​multiplied by the one-hot code (5x1) of each vocabulary, and the obtained 3x1 The matrix is ​​the word2vec tensor representation of the vocabulary.

(2) skipgram mode

Given a text corpus for training, a certain length (window) is selected as the research object, and the target vocabulary is used to predict the context vocabulary.

insert image description here

Analysis:
The window size in the figure is 9, and the target vocabulary is used to predict the four words before and after.

Word2vec process description in skipgram mode :

  • Assume that our given training corpus has only one sentence: Hope can set you free (May you grow freely), the window size is 3, so the first training sample of the model comes from Hope can set, because it is skipgram mode, so will use can As input, Hope and set are used as output. During model training, words such as Hope, can, and set all use their one-hot encoding. As shown in the figure: the one-hot encoding of can and the transformation matrix (that is, the parameter matrix 3x5 , where 3 refers to the final word vector dimension) multiplied to obtain the target vocabulary representation matrix (3x1).

  • Next, multiply the target vocabulary representation matrix with multiple transformation matrices (parameter matrix 5x3) to obtain multiple 5x1 result matrices, which will perform loss calculations with the one-hot encoding matrix (5x1) corresponding to our Hope and set, Then update the network parameters to complete a model iteration.

insert image description here

  • Finally, the window moves backwards in order, and the parameters are updated again until all the corpus is traversed, and the final transformation matrix is ​​the parameter matrix (3x5). This transformation matrix is ​​multiplied by the one-hot code (5x1) of each vocabulary to get The 3x1 matrix is ​​the word2vec tensor representation of the vocabulary.

5.2 Use the fasttext tool to realize the training and use of word2vec

Step 1: Obtain training data
Step 2: Train word vector
Step 3: Model hyperparameter setting
Step 4: Model effect test
Step 5: Save and reload the model

  • Step 1: Get training data
# 在这里, 我们将研究英语维基百科的部分网页信息, 它的大小在300M左右
# 这些语料已经被准备好, 我们可以通过Matt Mahoney的网站下载.
# 首先创建一个存储数据的文件夹data
$ mkdir data
# 使用wget下载数据的zip压缩包, 它将存储在data目录中
$ wget -c http://mattmahoney.net/dc/enwik9.zip -P data
# 使用unzip解压, 如果你的服务器中还没有unzip命令, 请使用: yum install unzip -y
# 解压后在data目录下会出现enwik9的文件夹
$ unzip data/enwik9.zip -d data

View raw data:

$ head -10 data/enwik9


# 原始数据将输出很多包含XML/HTML格式的内容, 这些内容并不是我们需要的
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.6alpha</generator>
    <case>first-letter</case>
      <namespaces>
      <namespace key="-2">Media</namespace>
      <namespace key="-1">Special</namespace>
      <namespace key="0" />

Raw data processing:

# 使用wikifil.pl文件处理脚本来清除XML/HTML格式的内容
# 注: wikifil.pl文件已为大家提供
$ perl wikifil.pl data/enwik9 > data/fil9

View the preprocessed data:

# 查看前80个字符
head -c 80 data/fil9

# 输出结果为由空格分割的单词
 anarchism originated as a term of abuse first used against early working class
  • Step 2: Training word vectors
# 代码运行在python解释器中
# 导入fasttext
>>> import fasttext
# 使用fasttext的train_unsupervised(无监督训练方法)进行词向量的训练
# 它的参数是数据集的持久化文件路径'data/fil9'
>>> model = fasttext.train_unsupervised('data/fil9')


# 有效训练词汇量为124M, 共218316个单词
Read 124M words
Number of words:  218316
Number of labels: 0
Progress: 100.0% words/sec/thread:   53996 lr:  0.000000 loss:  0.734999 ETA:   0h 0m

View the word vector corresponding to the word:

# 通过get_word_vector方法来获得指定词汇的词向量
>>> model.get_word_vector("the")

array([-0.03087516,  0.09221972,  0.17660329,  0.17308897,  0.12863874,
        0.13912526, -0.09851588,  0.00739991,  0.37038437, -0.00845221,
        ...
       -0.21184735, -0.05048715, -0.34571868,  0.23765688,  0.23726143],
      dtype=float32)
  • Step 3: Model hyperparameter setting
# 在训练词向量过程中, 我们可以设定很多常用超参数来调节我们的模型效果, 如:
# 无监督训练模式: 'skipgram' 或者 'cbow', 默认为'skipgram', 在实践中,skipgram模式在利用子词方面比cbow更好.
# 词嵌入维度dim: 默认为100, 但随着语料库的增大, 词嵌入的维度往往也要更大.
# 数据循环次数epoch: 默认为5, 但当你的数据集足够大, 可能不需要那么多次.
# 学习率lr: 默认为0.05, 根据经验, 建议选择[0.01,1]范围内.
# 使用的线程数thread: 默认为12个线程, 一般建议和你的cpu核数相同.

>>> model = fasttext.train_unsupervised('data/fil9', "cbow", dim=300, epoch=1, lr=0.1, thread=8)

Read 124M words
Number of words:  218316
Number of labels: 0
Progress: 100.0% words/sec/thread:   49523 lr:  0.000000 avg.loss:  1.777205 ETA:   0h 0m 0s
  • Step 4: Model Effect Test
# 检查单词向量质量的一种简单方法就是查看其邻近单词, 通过我们主观来判断这些邻近单词是否与目标单词相关来粗略评定模型效果好坏.

# 查找"运动"的邻近单词, 我们可以发现"体育网", "运动汽车", "运动服"等. 
>>> model.get_nearest_neighbors('sports')

[(0.8414610624313354, 'sportsnet'), (0.8134572505950928, 'sport'), (0.8100415468215942, 'sportscars'), (0.8021156787872314, 'sportsground'), (0.7889881134033203, 'sportswomen'), (0.7863013744354248, 'sportsplex'), (0.7786710262298584, 'sporty'), (0.7696356177330017, 'sportscar'), (0.7619683146476746, 'sportswear'), (0.7600985765457153, 'sportin')]


# 查找"音乐"的邻近单词, 我们可以发现与音乐有关的词汇.
>>> model.get_nearest_neighbors('music')

[(0.8908010125160217, 'emusic'), (0.8464668393135071, 'musicmoz'), (0.8444250822067261, 'musics'), (0.8113634586334229, 'allmusic'), (0.8106718063354492, 'musices'), (0.8049437999725342, 'musicam'), (0.8004694581031799, 'musicom'), (0.7952923774719238, 'muchmusic'), (0.7852965593338013, 'musicweb'), (0.7767147421836853, 'musico')]

# 查找"小狗"的邻近单词, 我们可以发现与小狗有关的词汇.
>>> model.get_nearest_neighbors('dog')

[(0.8456876873970032, 'catdog'), (0.7480780482292175, 'dogcow'), (0.7289096117019653, 'sleddog'), (0.7269964218139648, 'hotdog'), (0.7114801406860352, 'sheepdog'), (0.6947550773620605, 'dogo'), (0.6897546648979187, 'bodog'), (0.6621081829071045, 'maddog'), (0.6605004072189331, 'dogs'), (0.6398137211799622, 'dogpile')]
  • Step 5: Save and reload the model
# 使用save_model保存模型
>>> model.save_model("fil9.bin")

# 使用fasttext.load_model加载模型
>>> model = fasttext.load_model("fil9.bin")
>>> model.get_word_vector("the")

array([-0.03087516,  0.09221972,  0.17660329,  0.17308897,  0.12863874,
        0.13912526, -0.09851588,  0.00739991,  0.37038437, -0.00845221,
        ...
       -0.21184735, -0.05048715, -0.34571868,  0.23765688,  0.23726143],
      dtype=float32)

6. Word Embedding

6.1 What is word embedding (word embedding)

In a certain way, vocabulary is mapped to a space of a specified dimension (generally a higher dimension).

  • The generalized word embedding includes all representation methods of dense vocabulary vectors, such as word2vec learned before, which can be considered as a kind of word embedding.
  • Word embedding in a narrow sense refers to the embedding layer added to the neural network, and the embedding matrix (parameters of the embedding layer) generated while training the entire network. This embedding matrix is ​​a matrix composed of vector representations of all input words during the training process.

Narrow word embedding actual combat: https://blog.csdn.net/sinat_28015305/article/details/109344923

6.2 Visual analysis of word embedding

Visualize the embedded word vectors by using tensorboard.

# 导入torch和tensorboard的摘要写入方法
import torch
import json
import fileinput
from torch.utils.tensorboard import SummaryWriter
# 实例化一个摘要写入对象
writer = SummaryWriter()

# 随机初始化一个100x50的矩阵, 认为它是我们已经得到的词嵌入矩阵
# 代表100个词汇, 每个词汇被表示成50维的向量
embedded = torch.randn(100, 50)

# 导入事先准备好的100个中文词汇文件, 形成meta列表原始词汇
meta = list(map(lambda x: x.strip(), fileinput.FileInput("./vocab100.csv")))
writer.add_embedding(embedded, metadata=meta)
writer.close()

Start the tensorboard service in the terminal:

$ tensorboard --logdir runs --host 0.0.0.0


# 通过http://0.0.0.0:6006访问浏览器可视化页面

The browser displays and can use the right neighbor vocabulary function to test the effect:

insert image description here

2. Text data analysis

2.1 The role of text data analysis:

Text data analysis can effectively help us understand the data corpus, quickly check out possible problems in the corpus, and guide the selection of some hyperparameters in the model training process.

2.2 Several commonly used text data analysis methods:

  • Label Quantity Distribution
  • Sentence length distribution
  • Word frequency statistics and keyword cloud

Explanation:
We will explain several commonly used text data analysis methods based on real Chinese hotel review corpus.

  • Chinese hotel review corpus:
    Chinese sentiment analysis corpus belonging to two categories, the corpus is stored in the "./cn_data" directory.
    Among them, train.tsv represents the training set, and dev.tsv represents the verification set, and the two data styles are the same.

  • train.tsv data style:

sentence    label
早餐不好,服务不到位,晚餐无西餐,早餐晚餐相同,房间条件不好,餐厅不分吸烟区.房间不分有无烟房.    0
去的时候 ,酒店大厅和餐厅在装修,感觉大厅有点挤.由于餐厅装修本来该享受的早饭,也没有享受(他们是8点开始每个房间送,但是我时间来不及了)不过前台服务员态度好!    1
有很长时间没有在西藏大厦住了,以前去北京在这里住的较多。这次住进来发现换了液晶电视,但网络不是很好,他们自己说是收费的原因造成的。其它还好。  1
非常好的地理位置,住的是豪华海景房,打开窗户就可以看见栈桥和海景。记得很早以前也住过,现在重新装修了。总的来说比较满意,以后还会住   1
交通很方便,房间小了一点,但是干净整洁,很有香港的特色,性价比较高,推荐一下哦 1
酒店的装修比较陈旧,房间的隔音,主要是卫生间的隔音非常差,只能算是一般的    0
酒店有点旧,房间比较小,但酒店的位子不错,就在海边,可以直接去游泳。8楼的海景打开窗户就是海。如果想住在热闹的地带,这里不是一个很好的选择,不过威海城市真的比较小,打车还是相当便宜的。晚上酒店门口出租车比较少。   1
位置很好,走路到文庙、清凉寺5分钟都用不了,周边公交车很多很方便,就是出租车不太爱去(老城区路窄爱堵车),因为是老宾馆所以设施要陈旧些,    1
酒店设备一般,套房里卧室的不能上网,要到客厅去。    0
  • Train.tsv data style description:
    The data content in train.tsv is divided into 2 columns, the first column of data represents the comment text with emotional color; the second column of data, 0 or 1, represents each text data is positive or negative , where 0 is negative and 1 is positive.

2.3 Obtain the label number distribution of training set and verification set

# 导入必备工具包
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# 设置显示风格
plt.style.use('fivethirtyeight') 

# 分别读取训练tsv和验证tsv
train_data = pd.read_csv("./cn_data/train.tsv", sep="\t")
valid_data = pd.read_csv("./cn_data/dev.tsv", sep="\t")


# 获得训练数据标签数量分布
sns.countplot("label", data=train_data)
plt.title("train_data")
plt.show()


# 获取验证数据标签数量分布
sns.countplot("label", data=valid_data)
plt.title("valid_data")
plt.show()
  • The distribution of the number of labels in the training set:
    insert image description here

  • The distribution of the number of labels in the validation set:
    insert image description here

Analysis :
In the evaluation of deep learning models, we generally use ACC as the evaluation index. If we want to define the baseline of ACC at about 50%, we need to maintain the ratio of positive and negative samples at about 1:1, otherwise we must carry out necessary Data enhancement or data reduction. In the above figure, the positive and negative samples of the training and verification sets are slightly unbalanced, and some data enhancement can be performed.

2.4 Obtain the sentence length distribution of training set and verification set

# 在训练数据中添加新的句子长度列, 每个元素的值都是对应的句子列的长度
train_data["sentence_length"] = list(map(lambda x: len(x), train_data["sentence"]))

# 绘制句子长度列的数量分布图
sns.countplot("sentence_length", data=train_data)
# 主要关注count长度分布的纵坐标, 不需要绘制横坐标, 横坐标范围通过dist图进行查看
plt.xticks([])
plt.show()

# 绘制dist长度分布图
sns.distplot(train_data["sentence_length"])

# 主要关注dist长度分布横坐标, 不需要绘制纵坐标
plt.yticks([])
plt.show()


# 在验证数据中添加新的句子长度列, 每个元素的值都是对应的句子列的长度
valid_data["sentence_length"] = list(map(lambda x: len(x), valid_data["sentence"]))

# 绘制句子长度列的数量分布图
sns.countplot("sentence_length", data=valid_data)

# 主要关注count长度分布的纵坐标, 不需要绘制横坐标, 横坐标范围通过dist图进行查看
plt.xticks([])
plt.show()

# 绘制dist长度分布图
sns.distplot(valid_data["sentence_length"])

# 主要关注dist长度分布横坐标, 不需要绘制纵坐标
plt.yticks([])
plt.show()
  • Sentence length distribution in the training set:
    insert image description here

insert image description here

  • Verification set sentence length distribution:
    insert image description here

insert image description here

Analysis :
By drawing the sentence length distribution diagram, we can know the distribution range of most sentence lengths in our corpus, because the input of the model requires a fixed-size tensor, and the reasonable length range will be followed by sentence truncation and filling (standard length ) plays a key guiding role. The length of most sentences in the above figure ranges roughly from 20 to 250.

2.5 Obtain the positive and negative sample length scatter distribution of the training set and the verification set

# 绘制训练集长度分布的散点图
sns.stripplot(y='sentence_length',x='label',data=train_data)
plt.show()

# 绘制验证集长度分布的散点图
sns.stripplot(y='sentence_length',x='label',data=valid_data)
plt.show()
  • The length scatter distribution of positive and negative samples on the training set:
    insert image description here

  • The length scatter distribution of positive and negative samples on the validation set:
    insert image description here

Analysis :
By looking at the scatter diagram of positive and negative sample lengths, we can effectively locate the location of abnormal points and help us to review artificial corpus more accurately. In the above figure, abnormal points appear in the positive samples of the training set, and its sentence length is nearly 3500 Left and right, we need manual review.

2.6 Obtain the statistics of the total number of different words in the training set and the verification set

# 导入jieba用于分词
# 导入chain方法用于扁平化列表
import jieba
from itertools import chain

# 进行训练集的句子进行分词, 并统计出不同词汇的总数
train_vocab = set(chain(*map(lambda x: jieba.lcut(x), train_data["sentence"])))
print("训练集共包含不同词汇总数为:", len(train_vocab))

# 进行验证集的句子进行分词, 并统计出不同词汇的总数
valid_vocab = set(chain(*map(lambda x: jieba.lcut(x), valid_data["sentence"])))
print("训练集共包含不同词汇总数为:", len(valid_vocab))

Output effect:

训练集共包含不同词汇总数为: 12147
训练集共包含不同词汇总数为: 6857

2.7 Obtain the high-frequency adjective word cloud of positive and negative samples on the training set

# 使用jieba中的词性标注功能
import jieba.posseg as pseg

def get_a_list(text):
    """用于获取形容词列表"""
    # 使用jieba的词性标注方法切分文本,获得具有词性属性flag和词汇属性word的对象, 
    # 从而判断flag是否为形容词,来返回对应的词汇
    r = []
    for g in pseg.lcut(text):
        if g.flag == "a":
            r.append(g.word)
    return r

# 导入绘制词云的工具包
from wordcloud import WordCloud

def get_word_cloud(keywords_list):
    # 实例化绘制词云的类, 其中参数font_path是字体路径, 为了能够显示中文, 
    # max_words指词云图像最多显示多少个词, background_color为背景颜色 
    wordcloud = WordCloud(font_path="./SimHei.ttf", max_words=100, background_color="white")
    # 将传入的列表转化成词云生成器需要的字符串形式
    keywords_string = " ".join(keywords_list)
    # 生成词云
    wordcloud.generate(keywords_string)

    # 绘制图像并显示
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()

# 获得训练集上正样本
p_train_data = train_data[train_data["label"]==1]["sentence"]

# 对正样本的每个句子的形容词
train_p_a_vocab = chain(*map(lambda x: get_a_list(x), p_train_data))
#print(train_p_n_vocab)

# 获得训练集上负样本
n_train_data = train_data[train_data["label"]==0]["sentence"]

# 获取负样本的每个句子的形容词
train_n_a_vocab = chain(*map(lambda x: get_a_list(x), n_train_data))

# 调用绘制词云函数
get_word_cloud(train_p_a_vocab)
get_word_cloud(train_n_a_vocab)
  • Training set positive sample adjective word cloud:
    insert image description here

  • Training set negative sample adjective word cloud:
    insert image description here

2.8 Obtain the adjective word cloud of positive and negative samples on the validation set

# 获得验证集上正样本
p_valid_data = valid_data[valid_data["label"]==1]["sentence"]

# 对正样本的每个句子的形容词
valid_p_a_vocab = chain(*map(lambda x: get_a_list(x), p_valid_data))
#print(train_p_n_vocab)

# 获得验证集上负样本
n_valid_data = valid_data[valid_data["label"]==0]["sentence"]

# 获取负样本的每个句子的形容词
valid_n_a_vocab = chain(*map(lambda x: get_a_list(x), n_valid_data))

# 调用绘制词云函数
get_word_cloud(valid_p_a_vocab)
get_word_cloud(valid_n_a_vocab)
  • Verification set positive sample adjective word cloud:
    insert image description here

  • Verification set negative sample adjective word cloud:
    insert image description here

Analysis :
According to the word cloud display of high-frequency adjectives, we can simply evaluate the quality of the current corpus, and at the same time manually review and correct the words that violate the meaning of the corpus label to ensure that most of the corpus meets the training standards. The positive sample in the above figure Most of them are commendatory words, while most of the negative samples are derogatory words, which basically meet the requirements, but there are also commendatory words such as "convenience" in the negative sample word cloud, so they can be reviewed manually.

3. Text feature processing

3.1 The role of text feature processing

Text feature processing includes adding universal text features to the corpus, such as: n-gram features, and performing necessary processing on the text corpus after adding features, such as: length specification. These feature processing can effectively convert important Text features are added to model training to enhance model evaluation indicators.

3.2 Common text feature processing methods:

  • Add n-gram features
  • Text Length Specification

3.3 n-gram features

(1) What are n-gram features

Given a text sequence, the adjacent co-occurrence features of n words or characters are n-gram features. The commonly used n-gram features are bi-gram and tri-gram features, corresponding to n being 2 and 3, respectively.

Take a chestnut:

假设给定分词列表: ["是谁", "敲动", "我心"]

对应的数值映射列表为: [1, 34, 21]

我们可以认为数值映射列表中的每个数字是词汇特征.

除此之外, 我们还可以把"是谁""敲动"两个词共同出现且相邻也作为一种特征加入到序列列表中,

假设1000就代表"是谁""敲动"共同出现且相邻

此时数值映射列表就变成了包含2-gram特征的特征列表: [1, 34, 21, 1000]

这里的"是谁""敲动"共同出现且相邻就是bi-gram特征中的一个.

"敲动""我心"也是共现且相邻的两个词汇, 因此它们也是bi-gram特征.

假设1001代表"敲动""我心"共同出现且相邻

那么, 最后原始的数值映射列表 [1, 34, 21] 添加了bi-gram特征之后就变成了 [1, 34, 21, 1000, 1001]

(2) Extract n-gram features

# 一般n-gram中的n取2或者3, 这里取2为例
ngram_range = 2

def create_ngram_set(input_list):
    """
    description: 从数值列表中提取所有的n-gram特征
    :param input_list: 输入的数值列表, 可以看作是词汇映射后的列表, 
                       里面每个数字的取值范围为[1, 25000]
    :return: n-gram特征组成的集合

    eg:
    >>> create_ngram_set([1, 4, 9, 4, 1, 4])
    {(4, 9), (4, 1), (1, 4), (9, 4)}
    """ 
    return set(zip(*[input_list[i:] for i in range(ngram_range)]))

transfer:

input_list = [1, 3, 2, 1, 5, 3]
res = create_ngram_set(input_list)
print(res)

Output effect:

# 该输入列表的所有bi-gram特征
{
    
    (3, 2), (1, 3), (2, 1), (1, 5), (5, 3)}

3.4 Text length specification and its function

The input of the general model requires a matrix of equal size, so before entering the model, it is necessary to standardize the length of each text value after mapping. At this time, the reasonable length covering most of the text will be analyzed according to the sentence length distribution. The text is truncated, and the insufficient text is complemented (generally use the number 0). This process is the text length specification.

Implementation of text length specification:

from keras.preprocessing import sequence

# cutlen根据数据分析中句子长度分布,覆盖90%左右语料的最短长度.
# 这里假定cutlen为10
cutlen = 10

def padding(x_train):
    """
    description: 对输入文本张量进行长度规范
    :param x_train: 文本的张量表示, 形如: [[1, 32, 32, 61], [2, 54, 21, 7, 19]]
    :return: 进行截断补齐后的文本张量表示 
    """
    # 使用sequence.pad_sequences即可完成
    return sequence.pad_sequences(x_train, cutlen)

transfer:

# 假定x_train里面有两条文本, 一条长度大于10, 一天小于10
x_train = [[1, 23, 5, 32, 55, 63, 2, 21, 78, 32, 23, 1],
           [2, 32, 1, 23, 1]]

res = padding(x_train)
print(res)

Output effect:

[[ 5 32 55 63  2 21 78 32 23  1]
 [ 0  0  0  0  0  2 32  1 23  1]]

4. Text Data Enhancement

1. Common text data enhancement methods:

back translation data augmentation

2. What is the back-translation data augmentation method

Back-translation data enhancement is currently a better enhancement method for text data enhancement. Generally, based on the google translation interface, the text data is translated into another language (generally select a small language), and then translated back to the original language, which can be considered as obtained The new corpus with the same label as the original corpus can be considered as an enhancement of the original data set when the new corpus is added to the original data set.

  • Advantages of back-translation data enhancement:
    easy operation, high-quality new corpus.
  • Problems in back-translation data enhancement:
    In the process of short text back-translation, there may be a high repetition rate between the new corpus and the original corpus, which cannot effectively increase the feature space of the sample.
  • Solution to high repetition rate:
    perform continuous multilingual translation, such as: Chinese –> Korean –> Japanese –> English –> Chinese. According to experience, only 3 consecutive translations are used at most, and more translations will result in low efficiency. Semantic distortion and other issues.

Back-translated data augmentation implementation:

# 假设取两条已经存在的正样本和两条负样本
# 将基于这四条样本产生新的同标签的四条样本
p_sample1 = "酒店设施非常不错"
p_sample2 = "这家价格很便宜"
n_sample1 = "拖鞋都发霉了, 太差了"
n_sample2 = "电视不好用, 没有看到足球"

# 导入google翻译接口工具
from googletrans import Translator
# 实例化翻译对象
translator = Translator()
# 进行第一次批量翻译, 翻译目标是韩语
translations = translator.translate([p_sample1, p_sample2, n_sample1, n_sample2], dest='ko')
# 获得翻译后的结果
ko_res = list(map(lambda x: x.text, translations))
# 打印结果
print("中间翻译结果:")
print(ko_res)


# 最后在翻译回中文, 完成回译全部流程
translations = translator.translate(ko_res, dest='zh-cn')
cn_res = list(map(lambda x: x.text, translations))
print("回译得到的增强数据:")
print(cn_res)

Output effect:

中间翻译结果:
['호텔 시설은 아주 좋다', '이 가격은 매우 저렴합니다', '슬리퍼 곰팡이가 핀이다, 나쁜', 'TV가 잘 작동하지 않습니다, 나는 축구를 볼 수 없습니다']
回译得到的增强数据:
['酒店设施都非常好', '这个价格是非常实惠', '拖鞋都发霉了,坏', '电视不工作,我不能去看足球']

appendix

jieba part of speech comparison table:

- a 形容词  
    - ad 副形词  
    - ag 形容词性语素  
    - an 名形词  
- b 区别词  
- c 连词  
- d 副词  
    - df   
    - dg 副语素  
- e 叹词  
- f 方位词  
- g 语素  
- h 前接成分  
- i 成语 
- j 简称略称  
- k 后接成分  
- l 习用语  
- m 数词  
    - mg 
    - mq 数量词  
- n 名词  
    - ng 名词性语素  
    - nr 人名  
    - nrfg    
    - nrt  
    - ns 地名  
    - nt 机构团体名  
    - nz 其他专名  
- o 拟声词  
- p 介词  
- q 量词  
- r 代词  
    - rg 代词性语素  
    - rr 人称代词  
    - rz 指示代词  
- s 处所词  
- t 时间词  
    - tg 时语素  
- u 助词  
    - ud 结构助词 得
    - ug 时态助词
    - uj 结构助词 的
    - ul 时态助词 了
    - uv 结构助词 地
    - uz 时态助词 着
- v 动词  
    - vd 副动词
    - vg 动词性语素  
    - vi 不及物动词  
    - vn 名动词  
    - vq 
- x 非语素词  
- y 语气词  
- z 状态词  
    - zg 

hanlp part of speech comparison table:

【Proper Noun——NR,专有名词】

【Temporal Noun——NT,时间名词】

【Localizer——LC,定位词】如“内”,“左右”

【Pronoun——PN,代词】

【Determiner——DT,限定词】如“这”,“全体”

【Cardinal Number——CD,量词】

【Ordinal Number——OD,次序词】如“第三十一”

【Measure word——M,单位词】如“杯”

【Verb:VA,VC,VE,VV,动词】

【Adverb:AD,副词】如“近”,“极大”

【Preposition:P,介词】如“随着”

【Subordinating conjunctions:CS,从属连词】

【Conjuctions:CC,连词】如“和”

【Particle:DEC,DEG,DEV,DER,AS,SP,ETC,MSP,小品词】如“的话”

【Interjections:IJ,感叹词】如“哈”

【onomatopoeia:ON,拟声词】如“哗啦啦”

【Other Noun-modifier:JJ】如“发稿/JJ 时间/NN”

【Punctuation:PU,标点符号】

【Foreign word:FW,外国词语】如“OK

Guess you like

Origin blog.csdn.net/mengxianglong123/article/details/126093676