LightSide使用过程及错误分析

一、使用过程

1.数据集：

20%——development data——定性分析、特征空间设计、错误分析

70%——cross validation data——运行试验数据

10%——final test data——应用交叉验证集训练得到的模型测试测试集

2.试验流程：

①Dev数据进行定性分析→②从CV数据集中提取特征→③用CV数据集训练基本模型且用CV进行测试→④用CV数据集训练新的模型，用Dev数据集进行测试→⑤分析Dev数据集的评估结果且做错误分析→⑥从错误分析中对模型特征等产生新的想法→⑦从CV数据集中提取新特征→⑧使用CV中提取的新特征，训练新模型，用CV数据集进行评估→若不满足，返回步骤④。

3.基本的文本特征提取：

文本作为一个向量，向量的每个元素对应一个词，即词包方法

按照文本的特征顺序进行标记，如果包含cheese，则是1，包含cows，标记为1……

但如果 cheese make cows的向量和cows make cheese的向量是一致的，这种方法丢失了词间顺序的标志。

以because the cost of healthcare is just outta sight crazy为例

其含义分别为：

unigrams-每个词进行分割，because, the, cost,,,,

Bigrams-每两个进行分割，because the, the cost, cost of, of healthcare, healthcare is,,,,,

Trigrams-每三个词进行分割，because the cost, the cost of, cost of healthcare,,,

POS Bigrams- 对每个词进行speech tagging,如“the(DT) cost(NN) of(PRP) healthcare(NN) ”，然后两两进行分割，DT NN, NN PRP,PRP NN,,,,如果不是英语的话，意义不大,词标签的重叠性会很强。

所引用的NLP 词标签的网站https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

http://www.cis.uni-muenchen.de/~schmid/tools/Tree Tagger/data/Penn-Treebank-Tagset.pdf（网页资源无法找到）

Word/POS Pairs-将词语和其标签进行配对，如“the(DT) cost(NN) of(PRP) healthcare(NN) ”，每个词及标签的组合进行配对分割，the DT，cost NN, of PRP, ,,,

Line Length-文本长度，the cost of healthcare=4

Contains Non-Stopwords-停用词是没有内容的功能词，如it,在the cost of healthcare中包含两个非停用词-cost healthcare

Count Occurences-词特征出现的次数，取值为0或特征出现的次数，而不是1,0为默认值。(the value of the feature is the number of times it occurs, rather than 1 if it occurs or 0 otherwise, which is the default.)

Include Punctuation-是否包含标点符号

Remove Stopwords-移除停用词，如在the cost of healthcare中移去the 和of

Stem N-Grams-移除英语中的开头词或结尾词形式，如ed,s, ing等，healthcare costs→healthcare cost