上一章节对新闻句子长度、类别和字符进行了可视化分析。这一章节我们通过机器学习做文本分类。
1. 词向量
源数据给的是文本是匿名话字符,不能直接放入模型中训练,需要对每个字符进行数学上的表征,也就是将字符映射为词向量。什么是词向量呢?词向量就是将词转化为向量形式的表示。词向量主要有两种,一种是传统机器学习中的one-hot编码方式,一种是基于深度学习的词嵌入技术。下面我们先学习下传统机器学习的词向量表示。
1.1 One-Hot
设词典的大小为n(词典中有n个词),假如某个词在词典中的位置为k,则设立一个n维向量,第k维是1,其余维全都是0。这个思想就是one-hot编码,中文叫独热编码。
下面以字为例。
句子1:我 爱 北 京 天 安 门
句子2:我 喜 欢 上 海
即词典为{我 爱 北 京 天 安 门 喜 欢 上 海}
有了词典后,可以转化成以下的形式。
句子1:[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
句子2:[1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
缺点:
- 当类别的数量很多时,特征空间会变得非常大。
- 没有考虑词与词之间的顺序。
- 是稀疏向量,会造成维度灾难。
1.2 TF-IDF
TF-IDF的英文表示是term frequency(词频),inverse document frequency(逆文件频率)。tf-idf是一个权重,其常被用于信息检索和文本挖掘。此权重是一个统计量度,字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。
词频 (term frequency, TF) 指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被归一化(一般是词频除以文章总词数), 以防止它偏向长的文件。
逆文件频率(inverse document frequency, IDF)是由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到。包含词条t的文档越少, IDF越大,则说明词条具有很好的类别区分能力。
注:分母加1,是为了避免分母为0。
两项做乘积为最终的结果,即
缺点:
- 没有考虑词与词之间的顺序。
- 严重依赖语料库,需要选取质量较高且和所处理文本相符的语料库进行训练。
2.模型选择
这次只选择了线性分类模型-RidgeClassifier。尝试了随机森林,效果不好。
3. 代码参考
加载库
import time
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score
from sklearn.utils import shuffle
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
3.1 One-Hot + RidgeClassifier
对max_features
调参。
def onehot_ridgeclassifier(nrows, train_num, max_features):
start_time = time.time()
train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=nrows)
# shuffle
train_df = shuffle(train_df, random_state=666)
vectorizer = CountVectorizer(max_features=max_features)
train_text = vectorizer.fit_transform(train_df['text'])
clf = RidgeClassifier(random_state=666)
clf.fit(train_text[:train_num], train_df['label'].values[:train_num])
train_pred = clf.predict(train_text[:train_num])
val_pred = clf.predict(train_text[train_num:])
print('One-Hot+RidgeClassifier Train f1_score: {}'.format(f1_score(train_df['label'].values[:train_num], train_pred, average='macro')))
print('One-Hot+RidgeClassifier Val f1_score: {}'.format(f1_score(train_df['label'].values[train_num:], val_pred, average='macro')))
train_time = time.time()
print('Train time: {:.2f}s'.format(train_time - start_time))
# 预测并保存
test_df = pd.read_csv('../input/test_a.csv')
test_text = vectorizer.transform(test_df['text'])
test_pred = clf.predict(test_text)
test_pred = pd.DataFrame(test_pred, columns=['label'])
test_pred.to_csv('../input/test_bagofwords_ridgeclassifier.csv', index=False)
print('Test predict saved.')
end_time = time.time()
print('Predict time:{:.2f}s'.format(end_time - train_time))
if __name__ == '__main__':
# nrows = 200000
# train_num = int(nrows * 0.7)
# max_features = 3000
"""
One-Hot+RidgeClassifier Train f1_score: 0.8325002267944408
One-Hot+RidgeClassifier Val f1_score: 0.8175875672165276
Train time: 685.49s
Test predict saved.
Predict time:32.44s
"""
nrows = 200000
train_num = int(nrows * 0.7)
max_features = 4000
"""
One-Hot+RidgeClassifier Train f1_score: 0.8377852607681573
One-Hot+RidgeClassifier Val f1_score: 0.8178684044527644
Train time: 1058.56s
Test predict saved.
Predict time:31.95s
"""
onehot_ridgeclassifier(nrows, train_num, max_features)
One-Hot+RidgeClassifier Val f1_score: 0.812
3.2 TF-IDF + RidgeClassifier
对max_features
和ngram_range
调参。
def tfidf_ridgeclassifier(nrows, train_num, max_features, ngram_range):
start_time = time.time()
train_df = pd.read_csv('../input/train_set.csv', sep='\t', nrows=nrows)
# shuffle
train_df = shuffle(train_df, random_state=666)
tfidf = TfidfVectorizer(ngram_range=ngram_range, max_features=max_features)
train_text = tfidf.fit_transform(train_df['text'])
# TF-IDF
clf = RidgeClassifier(random_state=666)
clf.fit(train_text[:train_num], train_df['label'].values[:train_num])
train_pred = clf.predict(train_text[:train_num])
val_pred = clf.predict(train_text[train_num:])
print('Tf-Idf+RidgeClassifier Train f1_score: {}'.format(f1_score(train_df['label'].values[:train_num], train_pred, average='macro')))
print('Tf-Idf+RidgeClassifier Val f1_score: {}'.format(f1_score(train_df['label'].values[train_num:], val_pred, average='macro')))
train_time = time.time()
print('Train time: {:.2f}s'.format(train_time - start_time))
# 预测并保存
test_df = pd.read_csv('../input/test_a.csv')
test_text = tfidf.transform(test_df['text'])
test_pred = clf.predict(test_text)
test_pred = pd.DataFrame(test_pred, columns=['label'])
test_pred.to_csv('../input/test_tfidf_ridgeclassifier.csv', index=False)
print('Test predict saved.')
end_time = time.time()
print('Predict time:{:.2f}s'.format(end_time - train_time))
if __name__ == '__main__':
# nrows = 200000
# train_num = int(nrows * 0.7)
# max_features = 3000
# ngram_range = (1, 3)
"""
Tf-Idf+RidgeClassifier Train f1_score: 0.903158570543211
Tf-Idf+RidgeClassifier Val f1_score: 0.8941037520383751
Train time: 743.38s
Test predict saved.
Predict time:105.46s
"""
# nrows = 200000
# train_num = int(nrows * 0.7)
# max_features = 4000
# ngram_range = (1, 3)
"""
Tf-Idf+RidgeClassifier Train f1_score: 0.9123200043177631
Tf-Idf+RidgeClassifier Val f1_score: 0.9017549150589862
Train time: 800.63s
Test predict saved.
Predict time:110.57s
"""
nrows = 200000
train_num = int(nrows * 0.7)
max_features = 4000
ngram_range = (1, 2)
"""
Tf-Idf+RidgeClassifier Train f1_score: 0.9138842752839392
Tf-Idf+RidgeClassifier Val f1_score: 0.9029483134740949
Train time: 476.29s
Test predict saved.
Predict time:68.93s
"""
tfidf_ridgeclassifier(nrows, train_num, max_features, ngram_range)
Tf-Idf+RidgeClassifier Val f1_score: 0.903