本章节研究内容：基于词袋模型（BOW）特征抽取&n-gram&扩展特征维度数 + 贝叶斯算法文本分类

实践验证：

引入n-gram 和扩展特征后，发现平均正确率有部分的提升。
kfold accuracy = 0.8662404092071612
->
kfold accuracy = 0.8666879795396419

我们可以把特征做得更棒一点，比如说，我们试试加入抽取2-gram和3-gram的统计特征，比如可以把词库的量放大一点。

1-gram: [‘我’, ‘爱’, ‘自然语言’, ‘处理’]

2-gram: [‘我爱’, ‘爱自然语言’, ‘自然语言处理’]

3-gram: []

CountVectorizer ＋ n-gram 使用

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer(
    analyzer='word',
    max_features=4000,
    ngram_range=(1,4)# use ngrams of size 1, 2, 3
)#创建词袋数据结构
cv_fit=cv.fit_transform(texts)
#用数据输入形式为列表，列表元素为代表文章的字符串，一个字符串代表一篇文章，字符串是已经分割好的

print(cv.get_feature_names())#获得上面稀疏矩阵的列索引，即特征的名字（就是特征词）
print(cv_fit.toarray())# 得到分词的系数矩阵-稠密向量矩阵表示
#['bird', 'cat', 'dog', 'fish']
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

print(cv_fit.toarray().sum(axis=0)) #每个词在所有文档中的词频

['bird', 'cat', 'cat cat', 'cat fish', 'dog', 'dog cat', 'dog cat cat', 'dog cat fish', 'fish', 'fish bird']
[[0 1 0 1 1 1 0 1 1 0]
 [0 2 1 0 1 1 1 0 0 0]
 [1 0 0 0 0 0 0 0 1 1]
 [1 0 0 0 0 0 0 0 0 0]]
[2 3 1 1 2 2 1 1 2 1]

# 词汇表-也就是 字典顺序
cv.vocabulary_

{'dog': 4,
 'cat': 1,
 'fish': 8,
 'dog cat': 5,
 'cat fish': 3,
 'dog cat fish': 7,
 'cat cat': 2,
 'dog cat cat': 6,
 'bird': 0,
 'fish bird': 9}

# 统计基于BOW 抽取的字典以及词频数
word = cv.get_feature_names()
freq = cv_fit.toarray().sum(axis = 0)
print(word)
print(freq)
word_freqs = dict(zip(word,freq))
print(word_freqs)
# dict 进行排序
word_freqs = sorted(word_freqs.items(),key=lambda d:d[1],reverse=True)
print(word_freqs)

['bird', 'cat', 'cat cat', 'cat fish', 'dog', 'dog cat', 'dog cat cat', 'dog cat fish', 'fish', 'fish bird']
[2 3 1 1 2 2 1 1 2 1]
{'bird': 2, 'cat': 3, 'cat cat': 1, 'cat fish': 1, 'dog': 2, 'dog cat': 2, 'dog cat cat': 1, 'dog cat fish': 1, 'fish': 2, 'fish bird': 1}
[('cat', 3), ('bird', 2), ('dog', 2), ('dog cat', 2), ('fish', 2), ('cat cat', 1), ('cat fish', 1), ('dog cat cat', 1), ('dog cat fish', 1), ('fish bird', 1)]

# 第一行结果分析： 第0个列表元素，**词典中索引为3的元素**， 词频
print(cv_fit)

  (0, 7)	1
  (0, 3)	1
  (0, 5)	1
  (0, 8)	1
  (0, 1)	1
  (0, 4)	1
  (1, 6)	1
  (1, 2)	1
  (1, 5)	1
  (1, 1)	2
  (1, 4)	1
  (2, 9)	1
  (2, 0)	1
  (2, 8)	1
  (3, 0)	1

导入库

from sklearn.model_selection import train_test_split

加载数据

sentences = []
with open("../data/news.csv", 'r',encoding='utf8') as f:
    lines = f.readlines()
    for line in lines:
        splits = line.split(' ')
        feat = splits[:splits.__len__() - 1]
        label = splits[splits.__len__() - 1]
        sentences.append((" ".join(feat), label.strip()))

sentences[:2]

[('另一边 舞王 韩庚 跟随 欢乐 起舞 八十年代 迪斯科 舞步 轮番上阵 场面 精彩 歌之夜 敬请期待 浙江 卫视 2017 周五 00 畅意 100% 乳酸菌 饮品 独家 冠名 二十四 小时 第二季 水手 欢乐 出发',
  'entertainment'),
 ('三是 改变 割裂 状况 建立 一体化 防御 体系', 'technology')]

切分训练集合测试集

重点关注下输入数据格式

X, y = zip(*sentences)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1234)
print('X_train len=',len(X_train))
print('X_test len=',len(X_test))
print(X_train[:2])
print(y_train[:2])

X_train len= 82110
X_test len= 27370
['依托 腾讯 强大 技术 平台 视频 直播 经验 海外 用户 需求 技术 团队 联合 第三方 厂商 海外 用户 体验 优化 地区 打开 直播 缓冲 时间 两秒 以内 国内 用户 相差无几 成功 保障 全球 2145 在线 人群 观看 4K VR 5.1 声道 环绕声 传送 画面 稳定 流畅', '光圈 倒闭 直播 面临 生死战']
['technology', 'entertainment']

词袋模型特征抽取

X_train 和 y_train 的数据格式分别对应：

[‘音乐大师播出’, ‘设计公司承担四届奥运场馆设计’]
[‘entertainment’, ‘sports’]

备注：用数据输入形式为列表，列表元素为代表文章的字符串，一个字符串代表一篇文章，字符串是已经分割好的

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(
    analyzer='word',
    max_features=20000,# 通过词袋模型 字典长度最大max_features 
    min_df=100,# 词频最小数min_df 才进行统计
    ngram_range=(1,4)# use ngrams of size 1, 2, 3
)
vec.fit_transform(X_train)

def get_features(x):
    return vec.transform(x)

我们分析下基于词袋文本抽取方式，我们可以获取什么样的信息

print(vec.get_feature_names()[:20])  # 获得上面稀疏矩阵的列索引，即特征的名字（就是特征词－词典）
print('X_train=', X_train[:2])
print('y_train=', y_train[:2])
words_vec = vec.transform(X_train)  # sparse matrix, [n_samples, n_features]
print(words_vec[:10].toarray())  # 得到分词的系数矩阵-稠密向量矩阵表示

['00', '10', '100', '1000', '11', '12', '120', '13', '14', '15', '150', '1500', '16', '17', '19', '1997', '20', '200', '2000', '2002']
X_train= ['依托 腾讯 强大 技术 平台 视频 直播 经验 海外 用户 需求 技术 团队 联合 第三方 厂商 海外 用户 体验 优化 地区 打开 直播 缓冲 时间 两秒 以内 国内 用户 相差无几 成功 保障 全球 2145 在线 人群 观看 4K VR 5.1 声道 环绕声 传送 画面 稳定 流畅', '光圈 倒闭 直播 面临 生死战']
y_train= ['technology', 'entertainment']
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

#统计特征次－词频数据
word = vec.get_feature_names() # 字典中的word
print(word[-10:-1])

# 字典中freq
freq = words_vec.toarray().sum(axis = 0)
print(freq[:10])

# <word,count>
word_freqs = dict(zip(word,freq))
# dict 进行排序
word_freqs = sorted(word_freqs.items(),key=lambda d:d[1],reverse=True)
print('word_freqs size = ',len(word_freqs))
print(word_freqs[:10])

['黄磊', '黄金', '黎明', '黑客', '黑色', '黑马', '黑龙江', '默契', '鼓励']
[443 315 132 405 192 193 210 174 148 216]
word_freqs size =  3884
[('中国', 18494), ('比赛', 7962), ('电影', 7883), ('发展', 7626), ('用户', 6486), ('技术', 6161), ('市场', 6135), ('汽车', 6072), ('平台', 5891), ('北京', 5478)]

# 词汇表
vocab_dict = dict(vec.vocabulary_)
vocab_dict_results = sorted(vocab_dict.items(), key=lambda d: d[1],reverse=True) 
print(vocab_dict_results[:5])

[('龙舟', 3883), ('鼓励', 3882), ('默契', 3881), ('黑龙江', 3880), ('黑马', 3879)]

# 词汇表保存文件
with open('../data/bow_vocab_ngram_20000.txt','w') as f:
    for vocab in vocab_dict_results:
        text = "{}|{}".format(vocab[0],vocab[1])
        f.write(text+"\n")

模型训练

用朴素贝叶斯完成一个中文文本分类器，一般在数据量足够，数据丰富度够的情况下，用朴素贝叶斯完成这个任务，准确度还是很不错的。

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(vec.transform(X_train),y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

测试集正确率

accuracy = clf.score(vec.transform(X_test), y_test)
print('accuracy = ',accuracy)

accuracy =  0.8760321519912313

print('X_train size:',len(X_train))
print('X_test size:',len(X_test))

X_train size: 82110
X_test size: 27370

交叉验证-正确率

from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import accuracy_score
import numpy as np
import warnings
warnings.filterwarnings('ignore')
def stratifiedkfold_cv(x, y, clf_class, shuffle=True, n_folds=5, **kwargs):
    stratifiedk_fold = StratifiedKFold(y, n_folds=n_folds, shuffle=shuffle)
    y_pred = y[:]
    for train_index, test_index in stratifiedk_fold:
        X_train, X_test = x[train_index], x[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_pred[test_index] = clf.predict(X_test)
    return y_pred
NB = MultinomialNB
y_pred = stratifiedkfold_cv(vec.transform(X),np.array(y),NB)
accuracy = accuracy_score(y, y_pred)
print('kfold accuracy = ',accuracy)

kfold accuracy =  0.8666879795396419

模型保存

import pickle
with open('../model/tf_model.pkl','wb') as f:
    pickle.dump(clf,f)

在线预测

import warnings
warnings.filterwarnings('ignore')
# 加载停止词
with open('../data/stopwords.txt') as f:
    stopwords = [stopword.strip() for stopword in f.readlines()]
print(stopwords[:10])

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']

# 加载模型
import pickle
tf_model = '../model/tf_model.pkl'
with open(tf_model,'rb') as f:
    model = pickle.load(f)

预测案例1-汽车类

摘自今日头条： https://www.toutiao.com/a6714271125473346055/

import jieba
text = "奥迪A3、宝马1系和奔驰A级一直纠缠不休的三个冤家"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['奥迪', 'A3', '宝马', '奔驰', '纠缠', '不休', '三个', '冤家']
car

预测案例2-军事类

摘自今日头条新闻： https://www.toutiao.com/a6714188329937535496/

import jieba
text = "谁说文物只能躺在博物馆，想买一架梦想中的战斗机开着兜风吗？"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['文物', '只能', '博物馆', '一架', '梦想', '战斗机', '开着', '兜风']
military

预测案例3-娱乐类

我们从今日头条： https://www.toutiao.com/a6689675139333751299/ 拷贝标题来进行预测

import jieba
text = "陈晓旭：从完美林黛玉到身家过亿后剃度出家，她戏里戏外都是传奇"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['陈晓旭', '完美', '林黛玉', '身家', '亿后', '剃度', '出家', '戏里', '戏外', '传奇']
entertainment

预测案例4-体育类

摘自今日头条：https://www.toutiao.com/a6714266792253981192/

import jieba
text = "男女有别！国乒主力参加马来西亚T2联赛 男队站着吃自助女队吃桌餐"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['男女有别', '国乒', '主力', '参加', '马来西亚', 'T2', '联赛', '男队', '自助', '女队', '桌餐']
sports

预测案例5-科技类

import jieba
text = "摩托罗拉One Macro将是最新一款Android One智能手机"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
print(model.predict(feat)[0])

words =  ['摩托罗拉', 'One', 'Macro', '最新', '一款', 'Android', 'One', '智能手机']
technology

站内首发文章

走在前方博客专家

发布了267 篇原创文章 · 获赞 66 · 访问量 43万+

他的留言板关注

自然语言处理（NLP）：02 基于词袋模型（BOW）特征抽取&n-gram&扩展特征维度数 + 贝叶斯算法文本分类

CountVectorizer ＋ n-gram 使用

导入库

加载数据

切分训练集合测试集

词袋模型特征抽取

模型训练

测试集正确率

交叉验证-正确率

模型保存

在线预测

预测案例1-汽车类

预测案例2-军事类

预测案例3-娱乐类

预测案例4-体育类

预测案例5-科技类

猜你喜欢

自然语言处理（NLP）：02 基于词袋模型（BOW）特征抽取&n-gram&扩展特征维度数 + 贝叶斯算法 文本分类

CountVectorizer ＋ n-gram 使用

导入库

加载数据

切分训练集合测试集

词袋模型特征抽取

模型训练

测试集正确率

交叉验证-正确率

模型保存

在线预测

预测案例1-汽车类

预测案例2-军事类

预测案例3-娱乐类

预测案例4-体育类

预测案例5-科技类

猜你喜欢

自然语言处理（NLP）：02 基于词袋模型（BOW）特征抽取&n-gram&扩展特征维度数 + 贝叶斯算法文本分类