初步涉及短文本分类，jieba+词袋+TF-IFG+SVM

短文本分类，首先对文本做预处理，包括分词，去停顿词，文本向量化

1.分词：使用jieba分词，使用比较简单，jieba分词有三种模式，

精确模式：将句子最精确的分开，适合文本分析
全模式：句子中所有可以成词的词语都扫描出来，速度快，不能解决歧义
搜索引擎模式：在精确的基础上，对长词再次切分，提高召回

import jieba

#全模式

text = "我来到北京清华大学"

seg_list = jieba.cut(text, cut_all=True)

print u"[全模式]: ", "/ ".join(seg_list)

#精确模式

seg_list = jieba.cut(text, cut_all=False)

print u"[精确模式]: ", "/ ".join(seg_list)

#默认是精确模式

seg_list = jieba.cut(text)

print u"[默认模式]: ", "/ ".join(seg_list)

#新词识别 “杭研”并没有在词典中,但是也被Viterbi算法识别出来了

seg_list = jieba.cut("他来到了网易杭研大厦")

print u"[新词识别]: ", "/ ".join(seg_list)

#搜索引擎模式

seg_list = jieba.cut_for_search(text)

print u"[搜索引擎模式]: ", "/ ".join(seg_list)

2.去停顿词：去停顿词实际就是导入自己定义好的一个字典表，然后判断数据中有无这些关键字

3.词袋模型+TF-IDF算法 https://blog.csdn.net/ACM_hades/article/details/93085783

(1)词袋模型：它是一种用机器学习算法对文本进行建模时表示文本数据的方法，机器学习算法不能直接处理原始文本，文本必须转换成数字。具体来说，是数字的向量。

(2)词袋模型能够把一段文字或一个文档转化为向量表示，它不考虑句子中单词的顺序，只考虑词表（vocabulary）中单词在这个句子中的出现次数。具体的说，词袋模型将每段文字或文档都转化为一个与词汇表一样长的向量，向量的每个元素存储该位置对于的词出现的次数。

(3)由于每个文档中一般只会出现的词汇表中的很少一部分词，因此会有很多的单词没有出现过，这些词被标记为0。所以，向量中大多数的元素就会为0。

from sklearn.feature_extraction.text import CountVectorizer

#一个语料库

corpus = [

'This is the first document.',

'This document is the second document.',

'And this is the third one.',

'Is this the first document?',

]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)

print("词汇：索引",vectorizer.vocabulary_)

print("句子的向量：")

print(X.toarray())#元素为每个词出现的次数

输出：

词汇：索引

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

句子的向量：

[[0 1 1 1 0 0 1 0 1]

[0 2 0 1 0 1 1 0 1]

[1 0 0 1 1 0 1 1 1]

[0 1 1 1 0 0 1 0 1]]

可以设置CountVecorizer中的ngram_range参数来构建不同的n元组模型，默认ngram_range=(1,1)

1元组：“the”、“weather”、“is”、“sweet”。

2元组：“the weather”、“weather is”、“is sweet”

TF-IDF算法：词频-逆文档频率(TF-IDF，term frequency-inverse document frequency)：是一种用于信息检索与数据挖掘的常用加权技术，常用于挖掘文章中的关键词。算法简单高效，常被工业用于最开始的文本数据清洗。

TF-IDF有两层意思：词频（TF）表示在某个文档中每个单词出现的频率；逆文档频率（IDF）: 是一个词语普遍重要性的度量，它的大小与一个词的常见程度成反比，计算方法是语料库的文档总数除以语料库中包含该词语的文档数量，再将得到的商取对数。

TF-IDF=TF*IDF

4.词袋与TF-IDF结合使用

BOW模型有很多缺点：

1、没有考虑单词之间的顺序，

2、无法反应出一个句子的关键词，

比如这个句子：“John likes to play football, Mary likes too”：

若他的词汇表为：[‘football’, ‘john’, ‘likes’, ‘mary’, ‘play’, ‘to’, ‘too’]

则词向量表示为：[1 1 2 1 1 1 1]

若根据BOW模型提取这个句子的关键词，则为 “like”，因为其值最大，但是显然这个句子的关键词应该为“football”

TF-IDF则可以解决例子中描述的BOW模型的这个问题，我们用每个词的TF-IDF分数来代替其频率，就能解决这个问题。

from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np

corpus = [

'This is the first document.',

'This document is the second document.',

'And this is the third one.',

'Is this the first document?',

]

vectorizer = TfidfVectorizer()

#设置小数点的位数为2

np.set_printoptions(2)

X = vectorizer.fit_transform(corpus)

#词汇表

print(vectorizer.vocabulary_)

print(X.toarray())

输出：

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

[[0. 0.47 0.58 0.38 0. 0. 0.38 0. 0.38]

[0. 0.69 0. 0.28 0. 0.54 0.28 0. 0.28]

[0.51 0. 0. 0.27 0.51 0. 0.27 0.51 0.27]

[0. 0.47 0.58 0.38 0. 0. 0.38 0. 0.38]]

原文链接：https://blog.csdn.net/ACM_hades/article/details/93085783

5.SVM模型

初步涉及短文本分类，jieba+词袋+TF-IFG+SVM

猜你喜欢