在进行文本分类之前,需要对文本进行预处理。中文文本和英文文本预处理的方式有所差别。
(1)英文文本预处理
文本预处理过程大致分为以下几点:
1、英文缩写替换
预处理过程中需要把英文缩写进行替换,如it’s和it is是等价的,won’t和will not也是等价的,等等。
text = "The story loses its bite in a last-minute happy ending that's even less plausible than the rest of the picture ."
text.replace("that's", "that is")
‘The story loses its bite in a last-minute happy ending that is even less plausible than the rest of the picture .’
2、转换为小写字母
文本中英文有大小写之分,例如"Like"和"like",这两个字符串是不同的,但在统计单词时希望把它们认为是相同的单词。
text = "This is a film well worth seeing , talking and singing heads and all ."
text.lower()
‘this is a film well worth seeing , talking and singing heads and all .’
3、删除标点符号、数字及其它特殊字符
标点符号、数字及其它特殊字符对文本分类不起作用,把它们删除可以减少特征的维度。一般用正则化方法删除这些字符。
import re
text = "disney has always been hit-or-miss when bringing beloved kids' books to the screen . . . tuck everlasting is a little of both ."
text = re.sub("[^a-zA-Z]", " ", text)
# 删除多余的空格
' '.join(text.split())
‘disney has always been hit or miss when bringing beloved kids books to the screen tuck everlasting is a little of both’
4、分词
英文文本的分词和中文文本的分词方法不同,英文文本分词方法可以根据所提供的文本进行选择,如果文本中单词和标点符号或者其它字符是以空格隔开的,例如"a little of both .",那么可以直接使用split()方法;如果文本中单词和标点符号没有用空格隔开,例如"a little of both.",可以使用nltk库中的word_tokenize()方法。nltk库安装也比较简单,在windows下,用pip install nltk进行安装即可。
# 单词和标点符号用空格隔开
text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness ."
text.split()
[‘part’, ‘of’, ‘the’, ‘charm’, ‘of’, ‘satin’, ‘rouge’, ‘is’, ‘that’, ‘it’, ‘avoids’, ‘the’, ‘obvious’, ‘with’, ‘humour’, ‘and’, ‘lightness’, ‘.’]
# 单词和标点符号没有用空格隔开
from nltk.tokenize import word_tokenize
text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness."
word_tokenize(text)
[‘part’, ‘of’, ‘the’, ‘charm’, ‘of’, ‘satin’, ‘rouge’, ‘is’, ‘that’, ‘it’, ‘avoids’, ‘the’, ‘obvious’, ‘with’, ‘humour’, ‘and’, ‘lightness’, ‘.’]
5、拼写检查
由于英文文本中可能有拼写错误,因此一般需要进行拼写检查。如果确信分析的文本没有拼写错误,可以略去此步。拼写检查,一般用pyenchant类库完成。
from enchant.checker import SpellChecker
chkr = SpellChecker("en_US")
chkr.set_text("Many peope likee to watch in the Name of People.")
for err in chkr:
print("ERROR:", err.word)
ERROR: peope
ERROR: likee
6、词干提取和词形还原
词干提取(stemming)和词型还原(lemmatization)是英文文本预处理的特色。两者其实有共同点,即都是要找到词的原始形式。只不过词干提取(stemming)会更加激进一点,它在寻找词干的时候可以会得到不是词的词干。比如"imaging"的词干可能得到的是"imag", 并不是一个词。而词形还原则保守一些,它一般只对能够还原成一个正确的词的词进行处理。在nltk中,做词干提取的方法有PorterStemmer,LancasterStemmer和SnowballStemmer。推荐使用SnowballStemmer。这个类可以处理很多种语言,当然,除了中文。
from nltk.stem.porter import PorterStemmer
stem_porter = PorterStemmer()
stem_porter.stem('countries') # 输出countri
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("countries") # 输出countri
from nltk.stem.wordnet import WordNetLemmatizer
stem_wordnet = WordNetLemmatizer()
stem_wordnet.lemmatize('countries') # 输出country
7、删除停用词
停用词对文本分类没什么影响,比如’a’,‘is’,‘of’。nltk库中自带英文停用词。
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
text = "part of the charm of satin rouge is that it avoids the obvious with humour and lightness"
words = [w for w in text.split() if w not in stop_words]
' '.join(words)
‘part charm satin rouge avoids obvious humour lightness’
下面我使用传统的机器学习方法来实现简单的英文文本分类任务,使用情感分类数据集,是个二分类数据集,标签为positive或者negative,数据集可以从网上下载或者发给你。分类任务过程如下:首先进行文本预处理,然后提取TFIDF特征,最后利用传统的机器学习方法逻辑回归对文本进行分类。我假设文本中没有拼写错误,另外模型简单,分类效果一般,需要进一步优化。有什么不对或者遗漏之处,望各位不吝指教,谢谢。
import numpy as np
import os
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
import time
import warnings
warnings.filterwarnings('ignore')
path_neg = './polarity_data/rt-polarity.neg'
path_pos = './polarity_data/rt-polarity.pos'
# 文本预处理
def text_preprocessing(path):
# 读取数据
text = []
with open(path, 'r', encoding='utf-8', errors='ignore') as f:
for line in f.readlines():
text.append(line.strip())
# 英文缩写替换
text_abbreviation = []
for item in text:
item = item.lower().replace("it's", "it is").replace("i'm", "i am").replace("he's", "he is").replace("she's", "she is")\
.replace("we're", "we are").replace("they're", "they are").replace("you're", "you are").replace("that's", "that is")\
.replace("this's", "this is").replace("can't", "can not").replace("don't", "do not").replace("doesn't", "does not")\
.replace("we've", "we have").replace("i've", " i have").replace("isn't", "is not").replace("won't", "will not")\
.replace("hasn't", "has not").replace("wasn't", "was not").replace("weren't", "were not").replace("let's", "let us")
text_abbreviation.append(item)
# 删除标点符号、数字等其他字符
text_clear_str = []
for item in text_abbreviation:
item = re.sub("[^a-zA-Z]", " ", item)
text_clear_str.append(' '.join(item.split()))
text_clear_str_stem_del_stopwords = []
stem_porter = PorterStemmer() # 词形归一化
stop_words = stopwords.words("english") # 停用词
# 分词、词形归一化、删除停用词
for item in text_clear_str:
words_token = word_tokenize(item) # 分词
words = [stem_porter.stem(w) for w in words_token if w not in stop_words]
text_clear_str_stem_del_stopwords.append(' '.join(words))
return text_clear_str_stem_del_stopwords
start_time1 = time.clock()
text_neg = text_preprocessing(path_neg)
text_pos = text_preprocessing(path_pos)
end_time1 = time.clock()
print("the time of text preprocessing is %.2f s" % (end_time1 - start_time1))
text = text_neg + text_pos
# 特征提取
def features_extraction(text):
vectors = TfidfVectorizer()
features = vectors.fit_transform(text).todense()
return features
start_time2 = time.clock()
features = features_extraction(text)
end_time2 = time.clock()
print("the time of features extracting is %.2f s" % (end_time2 - start_time2))
print('.'*40)
m = len(text_neg)
n = len(text_pos)
# 标签
labels = np.vstack((np.zeros((m, 1)), np.ones((n, 1))))
# 数据集
data = np.hstack((features, labels))
# 数据集随机打乱
np.random.shuffle(data)
# 样本特征
features = data[:, :-1]
# 样本标签
labels = data[:, -1]
print(features.shape)
print(labels.shape)
# 训练集、测试集划分
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
# 逻辑回归
start_time3 = time.clock()
lr = LogisticRegression().fit(x_train, y_train)
end_time3 = time.clock()
# 训练时间
print("the time of LogisicRegression training is %.2f s" % (end_time3 - start_time3))
y_pred = lr.predict(x_test)
print("the f1_score of LogisicRegression is %.4f" % f1_score(y_test, y_pred))
(2)中文文本预处理
1、删除标点符号、数字及其它特殊字符
import re
text = '最后是12月12日《篮球先锋报》的新闻报道“湖”涂开营。到底是保罗还是霍华德,\
湖人转了一圈之后终于发现,三巨头不切实际,而霍华德也比保罗更符合湖人构建王朝的要求。\
于是,湖人放弃追逐保罗,又把奥多姆送去小牛,目的就是抢魔兽。这一次,他们能如愿吗?标签:$LOTOzf$'
text = re.sub("[0-9《》“”‘’。、,?!——$¥#@%……&*^()a-zA-Z<>;:/]", "", text)
print(text)
‘最后是月日篮球先锋报的新闻报道湖涂开营到底是保罗还是霍华德湖人转了一圈之后终于发现三巨头不切实际而霍华德也比保罗更符合湖人构建王朝的要求于是湖人放弃追逐保罗又把奥多姆送去小牛目的就是抢魔兽这一次他们能如愿吗标签’
2、jieba分词
可使用 jieba.cut 和 jieba.cut_for_search 方法进行分词,两者所返回的结构都是一个可迭代的 generator,可使用 for 循环来获得分词后得到的每一个词语(unicode),或者直接使用 jieba.lcut 以及 jieba.lcut_for_search 直接返回 list。其中:jieba.cut 和 jieba.lcut 接受 3 个参数:
sentence:需要分词的字符串(unicode 或 UTF-8 字符串、GBK 字符串)
cut_all 参数:是否使用全模式,默认值为 False
HMM 参数:用来控制是否使用 HMM 模型,默认值为 True
jieba.cut_for_search 和 jieba.lcut_for_search 接受 2 个参数:
sentence:需要分词的字符串(unicode 或 UTF-8 字符串、GBK 字符串)
HMM 参数:用来控制是否使用 HMM 模型,默认值为 True
尽量不要使用 GBK 字符串,可能无法预料地错误解码成 UTF-8
jieba 是目前最好的 Python 中文分词组件,它有以下三种分词模式:
2.1、精准模式
import jieba
sentence= '他来到北京大学参加暑期夏令营'
# 精准模式
print(list(jieba.cut(sentence, cut_all=False)))
[‘他’, ‘来到’, ‘北京大学’, ‘参加’, ‘暑期’, ‘夏令营’]
2.2、全模式
sentence= '他来到北京大学参加暑期夏令营'
# 全模式
print(list(jieba.cut(sentence, cut_all=True)))
[‘他’, ‘来到’, ‘北京’, ‘北京大学’, ‘大学’, ‘参加’, ‘暑期’, ‘夏令’, ‘夏令营’]
2.3、搜索引擎模式
sentence = '他毕业于北京大学机电系,后来在一机部上海电器科学研究所工作'
# 搜索引擎模式
print(list(jieba.cut_for_search(sentence)))
[‘他’, ‘毕业’, ‘于’, ‘北京’, ‘大学’, ‘北京大学’, ‘机电’, ‘系’, ‘,’, ‘后来’, ‘在’, ‘一机部’, ‘上海’, ‘电器’, ‘科学’, ‘研究’, ‘研究所’, ‘工作’]
3、删除停用词
首先需要在网上下载中文的停用词表或者自己制作停用词表,也可以下载好停用词表,然后加上一些其它的停用词。
import jieba
import re
# 停用词
stop_words = ['于', '后来', '在']
sentence = '他毕业于北京大学机电系后来在一机部上海电器科学研究所工作'
# 分词
sent_cut = list(jieba.cut(sentence))
# 删除停用词
text = [w for w in sent_cut if w not in stop_words]
print('分词:', sent_cut)
print('删除停用词:', text)
分词: [‘他’, ‘毕业’, ‘于’, ‘北京大学’, ‘机电’, ‘系’, ‘后来’, ‘在’, ‘一机部’, ‘上海’, ‘电器’, ‘科学’, ‘研究所’, ‘工作’]
删除停用词: [‘毕业’, ‘北京大学’, ‘机电’, ‘系’, ‘一机部’, ‘上海’, ‘电器’, ‘科学’, ‘研究所’, ‘工作’]
详细代码可以在我的github上查看https://github.com/LeungFe/Text-preprocessing