文本识别(自然语言处理,NLP)

语音识别

语音----------------------->文本--------------------->语义

NLTK - 自然语言工具包

分词

import nltk.tokenize as tk
tk.sent_tokenize(文本)->句子列表
tk.word_tokenize(文本)->单词列表
分词器 = tk.WordPunctTokenizer() > 略有不同(会把’s分开成’ 和s)
分词器.tokenize(文本)->单词列表 /
代码:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
      "Let's see how it works! " \
      "We need to analyze a couple of sentences " \
      "with punctuations to see it in action."
print(doc)
tokens = tk.sent_tokenize(doc,language='english')
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)

词干

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
注意:提取出来的不一定是单词,也有可能只是单词的部分组成
pt.PorterStemmer() -> 波特词干提取器,偏宽松
lc.LancasterStemmer() -> 朗卡斯特词干提取器,偏严格
sb.SnowballStemmer(语言) -> 思诺博词干提取器,偏中庸
XXX词干提取器.stem(单词)->词干
代码:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
    pt_stem = pt_stemmer.stem(word)
    lc_stem = lc_stemmer.stem(word)
    sb_stem = sb_stemmer.stem(word)
    print('%8s %8s %8s %8s' % (
        word, pt_stem, lc_stem, sb_stem))

词形还原

名词:复数->单数
动词:分词->原型

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
    n_lemma = lemmatizer.lemmatize(word, pos='n')
    v_lemma = lemmatizer.lemmatize(word, pos='v')
    print('%8s %8s %8s' % (word, n_lemma, v_lemma))

词袋

相似的词会出现在含义相似的语句里面。根据相似输入对应相似输出,统计词典中的词在每个样本内出现的次数,根据次数统计规律,找到相似语句,聊天机器人就可以通过其进行反馈。
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
1 The brown dog is running
2 The black dog is in the black room
3 Running in the room is forbidden
the brown dog is running black in room forbidden
1 1 1 1 1 1 0 0 0 0
2 1 0 1 1 0 2 1 1 0
3 1 0 0 1 1 0 1 1 1
代码:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
# 语句统计
sentences = tk.sent_tokenize(doc)
print(sentences)
# 计数向量
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
#返回的是矩阵,
print(bow)
words = cv.get_feature_names()
print(words)

词频

词频是词袋矩阵的归一化。根据词袋统计的词语数量,得到词语出现频率。

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
import sklearn.preprocessing as sp
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
# 文本统计的特征提取。 
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names()
print(words)
# 统计词频
tf = sp.normalize(bow, norm='l1')
print(tf)

文档频率(DF)

针对词典中的每一个单词,用包含该单词的样本数闭上总样本数。如果这个单词越稀有,则文档频率越小。单词越稀有,文档频率越小,单词的稀有度贡献了文档的特征。

逆文档频率(IDF)

逆文档频率越高,文档频率越低,单词越稀有,可识别性贡献越高。
词频越高---------------------------------------------->语义表现力贡献越高

词频你文档频率(TF-IDF)

词频乘你文档频率,综合体现了单词对样本语义表现力和可识别性贡献的大小。

词频矩阵中的每一个元素乘以相应单词的逆文档频率,其值越大说明该词对样本语义的贡献越大,根据每个词的贡献力度,构建学习模型。
代码:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
# 特征提取器,统计各文本在该行出现的次数(特征值次数)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 得到统计的特征值。
words = cv.get_feature_names()
print(words)
tt = ft.TfidfTransformer()
# 得到词频-逆文档频率
tfidf = tt.fit_transform(bow).toarray()
print(tfidf)

基于多项分布朴素贝叶斯的情感分析

多项分布朴素贝叶斯分类器
通过有监督学习,将关键单词和情感联系起来,对未知语句,进行词语匹配,判断其情感好坏。
情感分析
A B C
1 2 3 -> {‘A’: 1, ‘C’: 3, ‘B’: 2}
4 5 6 -> {‘C’:6, ‘A’: 4, ‘B’, 5}
7 8 9 …
代码:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu
# 每个好评实例的单词字典的列表
pdata = []
# 获取自带数据:电影评论的好评
fileids = nc.movie_reviews.fileids('pos')
for fileid in fileids:
    # nc.movie_reviews.words:nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader的实例方法
    words = nc.movie_reviews.words(fileid)
    # 生成含有每个单词的字典
    sample = {}
    for word in words:
        sample[word] = True
    pdata.append((sample, 'POSITIVE'))
# 每个差评实例的单词字典的列表
ndata = []
# 获取自带数据:电影评论的差评
fileids = nc.movie_reviews.fileids('neg')
for fileid in fileids:
    words = nc.movie_reviews.words(fileid)
    sample = {}
    for word in words:
        sample[word] = True
    ndata.append((sample, 'NEGATIVE'))
# 划分训练集和测试集,这里没有考虑交叉验证的问题
pnumb, nnumb = int(0.8 * len(pdata)), int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
# 生成多项式朴素贝叶斯分类模型,使用的nltk的模型
model = cf.NaiveBayesClassifier.train(train_data)
# 验证模型准确度
ac = cu.accuracy(model, test_data)
print('%.2f%%' % round(ac * 100, 2))
# 最具信息量的特征值
tops = model.most_informative_features()
for top in tops[:5]:
    print(top[0])
reviews = [
    'It is an amazing movie.',
    'This is a dull movie. I wound never recommend it to anyoue.',
    'The cinematography is pretty great in this movie.',
    'The direction was terrible and the story was all over the place.']
sents, probs = [], []
# 生成词语字典,这里就没有使用单词划分的方法,直接通过split切割。
for review in reviews:
    words = review.split()
    sample = {}
    for word in words:
        sample[word] = True
    # 可能的概率,这里相当于得到分类结果
    pcls = model.prob_classify(sample)
    # 分类
    sent = pcls.max()
    # 处于这个类的概率,置信度,准确率
    prob = pcls.prob(sent)
    sents.append(sent)
    probs.append(prob)
for review, sent, prob in zip(
        reviews, sents, probs):
    print(review, '->', sent, '%.2f%%' % round(
        prob * 100, 2))

主题抽取

代码:topic.py

文本分类,一般情况下选择基于统计的分类器进行训练。自然语言有明显的基于统计的特征。

代码:doc.py
1 2 3 4 5 6
2 3 0 0 1 4
0 4 1 1 2 2
10.性别识别
代码:gndr.py

猜你喜欢

转载自blog.csdn.net/weixin_36179862/article/details/85093962