Python之jieba库（例：文本词频统计）

1、jieba库概述

jieba是优秀的中文分词第三方库

-中文文本需要通过分词获得单个的词语

-jieba是优秀的中文分词第三方库，需要额外安装

-jieba库提供三种分词模式，最简单只需要掌握一个函数

2、jieba库的安装

（cmd命令行）pip install jieba 或 easy_install jieba

C:\Users\lenovo>easy_install jieba
Searching for jieba
Reading https://pypi.python.org/simple/jieba/
Downloading https://files.pythonhosted.org/packages/71/46/c6f9179f73b818d5827202ad1c4a94e371a29473b7f043b736b4dab6b8cd/jieba-0.39.zip#sha256=de385e48582a4862e55a9167334d0fbe91d479026e5dac40e59e22c08b8e883e
Best match: jieba 0.39
Processing jieba-0.39.zip
Writing C:\Users\lenovo\AppData\Local\Temp\easy_install-o02rlo5j\jieba-0.39\setup.cfg
Running jieba-0.39\setup.py -q bdist_egg --dist-dir C:\Users\lenovo\AppData\Local\Temp\easy_install-o02rlo5j\jieba-0.39\egg-dist-tmp-9zp6cf8i
zip_safe flag not set; analyzing archive contents...
jieba.__pycache__._compat.cpython-37: module references __file__
jieba.analyse.__pycache__.tfidf.cpython-37: module references __file__
creating d:\python37\lib\site-packages\jieba-0.39-py3.7.egg
Extracting jieba-0.39-py3.7.egg to d:\python37\lib\site-packages
Adding jieba 0.39 to easy-install.pth file

Installed d:\python37\lib\site-packages\jieba-0.39-py3.7.egg
Processing dependencies for jieba
Finished processing dependencies for jieba

3、jieba分词的原理

（1）jieba分词依靠中文词库

-利用一个中文词库，确定汉字之间的关联概率

-汉字间概率大的组成词组，形成分词结果

-除了分词，用户还可以添加自定义的词组

（2）jieba分词的三种模式

-精确模式：把文本精确的切分开，不存在冗余单词

-全模式：把文本中所有可能的词语都扫描出来，有冗余

-搜索引擎模式：在精确模式基础上，对长词再次切分，切分为短词语，进行搜索引擎

4、常用函数

函数	描述
jieba.lcut(s)	精确模式，返回一个列表类型的分词结果 >>> jieba.lcut("中国是一个伟大的国家") ['中国', '是', '一个', '伟大', '的', '国家']
jieba.lcut(s,cut_all=true)	全模式，返回一个列表类型的分词结果，存在冗余 >>> jieba.lcut("中国是一个伟大的国家",cut_all=True) ['中国', '国是', '一个', '伟大', '的', '国家']
jieba.lcut_for_search(s)	搜索引擎模式，返回一个列表类型的分词结果，存在冗余 >>> jieba.lcut_for_search("中华人民共和国是一个伟大的国家！") ['中华', '华人', '人民', '共和', '共和国', '中华人民共和国', '是', '一个', '伟大', '的', '国家', '！']
jieba.add_word(w)	向分词词典增加新词w >>> jieba.add_word("蟒蛇语言")

5、实例：文本词频统计

-英文文本：Hamet 分析词频

https://python123.io/resources/pye/hamlet.txt

-中文文本：《三国演义》分析人物

https://python123.io/resources/pye/threekingdoms.txt

（1）hamlet

#CalHamletV1.py
def getText():
    txt = open(r"C:\Users\lenovo\Desktop\hamlet.txt","r").read()   #文件路径前不加r报错
    txt = txt.lower() #把所有的英文字符变成小写
    for ch in '!#$%^&*()_"+./<>=;:,-~`?@[]{}\\|':
        txt = txt.replace(ch," ")
    return txt
hamletTxt = getText() #对文件进行读取和归一化
words = hamletTxt.split()  #默认用空格分隔，存放在一个列表
counts = {}    #定义一个字典
for word in words:
    counts[word] = counts.get(word,0) + 1   #get方法获得某一个键对应的值
items = list(counts.items()) #转换为列表
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
    word,count = items[i]
    print("{0:<10}{1:>5}".format(word,count))



运行结果：
the        1138
and         965
to          754
of          669
you         550
i           542
a           542
my          514
hamlet      462
in          436

（2）三国演义

import jieba
txt = open(r"C:\Users\lenovo\Desktop\三国演义.txt","r",encoding="gb18030").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
    word,count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

运行结果：
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\lenovo\AppData\Local\Temp\jieba.cache
Loading model cost 2.325 seconds.
Prefix dict has been built succesfully.
曹操          953
孔明          836
将军          772
却说          656
玄德          585
关公          510
丞相          491
二人          469
不可          440
荆州          425
玄德曰         390
孔明曰         390
不能          384
如此          378
张飞          358

import jieba
txt = open(r"C:\Users\lenovo\Desktop\三国演义.txt","r",encoding="gb18030").read()
excludes = {"将军","却说","荆州","二人","不可","不能","如此","左右"}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword == "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword == "刘备"
    elif word == "孟德" or word == "丞相":
        rword == "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
    word,count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

运行结果：
孔明         1391
曹操          963
张飞          366
商议          353
如何          344
主公          338
军士          320
吕布          303
军马          297
赵云          282

扩展：政府工作报告、科研论文、新闻报道、词云......

打开文件时，遇到一个错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte

使用python的时候经常会遇到文本的编码与解码问题，其中很常见的一种解码错误如题目所示，下面介绍该错误的解决方法，将‘utf-8’换成‘gbk’也适用。
（1）首先在打开文本的时候，设置其编码格式，如：open(‘1.txt’,encoding=’gbk’)；
（2）若（1）不能解决，可能是文本中出现的一些特殊符号超出了gbk的编码范围，可以选择编码范围更广的‘gb18030’，如：open(‘1.txt’,encoding=’gb18030’)；
（3）若（2）仍不能解决，说明文中出现了连‘gb18030’也无法编码的字符，可以使用‘ignore’属性进行忽略，如：open(‘1.txt’,encoding=’gb18030’，errors=‘ignore’)；
（4）还有一种常见解决方法为open(‘1.txt’).read().decode(‘gb18030’,’ignore’)