python实现中文文档jieba分词和分词结果写入excel文件

输入　

　　本篇文章中采用的是对京东某商品的2000个正面评价txt文档和2000个负面评价txt文档，总共是4000个txt文档。

　　一个正面评价txt文档中的内容类似如下：

　　1 钢琴漆，很滑很亮。2 LED宽屏，看起来很爽3 按键很舒服4 活动赠品多

　　一个负面评价txt文档中的内容类似如下：

　　送货上门后发现电脑显示器的两边有缝隙；成型塑料表面凹凸不平。做工很差，，，，，

输出　　

　　首先，是对4000个txt文档进行jieba分词后的输出结果。

　　对应上面输入中正面评价txt文档中的内容经过分词后，分词结果如下：

　　钢琴漆很滑很亮 LED 宽屏很爽按键舒服活动赠品

　　对应上面负面评价txt文档中的内容经过分词后，分词结果如下：

　　送货上门发现电脑显示器两边缝隙成型塑料表面凹凸不平做工很差

扫描二维码关注公众号，回复： 9180739 查看本文章

　　然后，把2000个正面评价txt文档和2000个负面评价txt文档的分词结果写入excel文件，每个分词结果都对应一个标签（正面评为1，负面评价为0），图示如下：

正面评价txt文档的分词结果

负面评价txt文档的分词结果

工具　

　　本文使用工具为：Anaconda、PyCharm、python语言、jieba中文分词工具、网上下载的停用词文档

原理

　　使用jieba工具对每篇txt文档中的中文段落进行分词，分词后的结果去掉停用词后写入excel文档。

Python代码实现

 1 from os.path import os
 2 from xlwt.Workbook import Workbook
 3 import jieba
 4 
 5 # 将停用词文档转换为停用词列表
 6 def stopwordslist():
 7     stopwords = [line.strip() for line in open('stopwords.txt', encoding='UTF-8').readlines()]
 8     return stopwords
 9 
10 # 对文档字符串进行中文分词
11 def seg_depart(sentence):
12     print('sentence:{}'.format(sentence))
13     # jieba工具分词结果
14     sentence_depart = jieba.cut(sentence.strip())
15     # 停用词列表
16     stopwords = stopwordslist()
17 
18     # 输出结果保存至outstr
19     outstr = ''
20     # 去停用词
21     for word in sentence_depart:
22         if word not in stopwords:
23             if word != '\t':
24                 outstr += word
25                 outstr += ' '
26     print('outstr:{}'.format(outstr))
27     return outstr
28 
29 # txt文档的路径
30 #mypath = 'F:\\Jingdong_4000\\neg\\'
31 mypath = 'F:\\Jingdong_4000\\pos\\'
32 myfiles = os.listdir(mypath)
33 
34 # txt文档名列表
35 fileList = []
36 for f in myfiles:
37         if(os.path.isfile(mypath + '/' + f)):
38             if os.path.splitext(f)[1] == '.txt':
39                 fileList.append(f)
40 # 待写入excel文件的每一行组成的列表
41 # excellist中的元素为列表，包括分词结果和标签两部分
42 excellist = []
43 for ff in fileList:
44     f = open(mypath+ff,'r',encoding='gb2312', errors='ignore')
45     sourceInLines = f.readlines()
46     f.close()
47     str = ''
48     rowList = []
49     for line in sourceInLines:
50         str += line
51         str = str.strip()
52 
53     # 对str做分词
54     str = seg_depart(str)
55     str = str.strip()
56     rowList.append(str)
57 
58     # 添加对应的标签0或1
59     #rowList.append(0)
60     rowList.append(1)
61 
62     excellist.append(rowList)
63 
64 # excel表格式
65 book = Workbook()
66 sheet1 = book.add_sheet('Sheet1')
67 row0 = ['review', 'label']
68 
69 for i in range(len(row0)):
70     sheet1.write(0,i,row0[i])
71 
72 # 两个for循环，第一个for循环针对写入excel的每行，第二个for循环针对每行的各列
73 for i, li in enumerate(excellist):
74     print('i:{}, li:{}'.format(i, li))
75     for j, lj in enumerate(li):
76         sheet1.write(i+1,j,lj)
77 # 数据存入excel文件
78 #book.save('neg_fenci_excel.xls')
79 book.save('pos_fenci_excel.xls')

代码运行结果

　　生成如输出一节展示内容的excel文档。