The use of python Chinese sentiment analysis Snownlp library

When you are not reconciled, you are improving; when you are suffering, you are growing.

1. Introduction to Snownlp

SnowNLP is a class library written in python, which can easily process Chinese text content. It was inspired by TextBlob. Since most of the natural language processing libraries are basically for English, I wrote a convenient processing Chinese Unlike TextBlob, NLTK is not used here, all algorithms are implemented by themselves, and some well-trained dictionaries are included. Note that this program handles unicode encoding, so please decode it into unicode encoding when you use it.

Snownlp github address: https://github.com/isnowfy/snownlp

# 安装
pip install snownlp -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

Second, the characteristics of Snownlp

  • Chinese word segmentation (Character-Based Generative Model)
  • Part of speech tagging (TnT 3-gram hidden horse)
  • Sentiment analysis (the official website does not introduce specific principles, but the accuracy of shopping comments is higher, in fact, because its corpus is mainly shopping)
  • Text classification (the principle is Naive Bayes)
  • Convert to Pinyin (the maximum match achieved by the Trie tree)
  • Traditional to Simplified (Maximum matching achieved by Trie tree)
  • Extract text keywords (TextRank algorithm)
  • Extract text summary (TextRank algorithm)
  • tf,idf
  • Tokenization (split into sentences)
  • Text similar (BM25)

Three, the basic use of the Snownlp library

from snownlp import SnowNLP

word = u'这个姑娘真好看'
s = SnowNLP(word)
print(s.words)        # 分词
print(list(s.tags))   # 词性标注
print(s.sentiments)   # 情感分数
print(s.pinyin)       # 拼音
print(SnowNLP(u'蒹葭蒼蒼,白露為霜。所謂伊人,在水一方。').han)  # 繁体字转简体

运行结果如下:
['这个', '姑娘', '真', '好看']
[('这个', 'r'), ('姑娘', 'n'), ('真', 'd'), ('好看', 'a')]
0.9002381975487243
['zhe', 'ge', 'gu', 'niang', 'zhen', 'hao', 'kan']
蒹葭苍苍,白露为霜。所谓伊人,在水一方。
from snownlp import SnowNLP

text = u'''
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。
它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
自然语言处理是一门融语言学、计算机科学、数学于一体的科学。
因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,
所以它与语言学的研究有着密切的联系,但又有重要的区别。
自然语言处理并不是一般地研究自然语言,
而在于研制能有效地实现自然语言通信的计算机系统,
特别是其中的软件系统。因而它是计算机科学的一部分。
'''

s = SnowNLP(text)

print(s.keywords(limit=3))        # 关键词提取
print('--------------------------------')
summary = s.summary(limit=4)      # 文本概括
for i in summary:
    print(i)

print('--------------------------------')

print(s.sentences)        # 句子

运行结果如下:
['语言', '自然', '计算机']
--------------------------------
因而它是计算机科学的一部分
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向
自然语言处理是一门融语言学、计算机科学、数学于一体的科学
所以它与语言学的研究有着密切的联系
--------------------------------
['自然语言处理是计算机科学领域与人工智能领域中的一个重要方向', '它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法', '自然语言处理是一门融语言学、计算机科学、数学于一体的科学', '因此', '这一领域的研究将涉及自然语言', '即人们日常使用的语言', '所以它与语言学的研究有着密切的联系', '但又有重要的区别', '自然语言处理并不是一般地研究自然语言', '而在于研制能有效地实现自然语言通信的计算机系统', '特别是其中的软件系统', '因而它是计算机科学的一部分']

Process finished with exit code 0
# 评价词语对文本的重要程度
# TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。
# TF词频越大越重要,但是文中会的“的”,“你”等无意义词频很大,却信息量几乎为0,这种情况导致单纯看词频评价词语重要性是不准确的。因此加入了idf
# IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t越重要
# TF-IDF综合起来,才能准确的综合的评价一词对文本的重要性。

from snownlp import SnowNLP


s = SnowNLP([[u'这篇', u'文章', u'写得', u'不错'],
             [u'那篇', u'论文', u'好'],
             [u'这个', u'东西', u'好吃']])
print(s.tf)     # tf 意思是词频(Term Frequency)
print('---------------------------------------------------')
print(s.idf)    # idf 意思是逆文本频率指数(Inverse Document Frequency)
print('-----------------------------------------------------')
# 文本相似度
print(s.sim([u'文章']))
print(s.sim([u'好']))

运行结果如下:
[{
    
    '这篇': 1, '文章': 1, '写得': 1, '不错': 1}, {
    
    '那篇': 1, '论文': 1, '好': 1}, {
    
    '这个': 1, '东西': 1, '好吃': 1}]
---------------------------------------------------
{
    
    '这篇': 0.5108256237659907, '文章': 0.5108256237659907, '写得': 0.5108256237659907, '不错': 0.5108256237659907, '那篇': 0.5108256237659907, '论文': 0.5108256237659907, '好': 0.5108256237659907, '这个': 0.5108256237659907, '东西': 0.5108256237659907, '好吃': 0.5108256237659907}
-----------------------------------------------------
[0.4686473612532025, 0, 0]
[0, 0.5348959411162205, 0]

# 关于训练
# 现在提供训练的包括分词,词性标注,情感分析,都是用的snownlp库自带的原始文件 以分词为例 分词在snownlp/seg目录下

from snownlp import seg

sentiment.train('neg.txt', 'pos.txt')
seg.save('seg.marshal')

# 这样训练好的文件就保存为seg.marshal了,之后修改snownlp/seg/__init__.py里的data_path指向刚训练好的文件即可

Four, NLP test

1. Get data

URL: https://item.jd.com/100000499657.html#none
Insert picture description here
crawls some good reviews, medium reviews and negative reviews and saves them to three txts.

2. Processing data

from pathlib import Path
import pandas as pd

# 获取当前目录下 存储好评 中评 差评数据的txt
p = Path(r'D:\python\pycharm2020\program\数据分析\中文情感分析')
review_txt = list(p.glob('**/*.txt'))
all_data = pd.DataFrame()
for item in review_txt:
    emotion = item.stem     # 获取文件名 除后缀的部分
    with Path(item).open(mode='r') as f:
        con = f.read().split('\n')
    data = pd.DataFrame({
    
    '评论内容': con, '标签': [emotion] * len(con)})
    all_data = all_data.append(data)

all_data.to_excel('评论数据.xlsx', index=False)

3. NLP testing

from snownlp import SnowNLP
import pandas as pd
import re

df = pd.read_excel('评论数据.xlsx')
content = df['评论内容']
# 去除一些无用的字符   只提取出中文出来
content = [' '.join(re.findall('[\u4e00-\u9fa5]+', item, re.S)) for item in content]
# 对每条评论进行情感打分
scores = [SnowNLP(i).sentiments for i in content]
emotions = []
# 根据分数来划定好评 中评 差评
for i in scores:
    if i >= 0.75:
        emotions.append('好评')
    elif 0.45 <= i < 0.75:
        emotions.append('中评')
    else:
        emotions.append('差评')

df['情感分数'] = scores
df['情感'] = emotions
df.to_excel('NLP测试后数据.xlsx')
import pandas as pd

# 计算预测准确率
df = pd.read_excel('NLP测试后数据.xlsx')
# 看准确率   通过Snownlp情感打分 设置梯度得出的情感 好评 中评 差评 与实际标签相比较
data = df[df['标签'] == df['情感']]
print('准确率为:{:.3%}'.format(len(data) / len(df)))

运行结果如下:
准确率为:72.292%

Process finished with exit code 0
  • The accuracy rate is okay, but not too high. The reasons for analysis may be as follows:
  • Because I'm just doing exercises and familiar with the basic use of the Snownlp library, I judge emotions by emotional scoring and setting gradients. I didn't build my own corpus in this field. If I build a related corpus and replace the default corpus, the accuracy will be much higher. So the corpus is very critical. If you want to formally conduct text mining, it is recommended to build your own corpus.
  • For reviews under this product, the boundary between medium and negative reviews is rather vague. The label of each review uses the default label when crawling: what kind of review it belongs to, and there will be considerable errors without manual review.
  • The processing of text only filters out other characters and extracts Chinese.

Guess you like

Origin blog.csdn.net/fyfugoyfa/article/details/108431045