Python to split chat, ask questions to find out users are most concerned!

background:

Recent department heads to author a task, the user wants from the data table records chats, the users of which questions to find out more concerned (ie: frequency consultation process which words users asked to appear in the highest), in order to appropriate the late the adjustments do business, changing marketing strategies, etc.


Chat follows:

Hello there

For the United States would like to apply for Ph.D.

She will graduate this year. Preparing to apply for 2020. Intends to apply financial or industrial and commercial

We are prepared

There are a few Chinese. Comparison of water.

not yet. Dr. fierce competition. We intend Dosen schools.

That qq it. 1111111

Thank you

2222222

Undergraduate GPA in general. 3.4 look

211

Meng

student

City line basis

okay

Under the trouble to ask it will first contact qq

Phone may not be able to receive

Yup

Ok

Thank you


Ideas:

Jieba module using custom thesaurus chats each split (ie: Chinese word), then the result is stored for each word to the middle of the table, this last intermediate table that summarizes the results. Although jieba have the ability to identify new words, but words lexicon of words for a particular area may be sub-word recognition is not particularly satisfactory, using a custom library of keywords, you can ensure higher when the word of accuracy.


Source:


cat userdict.txt

Study abroad

To go abroad

Postgraduate

United Kingdom

United States


cat fenci_dictionary.py

import jieba.analyse

import pymysql


db = pymysql.connect(host='xx.xx.xx.xx',user='xxx',passwd='xxx',db='dbname',charset='utf8',connect_timeout=30)

cursor = db.cursor()


sql= 'SELECT msg from tablename where msg_type="g" limit 50'

cursor.execute(sql)

results = cursor.fetchall()


for row in results:

    row = row[0]


    # UserDictionary Model

    jieba.load_userdict('userdict.txt')

    for i in jieba.cut(row):

        sql1 = 'insert into test.tmp_fenci_statistic(keywords) values("%s")' % i

        try:

            cursor.execute(sql1)

            db.commit()

        except:

            db.rollback()


db.close()


jieba介绍:


jieba分词器安装(就是一个Python模块)

pip3 install jieba


jieba分词添加自定义词典:

如果词库中没有特定领域的词语,或者对于某个特定领域的关键词不是识别的特别令人满意,虽然jieba具备了新词语的识别能力,但是我们可以自定义属于自己的关键词库,以便在分词时保证更高的准确性


语法:

jieba.load_userdict(filename)    #filename为自定义的词典路径


词典格式:

一个词占一行,可以包含三个部分,1:词语,2:词频;3:词性  2、3 都可以省略,之间用空格隔开


例:

cat userdict.txt

留学

出国

研究生

英国

美国


题外:

jieba还支持全精确模式、全模式、搜索引擎模式的分词功能,这些分词功能,无绝对的优劣之分,主要看适不适用于业务分析。关于这部分的内容,如果读者有兴趣,请自行百度查阅吧。

Guess you like

Origin blog.51cto.com/20131104/2437545