background:
Recent department heads to author a task, the user wants from the data table records chats, the users of which questions to find out more concerned (ie: frequency consultation process which words users asked to appear in the highest), in order to appropriate the late the adjustments do business, changing marketing strategies, etc.
Chat follows:
Hello there
For the United States would like to apply for Ph.D.
She will graduate this year. Preparing to apply for 2020. Intends to apply financial or industrial and commercial
We are prepared
There are a few Chinese. Comparison of water.
not yet. Dr. fierce competition. We intend Dosen schools.
That qq it. 1111111
Thank you
2222222
Undergraduate GPA in general. 3.4 look
211
Meng
student
City line basis
okay
Under the trouble to ask it will first contact qq
Phone may not be able to receive
Yup
Ok
Thank you
Ideas:
Jieba module using custom thesaurus chats each split (ie: Chinese word), then the result is stored for each word to the middle of the table, this last intermediate table that summarizes the results. Although jieba have the ability to identify new words, but words lexicon of words for a particular area may be sub-word recognition is not particularly satisfactory, using a custom library of keywords, you can ensure higher when the word of accuracy.
Source:
cat userdict.txt
Study abroad
To go abroad
Postgraduate
United Kingdom
United States
cat fenci_dictionary.py
import jieba.analyse
import pymysql
db = pymysql.connect(host='xx.xx.xx.xx',user='xxx',passwd='xxx',db='dbname',charset='utf8',connect_timeout=30)
cursor = db.cursor()
sql= 'SELECT msg from tablename where msg_type="g" limit 50'
cursor.execute(sql)
results = cursor.fetchall()
for row in results:
row = row[0]
# UserDictionary Model
jieba.load_userdict('userdict.txt')
for i in jieba.cut(row):
sql1 = 'insert into test.tmp_fenci_statistic(keywords) values("%s")' % i
try:
cursor.execute(sql1)
db.commit()
except:
db.rollback()
db.close()
jieba介绍:
jieba分词器安装(就是一个Python模块)
pip3 install jieba
jieba分词添加自定义词典:
如果词库中没有特定领域的词语,或者对于某个特定领域的关键词不是识别的特别令人满意,虽然jieba具备了新词语的识别能力,但是我们可以自定义属于自己的关键词库,以便在分词时保证更高的准确性
语法:
jieba.load_userdict(filename) #filename为自定义的词典路径
词典格式:
一个词占一行,可以包含三个部分,1:词语,2:词频;3:词性 2、3 都可以省略,之间用空格隔开
例:
cat userdict.txt
留学
出国
研究生
英国
美国
题外:
jieba还支持全精确模式、全模式、搜索引擎模式的分词功能,这些分词功能,无绝对的优劣之分,主要看适不适用于业务分析。关于这部分的内容,如果读者有兴趣,请自行百度查阅吧。