"Python Example" is shocked, using Python so simple to implement the dirty words of the chat system, advertisement detection

Little knowledge, big challenge! This article is participating in the " Necessary Knowledge of Programmers "

This article also participates in  the "Excavation Star Program" to win creative gift packs and challenge creative incentives

The chat function in the game is almost a necessary function. There are certain problems with such a function, that is, it will cause the world channel to be very chaotic. There are often some sensitive words, or some chats that game manufacturers do not want to see. There is also such a problem. Our company has done reporting and background monitoring. Today, we will implement this kind of monitoring.

1. Demand analysis:

Because deep learning is not very good, although I have written reinforcement learning before, but the results of reinforcement learning are not particularly satisfactory, so I will study a simpler method to achieve it.

This classification task actually has ready-made solutions. For example, the classification of spam is the same problem. Although there are different solutions, I still choose the simplest Naive Bayes classification. Mainly do some exploration,

Because most of our games are in Chinese, we need to segment Chinese. For example, I am a handsome guy, and we need to split it.

2. Algorithm principle:

Naive Bayesian algorithm is an algorithm that judges the category of a new sample according to the conditional probability of the existing features of the new sample in the data set; it assumes that (1) each feature is independent of each other, and (2) each feature is equally important. It can also be understood as judging the probability when the current characteristics are satisfied at the same time according to the past probability. The specific math company can use Baidu itself. The data formula is too difficult to write. It is probably enough to understand it.

Use the right algorithm at the right time.

Jieba word segmentation principle: jieba word segmentation belongs to probabilistic language model word segmentation. The task of probabilistic language model segmentation is to find a segmentation scheme S among all the results obtained from the full segmentation, so that P(S) is the largest.

image.png

It can be seen that jieba comes with some phrases, which will be split from these phrases as the basic unit during segmentation.

Note: I only briefly introduce the principles of the above two technologies. If you want to understand thoroughly, you have to write a large article. You can download it on Baidu, everywhere, and find an article that you can understand. Use it first if you can.

3. Technical analysis

中文分词的包最出名的分词包是jieba,至于是不是最好的我也不知道,我想火是有火的道理,先做起来。jieba的原理不用深究,优先解决问题,遇到了问题可以再以问题点进行学习,这样的学习模式才是最高效的。

因为最近在做语音相关的东西,有大佬推荐了库nltk,查阅了相关的资料,似乎是做语言处理方向很出名的库,很强大,功能很强大,我这里主要选择了他的分类算法,这样我就不用关注具体的实现,也不用重复造轮子了,况且还不如别人造的好,拿来用之就好。

python 是真不错,各种包,各种轮子。

安装命令:

pip install jieba
pip install nltk
复制代码

分别输入以上两句代码,等运行完毕后,包就安装成功了,可以开心的测试了

"""
#Author: 香菜
@time: 2021/8/5 0005 下午 10:26
"""
import jieba
 
if __name__ == '__main__':
   result = " | ".join(jieba.cut("我爱北京天安门,very happy"))
   print(result)
复制代码

看下分词结果,可以说非常好,果然专业就是专业。

image.png

4、源码

简单的测试做了,可以发现我们要完成的基本上都有了,现在开始直接搞代码。

1、加载初始的文本资源。

2、去除文本中的标点符号

3、对文本进行特征提取

4、训练数据集,训练出模型(也就是预测的模型)

5、开始测试新输入的词语

#!/usr/bin/env python
# encoding: utf-8
import re
 
import jieba
from nltk.classify import NaiveBayesClassifier
 
"""
#Author: 香菜
@time: 2021/8/5 0005 下午 9:29
"""
rule = re.compile(r"[^a-zA-Z0-9\u4e00-\u9fa5]")
def delComa(text):
    text = rule.sub('', text)
    return text
 
def loadData(fileName):
    text1 = open(fileName, "r", encoding='utf-8').read()
    text1 = delComa(text1)
    list1 = jieba.cut(text1)
    return " ".join(list1)
 
# 特征提取
def word_feats(words):
    return dict([(word, True) for word in words])
 
if __name__ == '__main__':
    adResult = loadData(r"ad.txt")
    yellowResult = loadData(r"yellow.txt")
    ad_features = [(word_feats(lb), 'ad') for lb in adResult]
    yellow_features = [(word_feats(df), 'ye') for df in yellowResult]
    train_set = ad_features + yellow_features
    # 训练决策
    classifier = NaiveBayesClassifier.train(train_set)
 
    # 分析测试
    sentence = input("请输入一句话:")
    sentence = delComa(sentence)
    print("\n")
    seg_list = jieba.cut(sentence)
    result1 = " ".join(seg_list)
    words = result1.split(" ")
    print(words)
    # 统计结果
    ad = 0
    yellow = 0
    for word in words:
     classResult = classifier.classify(word_feats(word))
     if classResult == 'ad':
        ad = ad + 1
     if classResult == 'ye':
        yellow = yellow + 1
    # 呈现比例
    x = float(str(float(ad) / len(words)))
    y = float(str(float(yellow) / len(words)))
    print('广告的可能性:%.2f%%' % (x * 100))
    print('脏话的可能性:%.2f%%' % (y * 100))
复制代码

看下运行的结果

image.png

所有资源下载地址:download.csdn.net/download/pe…

5、扩展

1、数据源可以修改,可以用已经监控的数据存储到数据库中进行加载

2、可以多一些数据分类,方便客服进行处理,比如分为广告,脏话,对官方的建议等等,根据业务需求进行定义

3、可以对概率比较高的数据衔接其他系统进行自动处理,提高处理问题的处理速度

4、可以使用玩家的举报,增加数据的积累

5、可以将这个思想用作敏感词的处理,提供敏感词字典,然后进行匹配,检测

6、可以做成web服务,进行回调游戏

7、可以把模型做成边学习边预测,比如有些案例需要客服手动处理,标记好之后直接加入到数据集中,这样数据模型可以一直学习s

6、遇到的问题

1、遇到的问题,标点符号问题,标点符号如果不去除会导致匹配的时候标点符号也算作匹配,不合理。

2、编码的问题,读出来的是二进制,搞了半天才解决

3. At the beginning, I wanted to use deep learning to solve the problem of technical selection, and I also read some solutions, but my computer training is too slow, so I choose this way to practice first.

4. The code is very simple, but it is difficult to explain the technology. The code has already been written, but this article was written after a weekend.

7. Summary:

If you encounter a problem, look for a technical solution. If you know the solution, implement it. If you encounter a bug, check it out. If you keep thinking about it, there will be repercussions. Any attempt of yours is a good opportunity to learn.

Guess you like

Origin juejin.im/post/7022108393947004941