SPAM、Bayesian和中文 4 - 在CakePHP中集成贝叶斯算法

上文提到了贝叶斯算法的几种开源实现，本文说说如何将其中一种名为b8的开源实现集成进CakePHP。

下载b8及安装

去b8的站点下载最新版本，将其解压至vendors目录，文件位置如vendors/b8/b8.php；
用文本编辑器打开vendors/b8/etc/config_b8，修改databaseType为mysql；
用文本编辑器打开vendors/b8/etc/config_storage，修改tableName为你用来存储关键字的数据表的名字，修改createDB为TRUE，要注意的是，当你第一次运行b8后，它会建立上述数据表，然后你要重新把createDB改为FALSE；
用文本编辑器打开vendors/b8/lexer/shared_functions.php，将38行的代码（在echoError())注释掉，否则b8会直接把错误信息显示在你的Cake应用中，当然这在调试程序时还是有用的。

为b8写一个wrapper component

为了让你的Cake能够调用到b8，你需要写一个component。在controllers/components/新建一个spam_shield.php，加入如下代码：

class SpamShieldComponent extends Object {

    /** * b8 instance */

    var $b8;

    /** * standard rating * * comments with ratings which are higher than this one will be considered as SPAM */

    var $standardRating = 0.7;

    /** * text to be classified */

    var $text;

    /** * rating of the text */

    var $rating;

    /** * Constructor * * @date 2009-1-20 */

    function startup(&$controller) {

        //register a CommentModel to get the DBO resource link

        $comment = ClassRegistry::init('Comment'); //import b8 and create an instance

       App::import('Vendor', 'b8/b8');

       $this->b8 = new b8($comment->getDBOResourceLink()); //set standard rating

       $this->standardRating = Configure::read('LT.bayesRating') ? Configure::read('LT.bayesRating') : $this->standardRating;

    }

    /** * Set the text to be classified * * @param $text String the text to be classified * @date 2009-1-20 */

    function set($text) {

        $this->text = $text;

    }

    /** * Get Bayesian rating * * @date 2009-1-20 */

    function rate() {

       //get Bayes rating and return return

       $this->rating = $this->b8->classify($this->text);

    }

    /** * Validate a message based on the rating, return true if it's NOT a SPAM * * @date 2009-1-20 */

    function validate() {

        return $this->rate() < $this->standardRating;

    }

    /** * Learn a SPAM or a HAM * * @date 2009-1-20 */

    function learn($mode) {

       $this->b8->learn($this->text, $mode);

    }

    /** * Unlearn a SPAM or a HAM * * @date 2009-1-20 */

    function unlearn($mode) {

       $this->b8->unlearn($this->text, $mode);

    }

}

几点说明：

$standardRating是一个临界点。如果贝叶斯概率高于这个值，则此留言被认为是spam，否则是ham。我设置为0.7，你可以根据自己的情况修改；
Configure::read('LT.bayesRating')是从系统运行配置中动态地获取上述临界点的值，这是我的做法，你可能用不到，根据情况稍微修改甚至不修改都行；
Comment指的是评论的model；
由于b8需要获得数据库句柄以便能够操作数据表，所以在startup()中我写了$this->b8 = new b8($comment->getDBOResourceLink())一句，其中用到的getDBOResourceLink()马上会提及。

为b8传入数据库句柄

在models/comment.php中加入如下代码：

/** * get the resource link of MySQL connection */ public function getDBOResourceLink() { return $this->getDataSource()->connection; }

至此，准备工作全部做完，我们终于可以使用贝叶斯算法来分类留言。

使用b8分类留言

在controllers/comments_controller.php中，首先载入SpamShieldComponent:

var $components = array('SpamShield');

然后在add()方法中，做如下操作：

//set data for Bayesian validation

$this->SpamShield->set($this->data['Comment']['body']); //validate the comment with Bayesian

if(!$this->SpamShield->validate()) { //set the status

    $this->data['Comment']['status'] = 'spam'; //save

    $this->Comment->save($this->data); //learn it $this->SpamShield->learn("spam"); //render

    $this->renderView('unmoderated');

    return;

}

//it's a normal post

$this->data['Comment']['status'] = 'published'; //save for publish

$this->Comment->save($this->data); //learn it

$this->SpamShield->learn("ham");

如此一来，b8就会在留言到来时自动的分类并学习，你基本上已经与spam绝缘了！

提醒一下：第一次运行后，别忘了把刚才提到的createDB改为FALSE。

http://dingyu.me/blog/spam-bayesian-chinese-4

SPAM、Bayesian和中文 4 - 在CakePHP中集成贝叶斯算法

下载b8及安装

为b8写一个wrapper component

为b8传入数据库句柄

使用b8分类留言

猜你喜欢