基于轻量级php搜索sphider站内搜索初级优化

转载:https://blog.csdn.net/chijiaodaxie/article/details/48714373

站内搜索初级优化

php1>. 概述:
站内搜索引擎顾名思义即网站内的信息搜索引擎,随着网络的发展,网站已经成为了企业或机构最重要的公共形象门户。每天,大量潜在的客户、合作者、投资人,分析师等会登陆企业的网站,网站带给他们的感受将直接影响到他们对公司的评价。根据IDC的调查显示:当用户登陆一个网站时,在一开始如果不能很快地检索到他所需要的信息,则50%的用户会立刻离开此网站,其中的60%将不再光顾这个网站,这意味着公司将永远失去30%的潜在客户。
当然,我也没去考证过上面的数据准不准,但是可以看出站内的搜索的展示结果质量的准确度对用户的体验是很重要的。

注:以下搜索引擎都特指站内的搜索

php2>. 搜索引擎的自我修养:
一个优秀的站内搜索除了一个醒目美观的搜索框外,最重要的是能快速准确的给出用户所检索的结果,此外还有一些附加功能可以提升用户的体验:
1. 自动提示:不仅能减少错误输入,还能帮助我们推荐产品与产品分类;
2. 自动纠错: 与“无搜索结果”相比,显示点结果总会减少些访客跳出。但这是一把双刃剑,若是推荐的词质量太低,搜索会显得很不专业;
3. 相关搜索:基于同义词的能容推荐,能给访客一些未想到的搜索提示,加大覆盖面,也加能增加用户的点击量。
4. 结果过滤或者在结果中搜索:给用户更精确的搜索体验;
5. 排序方式:如果搜索有多重属性,比如form站的下载量、点击次数或者评分高低,这样能让用户在靠前的位置找到他的关注内容;
6. 高级搜索…..
不在此一一列举

php3>. 搜索引擎的核心技术:
而这其中涉及到的技术有:分词技术(还好我们不做中文网站)、页面抓取分析(全文检索)、建立索引、搜索匹配和排序算法、对搜索关键词的统计、关联、推荐等算法。

php4>. 站内搜索的常用做法:

  1. 使用大型商业搜索引擎提供的接口:

    比如google、yahoo,国内的baidu 的API

    优点:简单省事,申请账户,使用API;
    缺点:1. 不能了解具体的搜索排序机制,不能对展示的结果做相应的控制,也不利于进行调整;
    2. 免费版本有广告,影响体验。

  2. 自己实现:

    2.1) sql 的 like 查询:
    代码实现比较简单,需要完全匹配搜索的字符串,否则搜索不出结果,多关键字的搜索结果展示差;

    2.2) 基于分词的搜索:
    有一些开源的项目:Java里比较有名的Lucene,口碑也很好,也有很多其他基于它的其他项目,可以支撑数据量较大的项目,速度很快,Java项目也可以借鉴,整合到项目里,因为是Java写的,而前期需要嵌入form站,所以只能忍痛割爱;
    review的第二个开源项目是Sphinx,C++写的,也比较主流,快,索引较大,搜索精度不如Lecence,试用了一下其编译好的exe文件,速度确实快,听说搜索亿级的数据的时间也在毫秒级,建立索引的时间在小时级,后期可以考虑使用;

    但是他们都有的问题是,牛逼闪闪但是项目较为庞大,封装出了接口给我们调用,我们要修改内部的算法,可能要track的代码较多,考虑到时间因素,选择了一个轻量级的搜索框架sphider,整个项目的代码量才不到300K, 估计撑死就一万行,而且他是基于mysql和php的,看完之后简直爽high了,这正是我们需要找的东西,跟踪其代码走一遍,整体上能大概了解一个搜索引擎的工作原理,我们下面展示的搜索就是移植和微调了一下其搜索方面的功能:

step 1>
获得源数据:如果要检索网页内容,我们需要建立爬虫爬取需要检索的网页的内容,存数据库,但我们pdf转的html实际上不太规整,而且搜索的关键词绝大部分都是cat或者post的name,所以我们省去了这一步骤,直接取数据库里的字段作为元数据;

step 2>
分词,提取关键词,建立索引, 代码见:SearchindexController.class.php
1)新加数据表:keywords表,keyword_post多张,keyword_cat多张
2)接口:indexallpost(), indexallcat()

注:分词速度较慢,后期跳出thinkphp的框架,用纯SQL写了一个提速版本,索引40万数据,大概需要4-5min,当然与aphinx等比较还有较大的差距,有机会再放出来
代码逻辑:
->indexPost() & indexCat(): 取数据源内容
->unique_word_array(): 每条数据按照多重规则分词(分隔符、忽略词、提取词干)
->计算权重(因为description等都是自动生成的,无意义,所以权重只是对keyword在数据源中出现的次数简单的计算)
->save_post_keywords(): 插入数据库的keywords表,(keyword唯一)
->save_post_keywords(): 然后插入多张关系表(delete_post_keywords_relation(): 事先删除关系表里该post的数据,多表的存在可以缓解单表的压力)
至此,分词完毕!

<?php
namespace Admin\Controller;
use Admin\Controller\CommonController;
/**
 * @author chijiaodaxie
 */
class SearchindexController extends CommonController {
    //only indexpost and indexcat were public as APIs
    private $keywords_array = array();
    public function indexAllPost($reindex = 0){
        set_time_limit(0);
        $this->keywords_array = $this->get_all_keyword();
        // dump($this->keywords_array);
        $post_db = D('Post');
        if($reindex){
            echo "post全部重新索引, 马力全开<br/><br/>";
            $posts = $post_db->field(array('postid', 'name', 'catid'))->where(array('status'=>4))->select();
        }else{
            echo "增量索引, 为新增post加索引<br/><br/>";
            $posts = $post_db->field(array('postid', 'name', 'catid'))->where(array(/*'status'=>4, */'indexed'=>0))->select();
        }
        $post_ids = array();
        $failed_ids = array();
        foreach($posts as $post){
            $res = $this->indexPost($post['postid'], $post['name'], $post['catid']);
            if($res){
                $post_ids[] = $post['postid'];
            }else{
                $failed_ids[] = $post['postid'];
            }
        }
        $post_id_str = implode(",", $post_ids);
        $data['indexed'] = 1;
        $post_db->where('postid in ('.$post_id_str.')')->save($data);
        if ($failed_ids){
            echo count($failed_ids)." posts was not index successly: <br/><br/>友情提示!注意乱码问题<br/><br/>";
            echo "Success ids: (".implode(", ", $post_ids).")<br/><br/>";
            echo "Failed ids:    (".implode(", ", $failed_ids).")";
        }else{
            echo "Success ids: (".implode(", ", $post_ids).")<br/><br/>";
            echo "index success!!";
        }
    }
    private function indexPost($postid, $postname, $catid){
        $keywords = $this->unique_word_array($postname);
        $this->delete_post_keywords_relation($postid);
        $res = $this->save_post_keywords($keywords, $postid, $catid);
        if(!$res){
            return false;
        }
        return true;
    }
    public function indexAllCat($reindex = 0){
        set_time_limit(0);
        $this->keywords_array = $this->get_all_keyword();
        $cat_db = D('Category');
        if($reindex){
            echo "cat全部重新索引, 马力全开<br/><br/>";
            $cats = $cat_db->field(array('catid', 'catname', 'parentid'))->where(array('disabled'=>0, 'ismenu'=>1))->select();
        }else{
            echo "增量索引, 为新增cat加索引<br/><br/>";
            $cats = $cat_db->field(array('catid', 'catname', 'parentid'))->where(array('disabled'=>0, 'ismenu'=>1, 'indexed'=>0))->select();           
        }
        $cat_ids = array();
        $failed_ids = array();
        foreach($cats as $cat){
            $res = $this->indexCat($cat['catid'], $cat['catname'], $cat['parentid']);
            if($res){
                $cat_ids[] = $cat['catid'];
            }else{
                $failed_ids[] = $cat['catid'];
            }
        }
        $cat_id_str = implode(",", $cat_ids);
        $data['indexed'] = 1;
        $cat_db->where('catid in ('.$cat_id_str.')')->save($data);
        if ($failed_ids){
            echo count($failed_ids)." cats was not index successly: <br/><br/>友情提示!注意乱码问题<br/><br/>";
            echo "Success ids: (".implode(", ", $cat_ids).")<br/><br/>";
            echo "Failed ids:    (".implode(", ", $failed_ids).")";
        }else{
            echo "index success!!<br/><br/>";
            echo "Success ids: (".implode(", ", $cat_ids).")";
        }
    }
    private function indexCat($catid, $catname, $parentid){
        $keywords = $this->unique_word_array($catname);
        $this->delete_cat_keywords_relation($catid);
        $res = $this->save_cat_keywords($keywords, $catid, $parentid);
        if(!$res){
            return false;
        }
        return true;
    }
    private function unique_word_array($str){
        if(is_array($str) && !empty($str)){
            $str = implode(" ", $str);
        }
        $str = strtolower($str);
        $str = preg_replace("/&nbsp;/", " ", $str);
        $str = preg_replace("/[\*\^\+\?\\\.\[\]\^\$\|\{\)\(\}~!\"\/@#£$%&=`´;><:,]+/", " ", $str);
        $str = preg_replace('/\s+/', ' ', $str);
        $arr = explode(" ", $str);
        $min_word_length = C('MIN_WORD_LENGTH');
        $word_upper_bound = C('WORD_UPPER_BOUND');
        $index_numbers = C('INDEX_NUMBER');
        $stem_words = C('STEM_WORDS');
        $common = $this->get_common_word();
        if ($stem_words == 1) {
            $stem_word = new \Common\Plugin\Stem();
            $newarr = array();
            foreach ($arr as $val) {
                $newarr[] = $stem_word->stem($val);
            }
            $arr = $newarr;
        }
        sort($arr);
        reset($arr);
        $newarr = array();
        $i = 0;
        $counter = 1;
        $element = current($arr);
        if ($index_numbers == 1) {
            $pattern = "/[a-z0-9]+/";
        } else {
            $pattern = "/[a-z]+/";
        }
        $regs = array();
        for ($n = 0; $n < sizeof($arr); $n ++) {
            //check if word is long enough, contains alphabetic characters and is not a common word
            //to eliminate/count multiple instance of words
            $next_in_arr = next($arr);
            if ($next_in_arr != $element) {
                // $element = rtrim($element, ".,");
                if (preg_match("/^(-|\\\')(.*)/", $element, $regs))
                    $element = $regs[2];
                if (preg_match("/(.*)(\\\'|-|\'s|\')$/", $element, $regs))
                    $element = $regs[1];
                if (strlen($element) > $min_word_length && preg_match($pattern, $this->remove_accents($element)) && (@ $common[$element] <> 1)) {
                    $newarr[$i][1] = $element;
                    $newarr[$i][2] = $counter;
                    $element = current($arr);
                    $i ++;
                    $counter = 1;
                } else {
                    $element = $next_in_arr;
                    $counter = 1;
                }
            } else {
                if ($counter < $word_upper_bound)
                    $counter ++;
            }
        }
        // var_dump($newarr);
        return $newarr;
    }
    /*
    * save the keywords to post related table
    */
    private function save_post_keywords($keywords, $post_id, $cat_id){
        // $this->keywords_array;
        $table_num = C('POST_KEYWORDS_NUM');
        foreach($keywords as $keyword){
            $word = $keyword[1];
            // dump($word);
            $wordmd5 = (int)(hexdec(substr(md5($word), 0, 1)))%$table_num;
            $weight = $keyword[2];
            if (strlen($word)<= 30) {
                $keyword_id = $this->keywords_array[$word];
                $keywords_db = M('search_keywords');
                if ($keyword_id  == "") {
                    $data['keyword'] = $word;
                    $data['post_word_frequency'] = 1;
                    $keyword_id = $keywords_db->add($data);
                    if(!$keyword_id){
                        return false;
                    }
                    if(!$keyword_id){
                        $a = $keywords_db->where(array('keyword'=>$word))->setInc('post_word_frequency', 1);
                        $thisword = $keywords_db->where(array('keyword'=>$word))->find();
                        $keyword_id = $thisword['keyword_id'];
                    }else{
                        $this->keywords_array[$word] = $keyword_id;
                    }
                }else{
                    $a = $keywords_db->where(array('keyword'=>$word))->setInc('post_word_frequency', 1);
                }
                $inserts[$wordmd5][] = array('post_id'=>$post_id, 'keyword_id'=>$keyword_id, 'weight'=>$weight * 10, 'cat_id'=>$cat_id); 
            }
        }
        for ($i=0;$i<=$table_num; $i++) {
            $char = $i;
            if ($inserts[$char]) {
                $post_keyword_db = M('search_post_keyword'.$char);
                $res = $post_keyword_db->addAll($inserts[$char]);
                if(!$res){
                    return false;
                }
            }
        }
        return true;
    }
    /*
    * save the keywords to post related table
    */
    private function save_cat_keywords($keywords, $cat_id, $parent_id){
        $table_num = C('CAT_KEYWORDS_NUM');
        foreach($keywords as $keyword){
            $word = $keyword[1];
            // dump($word);
            $wordmd5 = (int)(hexdec(substr(md5($word), 0, 1)))%$table_num;
            $weight = $keyword[2];
            if (strlen($word)<= 30) {
                $keyword_id = $this->keywords_array[$word];
                $keywords_db = M('search_keywords');
                if ($keyword_id  == "") {
                    $data['keyword'] = $word;
                    $data['cat_word_frequency'] = 1;
                    $keyword_id = $keywords_db->add($data);
                    if(!$keyword_id){
                        return false;
                    }
                    if(!$keyword_id){
                        $a = $keywords_db->where(array('keyword'=>$word))->setInc('cat_word_frequency', 1);
                        // dump($a);
                        $thisword = $keywords_db->where(array('keyword'=>$word))->find();
                        $keyword_id = $thisword['keyword_id'];
                    }else{
                        $this->keywords_array[$word] = $keyword_id;
                    }
                }else{
                    $a = $keywords_db->where(array('keyword'=>$word))->setInc('cat_word_frequency', 1);
                } 
                $inserts[$wordmd5][] = array('cat_id'=>$cat_id, 'keyword_id'=>$keyword_id, 'weight'=>$weight * 10, 'parent_cat_id'=>$parent_id); 
            }
        }
        for ($i=0;$i<=$table_num; $i++) {
            $char = dechex($i);
            if ($inserts[$char]) {
                $cat_keyword_db = M('search_cat_keyword'.$char);
                $res = $cat_keyword_db->addAll($inserts[$char]);
                if(!$res){
                    return false;
                }
            }
        }
        return true;
    }
    /*
    * before index, delete the relation that may ecxit
    */
    private function delete_post_keywords_relation($postid){
        $table_num = C('POST_KEYWORDS_NUM');
        for($i=0; $i<$table_num; $i++){
            $char = dechex($i);
            $db_name = 'search_post_keyword'.$char;
            $relation_db = M($db_name);
            $relation_db->where(array('post_id'=>$postid))->delete();
        }
    }
    /*
    * before index, delete the relation that may ecxit
    */
    private function delete_cat_keywords_relation($catid){
        $table_num = C('CAT_KEYWORDS_NUM');
        for($i=0; $i<$table_num; $i++){
            $char = dechex($i);
            $relation_db = M('search_cat_keyword'.$char);
            $relation_db->where(array('cat_id'=>$catid))->delete();
        }
    }
    /*
    * get all keywords that already exsit
    */
    private function get_all_keyword(){
        $keywords_array = array();
        $keywords_db = M('search_keywords');
        $keywords = $keywords_db->select();
        // dump($keywords);
        if($keywords){
            foreach($keywords as $keyword){
                $keywords_array[$keyword['keyword']] = $keyword['keyword_id'];
            }
        }
        return $keywords_array;
    }
    /*
    * trim the spical char mb:alabel
    */
    private function remove_accents($string) {
        return (strtr($string, "ÀÁÂÃÄÅÆàáâãäåæÒÓÔÕÕÖØòóôõöøÈÉÊËèéêëðÇçÐÌÍÎÏìíîïÙÚÛÜùúûüÑñÞßÿý",
                      "aaaaaaaaaaaaaaoooooooooooooeeeeeeeeecceiiiiiiiiuuuuuuuunntsyy"));
    }
    /*
    *Common word that should be Ignore
    */
    private function get_common_word(){
        $common = array();
        $lines = @file('common.txt');
        if (is_array($lines)) {
            while (list($id, $word) = each($lines)){
                $common[trim($word)] = 1;
            }
        }
        return $common;
    }
}

step 3>
搜索, 代码见:searchModel.class.php
新建search_query_log 表
接口: getPostByQueryString()

代码逻辑:
->makeboollist(): 对搜索词分词(除了一些特定的需求,其规则要与数据源的分词一致,这样才能保证搜索的准确性,而一些特定的搜索也分别提取出来,如禁止词,搜索词组等等)
->search(): 对每一类分出的词进行search:
1)sql的like搜索词组,关联关系表和keywords表搜索单词和禁止词;
2)确定and或者or的匹配关系,合并搜索结果,同时计算出复杂度,注意时间复杂度;
3)若没有找到,进入suggest环节,使用函数soundex()和levenshtein();
4)若有结果,确定是否cat控制,分别启用不同的排序算法;
若没结果,显示default内容;
5)根据postid找到相应的内容;
6)搜索完成,log记录(其内容方便往后的统计分析优化);
7)展示到相应的动态前端……

<?php
namespace Home\Model;
use Think\Model;
/**
*@Author chijiaodaxie
*/
class SearchModel extends Model{
    private $entities = array(
        "&amp" => "&",
        "&apos" => "'",
        "&THORN;"  => "Þ",
        "&szlig;"  => "ß",
        "&agrave;" => "à",
        "&aacute;" => "á",
        "&acirc;"  => "â",
        "&atilde;" => "ã",
        "&auml;"   => "ä",
        "&aring;"  => "å",
        "&aelig;"  => "æ",
        "&ccedil;" => "ç",
        "&egrave;" => "è",
        "&eacute;" => "é",
        "&ecirc;"  => "ê",
        "&euml;"   => "ë",
        "&igrave;" => "ì",
        "&iacute;" => "í",
        "&icirc;"  => "î",
        "&iuml;"   => "ï",
        "&eth;"    => "ð",
        "&ntilde;" => "ñ",
        "&ograve;" => "ò",
        "&oacute;" => "ó",
        "&ocirc;"  => "ô",
        "&otilde;" => "õ",
        "&ouml;"   => "ö",
        "&oslash;" => "ø",
        "&ugrave;" => "ù",
        "&uacute;" => "ú",
        "&ucirc;"  => "û",
        "&uuml;"   => "ü",
        "&yacute;" => "ý",
        "&thorn;"  => "þ",
        "&yuml;"   => "ÿ",
        "&THORN;"  => "Þ",
        "&szlig;"  => "ß",
        "&Agrave;" => "à",
        "&Aacute;" => "á",
        "&Acirc;"  => "â",
        "&Atilde;" => "ã",
        "&Auml;"   => "ä",
        "&Aring;"  => "å",
        "&Aelig;"  => "æ",
        "&Ccedil;" => "ç",
        "&Egrave;" => "è",
        "&Eacute;" => "é",
        "&Ecirc;"  => "ê",
        "&Euml;"   => "ë",
        "&Igrave;" => "ì",
        "&Iacute;" => "í",
        "&Icirc;"  => "î",
        "&Iuml;"   => "ï",
        "&ETH;"    => "ð",
        "&Ntilde;" => "ñ",
        "&Ograve;" => "ò",
        "&Oacute;" => "ó",
        "&Ocirc;"  => "ô",
        "&Otilde;" => "õ",
        "&Ouml;"   => "ö",
        "&Oslash;" => "ø",
        "&Ugrave;" => "ù",
        "&Uacute;" => "ú",
        "&Ucirc;"  => "û",
        "&Uuml;"   => "ü",
        "&Yacute;" => "ý",
        "&Yhorn;"  => "þ",
        "&Yuml;"   => "ÿ"
        );
    public function getPostByQueryString($query, $page_num, $pagesize){
        $starttime = $this->getmicrotime();
        if (substr_count($query,'"')==1){
           $query=str_replace('"','',$query);
        }
        $words = $this->makeboollist($query);
        // dump($words);
        $data = $this->search($words, $page_num, $pagesize);
        // dump($data);
        if(isset($data['did_you_mean'])){
            // dump($data['did_you_mean']);
            $words['hilight'] = $words['+'] = array_values($data['did_you_mean']);
            // dump($words);
            // dump($words['hilight']);
            $data_suggest = $this->search($words, $page_num, $pagesize);
            $did_you_mean_b=$query;
            $did_you_mean=$query;
            while (list($key, $val) = each($data['did_you_mean'])) {
                if ($key != $val && !stristr("<font color=#D54955><b>", $key) && !stristr("</b></font>", $key)) {
                    // dump($key);
                    // dump($val);
                    $did_you_mean_b = str_replace($key, "<font color=#D54955><b>$val</b></font>", $did_you_mean_b);
                    $did_you_mean = str_replace($key, "$val", $did_you_mean);
                }
            }
            $a_href = "<a href=\"/search?q=".$did_you_mean."\">";
            $data = $data_suggest;
            $data['did_you_mean'] = $a_href.$did_you_mean_b."</a>";
            // dump($data['did_you_mean']);
            $data['results_suggest'] = $data_suggest['results'];
            $data['results'] = 0;
        }
        $time = $this->getmicrotime() - $starttime;
        $data['time'] = $time;
        // dump($data);
        return $data;
    }
    public function makeboollist($query){
        //实体转换
        $stem_words = C('STEM_WORDS');
        while ($char = each($this->entities)){
            $query = preg_replace("/".$char[0]."/i", $char[1], $query);
        }
        $query = preg_replace("/&quot;/i", "\"", $query);
        $query = trim($query);
        $returnWords = array();
        //get all phrases
        $regs = array();
        // dump($query);
        while (preg_match("/([-]?)\"([^\"]+)\"/", $query, $regs)) {
            if ($regs[1] == '') {
                $returnWords['+s'][] = $regs[2];
                $returnWords['hilight'][] = $regs[2];
            } else {
                $returnWords['-s'][] = $regs[2];
            }
            $query = str_replace($regs[0], "", $query);
        }
        $query=str_replace('"','',$query);
        $query = preg_replace("/[\*\^\+\?\\\.\[\]\^\$\|\{\)\(\}~!\"\/@#£$%&=`´;><:,-]+/", " ", $query);
        $query = strtolower(preg_replace("/[ ]+/", " ", $query));
//      $query = remove_accents($query);
        $query = trim($query);
        $words = explode(' ', $query);
        if (!$query) {
            $limit = 0;
        } else {
            $limit = count($words);
        }
        $k = 0;
        //get all words (both include and exlude)
        $includeWords = array();
        while ($k < $limit) {
            if (substr($words[$k], 0, 1) == '+') {
                $includeWords[] = substr($words[$k], 1);
                $returnWords['hilight'][] = substr($words[$k], 1);
                if (!($this->ignoreWord(substr($words[$k], 1)))) {
                    if ($stem_words == 1) {
                        $stem_word = new \Common\Plugin\Stem();
                        $word = $stem_word->stem(substr($words[$k], 1));
                        if($word != substr($words[$k], 1)){
                            $returnWords['hilight'][] = $word;
                            $includeWords[] = $word;
                        }
                    }
                }
            } else if (substr($words[$k], 0, 1) == '-') {
                $returnWords['-'][] = substr($words[$k], 1);
                if ($stem_words == 1){
                    $stem_word = new \Common\Plugin\Stem();
                    $word = $stem_word->stem(substr($words[$k], 1));
                    if ($word != substr($words[$k], 1)){
                        $returnWords['-'][] = $word;
                    }
                }
            } else {
                $includeWords[] = $words[$k];
                $returnWords['hilight'][] = $words[$k];
                if (!($this->ignoreWord($words[$k]))) {
                    if ($stem_words == 1) {
                        $stem_word = new \Common\Plugin\Stem();
                        $word = $stem_word->stem($words[$k]);
                        if ($word != $words[$k]){
                            $returnWords['hilight'][] = $word;
                            $includeWords[] = $word;
                        }
                    }
                }
            }
            $k++;
        }
        // add words from phrases to includes
        if (isset($returnWords['+s'])) {
            foreach ($returnWords['+s'] as $phrase) {
                $phrase = strtolower(preg_replace("/[ ]+/", " ", $phrase));
                $phrase = trim($phrase);
                $temparr = explode(' ', $phrase);
                foreach ($temparr as $w){
                    $includeWords[] = $w;
                    $returnWords['hilight'][] = $w;
                    if (!($this->ignoreWord($w))) {
                        if ($stem_words == 1) {
                            $stem_word = new \Common\Plugin\Stem();
                            $word = $stem_word->stem($w);
                            if ($word != $w){
                                $includeWords[] = $word;
                                $returnWords['hilight'][] = $w;
                            }
                        }
                    }
                }
            }
        }
        foreach ($includeWords as $word) {
            if($word){
                if ($this->ignoreWord($word)) {
                    $returnWords['ignore'][] = $word;
                } else {
                    $returnWords['+'][] = $word;
                }
            }
        }
        return $returnWords;
    }
    //need search two
    public function search($searchstr, $start = 1, $per_page = 15, $type = "or"){
        $db_host = C('DB_HOST');
        $db_name = C('DB_NAME');
        $db_user = C('DB_USER');
        $db_paw = C('DB_PWD');
        $conn = new \mysqli($db_host, $db_user, $db_paw, $db_name);
        // $conn = new \mysqli('localhost', 'root', 'moma');
        $merge_show_results = C('MERGE_SHOW_RESULT');
        $mysql_table_prefix = C('DB_PREFIX');
        $did_you_mean_enabled = C('DID_YOU_MEAN_ENABLED');
        $stem_words = C('STEM_WORDS');
        $table_num = C('POST_KEYWORDS_NUM');
        $possible_to_find = 1;
        //find all sites that should not be included in the result
        if (count($searchstr['+']) == 0){
            return null;
        }
        $wordarray = $searchstr['-'];
        $notlist = array();
        $not_words = 0;
        while ($not_words < count($wordarray)){
            // if ($stem_words == 1) {
            //  $searchword = addslashes(stem($wordarray[$not_words]));
            // } else {
            //  $searchword = addslashes($wordarray[$not_words]);
            // }
            $searchword = addslashes($wordarray[$not_words]);
            $wordmd5 = (int)(hexdec(substr(md5($searchword), 0, 1)))%$table_num;
            $query1 = "SELECT post_id from ".$mysql_table_prefix."search_post_keyword$wordmd5, ".$mysql_table_prefix."search_keywords where ".$mysql_table_prefix."search_post_keyword$wordmd5.keyword_id= ".$mysql_table_prefix."search_keywords.keyword_id and keyword='$searchword'";
            $result = $conn->query($query1);
            while($row = $result->fetch_row()){
                $notlist[$not_words]['id'][$row[0]] = 1;
            }
            $result->close();
            $not_words++;
        }
        // echo "notlist: ";
        // dump($notlist);
        //find all sites containing the search phrase
        $wordarray = $searchstr['+s'];
        $phrase_words = 0;
        while ($phrase_words < count($wordarray)) {
        // dump($wordarray);
            $searchword = addslashes($wordarray[$phrase_words]);
            // dump($searchword);
            $query1 = "SELECT postid from ".$mysql_table_prefix."post where name like '% $searchword%'";
            $result = $conn->query($query1);
            $num_rows = $result->num_rows;
                    // dump($num_rows);
            if ($num_rows == 0) {
                if($type != "or"){
                    $possible_to_find = 0;
                    break;
                }
            }
            while ($row = $result->fetch_row()) {
                $phraselist[$phrase_words]['id'][$row[0]] = 1;
                if(isset($phraseweight['id'][$row[0]])){
                    $phraseweight['id'][$row[0]] += 50;
                }else{
                    $phraseweight['id'][$row[0]] = 50;                    
                }
            }
            $result->close();
            $phrase_words++;
        }
        // echo "phraselist: ";
        // dump($phraselist);
        //find all sites that include the search word       
        $wordarray = $searchstr['+'];
        $words = 0;
        // dump($wordarray);
        $starttime = $this->getmicrotime();
        while (($words < count($wordarray)) && $possible_to_find == 1) {
            // if ($stem_words == 1) {
            //  $searchword = addslashes(stem($wordarray[$words]));
            // } else {
            //  $searchword = addslashes($wordarray[$words]);
            // }
            $searchword = addslashes($wordarray[$words]);
            $wordmd5 = (int)(hexdec(substr(md5($searchword), 0, 1)))%$table_num;
            $query1 = "SELECT distinct post_id, weight, cat_id from ".$mysql_table_prefix."search_post_keyword$wordmd5, ".$mysql_table_prefix."search_keywords where ".$mysql_table_prefix."search_post_keyword$wordmd5.keyword_id= ".$mysql_table_prefix."search_keywords.keyword_id and keyword='$searchword' order by weight desc";
            $result = $conn->query($query1);
            $num_rows = $result->num_rows;
            // dump($num_rows);
            if ($num_rows == 0) {
                if ($type != "or") {
                    $possible_to_find = 0;
                    break;
                }
            }
            if ($type == "or") {
                $indx = 0;
            } else {
                $indx = $words;
            }
            while ($row = $result->fetch_row()) {    
                $linklist[$indx]['id'][] = $row[0];
                $post_cats[$row[0]] = $row[2];
                if(isset($linklist[$indx]['weight'][$row[0]])){
                    $linklist_match_times[$indx][$row[0]] += 1;
                    $match_times = $linklist_match_times[$indx][$row[0]];
                    $linklist[$indx]['weight'][$row[0]] = $linklist[$indx]['weight'][$row[0]] * ($match_times)/($match_times - 1) + $row[1] * $match_times; //变量类型int 和 string
                }else{
                    $linklist_match_times[$indx][$row[0]] = 1;
                    $linklist[$indx]['weight'][$row[0]] = $row[1];
                }
            }
            $result->close();
            $words++;
        }
        // echo "linklist: ";
        // dump($linklist);
        if ($type == "or") {
            $words = 1;
        }
        $result_array_full = array();
        if ($possible_to_find !=0) {
            if ($words == 1 && $not_words == 0 && $category < 1) { //if there is only one search word, we already have the result
                $result_array_full = $linklist[0]['weight'];
            } else { 
            //otherwise build an intersection of all the results
                $j= 1;
                $min = 0;
                //find the leastmatch keyword
                while ($j < $words) {
                    if (count($linklist[$min]['id']) > count($linklist[$j]['id'])) {
                        $min = $j;
                    }
                    $j++;
                }
                $j = 0;
                $temp_array = $linklist[$min]['id'];
                $count = 0;
                while ($j < count($temp_array)){
                    $k = 0; //and word counter
                    $n = 0; //not word counter
                    $o = 0; //phrase word counter
                    $weight = 0;//	$weight = 1;
                    $break = 0;
                    while ($k < $words && $break== 0) {
                        if ($linklist[$k]['weight'][$temp_array[$j]] > 0) {
                            $weight = $weight + $linklist[$k]['weight'][$temp_array[$j]];
                        } else {
                            $break = 1;
                        }
                        $k++;
                    }
                    while ($n < $not_words && $break== 0) {
                        if ($notlist[$n]['id'][$temp_array[$j]] > 0) {
                            $break = 1;
                        }
                        $n++;
                    }               
                    while ($o < $phrase_words && $break== 0) {
                        if ($phraselist[$o]['id'][$temp_array[$j]] != 1) {
                            $break = 1;
                        }
                        $o++;
                    }
                    // if ($break== 0 && $category > 0 && $category_list[$temp_array[$j]] != 1) {
                    //  $break = 1;
                    // }
                    if ($break == 0) {
                        $result_array_full[$temp_array[$j]] = $weight;
                        $count ++;
                    }
                    $j++;
                }
            }
        }
        foreach($result_array_full as $phrase_postid=>$result_array_add_phrase){
            // echo "aaaa";
            if (isset($phraseweight['id'][$phrase_postid])){
                $result_array_full[$phrase_postid] += $phraseweight['id'][$phrase_postid];
            }
        }
        $end = $this->getmicrotime() - $starttime;
        // echo $end;
        // echo "result_full: ";
        // dump($result_array_full);
        //词错了
        if ((count($result_array_full) == 0 || $possible_to_find == 0) && !$linklist && $did_you_mean_enabled == 1) {
            reset ($searchstr['+']);
            foreach ($searchstr['+'] as $word) {
                $word = addslashes($word);
                $result = $conn->query("select keyword from ".$mysql_table_prefix."search_keywords where soundex(keyword) = soundex('$word')");
                $max_distance = 100;
                $near_word ="";
                while ($row=$result->fetch_row()) {
                    $distance = levenshtein($row[0], $word);
                    if ($distance < $max_distance && $distance <4) {
                        $max_distance = $distance;
                        $near_word = $row[0];
                    }
                }
                $result->close();
                if ($near_word != "" && $word != $near_word) {
                    $near_words[$word] = $near_word;
                }
            }
            $res['did_you_mean'] = $near_words;
            $res['results'] = 0;
            $conn->close();
            // dump($res);
            return $res;
        }
        if (count($result_array_full) == 0) {
            $res['results'] = 0;
            $conn->close();
            return $res;
        }
        arsort ($result_array_full);
        if ($merge_show_results) {
            while (list($key, $value) = each($result_array_full)) {
                if (!isset($post_cats_to_show[$post_cats[$key]])) {
                    $result_array_temp[$key] = $value;
                    $post_cats_to_show[$post_cats[$key]] = 1;
                } else if ($post_cats_to_show[$post_cats[$key]] ==  1) {
                    $post_cats_to_show[$post_cats[$key]] = array($key => $value);
                }
            }
        } else {
            $result_array_temp = $result_array_full;
        }
        while (list($key, $value) = each ($result_array_temp)) {
            $result_array[$key] = $value;
            if (isset ($post_cats_to_show[$post_cats[$key]]) && $post_cats_to_show[$post_cats[$key]] != 1) {
                list ($k, $v) = each($post_cats_to_show[$post_cats[$key]]);
                $result_array[$k] = $v;
            }
        }
        // echo "result after merge: ";
        // dump($post_cats);
        // dump($result_array);
        $results = count($result_array);
        $keys = array_keys($result_array);
        $maxweight = $result_array[$keys[0]];
        //get all ids
        for ($i = ($start -1)*$per_page; $i <min($results, ($start -1)*$per_page + $per_page) ; $i++) {
            $in[] = $keys[$i];
        }
        if (!is_array($in)) {
            $res['results'] = $results;
            $conn->close();
            return $res;
        }
        $inlist = implode(",", $in);
        // dump($inlist);
        $query1 = "SELECT distinct postid, title, name, slug, catid, description , thumbpath, filepages, vote, votetimes, viewtimes, downloadtimes FROM ".$mysql_table_prefix."post WHERE postid in ($inlist)";
        $result = $conn->query($query1);
        $i = 0;
        // $hilight_words_preg = array();
        // foreach($searchstr['hilight'] as $word){
        //  $hilight_words_preg[] = "/(".$word.")/i";
        // }
        while ($row = $result->fetch_row()) {
            $data[$i]['postid'] = $row[0];
            $data[$i]['title'] = $row[1];
            $data[$i]['name'] = $row[2];
            // dump($data[$i]['name']);
            $data[$i]['color_name'] = $data[$i]['name'];
            // $data[$i]['name'] = preg_replace($hilight_words_preg, "<font color=#D54955><b>$1</b></font>", $data[$i]['name']);
            // dump($data[$i]['color_name']);
            // dump($data[$i]['name']);
            foreach($searchstr['hilight'] as $word){
                if(!stristr("<font color=#D54955><b>", $word) && !stristr("</b></font>", $word)){
                    $data[$i]['color_name'] = preg_replace("/($word)/i", "<font color=#D54955><b>$1</b></font>", $data[$i]['color_name']);
                }
            }
            $data[$i]['slug'] = $row[3];
            $data[$i]['catid'] = $row[4];
            $data[$i]['description'] = $row[5];
            $data[$i]['thumbpath'] = $row[6];
            $data[$i]['filepages'] = $row[7];
            $data[$i]['vote'] = $row[8];
            $data[$i]['votetimes'] = $row[9];
            $data[$i]['viewtimes'] = $row[10];
            $data[$i]['downloadtimes'] = $row[11];
            $data[$i]['weight'] = $result_array[$row[0]];
            $dom_result = $conn->query("select catname from ".$mysql_table_prefix."category where catid='".$post_cats[$row[0]]."'");
            $dom_row = $dom_result->fetch_row();
            $data[$i]['catname'] = $dom_row[0];
            $i++;
        }
        $result->close();
        if ($merge_show_results) {
            $this->sort_with_domains($data);
        } else {
            usort($res, "cmp");
        }
        $res['data'] = $data;
        $res['maxweight'] = $maxweight;
        $res['results'] = $results;
        $conn->close();
        // dump($res);
        return $res;
    }
    public function swap_max (&$arr, $start, $catid) {
        $pos  = $start;
        $maxweight = $arr[$pos]['weight'];
        for  ($i = $start; $i< count($arr); $i++) {
            //控制同一个catid下的post,放一块
            if ($arr[$i]['catid'] == $catid) {
                $pos = $i;
                $maxweight = $arr[$i]['weight'];
                break;
            }
            if ($arr[$i]['weight'] > $maxweight) {
                $pos = $i;
                $maxweight = $arr[$i]['weight'];
            }
        }
        $temp = $arr[$start];
        $arr[$start] = $arr[$pos];
        $arr[$pos] = $temp;
    }
    public function sort_with_domains (&$arr) {
        $catid = -1;
        for  ($i = 0; $i < count($arr)-1; $i++) {
            $this->swap_max($arr, $i, $catid);
            $catid = $arr[$i]['catid'];
        }
    }
    public function cmp($a, $b) {
        echo "aaaaaaaaaaaaaaaaaaaa";
        if ($a['weight'] == $b['weight'])
            return 0;
        return ($a['weight'] > $b['weight']) ? -1 : 1;
    }
    public function ignoreWord($query_word) {
        $common = array();
        $lines = @file('common.txt');
        if (is_array($lines)) {
            while (list($id, $word) = each($lines)){
                $common[trim($word)] = 1;
            }
        }
        $min_word_length = C('MIN_WORD_LENGTH');
        $index_numbers = C('INDEX_NUMBER');
        if ($index_numbers == 1) {
            $pattern = "[a-z0-9]+";
        } else {
            $pattern = "[a-z]+";
        }
        if (strlen($query_word) < $min_word_length || (!preg_match("/".$pattern."/i", $this->remove_accents($query_word))) || ($common[$query_word] == 1)) {
            return 1;
        } else {
            return 0;
        }
    }
    /*
    * trim the spical char mb:alabel
    */
    public function remove_accents($string){
        return (strtr($string, "ÀÁÂÃÄÅÆàáâãäåæÒÓÔÕÕÖØòóôõöøÈÉÊËèéêëðÇçÐÌÍÎÏìíîïÙÚÛÜùúûüÑñÞßÿý",
                      "aaaaaaaaaaaaaaoooooooooooooeeeeeeeeecceiiiiiiiiuuuuuuuunntsyy"));
    }
    public function getmicrotime(){
        list($usec, $sec) = explode(" ",microtime());
        return ((float)$usec + (float)$sec);
    }
    public function insert_log($query, $result_num, $time, $suggest_word, $suggest_num){
  //    $data['querystring'] = $query;
        // $data['searchtime'] = date('Y-m-d H:i:s');
        // $data['elapsed'] = $time;
        // $data['result_num'] = $result_num;
        $db_host = C('DB_HOST');
        $db_name = C('DB_NAME');
        $db_user = C('DB_USER');
        $db_paw = C('DB_PWD');
        $mysql_table_prefix = C('DB_PREFIX');
        $conn = new \mysqli($db_host, $db_user, $db_paw, $db_name);
        $sql = "insert into ".$mysql_table_prefix."search_log (querystring, searchtime, elapsed, result_num, suggest_word, suggest_num) values ('$query', date('Y-m-d H:i:s'), '$time', '$result_num', '$suggest_word', '$suggest_num')";
        $conn->query($sql);
        $conn->close();
    }
    public function get_common_search_data($pagenum = 1, $pagesize = 15, $limit = 75, $order = "downloadtimes"){
        $db_host = C('DB_HOST');
        $db_name = C('DB_NAME');
        $db_user = C('DB_USER');
        $db_paw = C('DB_PWD');
        $mysql_table_prefix = C('DB_PREFIX');
        $conn = new \mysqli($db_host, $db_user, $db_paw, $db_name);
        $sql = "SELECT distinct postid, title, name, slug, catid, description , thumbpath, filepages, vote, votetimes, viewtimes, downloadtimes FROM ".$mysql_table_prefix."post order by ".$order." DESC limit ".($pagenum - 1)*$pagesize.", ".$pagesize;
        $result = $conn->query($sql);
        $i = 0;
        while ($row = $result->fetch_row()) {
            $data[$i]['postid'] = $row[0];
            $data[$i]['title'] = $row[1];
            $data[$i]['name'] = $row[2];
            $data[$i]['color_name'] = $row[2];
            $data[$i]['slug'] = $row[3];
            $data[$i]['catid'] = $row[4];
            $data[$i]['description'] = $row[5];
            $data[$i]['thumbpath'] = $row[6];
            $data[$i]['filepages'] = $row[7];
            $data[$i]['vote'] = $row[8];
            $data[$i]['votetimes'] = $row[9];
            $data[$i]['viewtimes'] = $row[10];
            $data[$i]['downloadtimes'] = $row[11];
            $i++;
        }
        $result->close();
        $conn->close();
        return $data;
    }

}

大概过程是这样,说多了都是废话,只要建立了相应的数据表,用正确的方式调用上述类,应该就能正确的运行,都在代码里……

另附词干提取:
Stem.class.php, 波特词干提取,网上有开源代码,不在此贴出

转载:https://blog.csdn.net/chijiaodaxie/article/details/48714373

站内搜索初级优化

php1>. 概述:
站内搜索引擎顾名思义即网站内的信息搜索引擎,随着网络的发展,网站已经成为了企业或机构最重要的公共形象门户。每天,大量潜在的客户、合作者、投资人,分析师等会登陆企业的网站,网站带给他们的感受将直接影响到他们对公司的评价。根据IDC的调查显示:当用户登陆一个网站时,在一开始如果不能很快地检索到他所需要的信息,则50%的用户会立刻离开此网站,其中的60%将不再光顾这个网站,这意味着公司将永远失去30%的潜在客户。
当然,我也没去考证过上面的数据准不准,但是可以看出站内的搜索的展示结果质量的准确度对用户的体验是很重要的。

注:以下搜索引擎都特指站内的搜索

php2>. 搜索引擎的自我修养:
一个优秀的站内搜索除了一个醒目美观的搜索框外,最重要的是能快速准确的给出用户所检索的结果,此外还有一些附加功能可以提升用户的体验:
1. 自动提示:不仅能减少错误输入,还能帮助我们推荐产品与产品分类;
2. 自动纠错: 与“无搜索结果”相比,显示点结果总会减少些访客跳出。但这是一把双刃剑,若是推荐的词质量太低,搜索会显得很不专业;
3. 相关搜索:基于同义词的能容推荐,能给访客一些未想到的搜索提示,加大覆盖面,也加能增加用户的点击量。
4. 结果过滤或者在结果中搜索:给用户更精确的搜索体验;
5. 排序方式:如果搜索有多重属性,比如form站的下载量、点击次数或者评分高低,这样能让用户在靠前的位置找到他的关注内容;
6. 高级搜索…..
不在此一一列举

php3>. 搜索引擎的核心技术:
而这其中涉及到的技术有:分词技术(还好我们不做中文网站)、页面抓取分析(全文检索)、建立索引、搜索匹配和排序算法、对搜索关键词的统计、关联、推荐等算法。

php4>. 站内搜索的常用做法:

  1. 使用大型商业搜索引擎提供的接口:

    比如google、yahoo,国内的baidu 的API

    优点:简单省事,申请账户,使用API;
    缺点:1. 不能了解具体的搜索排序机制,不能对展示的结果做相应的控制,也不利于进行调整;
    2. 免费版本有广告,影响体验。

  2. 自己实现:

    2.1) sql 的 like 查询:
    代码实现比较简单,需要完全匹配搜索的字符串,否则搜索不出结果,多关键字的搜索结果展示差;

    2.2) 基于分词的搜索:
    有一些开源的项目:Java里比较有名的Lucene,口碑也很好,也有很多其他基于它的其他项目,可以支撑数据量较大的项目,速度很快,Java项目也可以借鉴,整合到项目里,因为是Java写的,而前期需要嵌入form站,所以只能忍痛割爱;
    review的第二个开源项目是Sphinx,C++写的,也比较主流,快,索引较大,搜索精度不如Lecence,试用了一下其编译好的exe文件,速度确实快,听说搜索亿级的数据的时间也在毫秒级,建立索引的时间在小时级,后期可以考虑使用;

    但是他们都有的问题是,牛逼闪闪但是项目较为庞大,封装出了接口给我们调用,我们要修改内部的算法,可能要track的代码较多,考虑到时间因素,选择了一个轻量级的搜索框架sphider,整个项目的代码量才不到300K, 估计撑死就一万行,而且他是基于mysql和php的,看完之后简直爽high了,这正是我们需要找的东西,跟踪其代码走一遍,整体上能大概了解一个搜索引擎的工作原理,我们下面展示的搜索就是移植和微调了一下其搜索方面的功能:

step 1>
获得源数据:如果要检索网页内容,我们需要建立爬虫爬取需要检索的网页的内容,存数据库,但我们pdf转的html实际上不太规整,而且搜索的关键词绝大部分都是cat或者post的name,所以我们省去了这一步骤,直接取数据库里的字段作为元数据;

step 2>
分词,提取关键词,建立索引, 代码见:SearchindexController.class.php
1)新加数据表:keywords表,keyword_post多张,keyword_cat多张
2)接口:indexallpost(), indexallcat()

注:分词速度较慢,后期跳出thinkphp的框架,用纯SQL写了一个提速版本,索引40万数据,大概需要4-5min,当然与aphinx等比较还有较大的差距,有机会再放出来
代码逻辑:
->indexPost() & indexCat(): 取数据源内容
->unique_word_array(): 每条数据按照多重规则分词(分隔符、忽略词、提取词干)
->计算权重(因为description等都是自动生成的,无意义,所以权重只是对keyword在数据源中出现的次数简单的计算)
->save_post_keywords(): 插入数据库的keywords表,(keyword唯一)
->save_post_keywords(): 然后插入多张关系表(delete_post_keywords_relation(): 事先删除关系表里该post的数据,多表的存在可以缓解单表的压力)
至此,分词完毕!

<?php
namespace Admin\Controller;
use Admin\Controller\CommonController;
/**
 * @author chijiaodaxie
 */
class SearchindexController extends CommonController {
    //only indexpost and indexcat were public as APIs
    private $keywords_array = array();
    public function indexAllPost($reindex = 0){
        set_time_limit(0);
        $this->keywords_array = $this->get_all_keyword();
        // dump($this->keywords_array);
        $post_db = D('Post');
        if($reindex){
            echo "post全部重新索引, 马力全开<br/><br/>";
            $posts = $post_db->field(array('postid', 'name', 'catid'))->where(array('status'=>4))->select();
        }else{
            echo "增量索引, 为新增post加索引<br/><br/>";
            $posts = $post_db->field(array('postid', 'name', 'catid'))->where(array(/*'status'=>4, */'indexed'=>0))->select();
        }
        $post_ids = array();
        $failed_ids = array();
        foreach($posts as $post){
            $res = $this->indexPost($post['postid'], $post['name'], $post['catid']);
            if($res){
                $post_ids[] = $post['postid'];
            }else{
                $failed_ids[] = $post['postid'];
            }
        }
        $post_id_str = implode(",", $post_ids);
        $data['indexed'] = 1;
        $post_db->where('postid in ('.$post_id_str.')')->save($data);
        if ($failed_ids){
            echo count($failed_ids)." posts was not index successly: <br/><br/>友情提示!注意乱码问题<br/><br/>";
            echo "Success ids: (".implode(", ", $post_ids).")<br/><br/>";
            echo "Failed ids:    (".implode(", ", $failed_ids).")";
        }else{
            echo "Success ids: (".implode(", ", $post_ids).")<br/><br/>";
            echo "index success!!";
        }
    }
    private function indexPost($postid, $postname, $catid){
        $keywords = $this->unique_word_array($postname);
        $this->delete_post_keywords_relation($postid);
        $res = $this->save_post_keywords($keywords, $postid, $catid);
        if(!$res){
            return false;
        }
        return true;
    }
    public function indexAllCat($reindex = 0){
        set_time_limit(0);
        $this->keywords_array = $this->get_all_keyword();
        $cat_db = D('Category');
        if($reindex){
            echo "cat全部重新索引, 马力全开<br/><br/>";
            $cats = $cat_db->field(array('catid', 'catname', 'parentid'))->where(array('disabled'=>0, 'ismenu'=>1))->select();
        }else{
            echo "增量索引, 为新增cat加索引<br/><br/>";
            $cats = $cat_db->field(array('catid', 'catname', 'parentid'))->where(array('disabled'=>0, 'ismenu'=>1, 'indexed'=>0))->select();           
        }
        $cat_ids = array();
        $failed_ids = array();
        foreach($cats as $cat){
            $res = $this->indexCat($cat['catid'], $cat['catname'], $cat['parentid']);
            if($res){
                $cat_ids[] = $cat['catid'];
            }else{
                $failed_ids[] = $cat['catid'];
            }
        }
        $cat_id_str = implode(",", $cat_ids);
        $data['indexed'] = 1;
        $cat_db->where('catid in ('.$cat_id_str.')')->save($data);
        if ($failed_ids){
            echo count($failed_ids)." cats was not index successly: <br/><br/>友情提示!注意乱码问题<br/><br/>";
            echo "Success ids: (".implode(", ", $cat_ids).")<br/><br/>";
            echo "Failed ids:    (".implode(", ", $failed_ids).")";
        }else{
            echo "index success!!<br/><br/>";
            echo "Success ids: (".implode(", ", $cat_ids).")";
        }
    }
    private function indexCat($catid, $catname, $parentid){
        $keywords = $this->unique_word_array($catname);
        $this->delete_cat_keywords_relation($catid);
        $res = $this->save_cat_keywords($keywords, $catid, $parentid);
        if(!$res){
            return false;
        }
        return true;
    }
    private function unique_word_array($str){
        if(is_array($str) && !empty($str)){
            $str = implode(" ", $str);
        }
        $str = strtolower($str);
        $str = preg_replace("/&nbsp;/", " ", $str);
        $str = preg_replace("/[\*\^\+\?\\\.\[\]\^\$\|\{\)\(\}~!\"\/@#£$%&=`´;><:,]+/", " ", $str);
        $str = preg_replace('/\s+/', ' ', $str);
        $arr = explode(" ", $str);
        $min_word_length = C('MIN_WORD_LENGTH');
        $word_upper_bound = C('WORD_UPPER_BOUND');
        $index_numbers = C('INDEX_NUMBER');
        $stem_words = C('STEM_WORDS');
        $common = $this->get_common_word();
        if ($stem_words == 1) {
            $stem_word = new \Common\Plugin\Stem();
            $newarr = array();
            foreach ($arr as $val) {
                $newarr[] = $stem_word->stem($val);
            }
            $arr = $newarr;
        }
        sort($arr);
        reset($arr);
        $newarr = array();
        $i = 0;
        $counter = 1;
        $element = current($arr);
        if ($index_numbers == 1) {
            $pattern = "/[a-z0-9]+/";
        } else {
            $pattern = "/[a-z]+/";
        }
        $regs = array();
        for ($n = 0; $n < sizeof($arr); $n ++) {
            //check if word is long enough, contains alphabetic characters and is not a common word
            //to eliminate/count multiple instance of words
            $next_in_arr = next($arr);
            if ($next_in_arr != $element) {
                // $element = rtrim($element, ".,");
                if (preg_match("/^(-|\\\')(.*)/", $element, $regs))
                    $element = $regs[2];
                if (preg_match("/(.*)(\\\'|-|\'s|\')$/", $element, $regs))
                    $element = $regs[1];
                if (strlen($element) > $min_word_length && preg_match($pattern, $this->remove_accents($element)) && (@ $common[$element] <> 1)) {
                    $newarr[$i][1] = $element;
                    $newarr[$i][2] = $counter;
                    $element = current($arr);
                    $i ++;
                    $counter = 1;
                } else {
                    $element = $next_in_arr;
                    $counter = 1;
                }
            } else {
                if ($counter < $word_upper_bound)
                    $counter ++;
            }
        }
        // var_dump($newarr);
        return $newarr;
    }
    /*
    * save the keywords to post related table
    */
    private function save_post_keywords($keywords, $post_id, $cat_id){
        // $this->keywords_array;
        $table_num = C('POST_KEYWORDS_NUM');
        foreach($keywords as $keyword){
            $word = $keyword[1];
            // dump($word);
            $wordmd5 = (int)(hexdec(substr(md5($word), 0, 1)))%$table_num;
            $weight = $keyword[2];
            if (strlen($word)<= 30) {
                $keyword_id = $this->keywords_array[$word];
                $keywords_db = M('search_keywords');
                if ($keyword_id  == "") {
                    $data['keyword'] = $word;
                    $data['post_word_frequency'] = 1;
                    $keyword_id = $keywords_db->add($data);
                    if(!$keyword_id){
                        return false;
                    }
                    if(!$keyword_id){
                        $a = $keywords_db->where(array('keyword'=>$word))->setInc('post_word_frequency', 1);
                        $thisword = $keywords_db->where(array('keyword'=>$word))->find();
                        $keyword_id = $thisword['keyword_id'];
                    }else{
                        $this->keywords_array[$word] = $keyword_id;
                    }
                }else{
                    $a = $keywords_db->where(array('keyword'=>$word))->setInc('post_word_frequency', 1);
                }
                $inserts[$wordmd5][] = array('post_id'=>$post_id, 'keyword_id'=>$keyword_id, 'weight'=>$weight * 10, 'cat_id'=>$cat_id); 
            }
        }
        for ($i=0;$i<=$table_num; $i++) {
            $char = $i;
            if ($inserts[$char]) {
                $post_keyword_db = M('search_post_keyword'.$char);
                $res = $post_keyword_db->addAll($inserts[$char]);
                if(!$res){
                    return false;
                }
            }
        }
        return true;
    }
    /*
    * save the keywords to post related table
    */
    private function save_cat_keywords($keywords, $cat_id, $parent_id){
        $table_num = C('CAT_KEYWORDS_NUM');
        foreach($keywords as $keyword){
            $word = $keyword[1];
            // dump($word);
            $wordmd5 = (int)(hexdec(substr(md5($word), 0, 1)))%$table_num;
            $weight = $keyword[2];
            if (strlen($word)<= 30) {
                $keyword_id = $this->keywords_array[$word];
                $keywords_db = M('search_keywords');
                if ($keyword_id  == "") {
                    $data['keyword'] = $word;
                    $data['cat_word_frequency'] = 1;
                    $keyword_id = $keywords_db->add($data);
                    if(!$keyword_id){
                        return false;
                    }
                    if(!$keyword_id){
                        $a = $keywords_db->where(array('keyword'=>$word))->setInc('cat_word_frequency', 1);
                        // dump($a);
                        $thisword = $keywords_db->where(array('keyword'=>$word))->find();
                        $keyword_id = $thisword['keyword_id'];
                    }else{
                        $this->keywords_array[$word] = $keyword_id;
                    }
                }else{
                    $a = $keywords_db->where(array('keyword'=>$word))->setInc('cat_word_frequency', 1);
                } 
                $inserts[$wordmd5][] = array('cat_id'=>$cat_id, 'keyword_id'=>$keyword_id, 'weight'=>$weight * 10, 'parent_cat_id'=>$parent_id); 
            }
        }
        for ($i=0;$i<=$table_num; $i++) {
            $char = dechex($i);
            if ($inserts[$char]) {
                $cat_keyword_db = M('search_cat_keyword'.$char);
                $res = $cat_keyword_db->addAll($inserts[$char]);
                if(!$res){
                    return false;
                }
            }
        }
        return true;
    }
    /*
    * before index, delete the relation that may ecxit
    */
    private function delete_post_keywords_relation($postid){
        $table_num = C('POST_KEYWORDS_NUM');
        for($i=0; $i<$table_num; $i++){
            $char = dechex($i);
            $db_name = 'search_post_keyword'.$char;
            $relation_db = M($db_name);
            $relation_db->where(array('post_id'=>$postid))->delete();
        }
    }
    /*
    * before index, delete the relation that may ecxit
    */
    private function delete_cat_keywords_relation($catid){
        $table_num = C('CAT_KEYWORDS_NUM');
        for($i=0; $i<$table_num; $i++){
            $char = dechex($i);
            $relation_db = M('search_cat_keyword'.$char);
            $relation_db->where(array('cat_id'=>$catid))->delete();
        }
    }
    /*
    * get all keywords that already exsit
    */
    private function get_all_keyword(){
        $keywords_array = array();
        $keywords_db = M('search_keywords');
        $keywords = $keywords_db->select();
        // dump($keywords);
        if($keywords){
            foreach($keywords as $keyword){
                $keywords_array[$keyword['keyword']] = $keyword['keyword_id'];
            }
        }
        return $keywords_array;
    }
    /*
    * trim the spical char mb:alabel
    */
    private function remove_accents($string) {
        return (strtr($string, "ÀÁÂÃÄÅÆàáâãäåæÒÓÔÕÕÖØòóôõöøÈÉÊËèéêëðÇçÐÌÍÎÏìíîïÙÚÛÜùúûüÑñÞßÿý",
                      "aaaaaaaaaaaaaaoooooooooooooeeeeeeeeecceiiiiiiiiuuuuuuuunntsyy"));
    }
    /*
    *Common word that should be Ignore
    */
    private function get_common_word(){
        $common = array();
        $lines = @file('common.txt');
        if (is_array($lines)) {
            while (list($id, $word) = each($lines)){
                $common[trim($word)] = 1;
            }
        }
        return $common;
    }
}

猜你喜欢

转载自blog.csdn.net/ahaotata/article/details/84819867
今日推荐