elasticsearch安装中文分词器插件smartcn

elasticsearch默认分词器比较坑，中文的话，直接分词成单个汉字。

我们这里来介绍下smartcn插件，这个是官方推荐的，中科院搞的，基本能满足需求；

还有另外一个IK分词器。假如需要自定义词库的话，那就去搞下IK，主页地址：https://github.com/medcl/elasticsearch-analysis-ik

smartcn安装比较方便，

直接用 elasticsearch的bin目录下的plugin命令；

先进入elasticsearch的bin目录

然后执行 sh elasticsearch-plugin install analysis-smartcn

-> Downloading analysis-smartcn from elastic

[=================================================] 100%

-> Installed analysis-smartcn

下载自动安装；

（注意，假如集群是3个节点，所有节点都需要安装；不过一般都是先一个节点安装好所有的东西，然后克隆几个节点，这样方便）

安装后 plugins目录会多一个smartcn文件包；

安装后，我们需要重启es；

然后我们来测试下；

POST http://192.168.1.111:9200/_analyze/

{"analyzer":"standard","text":"我是中国人"}

执行标准分词器；

结果：

QQ鎴浘20180116173626.jpg

中文都是单个字了；

很不符合需求；

我们用下 smartcn；

{"analyzer":"smartcn","text":"我是中国人"}

执行结果：

QQ鎴浘20180116173736.jpg

我们发现中国编程个单个词汇；

我们新建索引film2

然后映射的时候，指定smartcn分词；

post http://192.168.1.111:9200/film2/_mapping/dongzuo/

{

"properties": {

"title": {

"type": "text",

"analyzer": "smartcn"

"publishDate": {

"type": "date"

"content": {

"type": "text",

"analyzer": "smartcn"

"director": {

"type": "keyword"

"price": {

"type": "float"

}

然后执行前面的数据代码；

这样前面film索引，数据是标准分词，中文全部一个汉字一个汉字分词；film2用了smartcn，根据内置中文词汇分词；

我们用java代码来搞分词搜索；

先定义一个静态常量：

private static final String ANALYZER="smartcn";

 
       /** 
      
       * 条件分词查询 
      
       * @throws Exception 
      
       */ 
      
       @Test 
      
       public  
       void  
       search() 
       throws  
       Exception{ 
      
       SearchRequestBuilder srb=client.prepareSearch( 
       "film2" 
       ).setTypes( 
       "dongzuo" 
       ); 
      
       SearchResponse sr=srb.setQuery(QueryBuilders.matchQuery( 
       "title" 
       ,  
       "星球狼" 
       ).analyzer(ANALYZER)) 
      
       .setFetchSource( 
       new  
       String[]{ 
       "title" 
       , 
       "price" 
       },  
       null 
       ) 
      
       .execute() 
      
       .actionGet();  
      
       SearchHits hits=sr.getHits(); 
      
       for 
       (SearchHit hit:hits){ 
      
       System.out.println(hit.getSourceAsString()); 
      
       } 
      
       }

指定了中文分词，查询的时候查询的关键字先进行分词然后再查询，不指定的话，默认标准分词；

这里再讲下多字段查询，比如百度搜索，搜索的不仅仅是标题，还有内容，所以这里就有两个字段；

我们使用 multiMatchQuery 我们看下Java代码：‘’

 
       /** 
      
       * 多字段条件分词查询 
      
       * @throws Exception 
      
       */ 
      
       @Test 
      
       public  
       void  
       search2() 
       throws  
       Exception{ 
      
       SearchRequestBuilder srb=client.prepareSearch( 
       "film2" 
       ).setTypes( 
       "dongzuo" 
       ); 
      
       SearchResponse sr=srb.setQuery(QueryBuilders.multiMatchQuery( 
       "非洲星球" 
       ,  
       "title" 
       , 
       "content" 
       ).analyzer(ANALYZER)) 
      
       .setFetchSource( 
       new  
       String[]{ 
       "title" 
       , 
       "price" 
       },  
       null 
       ) 
      
       .execute() 
      
       .actionGet();  
      
       SearchHits hits=sr.getHits(); 
      
       for 
       (SearchHit hit:hits){ 
      
       System.out.println(hit.getSourceAsString()); 
      
       } 
      
       }

elasticsearch安装中文分词器插件smartcn

猜你喜欢