网络词语日新月异,如何让新出的网络热词(或特定的词语)实时的更新到我们的搜索当中呢
先用 ik 测试一下 :
curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d ' 成龙原名陈港生 ' #返回 { "tokens" : [ { "token" : "成龙", "start_offset" : 1, "end_offset" : 3, "type" : "CN_WORD", "position" : 0 }, { "token" : "原名", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 1 }, { "token" : "陈", "start_offset" : 5, "end_offset" : 6, "type" : "CN_CHAR", "position" : 2 }, { "token" : "港", "start_offset" : 6, "end_offset" : 7, "type" : "CN_WORD", "position" : 3 }, { "token" : "生", "start_offset" : 7, "end_offset" : 8, "type" : "CN_CHAR", "position" : 4 } ] }ik 的主词典中没有”陈港生” 这个词,所以被拆分了。
现在我们来配置一下
修改 IK 的配置文件 :ES 目录/plugins/ik/config/ik/IKAnalyzer.cfg.xml
修改如下:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords">custom/ext_stopword.dic</entry> <!--用户可以在这里配置远程扩展字典 --> <entry key="remote_ext_dict">http://192.168.1.136/hotWords.php</entry> <!--用户可以在这里配置远程扩展停止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>这里我是用的是远程扩展字典,因为可以使用其他程序调用更新,且不用重启 ES,很方便; 使用本地的文件进行词库扩展,需要重启ES 。当然使用自定义的 mydict.dic 字典也是很方便的,一行一个词,自己加就可以了
既然是远程词典,那么就要是一个可访问的链接,可以是一个页面,也可以是一个txt的文档,但要保证输出的内容是 utf-8 的格式
hotWords.php 的内容
$s = <<<'EOF' 陈港生 元楼 蓝瘦 EOF; header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, 200); header('ETag: "5816f349-19"'); echo $s;
ik 接收两个返回的头部属性 Last-Modified 和 ETag,只要其中一个有变化,就会触发更新,ik 会每分钟获取一次
重启 Elasticsearch ,查看启动记录,看到了三个词已被加载进来
[2016-10-31 15:08:57,749][INFO ][ik-analyzer ] 陈港生 [2016-10-31 15:08:57,749][INFO ][ik-analyzer ] 元楼 [2016-10-31 15:08:57,749][INFO ][ik-analyzer ] 蓝瘦
现在我们来测试一下,再次执行上面的请求,返回
... }, { "token" : "陈港生", "start_offset" : 5, "end_offset" : 8, "type" : "CN_WORD", "position" : 2 }, { ...
可以看到 ik 分词器已经匹配到了 “陈港生” 这个词。
Java服务器端实现:实现加载扩展词、添加扩展词、扩展词刷新接口
<!--用户可以在这里配置远程扩展字典 --> <entry key="remote_ext_dict">http://ip:port/es/dic/loadExtDict</entry>
@RestController @RequestMapping("/es/dic") public class DicController { private static final Logger logger = LoggerFactory.getLogger(DicController.class); @Autowired private DictRedis dictRedis; private static final String EXT_DICT_PATH = "E:\\ext_dict.txt"; /** * Description:加载扩展词 * @param response */ @RequestMapping(value = "/loadExtDict") public void loadExtDict(HttpServletResponse response) { logger.error("extDict get start"); long count = dictRedis.incr(RedisKeyConstants.ES_EXT_DICT_FLUSH); //要保证每个节点都能获取到扩展词 if(count > getEsClusterNodesNum()) { return; } String result = FileUtil.read(EXT_DICT_PATH); if(StringUtils.isEmpty(result)) { return; } // String result = "黄焖鸡米饭\n腾冲大救驾\n陈港生\n大西瓜\n大南瓜"; try { response.setHeader("Last-Modified", TimeUtil.currentTimeHllDT().toString()); response.setHeader("ETag",TimeUtil.currentTimeHllDT().toString()); response.setContentType("text/plain; charset=UTF-8"); PrintWriter out = response.getWriter(); out.write(result); out.flush(); } catch (IOException e) { logger.error("DicController loadExtDict exception" , e); } logger.error("extDict get end,result:{}", result); } /** * Description:扩展词刷新 * @param response * @return */ @RequestMapping(value = "/extDictFlush") public String extDictFlush() { String result = "ok"; try { dictRedis.del(RedisKeyConstants.ES_EXT_DICT_FLUSH); } catch (Exception e) { result = e.getMessage(); } return result; } /** * Description:添加扩展词典,多个词以逗号隔开“,” * @param dict * @return */ @RequestMapping(value = "/addExtDict") public String addExtDict(String dict) { String result = "ok"; if(StringUtils.isEmpty(dict)) { return "添加词不能为空"; } StringBuilder sb = new StringBuilder(); String[] dicts = dict.split(","); for (String str : dicts) { sb.append("\n").append(str); } boolean flag = FileUtil.write(EXT_DICT_PATH, sb.toString()); if(flag) { extDictFlush(); } else { result = "fail"; } return result; } /** * Description:获取集群节点个数,若未获取到,默认10个 * @return */ private int getEsClusterNodesNum() { int num = 10; String esAddress = PropertyConfigurer.getString("es.address","http://172.16.32.69:9300,http://172.16.32.48:9300"); List<String> clusterNodes = Arrays.asList(esAddress.split(",")); if(clusterNodes != null && clusterNodes.size() != 0) { num = clusterNodes.size(); } return num; } }
文件读写工具类:
public class FileUtil { private static final Logger logger = LoggerFactory.getLogger(FileUtil.class); /** * Description:文件读取 * * @param path * @return * @throws Exception */ public static String read(String path) { StringBuilder sb = new StringBuilder(); BufferedReader reader = null; try { BufferedInputStream fis = new BufferedInputStream(new FileInputStream(new File(path))); reader = new BufferedReader(new InputStreamReader(fis, "utf-8"), 512);// 用512的缓冲读取文本文件 String line = ""; while ((line = reader.readLine()) != null) { sb.append(line).append("\n"); } } catch (Exception e) { logger.error("FileUtil read exception", e); } finally { if(reader != null) { try { reader.close(); } catch (IOException e) { e.printStackTrace(); } } } return sb.toString(); } /** * Description:追加写入文件 * */ public static boolean write(String path, String content) { boolean flag = true; BufferedWriter out = null; try { out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(path), true))); // 追加的方法 out.write(content); } catch (IOException e) { flag = false; logger.error("FileUtil write exception", e); } finally { try { if(out != null) { out.close(); } } catch (IOException e) { e.printStackTrace(); } } return flag; } }