浅谈es的原理、机制，IK分词原理

1、分布式的架构es都有哪些机制？

1、主备
primary shard 的副本 replica shard
primary shard不能和自己的replica shard放在同一个节点上、

2、容错
使用选举机制
master node宕机，选举master node，提升replica 为primary、
宕机的node重启数据恢复

2、IK分词原理

IK 分词器，
1、词典树Tire Tree的构建，即将现在的词典加载到一个内存结构中去
2、词的匹配查找，就是切词
3、歧义判断，即对不同切分方式的判定，哪种应是更合理的

2.1、词典树的构建

class DictSegment implements Comparable<DictSegment>{  
  
    //公用字典表，存储汉字  
    private static final Map<Character , Character> charMap = new HashMap<Character , Character>(16 , 0.95f);  
    //数组大小上限  
    private static final int ARRAY_LENGTH_LIMIT = 3;  
  
      
    //Map存储结构  
    private Map<Character , DictSegment> childrenMap;  
    //数组方式存储结构  
    private DictSegment[] childrenArray;  
  
  
    //当前节点上存储的字符  
    private Character nodeChar;  
    //当前节点存储的Segment数目  
    //storeSize <=ARRAY_LENGTH_LIMIT ，使用数组存储， storeSize >ARRAY_LENGTH_LIMIT ,则使用Map存储  
    private int storeSize = 0;  
    //当前DictSegment状态 ,默认 0 , 1表示从根节点到当前节点的路径表示一个词  
    private int nodeState = 0;    
    ……

ARRAY_LENGTH_LIMIT 的阀值来判断，数据较小存在数组，数据较大，存在HashMap 中；
如数据较小，存数组，采用二分查找的方式、
若数据较大，存在HashMap中时候，使用递归调用写入字典树，直接查找的方式；

private synchronized void fillSegment(char[] charArray , int begin , int length , int enabled){  
 
     ……       
    //搜索当前节点的存储，查询对应keyChar的keyChar，如果没有则创建  
    DictSegment ds = lookforSegment(keyChar , enabled);  
    if(ds != null){  
        //处理keyChar对应的segment  
        if(length > 1){  
            //词元还没有完全加入词典树  
            ds.fillSegment(charArray, begin + 1, length - 1 , enabled);  
        }else if (length == 1){  
            //已经是词元的最后一个char,设置当前节点状态为enabled，  
            //enabled=1表明一个完整的词，enabled=0表示从词典中屏蔽当前词  
            ds.nodeState = enabled;  
        }  
    }   
}

2.2、切词

切词的2种方式
1、非smart模式
IK分词输出所有分词、
2、smart模式
IK分词器则会根据内在方法输出一个认为最合理的分词结果，这就涉及到了歧义判断、
、
smart模式张三 | 说的 | 确实 | 在理
非smart模式张三 | 三 | 说的 | 的确 | 的 | 确实 | 实在 | 在理

IK中默认用到三个子分词器，
LetterSegmenter（字母分词器），CN_QuantifierSegment(量词分词器)，CJKSegmenter(中日韩分词器)。分词是会先后经过这三个分词器，我们这里重点根据CJKSegment分析。其核心是一个analyzer方法。

public void analyze(AnalyzeContext context) {  
    …….  
          
        //优先处理tmpHits中的hit  
        if(!this.tmpHits.isEmpty()){  
            //处理词段队列  
            Hit[] tmpArray = this.tmpHits.toArray(new Hit[this.tmpHits.size()]);  
            for(Hit hit : tmpArray){  
                hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor() , hit);  
                if(hit.isMatch()){  
                    //输出当前的词  
                    Lexeme newLexeme = new Lexeme(context.getBufferOffset() , hit.getBegin() , context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_CNWORD);  
                    context.addLexeme(newLexeme);  
                      
                    if(!hit.isPrefix()){//不是词前缀，hit不需要继续匹配，移除  
                        this.tmpHits.remove(hit);  
                    }  
                      
                }else if(hit.isUnmatch()){  
                    //hit不是词，移除  
                    this.tmpHits.remove(hit);  
                }                     
            }  
        }             
          
        //*********************************  
        //再对当前指针位置的字符进行单字匹配  
        Hit singleCharHit = Dictionary.getSingleton().matchInMainDict(context.getSegmentBuff(), context.getCursor(), 1);  
        if(singleCharHit.isMatch()){//首字成词  
            //输出当前的词  
            Lexeme newLexeme = new Lexeme(context.getBufferOffset() , context.getCursor() , 1 , Lexeme.TYPE_CNWORD);  
            context.addLexeme(newLexeme);  
 
            //同时也是词前缀  
            if(singleCharHit.isPrefix()){  
                //前缀匹配则放入hit列表  
                this.tmpHits.add(singleCharHit);  
            }  
        }else if(singleCharHit.isPrefix()){//首字为词前缀  
            //前缀匹配则放入hit列表  
            this.tmpHits.add(singleCharHit);  
        }  
   ……  
}

2.3、歧义判断

IKArbitrator(歧义分析裁决器)是处理歧义的主要类。

浅谈es的原理、机制 ，IK分词原理

猜你喜欢

浅谈es的原理、机制，IK分词原理