修改whoosh支持hanlp中文分词

先修改两个Python代码文件

analyzers.py加了一句话

from whoosh.analysis.tokenizers import ChineseTokenizer

然后tokenizers.py底部加了一个类
ChineseTokenizer


class ChineseTokenizer(Tokenizer):
    def __call__(self, value,positions=False,chars=False,
                 keeporiginal=False,removestops=True,
                 start_pos=0,start_char=0,mode="",**kwargs):
        t = Token(positions,chars,removestops=removestops,mode=mode,**kwargs)
        seglist = HanLP.segment(value)
        for wf in seglist:
            w=str(wf).strip().split('/')[0]
            f=str(wf).strip().split('/')[1]
            #判断名词实体  rr是代词  t是时间,仅索引名词
            if (re.search('n', f) and f is not 'begin' and f is not 'end') or f == 'rr' or f == 't':
                t.original = t.text = w
                t.boost = 1.0
                if positions:
                    t.pos = start_pos + value.find(w)
                if chars:
                    t.startchar = start_char + value.find(w)
                    t.endchar = start_char + value.find(w) + len(w)
                #print(t.text+' ',end='')
                yield t

修改好后怎么用

修改whoosh支持hanlp中文分词:

step1 >>> 将tokenizers.py和analyzers.py放入python环境下site-packages/whoosh/analysis中(

例如:我的环境为anaconda3/lib/python3.7/site-packages/whoosh/analysis)

step2 >>> 以如下方式导入包即可:
from whoosh.analysis import ChineseAnalyzer

猜你喜欢

转载自www.cnblogs.com/like1tree/p/12973268.html