ltp分词后处理——强制分词模块

为了解决这个问题：我使用分词（词性标注）词典了，但是为什么某些词典词还是被切开了（词性没有按照字典标注）
官方解释为：ltp的分词（词性标注）模块并非采用词典匹配的策略，外部词典以特征方式加入机器学习算法，并不能保证所有的词都是按照词典里的方式进行切分（标注）。如果要完全按照词典匹配的方式切词（标注），您可以尝试对结果进行后处理。
为此我自己写了一个后处理模块，用来将切开了的词进行合并（保证同一句话中出现多次不能拆分的词，也能正确合并）。

主要思路如下：

构建一个强制分词词典
从强制分词词典生成正则表达式以加速词典查找过程
如果待分词的句子包含强制分词词典中的词，将切开的词进行合并，否则跳过。

测试了三种词查找方法的效率：

将词拼接成形如(?:钢铁是怎样炼成的|钢铁侠)的正则表达式后编译，再用search效率最高。（事先编译需要消耗一定的时间，但是这个可以放在在初始化过程中，只用计算一次）
遍历词典，用if word in sentence:来判断效率其次。
遍历词典，用if sentence.find(word)>-1:来判断效率最低。

强制分词模块代码如下：

class ForceSegmentor(object):
    def __init__(self):
        self.forcelist = []

    def load(self, filepath):
        with open(filepath, 'r') as file:
            line = file.readline()
            while line:
                if ('#' in line):
                    line = file.readline().strip()
                    continue
                self.forcelist.append(ForceSegmentorItem(line))
                line = file.readline()
        self.compilelist = []
        y = 0
        xlen = 60
        stop = False
        while not stop:
            comstr = '(?:'
            for x in range(xlen):
                z = y * xlen + x
                if z > len(self.forcelist) - 1:
                    stop = True
                    break
                if x > 0:
                    comstr += '|'
                comstr += self.forcelist[z].get_text()
            comstr += ')'
            self.compilelist.append(re.compile(comstr.decode('utf8')))
            y += 1
    def find_in_dict(self,sentence):
        de_sentence = sentence.decode('utf8')
        for compilestr in self.compilelist:
            result=compilestr.search(de_sentence)
            if result:
                #找到句子中包含的字典中的词
                return result.group().encode('utf8')
        return None
    
    def merge(self, sentence, words):
        # 有些词无法通过自定义分词词典直接正确划分，用该方法将属于强制词典的多个词合并
        # 例：“《钢铁是怎样炼成的》的主演是谁电影《钢铁是怎样炼成的》的主演是谁”得到
        # “《 钢铁是怎样炼成的 》 的 主演 是 谁 电影 《 钢铁是怎样炼成的 》 的 主演 是 谁”
        result = words
        found_word=self.find_in_dict(sentence)
        if found_word:
            # 可能同一个词在这句话里出现多次
            indexs_start = []
            # 合并的词首尾距离
            index_distance = 0
            index_start = -1
            strm = ''
            for i, word in enumerate(words):
                wl = len(word)
                if (index_start == -1 and word == found_word[0:wl]):
                    index_start = i
                    strm += word
                elif (index_start != -1):
                    strm += word
                    if (strm == found_word):
                        # 已经完全匹配
                        indexs_start.append(index_start)
                        index_distance = i - index_start + 1
                        index_start = -1
                        strm = ''
                    elif (strm not in found_word):
                        # 现在连接得到的多个词是错误的，重新开始匹配
                        index_start = -1
                        strm = ''
            result = []
            i = 0
            while (i < len(words)):
                word = words[i]
                if (i in indexs_start):
                    result.append(found_word)
                    i += index_distance
                else:
                    result.append(word)
                    i += 1
        return result

class ForceSegmentorItem(object):
    def __init__(self, line):
        self.text = line.replace('\n', '')

    def get_text(self):
        return self.text

调用方法：

	ws = self.segmentor.segment(new_sentence.encode('utf8'))
    words = list(ws)#调用默认的ltp分词得到的分词结果
	forceSegmentor= ForceSegmentor()
    forceSegmentor.load('../init_data/cws.lex')
    words = forcesegmentor.merge(sentence, words)#强制分词以后的结果

附词典查找效率对比代码：

    def find_word(self):
        wlist=[]
        #测试大词典下的效果，forcelist词典本身70个词，wlist扩大10000倍
        for i in range(10000):
            wlist.extend(self.forcelist)
        import datetime
        sentence='《钢铁是怎样炼成的》的主演是谁电影《钢铁是怎样炼成的》的主演是谁'

        de_sentence= sentence.decode('utf8')
        begin = datetime.datetime.now()
        a=0
        for fi in wlist:
            fw = fi.get_text()
            if sentence.find(fw)>-1:
                a += 1

        end = datetime.datetime.now()
        k = end - begin
        print 'find方法-找到次数：%s,耗时：%s' %(a, k.total_seconds())

        begin = datetime.datetime.now()
        a = 0
        for fi in wlist:
            fw = fi.get_text()
            if (fw in sentence):
                a+=1

        end = datetime.datetime.now()
        k=end-begin
        print 'in方法-找到次数：%s,耗时：%s' %(a, k.total_seconds())


        compilelist=[]
        y=0
        xlen=60#每个编译语句中包含词的数量，根据词典大小调整可以进一步调高效率
        stop=False
        while not stop:
            comstr='(?:'
            for x in range(xlen):
                z=y*xlen+x
                if z>len(wlist)-1:
                    stop=True
                    break
                if x>0:
                    comstr+='|'
                comstr+=wlist[z].get_text()
            comstr+=')'
            compilelist.append(re.compile(comstr.decode('utf8')))
            y+=1

        begin = datetime.datetime.now()
        a = 0
        for compilestr in compilelist:
            result=compilestr.search(de_sentence)
            if result:
                g=result.group()
                a += 1

        end = datetime.datetime.now()
        k = end - begin
        print '正则方法-找到次数：%s,耗时：%s' %(a, k.total_seconds())

结果：

find方法-找到次数：10000,耗时：0.169297
in方法-找到次数：10000,耗时：0.094882
正则方法-找到次数：10000,耗时：0.00471

ltp分词后处理——强制分词模块

猜你喜欢