Word segmentation processing using the reverse maximum matching method (python) - Code World

Word segmentation processing using the reverse maximum matching method (python)

Enterprise 2023-06-21 13:47:31 views: null

flow chart:

# 逆向最大匹配
class IMM(object):
    def __init__(self, dic_path):
        self.dictionary = set()  #定义集合
        self.maximum = 0 #定义最大匹配长度
        with open(dic_path, 'r', encoding='utf-8') as f:   #将存储路径中的语料库打开
            for line in f:
                line = line.strip()  #去除首尾的空白字符
                if not line:
                    continue
                self.dictionary.add(line)  #将遍历的语料库中的元素添加到集合中
                if len(line) > self.maximum:
                    self.maximum = len(line)  #元素长度与最大长度的比较

    def cut(self, text):
        result = []
        index = len(text)
        while index > 0:
            word = None
            for size in range(self.maximum, 0, -1):    由最大长度，逆向遍历
                if index - size < 0:
                    continue
                piece = text[(index - size):index]  #逆向切分
                if piece in self.dictionary:
                    word = piece
                    result.append(word)
                    index -= size
                    break
            if word is None:
                index -= 1
        return result[::-1]

if __name__ == '__main__':
    data_path = ""

    text=‘待切分文本’
    tokenizer = IMM('data_path')
    print(tokenizer.cut(text))

Note: The corpus here needs to be searched by itself

Guess you like

Origin blog.csdn.net/m0_52051577/article/details/124039543

Word segmentation processing using the reverse maximum matching method (python)

Forward Maximum Matching and Reverse Maximum Matching in Chinese Word Segmentation

Python realizes the method of maximum probability word segmentation

Forward maximum matching algorithm (Chinese word segmentation)

jieba word - maximum matching forward and reverse maximum matching

Realize the maximum matching algorithm of forward segmentation with python

basic programming python: Python stemming of natural language processing, word forms and explain the maximum matching algorithm code

Python Chinese word segmentation, using stuttering word segmentation to segment python

Python is inscribed: parentheses matching processing using the stack

Simple sorting of word segmentation processing (including word attribute processing) using HanLP in Android Studio's Android

Unity Android's simple sorting of word segmentation processing (including word attribute processing) using HanLP

Back to achieve maximum matching algorithm before python word

Before python Chinese word tutorial maximum forward matching algorithm Detailed

On the segmentation algorithm of word segmentation method (HMM) based

Chinese word segmentation for python

[Python1] word segmentation

Manual of jieba word segmentation in python

[Text information processing] Network text access and processing + word segmentation

Chinese word segmentation using Stanford CoreNLP

What does NLP "regular matching word segmentation" mean

Pillow Python image processing library using the conventional method

Using sqlalchemy to operate the database in python encounters a password containing @ processing method

Python word segmentation and word cloud image generation

40 python regular expression matching string match method to use the search function to find a string of sub-word

Matlab|Image Processing 04|Image Segmentation-Threshold Segmentation Method

python list reverse method

Using a method as a parameter for another method *Java Processing*

Several word segmentation methods of hanlp Chinese natural language processing

[Image segmentation] Maximum between-class variance (otsu) image segmentation [Matlab 121] [Image processing 38]

Word title unified processing method (fool operation)

Recommended

Ranking

ElasticSearch-- data modeling best practices

Permission Maintenance - Shadow User Backdoor

Refactor the code using MVP mode

Quantitative investment-fundamental model-PVC multi-factor model

Spark Big Data Processing Lecture Notes 3.2 Mastering RDD Operators

Blazor page components (2)

Erlernen von Kenntnissen zur Android-Entwicklung – Kodierung, Verschlüsselung, Hash, Serialisierung und Zeichensätze

About Qi high in JAVA study notes SORM summary detailed personal explanation

Will you calculate the accuracy of the rope displacement sensor in the measurement?

OPENJTAG debugging learning (3): debugging using the gdb command line

Daily

More

2024-05-01(4)

2024-04-30(36)

2024-04-29(5)

2024-04-28(12)

2024-04-27(29)

2024-04-26(22)

2024-04-25(32)

2024-04-24(30)

2024-04-23(30)

2024-04-22(5)