Implementation of Chinese and Japanese material clauses in Python3

1. Background

Because I have been looking at parallel corpus alignment and word alignment recently, if you want to do alignment, you need to make a clause first.
In some cases, punctuation is also a good feature, and here we want to segment it as accurately as possible.
The main issues to be considered include:

  • Separator reserved
  • Sentences in quotation marks
  • Multiple punctuation in the same place

After deciding not to split the quotation marks, use a little trick to make the idea very clear:
save the parentheses as a whole to a queue, and use a flag to occupy the place.
Substitute it back later.

2. Code

Note that a zero-width regex is used here as the splitting flag, but re.split() cannot use it for splitting, and a ValueError will be generated.

def my_split(string):
    """
    将引号内看作整体保存与队列,后面再换回
    省略号暂时不加
    # todo 可以考虑说话部分的分句,
    # 例如‘xxx:“xxx。”xx,xxxx。’
    # 还可分。
    """
    SPLIT_SIGN = '%%%%'  # 需要保证字符串内本身没有这个分隔符

    # 替换的符号用: $PACK$
    SIGN = '$PACK$'
    search_pattern = re.compile('\$PACK\$')
    pack_pattern = re.compile('(“.+?”|(.+?)|《.+?》|〈.+?〉|[.+?]|【.+?】|‘.+?’|「.+?」|『.+?』|".+?"|\'.+?\')')
    pack_queue = []
    pack_queue = re.findall(pack_pattern, string)
    string = re.sub(pack_pattern, SIGN, string)

    pattern = re.compile('(?<=[。?!])(?![。?!])')
    result = []
    while string != '':
        s = re.search(pattern, string)
        if s is None:
            result.append(string)
            break
        loc = s.span()[0]
        result.append(string[:loc])
        string = string[loc:]
    
    result_string = SPLIT_SIGN.join(result)
    while pack_queue:
        pack = pack_queue.pop(0)
        loc = re.search(search_pattern, result_string).span()
        result_string = result_string[:loc[0]] + pack + result_string[loc[1]:]

    return result_string.split(SPLIT_SIGN)

refer to

Use Python to implement Chinese clause
github address (I didn't delete the stupid way, I always feel like a certain algorithm problem I have done, but I can't remember it.)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324950141&siteId=291194637