The python String.split () and re.split ()

Now I have a string:

my_string = "我是谁?我在哪儿?!不管了。不管了。"

string.split()

Punctuation is required by the text cut into several parts, the built-in look python split () method

# e.g.1
string_list = my_string.split('。')
print(string_list)
>>> ['我是谁?我在哪儿?!不管了', '不管了', '']

See split ( '.') Does this by dividing a period of the character string, returns a list
disadvantages split () method is only a defined delimiter. For example, we want in the face of Chinese periods and question marks, the string will be separated:

# e.g.2
string_list = my_string.split('?。')
print(string_list)
>>> ['我是谁?我在哪儿?!不管了。不管了。']

string_list = my_string.split('?!')
print(string_list)
>>> ['我是谁?我在哪儿', '不管了。不管了。']

You can see, split will be. "?" This whole as a separator, and therefore simply can not achieve the goal of segmentation.
Ps like eg1 in it, if there was end of the string we want to divide the separator, will lead to the final element of the returned list is an empty string "", the situation in the matching task is fatal, so it is necessary to avoid this situation. Here provided two methods:

# 方法1:在切分之前,对字符串进行strip()处理——去掉开头和结尾的分隔符
string_list = string_list.strip("\s  \n ! ! ? ?。")
# 转义的字符要用空格隔开
# 方法2:在切分后,过滤掉split返回的list中的空字符串
# filter_data()函数的功能是:对于一个由string组成的list [str1, str2, str3, ......],过滤掉那些空字符串''、特殊字符串'\n',并返回过滤后的新list
def not_break(sen):
    return (sen != '\n' and sen != '\u3000' and  sen != '' and not sen.isspace())
def filter_data(ini_data):
    # ini_data是由句子组成的string
    new_data = list(filter(not_break, [data.strip() for data in ini_data]))
    return new_data

The second recommendation ~

re.split()

If you want to achieve "more than one delimiter segmentation of sentences" function, it is necessary to rely on python is more powerful regular method to achieve.
First introduced to rethe library, still at the top of the string as an example:

import re
'''
函数原型 re.split(pattern, string, maxsplit=0, flags=0)
pattern: 分隔符(str)
string: 原始字符串(str)
maxsplit:最大分割次数
flags:没有深入研究,感觉是个辅助功能参数,比如说flags=re.IGNORECASE应该是不区分大小写
'''
#e.g.3
my_string = "我是谁?我在哪儿?!不管了。不管了。"
string_list = re.split("。", my_string)
>>> ['我是谁?我在哪儿?!不管了', '不管了', '']
# 一个分隔符的时候,和string.split()功能一致(所以以后都用re.split()好了,嗯,功能还更强大)

string_list = filter_data(re.split(r"[。|!]", my_string))
print(string_list)
>>> ['我是谁?我在哪儿?', '不管了', '不管了']
# 这就是功能强大之处

'''解释一下 r"[。 |!]"
1) 正则表达式和 \ 会有冲突,'r'是为了保证python在解析"[。 |!]"的时候,把它当做一个字符串来处理,不转义
2) 当定义多个分隔符的时候,要将分隔符放在‘[]’中(貌似只是一种书写规范,因为不没有[]功能好像依然不变)或者'()'中,
    两种括号的区别可以看下面的例子
3) 除了第一个分隔符,后面的分隔符都要用'|'隔开
'''
string_list = filter_data(re.split(r"(。|!|?)", my_string))
print(string_list)
>>> ['我是谁', '?', '我在哪儿', '?', '!', '不管了', '。', '不管了', '。']
# 这种方式方便我们复原原始文本

All English punctuation by slicing string

For some "idiot" of the operation, a reference library is a more comfortable way

from zhon.hanzi import punctuation as chinese_punctuation  # 中文标点符号
import string 
english_punctuation = string.punctuation  # 英文标点符号

chi_punc = '|'.join([c for c in chinese_punctuation])
eng_punc = '|'.join([c for c in english_punctuation])
punc = chi_punc + eng_punc
>>> punc: '"|#|$|%|&|'|(|)|*|+|,|-|/|:|;|<|=|>|@|[|\|]|^|_|`|{|||}|~|⦅|⦆|「|」|、|\u3000|、|〃|〈|〉|《|》|「|」|『|』|【|】|〔|〕|〖|〗|〘|〙|〚|〛|〜|〝|〞|〟|〰|〾|〿|–|—|‘|’|‛|“|”|„|‟|…|‧|﹏|﹑|﹔|·|!|?|。|。!|"|#|$|%|&|\'|(|)|*|+|,|-|.|/|:|;|<|=|>|?|@|[|\\|]|^|_|`|{|||}|~'
# 注意 punc 中的'||'会导致逐字符分句的情况,所以手动抛去;如果真要把‘|’也当做分隔符,再做研究
punc = punc[:-6]+punc[-4:]
my_string = "sen1。sen2.sen3?sen4“”sen5,.,"
my_stringList = filter_data(re.split(r''+("["+punc+"]"), my_string))

# 效果
>>> ['sen1', 'sen2', 'sen3', 'sen4', 'sen5']

Solve all problems, it should be kept in mind

Reproduced in: https: //www.jianshu.com/p/eb5610fc2c4b

Guess you like

Origin blog.csdn.net/weixin_33881041/article/details/91074298
Recommended