nltk:python自然语言处理二

前面的一些分词工具都是写好的的规则

如果我们想按照自己的规则进行分词 可以使用正则分词器

1.RegexpTokenizer类

from nltk.tokenize import RegexpTokenizer

text = " I won't just survive, Oh, you will see me thrive. Can't write my story,I'm beyond the archetype."

# 实例化RegexpTokenizer 会按照正则表达式进行re.findall()
regexp_tokenizer = RegexpTokenizer(pattern="\w+")
# 实例化RegexpTokenizer 指定gaps=True会按照正则表达式进行re.split()
regexp_tokenizer1 = RegexpTokenizer("[\s,'\.]", gaps=True)
print(regexp_tokenizer.tokenize(text))
# ['I', 'won', 't', 'just', 'survive', 'Oh', 'you', 'will', 'see', 'me', 'thrive', 'Can', 't', 'write', 'my', 'story', 'I', 'm', 'beyond', 'the', 'archetype']
print(regexp_tokenizer1.tokenize(text))
# ['I', 'won', 't', 'just', 'survive', 'Oh', 'you', 'will', 'see', 'me', 'thrive', 'Can', 't', 'write', 'my', 'story', 'I', 'm', 'beyond', 'the', 'archetype']

2.regexp_tokenize方法

# 此方法封装实例化RegexpTokenizer并用实例的tokenize方法进行分词 用法及作用和RegexpTokenizer 一样

from nltk import regexp_tokenize

text = " I won't just survive, Oh, you will see me thrive. Can't write my story,I'm beyond the archetype."
print(regexp_tokenize(text, "[\s,'\.]", gaps=True))
# ['I', 'won', 't', 'just', 'survive', 'Oh', 'you', 'will', 'see', 'me', 'thrive', 'Can', 't', 'write', 'my', 'story', 'I', 'm', 'beyond', 'the', 'archetype']

3.BlanklineTokenizer

# 是BlanklineTokenizer是RegexpTokenizer的子类 内部预设了使用正则表达式r'\s*\n\s*\n\s*'进行分割

# 同样的也有blankline_tokenize()方法

from nltk.tokenize import BlanklineTokenizer

text = " I won't just survive, Oh, you will see me thrive.\n\n Can't write my story,I'm beyond the archetype."
# 实例化并对文本进行切分
print(BlanklineTokenizer().tokenize(text))
# [" I won't just survive, Oh, you will see me thrive.", "Can't write my story,I'm beyond the archetype."]

4.WhitespaceTokenizer

# 是BlanklineTokenizer是RegexpTokenizer的子类 内部预设了使用正则表达式r'\s+'进行分割

from nltk import WhitespaceTokenizer
# 是RegexpTokenizer的子类  内部预设了使用正则表达式r'\s*\n\s*\n\s*'进行分割

text = " I won't just survive, Oh, you will see me thrive. Can't write my story,I'm beyond the archetype."

print(WhitespaceTokenizer().tokenize(text))
# ['I', "won't", 'just', 'survive,', 'Oh,', 'you', 'will', 'see', 'me', "thrive.Can't", 'write', 'my', "story,I'm", 'beyond', 'the', 'archetype.']

# 还有一些根据分割工具如: StringTokenizer根据字符进行切分以及它的一些子类SpaceTokenizer根据空白进行切分、TabTokenizer根据tab进行切分等;LineTokenizer按行切分

猜你喜欢

转载自blog.csdn.net/qq_41864652/article/details/81505768