This article will introduce the following:
- Use jieba participle
- Use pyltp word segmentation
- Use pkuseg to segment words
- Use nltk participle
Normally, NLP cannot process complete paragraphs or sentences at once, so the first step is often sentence and word segmentation. Here we will introduce several word segmentation methods.
One, use jieba word segmentation
You can refer to the article I wrote before: https://blog.csdn.net/TFATS/article/details/108810284
Second, use pyltp word segmentation
You can refer to the article I wrote before: https://blog.csdn.net/TFATS/article/details/108511408
Third, use pkuseg word segmentation
You can refer to the article I wrote before: https://blog.csdn.net/TFATS/article/details/108851344
Fourth, use nltk participle
The nltk tool is generally used as a word embedding tool in English text. Only the tokenize
method is introduced here . For detailed usage, please refer to: https://www.cnblogs.com/chen8023miss/p/11458571.html
http://www.pythontip.com/blog/post/10012/
Note: There may be some problems when installing nltk, you can refer to the article I shared before: https://blog.csdn.net/TFATS/article/details/108519904
from nltk import word_tokenize
sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]
# ------ output------
[['I', 'love', 'sky', ',', 'I', 'love', 'sea', '.'], ['I', 'like', 'running', ',', 'I', 'love', 'reading', '.']]