Natural language processing (a) - English word

Copyright: South Wood's blog https://blog.csdn.net/Godsolve/article/details/90522525


English word means there are many, today we use Jieba word , SnowNlp word , NLTK word , thunlp word , NLPIR word , Stanford segmentation of six segmentation tools for word given in English text.

A, Jieba word

Stuttered word for Chinese word segmentation tools, installation and use are relatively easy to grasp, but stuttered word segmentation supports three modes:

  • Precision mode, attempting to sentence most accurately cut, fit the text analysis;
  • Full mode, all of the words in the sentence can be scanned into words are very fast, but does not resolve the ambiguity;
  • Search engine model, based on the precise mode of long-term re-segmentation, improve recall, suitable for search engines word.

This experiment I will use three different modes to be segmentation of the Chinese document content in the experiment file.

Segmentation results are as follows:
Here Insert Picture Description
it can be found that some words are not very good word, such as "Shendan card", "soil eggs" and so on, then you need to perform custom dictionary word.
The so-called custom dictionary, is the developer can specify your own custom dictionary to include words not jieba thesaurus. Although jieba new word recognition capability, but add their own new word can ensure higher accuracy.

Usage: jieba.load_userdict (file_name)
file_name class object file or custom dictionary path

Format and dict.txt dictionary as a word per line; each row of three parts: words, word frequency (may be omitted), speech (can be omitted), separated by a space, the order can not be reversed. If file_name open path or binary file, the file must be UTF-8 encoding.

Using automatically calculated to ensure separation of the word when the word frequency word frequency will be omitted.
Custom dictionaries:
Here Insert Picture Description
output after use dictionaries:
Here Insert Picture Description

Two, SnowNlp word

SnowNLP is a library written in python, can easily handle Chinese text, this library can do many things, such as noon segmentation, POS tagging, sentiment analysis, text classification, text and so on a dozen similar function, which time I want to use just a simple word function, after the results of the Chinese text word is:
Here Insert Picture Description
but you can see, and Jieba word not use before constructing a custom dictionary sense, is the word unusual word is not accurate enough for example, "Shen Dan" brand, "soil eggs" and so on.

Three, nltk word

Python is an efficient NLTK build platform for processing natural language data, which provides an easy to use interface, these interfaces can be accessed by more than 50 vocabulary corpus and resources (such as the WordNet), there is a set for classification, labeling , stemming mark, analytical reasoning and semantic text processing library.
Nltk installation is very simple, just pip can be completed, but after installed nltk, install it inside the different packages and become another problem, because there are too many packages, different functions require different packages, while my choice is - fully armored, because I do not know what the specific use of the function which package do.

The results nltk with the text word for:
Here Insert Picture Description

Four, thunlp word

THULAC(THU Lexical Analyzer for Chinese)是由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包,具有中文分词和词性标注功能。它具有能力强、准确率高、速度较快等特点。

使用thulac进行分词的结果为:
Here Insert Picture Description
可以发现,在使用这个类库进行分词操作时,得到的分词准确度是比之前的中文分词工具要高的(包括未使用自定义词典的jieba),类似“土鸡蛋”、“神丹牌”等不常见词语也被很好地分了开来。

五、nlpIR分词

在使用NLPIR分词方法时,需要注意授权是否过期,而解决方法就是去下载license以更新授权。

在调试好之后,分词结果为:
Here Insert Picture Description
不仅实现了分词功能,还顺便完成了词性标注,并且分词结果也不算很差。

六、Stanford分词

斯坦福大学的分词工具,在使用之前要下载一些比较大的包,所以在使用中遇到了一些问题,不过都是可以通过百度解决的。

分词结果为:
Here Insert Picture Description

七、结论

不同的分词工具都有着自己的特点,使用条件不同,效果也不大相同。

五款中文分词工具的比较,尝试的有Jieba,SnowNLP,thulac(清华大学自然语言处理与社会人文计算实验室),StanfordCoreNLP,pyltp(哈工大语言云),环境是Win10,anaconda3.7。

只有Thulac的结果比较特别,StanfordCoreNLP的运行占用大量内存和CPU,尝试另一句话‘这本书很不错’,jieba无法分出‘本’,其他都可以完整分词,不过StanfordCoreNLP依然占用大量内存和CPU。

代码和文本太多,上传太麻烦,如果需要请点击这里下载

附录 · 分词工具推荐

中文分词工具

  1. Jieba
  2. SnowNLP
  3. THULAC
  4. NLPIR
    NLPIR
  5. StanfordCoreNLP
  6. HanLP

英文分词工具

  1. nltk
    nltk
    nltk
  2. Spacy
    Spacy
  3. StanfordCoreNLP

Micro-channel public number

I also welcome you to focus on the micro-channel public number Nanmu afternoon tea

Here Insert Picture Description

Guess you like

Origin blog.csdn.net/Godsolve/article/details/90522525