Several word segmentation methods of hanlp Chinese natural language processing

Natural language processing has extraordinary significance in both big data and artificial intelligence, which has exploded in recent years. So, what is natural language processing? When I was not exposed to big data, I only heard about natural language processing when I was learning about computers. Books define or describe too many specializations for natural language processing. In other words, natural language processing is to translate our human language into a language that can be understood by machines through some methods or technologies.

There are too many human languages, and computer technology originated in foreign countries, so natural language processing has always been basically around English. Chinese natural language processing, of course, is to translate our Chinese into instructions that machines can recognize and understand. I believe that everyone knows the breadth and depth of Chinese, and it is this breadth and depth that makes it difficult to translate Chinese into machine instructions! At least for a long time, Chinese natural language processing has faced such problems.

Hanlp Chinese natural language processing believes that many friends who are engaged in program development should know or be familiar with it. Hanlp Chinese natural language processing is developed under the auspices of Dakuai Search, and is an important part of the Dakuai DKhadoop big data integrated development framework. The following is a brief introduction to the Chinese natural language processing word segmentation method of Hanlp .

The word segmentation methods in Hanlp Chinese natural language processing include standard word segmentation, NLP word segmentation, index word segmentation, N- shortest path word segmentation, CRF word segmentation, and extremely fast dictionary word segmentation. These word segmentation methods are described below.

Standard Participle:

 

There is a series of "out of the box" static tokenizers in Hanlp , ending with Tokenizer . HanLP.segment is actually a wrapper for StandardTokenizer.segment

NLP word segmentation:

  1. List<Term> termList = NLPTokenizer.segment("Professor Zong Chengqing from the Institute of Computing Technology, Chinese Academy of Sciences is teaching a natural language processing course");
  2. System.out.println(termList);

NLP tokenizer NLPTokenizer performs all named entity recognition and part-of-speech tagging.

Index Segmentation:

 

Index Tokenizer IndexTokenizer is a tokenizer for search engines, which can completely segment long words. In addition , the offset of words in the text can be obtained through term.offset .

N- shortest strenuous participle

 

The N -shortest tokenizer NSshortSegment is slower than the shortest tokenizer, but the effect is slightly better, and the ability to recognize named entities is stronger.

In general scenarios, the accuracy of the shortest word segmenter is sufficient, and the speed is several times faster than the N shortest word segmenter. Please choose as appropriate.

CRF participle:

 

CRF has a good ability to recognize new words, but cannot utilize custom dictionaries.

Speedy dictionary segmentation:

 

Extremely fast word segmentation is the longest word segmentation in the dictionary, with extremely fast speed and average accuracy.

It ran at a speed of 20 million words per second on the i7 .

The above information is not very comprehensive, and will be supplemented in the future!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324932341&siteId=291194637