Chinese Natural Language Processing (NLP) (a) python jieba module

1.jieba word installation 

  Among cmd window directly to pip install

2.jieba word of introduction

  jieba word is good Chinese division one word components, jieba word supports three modes of word (exact mode, full mode, the search engine mode), and supports custom dictionary (this is important in specific areas, sometimes need to areas of need to add specific word dictionary in order to improve the quality of the results points), support for traditional Chinese characters word

3.jieba three kinds of word segmentation models and examples of use

  The main method of jieba segmentation module is jieba.cut (), the difference between the exact model and the full model parameters which differ mainly in

  (1) fine mode: precisely cutting the text for text analysis

  jieba.cut () method inside CUT_ALL = False

  (2) Full mode: All the words in the text which can be extracted into words all out faster, but can not solve the problem of ambiguity

  jieba.cut () method inside CUT_ALL = True

  (3) search engine mode: On the basis of precise patterns, based on the long-term re-segmentation for search use

  You need to call jieba.cut_for_search () method

  Just look at these things, do not see what they were, next to models to test three kinds of word:

1 text = ' Beijing University of Posts and Telecommunications under the Ministry of Education, and the Ministry of Industry and Information Technology, were the first batch of "211 Project" national key universities ' 
2 try_words = jieba.cut (text, cut_all = True)
 3  Print ( ' full mode segmentation results: ' + ' , ' .join (try_words))
 . 4 try_words = jieba.cut (text, cut_all = False)
 . 5  Print ( ' fine mode segmentation results: ' + ' , ' .join (try_words))
 . 6 = try_words jieba.cut_for_search (text)
 . 7  Print ( ' search engines mode segmentation results: ' + ','.join(try_words))

  The code run results:

  Full mode segmentation results: Beijing, Beijing Posts and Telecommunications, Beijing University of Posts and Telecommunications, Posts and Telecommunications, Posts and Telecommunications University, university, college, yes, education, the Ministry of Education, directly under ,,, industry, and, information, information technology, department, build ,,, the first, carried out ,, 211 ,,, construction project, the national focus, university

  Precise mode segmentation results: Beijing University of Posts and Telecommunications, is the Ministry of Education, directly under ,,, industry, and information technology, the Ministry of build ,,, first performed, "211 Project", the construction of the country Key University

  Search engines mode segmentation results: Beijing, telecommunications, Radio and TV University, Beijing University of Posts and Telecommunications, it is education, education department, directly under ,,, industry, and, information, information technology, ministry, build ,,, first performed , "211 project", the construction of the country, focusing on university

  The results obtained in the above code, we see a full-text mode word among all of the words are extracted, is to allow the prompter window overlapping and contained, among words words can appear multiple times, but this model has word ambiguity may occur; exact probability model text word is precisely spaced apart, into a long word priority, there is no repetition of the words to overlap windows, the segmentation pattern of this ambiguity is relatively small, but it is possible there will be word window is too large, some focus on words alone do not appear among the results in the case. In summary, the word of these two models for different scenarios should be selected as appropriate.

  It is worth mentioning that in the example is "the Ministry of Industry and Information Technology" should also be an independent term, but the three segmentation methods which are not found in the words, which need to be implemented manually add thesaurus, placed under portion.

  And then the text ambiguous situations which may arise to try:

. 1 text2 = ' Nanjing Yangtze Bridge is a bridge ' 
2 try_words = jieba.cut (text2, cut_all = True)
 . 3  Print ( ' Full mode segmentation results: ' + ' , ' .join (try_words))
 . 4 try_words = jieba. Cut (text2, cut_all = False)
 . 5  Print ( ' fine mode segmentation results: ' + ' , ' .join (try_words))
 . 6 try_words = jieba.cut_for_search (text2)
 . 7  Print ( ' search mode segmentation results: ' + ' , '.join(try_words))

  Text in the example is the "Nanjing Yangtze River Bridge is a bridge," we hope the results extracted from the word should be: "Nanjing Yangtze River Bridge," or, most times it should be "Nanjing" and "Yangtze River Bridge" two words, a result of the operation code is:

  Full mode segmentation results: Nanjing, Nanjing, Beijing City, the mayor, the Yangtze River, the Yangtze River Bridge, the bridge, is a, bridge

  Precise mode segmentation results: Nanjing Yangtze River Bridge, is a, bridge

  Search engine model segmentation results: Nanjing, Beijing city of Nanjing, the Yangtze River Bridge, the Yangtze River Bridge, is a, bridge

  The results: among the three modes are not found our best expectations (Nanjing Yangtze River Bridge), but there was an ambiguous word (mayor, Beijing City) in the segmentation result in the full mode and the mode among search engines, which with our the context is clearly irrelevant, which also happens to reflect a full mode and search engine defect mode word, that is, ambiguous words regardless of context prone, and precise pattern appears relatively small.

  As to why the above scenario, I have access to relevant information after that for the following reasons:

  Chinese word segmentation method can be roughly divided into three categories: methods based on word dictionary, thesaurus matching; segmentation method based on word frequency statistics based on the degree of segmentation method (in addition to traversing the law word for word, but because of the knowledge to understand whether there is text how short, how much word for word thesaurus must traverse again, efficiency is too low, so that in most cases non-adoption). Module general Chinese word segmentation method based dictionary, thesaurus, strategy used is the full mode (the length of the longest word in the dictionary is provided n): a character from the first field, read backward 1,2 , ... n characters, and to which correspond to the dictionary, if the correspondence, it will be taken out, so that you can reach can be extracted into words all fields (and there are overlapping portion) of the object; the corresponding , the precise mode segmentation strategies may be taken as the expanded term window, i.e. the repeat length may contain from 1 to n of these words, the output takes the longest.

  (The above "reasons" just my personal view points in use at the time of each mode word module, not necessarily correct, there is no effect for the project, after all, I did not go to a special study of the source code ...)

4.jieba word to add a custom dictionary

  Chinese word often used in many specific context, there is a corresponding need to add some dictionaries in a specific context, such as a part of the "Nanjing Yangtze River Bridge" belong Dictionary "landmark" and the like, if the word it divided into "Nanjing" and "Yangtze River Bridge," there may not fully reflect the focus of the original text (after all, a lot of things in Nanjing, Nanjing Yangtze River Bridge is also more than there are), this time we need to add a custom dictionary to improve the quality of sub-word.

. 1 jieba.add_word ( ' Industry and Information Technology ' )

  Use jieba module add_word () method to add a new word to its, after insertion segmentation results as follows:

  Full mode segmentation results: Beijing, Beijing Posts and Telecommunications, Beijing University of Posts and Telecommunications, Posts and Telecommunications, Posts and Telecommunications University, university, college, yes, education, the Ministry of Education, directly under ,,, Industrial, Industry and Information Technology , Information, Information Technology, Ministry, a total of ,,, first built, be ,, 211 ,,, engineering construction, the country,
the focus of the University
  precise mode segmentation results: Beijing University of Posts and Telecommunications, is the Ministry of Education, directly under ,,, industry and information technology Ministry , a total of built ,,, first performed, "211 project", the construction of the country, the focus of the University of
  search engines mode segmentation results: Beijing, telecommunications, Radio and TV University, Beijing University of Posts and Telecommunications, is education, education department, directly under ,,, industry, information, information technology, industry and information technology , to build ,,, first performed, "211 project", the construction of the country, focusing on university
  full-mode segmentation results : Nanjing, Nanjing, Nanjing Yangtze River Bridge , Beijing City, the mayor, the Yangtze River, the Yangtze River Bridge, the bridge, is a, Bridge
  precise mode segmentation results: Nanjing Yangtze River Bridge , is a, Bridge
  search engine mode segmentation results: Nanjing, Beijing City, the mayor, the Yangtze River Bridge, Nanjing, Nanjing Yangtze River Bridge , is a, bridge

  The main change is that the whole mode and search engine directly into the mode of this word, but accurate clock is a subset of the new epithet word (tentatively so called) replaces, which is consistent with the assumption that a part of.

 

  This is probably the first day of the results, the content is actually learned a few months ago, out today to sort out hope that it can continue to do so.

Guess you like

Origin www.cnblogs.com/aLieb/p/11129774.html