Kuromoji of Java Japanese Tokenizer

Kuromoji is an open source, lightweight Japanese word segmentation toolkit developed based on java language. Donated to ASF and built into Lucene and Solr as the default Japanese tokenizer (the default Chinese tokenizer is smartcn). But he also doesn't depend on Lucene or Solr and can be used alone. The Viterbi algorithm is used, and the IPA dictionary is used by default. 

Other famous ones include lucene-gosen: http://code.google.com/p/lucene-gosen/  and Rosette used by major Japanese search engines such as Google, Amazon, and Rakuten: http://www.basistech.jp /base-linguistics/japanese/  . Rosette is a commercial version and can be used in many languages, such as: Chinese, Japanese, Korean, English and so on. 

http://www.atilika.org/Version 

: kuromoji-0.7.7.jar 

(1) 2 lines of code can be used for word segmentation 

Java code   Favorite code
  1. Tokenizer tokenizer = Tokenizer.builder().build();  
  2. List<Token> tokens = tokenizer.tokenize(word);  


Token after word segmentation: 

Java code   Favorite code
  1. for (Token token : tokens) {  
  2.     System.out.println("==================================================");  
  3.     System.out.println("allFeatures : " + token.getAllFeatures());  
  4.     System.out.println("partOfSpeech : " + token.getPartOfSpeech());  
  5.     System.out.println("position : " + token.getPosition());  
  6.     System.out.println("reading : " + token.getReading());  
  7.     System.out.println("surfaceFrom : " + token.getSurfaceForm());  
  8.     System.out.println("allFeaturesArray : " + Arrays.asList(token.getAllFeaturesArray()));  
  9.     System.out.println ("words in the dictionary?:" + Token.isKnown ());  
  10.     System.out.println("未知語? : " + token.isUnknown());  
  11.     System.out.println ("User defined?:" + Token.isUser ());  
  12. }  



(2) 3 middle word segmentation mode 

Java code   Favorite code
  1. String word = "I read an article about Mobage in the Nihon Keizai Shimbun.";  
  2. Builder builder = Tokenizer.builder();  
  3.   
  4. // Normal  
  5. Tokenizer normal = builder.build();  
  6. List<Token> tokensNormal = normal.tokenize(word);  
  7. disp(tokensNormal);  
  8.   
  9. // Search  
  10. builder.mode(Mode.SEARCH);  
  11. Tokenizer search = builder.build();  
  12. List<Token> tokensSearch = search.tokenize(word);  
  13. disp(tokensSearch);  
  14.   
  15. // Extends  
  16. builder.mode(Mode.EXTENDED);  
  17. Tokenizer extended = builder.build();  
  18. List<Token> tokensExtended = extended.tokenize(word);  
  19. disp(tokensExtended);  

 

quote
Nihon Keizai Shimbun | In | Mobage | | Articles |
Japan | Economy | Newspapers | In | Mobage | | Articles |
Japan | Economy | Newspapers | In | Mo | Ba | Ge | |



(3) Custom dictionary 

Java code   Favorite code
  1. // use custom dictionary  
  2. InputStream is = UserDictSample.class.getClassLoader().getResourceAsStream("resources/userdict_ja.txt");  
  3.   
  4. Builder builder = Tokenizer.builder();  
  5. builder.userDictionary(is);  
  6. Tokenizer userTokenizer = builder.build();  
  7.   
  8. List<Token> tokens2 = userTokenizer.tokenize(word);  
  9.   
  10. StringBuilder sb2 = new StringBuilder();  
  11. for (Token token : tokens2) {  
  12.     sb2.append(token.getSurfaceForm() + " | ");  
  13. }  
  14. System.out.println(sb2.toString());  



quote
Kisenosato | Kisenosato | Hiroshi | 
Kisenosato | Hiroshi |



resources/userdict_ja.txt: 

quote
#Words , words after morphological analysis (separate words by spaces), reading, part 
of speech Kisenosato Yutaka, Kisenosato Yutaka, custom personal name



(4) Kanji to Katakana 

Java code   Favorite code
  1. String word = "Tokyo License Bureau";  
  2.   
  3. Builder builder = Tokenizer.builder();  
  4. builder.mode(Mode.NORMAL);  
  5. Tokenizer tokenizer = builder.build();  
  6. List<Token> tokens = tokenizer.tokenize(word);  
  7.   
  8. StringBuilder sb = new StringBuilder();  
  9. for (Token token : tokens) {  
  10.     sb.append(token.getReading() + " | ");  
  11. }  
  12. System.out.println(sb.toString());  

 

Article from: http://rensanning.iteye.com/blog/2008575

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326308541&siteId=291194637