Kuromoji is an open source, lightweight Japanese word segmentation toolkit developed based on java language. Donated to ASF and built into Lucene and Solr as the default Japanese tokenizer (the default Chinese tokenizer is smartcn). But he also doesn't depend on Lucene or Solr and can be used alone. The Viterbi algorithm is used, and the IPA dictionary is used by default.
Other famous ones include lucene-gosen: http://code.google.com/p/lucene-gosen/ and Rosette used by major Japanese search engines such as Google, Amazon, and Rakuten: http://www.basistech.jp /base-linguistics/japanese/ . Rosette is a commercial version and can be used in many languages, such as: Chinese, Japanese, Korean, English and so on.
http://www.atilika.org/Version
: kuromoji-0.7.7.jar
(1) 2 lines of code can be used for word segmentation
- Tokenizer tokenizer = Tokenizer.builder().build();
- List<Token> tokens = tokenizer.tokenize(word);
Token after word segmentation:
- for (Token token : tokens) {
- System.out.println("==================================================");
- System.out.println("allFeatures : " + token.getAllFeatures());
- System.out.println("partOfSpeech : " + token.getPartOfSpeech());
- System.out.println("position : " + token.getPosition());
- System.out.println("reading : " + token.getReading());
- System.out.println("surfaceFrom : " + token.getSurfaceForm());
- System.out.println("allFeaturesArray : " + Arrays.asList(token.getAllFeaturesArray()));
- System.out.println ("words in the dictionary?:" + Token.isKnown ());
- System.out.println("未知語? : " + token.isUnknown());
- System.out.println ("User defined?:" + Token.isUser ());
- }
(2) 3 middle word segmentation mode
- String word = "I read an article about Mobage in the Nihon Keizai Shimbun.";
- Builder builder = Tokenizer.builder();
- // Normal
- Tokenizer normal = builder.build();
- List<Token> tokensNormal = normal.tokenize(word);
- disp(tokensNormal);
- // Search
- builder.mode(Mode.SEARCH);
- Tokenizer search = builder.build();
- List<Token> tokensSearch = search.tokenize(word);
- disp(tokensSearch);
- // Extends
- builder.mode(Mode.EXTENDED);
- Tokenizer extended = builder.build();
- List<Token> tokensExtended = extended.tokenize(word);
- disp(tokensExtended);
Japan | Economy | Newspapers | In | Mobage | | Articles | |
Japan | Economy | Newspapers | In | Mo | Ba | Ge | |
(3) Custom dictionary
- // use custom dictionary
- InputStream is = UserDictSample.class.getClassLoader().getResourceAsStream("resources/userdict_ja.txt");
- Builder builder = Tokenizer.builder();
- builder.userDictionary(is);
- Tokenizer userTokenizer = builder.build();
- List<Token> tokens2 = userTokenizer.tokenize(word);
- StringBuilder sb2 = new StringBuilder();
- for (Token token : tokens2) {
- sb2.append(token.getSurfaceForm() + " | ");
- }
- System.out.println(sb2.toString());
Kisenosato | Hiroshi |
resources/userdict_ja.txt:
of speech Kisenosato Yutaka, Kisenosato Yutaka, custom personal name
(4) Kanji to Katakana
- String word = "Tokyo License Bureau";
- Builder builder = Tokenizer.builder();
- builder.mode(Mode.NORMAL);
- Tokenizer tokenizer = builder.build();
- List<Token> tokens = tokenizer.tokenize(word);
- StringBuilder sb = new StringBuilder();
- for (Token token : tokens) {
- sb.append(token.getReading() + " | ");
- }
- System.out.println(sb.toString());
Article from: http://rensanning.iteye.com/blog/2008575