Java Chinese word segmentation component - word segmentation

Keyword: java Chinese word segmentation component-word


word segmentation word segmentation homepage: https://github.com/ysc/word



word segmentation is a Chinese word segmentation component implemented in Java, providing a variety of dictionary-based word segmentation algorithms, and using ngram model to disambiguate. Can accurately identify English, numbers, as well as date, time and other quantifiers, and can identify unregistered words such as person names, place names, and organization names. Also provides Lucene, Solr, ElasticSearch plug-ins.

How to use word segmentation:


  1. Quick experience
  Run the script demo-word.bat in the root directory of the project to quickly experience the word segmentation effect
  Usage: command [text] [input] [output]
  The optional values ​​of command command are: demo, text, file
  demo
  text Yang Shangchuan is the author of the APDPlat application-level product development platform
  file d:/text.txt d:/word.txt
  exit

  2. Segment the text and
  remove stop words: List<Word> words = WordSegmenter.seg("Yang Shangchuan is the author of the APDPlat application-level product development platform");
  reserved stop words: List<Word> words = WordSegmenter.segWithStopWords("Yang Shangchuan is the author of the APDPlat application-level product development platform");
  System.out.println(words);


  output:
  Remove stop words: [Yang Shangchuan, apdplat, application level, product, development platform, author]
  Keep stop words: [Yang Shangchuan, yes, apdplat, application level, product, development platform, of, author]

  3. Carry out the file Word segmentation
  String input = "d:/text.txt";
  String output = "d:/word.txt";
  Remove stop words: WordSegmenter.seg(new File(input), new File(output));
  keep stop Words: WordSegmenter.segWithStopWords(new File(input), new File(output));

  4. Custom configuration file The
  default configuration file is word.conf under the classpath, packaged in word-xxjar
  The custom configuration file is a class The word.local.conf under the path needs to be provided by the user.
  If the   custom configuration is the same as the default configuration, the custom configuration will override the default configuration. The
  configuration file encoding is UTF-8 Or multiple folders or files, you can use absolute path or relative path   User thesaurus consists of multiple dictionary files, the file encoding is UTF-8   The format of the dictionary file is a text file, one line represents a word   Can be passed through system properties or configuration files way to specify the path, multiple paths are separated by commas







  The dictionary file under the classpath needs to add the prefix classpath: before the relative path:


  There are three designation methods:
  designation method one, programming designation (high priority):
  WordConfTools.set("dic.path", "classpath:dic.txt, d:/custom_dic");
  DictionaryFactory.reload();//After changing the dictionary path, reload the dictionary
  Designation method 2, Java virtual machine startup parameters (medium priority):
  java -Ddic.path=classpath:dic.txt, d:/custom_dic
  designation method 3, configuration file designation (low priority):
  use the file word.local.conf under the classpath to specify the configuration information
  dic.path=classpath:dic.txt, if d:/custom_dic is

  not specified, By default, the dic.txt dictionary file in the classpath is used.

  6. Custom stopwords The
  usage is similar to that of custom user thesaurus. The configuration items are:
  stopwords.path=classpath:stopwords.txt, d:/custom_stopwords_dic

  7 、Automatically detect changes in thesaurus
  Can automatically detect changes in custom user thesaurus and custom stop word thesaurus
  Include files and folders under the classpath, absolute paths and relative paths under the non-classpath,
  such as :
  classpath:dic.txt, classpath:custom_dic_dir,
  d:/dic_more.txt, d:/DIC_DIR, D:/DIC2_DIR, my_dic_dir, my_dic_file.txt


  classpath:stopwords.txt, classpath:custom_stopwords_dic_dir,
  d:/stopwords_more.txt,d :/STOPWORDS_DIR, d:/STOPWORDS2_DIR, stopwords_dir, remove.txt


  8. Explicitly specify
  the When segmenting text, you can explicitly specify a specific word segmentation algorithm, such as:
  WordSegmenter.seg("APDPlat application-level product development platform" , SegmentationAlgorithm.BidirectionalMaximumMatching);


  The optional types of SegmentationAlgorithm are:
  Forward Maximum Matching Algorithm: MaximumMatching
  Reverse Maximum Matching Algorithm: ReverseMaximumMatching
  Forward Minimum Matching Algorithm: MinimumMatching
  Reverse Minimum Matching Algorithm: ReverseMinimumMatching
  Bidirectional Maximum Matching Algorithm: BidirectionalMaximumMatching
  Bidirectional Minimum Matching Algorithm: BidirectionalMinimumMatching
  Bidirectional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching

  9. Evaluation of word segmentation effect
  Run the script evaluation.bat in the project root directory to evaluate the word segmentation effect.
  The test text used for evaluation has 253 3709 lines and a total of 2837 4490 characters. The
  evaluation results are located in the target/evaluation directory Below:
  corpus-text.txt is the manually labeled text with divided words, and the words are separated by spaces
  test-text.txt is the test text, which is the result of dividing corpus-text.txt into multiple lines with punctuation marks.
  Standard-text .txt is the manually labeled text corresponding to the test text, as the standard result-text-***.txt for the correct word segmentation
  , *** is the name of various word segmentation algorithms, which is the word segmentation result
   perfect-result-***. txt, *** is the name of various word segmentation algorithms, this is the text of the word segmentation result and the manual annotation standard exactly
   wrong-result-***.txt, *** is the name of various word segmentation algorithms, this is the word segmentation result


Lucene plugin :

  1. Construct a word analyzer ChineseWordAnalyzer
  Analyzer analyzer = new ChineseWordAnalyzer();

  2. Use word analyzer to segment text
  TokenStream tokenStream = analyzer.tokenStream("text", "Yang Shangchuan is the author of the APDPlat application-level product development platform");
  while(tokenStream.incrementToken()){
  CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
  OffsetAttribute offsetAttribute = tokenStream .getAttribute(OffsetAttribute.class);
  System.out.println(charTermAttribute.toString()+" "+offsetAttribute.startOffset());
  }


  3. Use word analyzer to create Lucene index
  Directory directory = new RAMDirectory();
  IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, analyzer);
  IndexWriter indexWriter = new IndexWriter(directory, config);





  4. Use word analyzer to query Lucene index
  QueryParser queryParser = new QueryParser(Version.LUCENE_47, "text", analyzer);
  Query query = queryParser.parse("text:Yang Shangchuan");
  TopDocs docs = indexSearcher.search(query, Integer.MAX_VALUE);




Solr plugin:


  1. Generate the binary jar of the word segmentation component and
  execute mvn clean install to generate the word Chinese word segmentation component target/word-1.0.jar


  2. Create the directory solr-4.7.1/example/solr/lib, and copy the target/word-1.0.jar file to the lib directory


  3. Configure the schema to
  specify the tokenizer Put all the
  <tokenizer class="solr.WhitespaceTokenizerFactory"/> and
  <tokenizer class="solr.StandardTokenizerFactory" in the solr-4.7.1/example/solr/collection1/conf/schema.xml file />Replace all with
  <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory"/>
  And remove all filter tags

  4. If you need to use a specific word segmentation algorithm:
  <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching"/>
  The optional values ​​of segAlgorithm are:
  Forward Maximum Matching Algorithm: MaximumMatching
  Reverse Maximum Matching Algorithm: ReverseMaximumMatching
  Forward Minimum Matching Algorithm: Minimum
  Reverse Minimum Matching Matching algorithm: ReverseMinimumMatching
  Bidirectional maximum matching algorithm: BidirectionalMaximumMatching
  Bidirectional minimum matching algorithm: BidirectionalMinimumMatching
  Bidirectional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching
  If not specified, the default bidirectional maximum matching algorithm: BidirectionalMaximumMatching

  5. If you need to specify a specific configuration file:
  <tokenizer class=" org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching"
  conf="C:/solr-4.7.0/example/solr/nutch/conf/word.local.conf"/>
  For the configurable content in the word.local.conf file, see the word.conf file in word-1.0.jar.
  If not specified, use the default configuration file, the word.conf file located in word-1.0.jar




ElasticSearch plugin:


  1. Execute Command: mvn clean install dependency:copy-dependencies


  2. Create a directory elasticsearch-1.1.0/plugins/word


  3. Put the Chinese word segmentation library file target/word-1.0.jar and the dependent log library file
  target/dependency/slf4j- api-1.6.4.jar
  target/dependency/logback-core-0.9.28.jar
  target/dependency/logback-classic-0.9.28.jar is
  copied to the newly created word directory


  4. Modify the file elasticsearch-1.1.0/ config/elasticsearch.yml, add the following configuration:
  index.analysis.analyzer.default.type : "word"
  index.analysis.tokenizer.default.type : "word"


  5. Start the ElasticSearch test effect and access it in the Chrome browser :
  http://localhost:9200/_analyze?analyzer=word&text=Yang Shangchuan is the author of the APDPlat application-level product development platform


  6. Custom configuration Extract the configuration file word.conf
  from word-1.0.jar and change its name to word.local.conf , put it in the elasticsearch-1.1.0/plugins/word directory


  7. Specify the word segmentation algorithm
  Modify the file elasticsearch-1.1.0/config/elasticsearch.yml and add the following configuration:
  index.analysis.analyzer.default.segAlgorithm : "ReverseMinimumMatching "
  index.analysis.tokenizer.default.segAlgorithm : "ReverseMinimumMatching"


  Here segAlgorithm can specify the following values:
  Forward maximum matching algorithm: MaximumMatching
  Reverse maximum matching algorithm: ReverseMaximumMatching
  Forward minimum matching algorithm: MinimumReverse
  minimum matching algorithm: ReverseMinimumMatching
  Bidirectional maximum Matching algorithm: BidirectionalMaximumMatching
  Bidirectional minimum matching algorithm: BidirectionalMinimumMatching
  Bidirectional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching
  If not specified, the bidirectional maximum matching algorithm is used by default: BidirectionalMaximumMatching

Word vector:

  Count the context-related words of a word from a large-scale corpus, and use the vector composed of these context-related words to express the word.
  The similarity of words can be obtained by calculating the similarity of word vectors.
  The assumption of similarity is based on the premise that if two words are more similar in context, then the two words are more similar.

  By running the script demo-word-vector-corpus.bat in the project root directory to experience the effect of the word project's own corpus


  If you have your own text content, you can use the script demo-word-vector-file.bat to segment the text, Build word vectors and calculate similarity

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326608927&siteId=291194637