Keyword: java Chinese word segmentation component-word
word segmentation word segmentation homepage: https://github.com/ysc/word
word segmentation is a Chinese word segmentation component implemented in Java, providing a variety of dictionary-based word segmentation algorithms, and using ngram model to disambiguate. Can accurately identify English, numbers, as well as date, time and other quantifiers, and can identify unregistered words such as person names, place names, and organization names. Also provides Lucene, Solr, ElasticSearch plug-ins.
How to use word segmentation:
1. Quick experience
Run the script demo-word.bat in the root directory of the project to quickly experience the word segmentation effect
Usage: command [text] [input] [output]
The optional values of command command are: demo, text, file
demo
text Yang Shangchuan is the author of the APDPlat application-level product development platform
file d:/text.txt d:/word.txt
exit
2. Segment the text and
remove stop words: List<Word> words = WordSegmenter.seg("Yang Shangchuan is the author of the APDPlat application-level product development platform");
reserved stop words: List<Word> words = WordSegmenter.segWithStopWords("Yang Shangchuan is the author of the APDPlat application-level product development platform");
System.out.println(words);
output:
Remove stop words: [Yang Shangchuan, apdplat, application level, product, development platform, author]
Keep stop words: [Yang Shangchuan, yes, apdplat, application level, product, development platform, of, author]
3. Carry out the file Word segmentation
String input = "d:/text.txt";
String output = "d:/word.txt";
Remove stop words: WordSegmenter.seg(new File(input), new File(output));
keep stop Words: WordSegmenter.segWithStopWords(new File(input), new File(output));
4. Custom configuration file The
default configuration file is word.conf under the classpath, packaged in word-xxjar
The custom configuration file is a class The word.local.conf under the path needs to be provided by the user.
If the custom configuration is the same as the default configuration, the custom configuration will override the default configuration. The
configuration file encoding is UTF-8 Or multiple folders or files, you can use absolute path or relative path User thesaurus consists of multiple dictionary files, the file encoding is UTF-8 The format of the dictionary file is a text file, one line represents a word Can be passed through system properties or configuration files way to specify the path, multiple paths are separated by commas
The dictionary file under the classpath needs to add the prefix classpath: before the relative path:
There are three designation methods:
designation method one, programming designation (high priority):
WordConfTools.set("dic.path", "classpath:dic.txt, d:/custom_dic");
DictionaryFactory.reload();//After changing the dictionary path, reload the dictionary
Designation method 2, Java virtual machine startup parameters (medium priority):
java -Ddic.path=classpath:dic.txt, d:/custom_dic
designation method 3, configuration file designation (low priority):
use the file word.local.conf under the classpath to specify the configuration information
dic.path=classpath:dic.txt, if d:/custom_dic is
not specified, By default, the dic.txt dictionary file in the classpath is used.
6. Custom stopwords The
usage is similar to that of custom user thesaurus. The configuration items are:
stopwords.path=classpath:stopwords.txt, d:/custom_stopwords_dic
7 、Automatically detect changes in thesaurus
Can automatically detect changes in custom user thesaurus and custom stop word thesaurus
Include files and folders under the classpath, absolute paths and relative paths under the non-classpath,
such as :
classpath:dic.txt, classpath:custom_dic_dir,
d:/dic_more.txt, d:/DIC_DIR, D:/DIC2_DIR, my_dic_dir, my_dic_file.txt
classpath:stopwords.txt, classpath:custom_stopwords_dic_dir,
d:/stopwords_more.txt,d :/STOPWORDS_DIR, d:/STOPWORDS2_DIR, stopwords_dir, remove.txt
8. Explicitly specify
the When segmenting text, you can explicitly specify a specific word segmentation algorithm, such as:
WordSegmenter.seg("APDPlat application-level product development platform" , SegmentationAlgorithm.BidirectionalMaximumMatching);
The optional types of SegmentationAlgorithm are:
Forward Maximum Matching Algorithm: MaximumMatching
Reverse Maximum Matching Algorithm: ReverseMaximumMatching
Forward Minimum Matching Algorithm: MinimumMatching
Reverse Minimum Matching Algorithm: ReverseMinimumMatching
Bidirectional Maximum Matching Algorithm: BidirectionalMaximumMatching
Bidirectional Minimum Matching Algorithm: BidirectionalMinimumMatching
Bidirectional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching
9. Evaluation of word segmentation effect
Run the script evaluation.bat in the project root directory to evaluate the word segmentation effect.
The test text used for evaluation has 253 3709 lines and a total of 2837 4490 characters. The
evaluation results are located in the target/evaluation directory Below:
corpus-text.txt is the manually labeled text with divided words, and the words are separated by spaces
test-text.txt is the test text, which is the result of dividing corpus-text.txt into multiple lines with punctuation marks.
Standard-text .txt is the manually labeled text corresponding to the test text, as the standard result-text-***.txt for the correct word segmentation
, *** is the name of various word segmentation algorithms, which is the word segmentation result
perfect-result-***. txt, *** is the name of various word segmentation algorithms, this is the text of the word segmentation result and the manual annotation standard exactly
wrong-result-***.txt, *** is the name of various word segmentation algorithms, this is the word segmentation result
Lucene plugin :
1. Construct a word analyzer ChineseWordAnalyzer
Analyzer analyzer = new ChineseWordAnalyzer();
2. Use word analyzer to segment text
TokenStream tokenStream = analyzer.tokenStream("text", "Yang Shangchuan is the author of the APDPlat application-level product development platform");
while(tokenStream.incrementToken()){
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute = tokenStream .getAttribute(OffsetAttribute.class);
System.out.println(charTermAttribute.toString()+" "+offsetAttribute.startOffset());
}
3. Use word analyzer to create Lucene index
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);
4. Use word analyzer to query Lucene index
QueryParser queryParser = new QueryParser(Version.LUCENE_47, "text", analyzer);
Query query = queryParser.parse("text:Yang Shangchuan");
TopDocs docs = indexSearcher.search(query, Integer.MAX_VALUE);
Solr plugin:
1. Generate the binary jar of the word segmentation component and
execute mvn clean install to generate the word Chinese word segmentation component target/word-1.0.jar
2. Create the directory solr-4.7.1/example/solr/lib, and copy the target/word-1.0.jar file to the lib directory
3. Configure the schema to
specify the tokenizer Put all the
<tokenizer class="solr.WhitespaceTokenizerFactory"/> and
<tokenizer class="solr.StandardTokenizerFactory" in the solr-4.7.1/example/solr/collection1/conf/schema.xml file />Replace all with
<tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory"/>
And remove all filter tags
4. If you need to use a specific word segmentation algorithm:
<tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching"/>
The optional values of segAlgorithm are:
Forward Maximum Matching Algorithm: MaximumMatching
Reverse Maximum Matching Algorithm: ReverseMaximumMatching
Forward Minimum Matching Algorithm: Minimum
Reverse Minimum Matching Matching algorithm: ReverseMinimumMatching
Bidirectional maximum matching algorithm: BidirectionalMaximumMatching
Bidirectional minimum matching algorithm: BidirectionalMinimumMatching
Bidirectional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching
If not specified, the default bidirectional maximum matching algorithm: BidirectionalMaximumMatching
5. If you need to specify a specific configuration file:
<tokenizer class=" org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm="ReverseMinimumMatching"
conf="C:/solr-4.7.0/example/solr/nutch/conf/word.local.conf"/>
For the configurable content in the word.local.conf file, see the word.conf file in word-1.0.jar.
If not specified, use the default configuration file, the word.conf file located in word-1.0.jar
ElasticSearch plugin:
1. Execute Command: mvn clean install dependency:copy-dependencies
2. Create a directory elasticsearch-1.1.0/plugins/word
3. Put the Chinese word segmentation library file target/word-1.0.jar and the dependent log library file
target/dependency/slf4j- api-1.6.4.jar
target/dependency/logback-core-0.9.28.jar
target/dependency/logback-classic-0.9.28.jar is
copied to the newly created word directory
4. Modify the file elasticsearch-1.1.0/ config/elasticsearch.yml, add the following configuration:
index.analysis.analyzer.default.type : "word"
index.analysis.tokenizer.default.type : "word"
5. Start the ElasticSearch test effect and access it in the Chrome browser :
http://localhost:9200/_analyze?analyzer=word&text=Yang Shangchuan is the author of the APDPlat application-level product development platform
6. Custom configuration Extract the configuration file word.conf
from word-1.0.jar and change its name to word.local.conf , put it in the elasticsearch-1.1.0/plugins/word directory
7. Specify the word segmentation algorithm
Modify the file elasticsearch-1.1.0/config/elasticsearch.yml and add the following configuration:
index.analysis.analyzer.default.segAlgorithm : "ReverseMinimumMatching "
index.analysis.tokenizer.default.segAlgorithm : "ReverseMinimumMatching"
Here segAlgorithm can specify the following values:
Forward maximum matching algorithm: MaximumMatching
Reverse maximum matching algorithm: ReverseMaximumMatching
Forward minimum matching algorithm: MinimumReverse
minimum matching algorithm: ReverseMinimumMatching
Bidirectional maximum Matching algorithm: BidirectionalMaximumMatching
Bidirectional minimum matching algorithm: BidirectionalMinimumMatching
Bidirectional maximum and minimum matching algorithm: BidirectionalMaximumMinimumMatching
If not specified, the bidirectional maximum matching algorithm is used by default: BidirectionalMaximumMatching
Word vector:
Count the context-related words of a word from a large-scale corpus, and use the vector composed of these context-related words to express the word.
The similarity of words can be obtained by calculating the similarity of word vectors.
The assumption of similarity is based on the premise that if two words are more similar in context, then the two words are more similar.
By running the script demo-word-vector-corpus.bat in the project root directory to experience the effect of the word project's own corpus
If you have your own text content, you can use the script demo-word-vector-file.bat to segment the text, Build word vectors and calculate similarity
Java Chinese word segmentation component - word segmentation
Guess you like
Origin http://10.200.1.11:23101/article/api/json?id=326608927&siteId=291194637
Recommended
Ranking