OpenNLP corpus in training on language detection

Because of the project, the need to use technologies of NLP. Was first used nltk, but mainly because of nltk rich foreign language support, and it is the python to be integrated with the project is not very convenient, and later found OpenNLP, find it relatively speaking, for some Asian language support . Therefore, the use of recently available, wanted to learn something of meticulous training under OpenNLP relevant because the situation met the Chinese and Japanese characters cross in the project, and if the detected object is too short, the result is easy to detect case deviations occur. Well, ado, directly topic.
We start document start with, the official website of the document is very standardized, first find the Language Detector this title, and then see the training down, according to documents prompted us, in fact, we found that the corpus accordance with the following specifications can be:
OpenNLP corpus in training on language detection

Note points :
1. the text file is a corpus line, the first column is the language code corresponding to the ISO-639-3, the second column is the tab indentation, the third column is the text corpus
2. For long text, do not to artificially add newline
3. training corpus corpus must have a number of different information, or in the training being given

With the above corpus file, you may be able to be trained by the language we need a few simple lines of code tested

InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("corpus.txt"));

ObjectStream<String> lineStream =  new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);
ObjectStream<LanguageSample> sampleStream = new LanguageDetectorSampleStream(lineStream);

TrainingParameters params = ModelUtil.createDefaultTrainingParameters();
params.put(TrainingParameters.ALGORITHM_PARAM,  PerceptronTrainer.PERCEPTRON_VALUE);
params.put(TrainingParameters.CUTOFF_PARAM, 0);

LanguageDetectorFactory factory = new LanguageDetectorFactory();

LanguageDetectorModel model = LanguageDetectorME.train(sampleStream, params, factory);
model.serialize(new File("langdetect.bin"));
}

Finally, run it, you can generate a langdetect.bin of the corpus file in your local, later in the program to use it just fine.

Guess you like

Origin blog.51cto.com/biyusheng/2464894