[NLP] Use of OpenNLP sentence detector

table of Contents

 

Sentence Detector

Model training

Sentence detection


Sentence Detector

Sentence detector, OpenNLP sentence detector can detect whether punctuation characters mark the end of a sentence. In this sense, a sentence is defined as the longest sequence of blank characters between two punctuation marks. The first and last sentence are exceptions to this rule. The first non-blank character is assumed to be the beginning of the sentence, and the last non-blank character is assumed to be the end of the sentence.

Usually the sentence detection is done before the text is segmented, but it is also possible to perform word segmentation first and let the sentence detector process the segmented text. The OpenNLP sentence detector cannot identify sentence boundaries based on sentence content. For example, the title in the article was mistaken for the first part of the first sentence. Most components in OpenNLP expect the input to be segmented into sentences.

The input of Sentence Detector is a piece of text, and the output is one line for each sentence.

Model training

import java.io.BufferedOutputStream;

import java.io.File;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.OutputStream;

import java.nio.charset.StandardCharsets;

import opennlp.tools.sentdetect.SentenceDetectorFactory;

import opennlp.tools.sentdetect.SentenceDetectorME;

import opennlp.tools.sentdetect.SentenceModel;

import opennlp.tools.sentdetect.SentenceSample;

import opennlp.tools.sentdetect.SentenceSampleStream;

import opennlp.tools.util.InputStreamFactory;

import opennlp.tools.util.MarkableFileInputStreamFactory;

import opennlp.tools.util.ObjectStream;

import opennlp.tools.util.PlainTextByLineStream;

import opennlp.tools.util.TrainingParameters;

public class SentenceDetectorTrain {



    public static void main(String[] args) throws IOException {

       // TODO Auto-generated method stub

       String rootDir = System.getProperty("user.dir") + File.separator;

      

       String fileResourcesDir = rootDir + "resources" + File.separator;

       String modelResourcesDir = rootDir + "opennlpmodel" + File.separator;

      

       //训练数据的路径

       String filePath = fileResourcesDir + "sentenceDetector.txt";

       //训练后模型的保存路径

       String modelPath = modelResourcesDir + "da-sent-my.bin";



       InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File(filePath));

       ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);

   

       //按行读取数据

       ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream);

       SentenceDetectorFactory sentenceFactory=new SentenceDetectorFactory();

     

       SentenceModel model = SentenceDetectorME.train("en", sampleStream, sentenceFactory, TrainingParameters.defaultParams());

       OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelPath));

       model.serialize(modelOut);

       } 

}

 

Sentence detection

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStream;

import opennlp.tools.sentdetect.SentenceDetectorME;

import opennlp.tools.sentdetect.SentenceModel;



public class SentenceDetectorPredit {

    public static void main(String[] args) throws IOException {

       // TODO Auto-generated method stub

       String rootDir = System.getProperty("user.dir") + File.separator;

       String modelResourcesDir = rootDir + "opennlpmodel" + File.separator;

       String modelPath = modelResourcesDir + "da-sent.bin";

       InputStream modelIn = new FileInputStream(modelPath) ;

       //加载模型

       SentenceModel model = new SentenceModel(modelIn);

       //实例化模型

       SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);

       //语句检测

       String sentences[] = sentenceDetector.sentDetect("First sentence. Second sentence. ");

       for(String str:sentences){

           System.out.println(str);

       }

    }

}

 

Guess you like

Origin blog.csdn.net/henku449141932/article/details/111190629