[NLP] OpenNLP named entity recognition (NameFinder) use

table of Contents

 

Name Finder

Model training

Name recognition


Name Finder

The named finder can detect named entities and numbers in the text. To be able to detect entities, the named finder needs a model. The model depends on the language and entity type it is trained on. The OpenNLP project provides many pre-trained name lookup models, which have been trained on various free corpora. They can be downloaded on our model download page. To find the name in the original text, the text must be divided into tags and sentences.

By default, there is one sentence per line in the input training data, and the sentence is the word segmented by the tokenizer. Name entities are marked with Span; a blank line encountered in the input indicates the end of the document. The official recommendation is to train a model with at least 15,000 sentences. Such as:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .

Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

OpenNLP defines a default feature generation, which is used when no custom feature generation is specified. Users can customize the feature generator through API.

 

Model training

import java.io.BufferedOutputStream;

import java.io.File;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.OutputStream;

import java.nio.charset.StandardCharsets;

import opennlp.tools.namefind.NameFinderME;

import opennlp.tools.namefind.NameSample;

import opennlp.tools.namefind.NameSampleDataStream;

import opennlp.tools.namefind.TokenNameFinderFactory;

import opennlp.tools.namefind.TokenNameFinderModel;

import opennlp.tools.util.InputStreamFactory;

import opennlp.tools.util.MarkableFileInputStreamFactory;

import opennlp.tools.util.ObjectStream;

import opennlp.tools.util.PlainTextByLineStream;

import opennlp.tools.util.TrainingParameters;

public class NameFinderTrain {

    public static void main(String[] args) throws IOException {

       // TODO Auto-generated method stub

       String rootDir = System.getProperty("user.dir") + File.separator;

       String fileResourcesDir = rootDir + "resources" + File.separator;

       String modelResourcesDir = rootDir + "opennlpmodel" + File.separator;      

       //训练数据的路径

        String filePath = fileResourcesDir + "naneFinder.txt";

       //训练后模型的保存路径

        String modelPath = modelResourcesDir + "en-ner-person-my.bin";

         
       //按行读取数据

       InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File(filePath));

       ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);      

       //按行读取数据

        ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);

         TokenNameFinderFactory factory =new TokenNameFinderFactory();

       //训练模型

       TokenNameFinderModel model =NameFinderME.train("en","person", sampleStream,  TrainingParameters.defaultParams(),  factory);        

       //保存模型

       FileOutputStream fos=new FileOutputStream(new File(modelPath));

        OutputStream modelOut = new BufferedOutputStream(fos);

        model.serialize(modelOut);


//评估模型

        TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model));

        evaluator.evaluate(sampleStream);



        FMeasure result = evaluator.getFMeasure();

        System.out.println(result.toString());

    }

}

 

Name recognition

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStream;

import opennlp.tools.namefind.NameFinderME;

import opennlp.tools.namefind.TokenNameFinderModel;

import opennlp.tools.util.Span;



public class NameFinderPredit {



    public static void main(String[] args) throws IOException {

       // TODO Auto-generated method stub

       String rootDir = System.getProperty("user.dir") + File.separator;

      

       String fileResourcesDir = rootDir + "resources" + File.separator;

       String modelResourcesDir = rootDir + "opennlpmodel" + File.separator;

      

       //String filePath = fileResourcesDir + "sentenceDetector.txt";

       String modelPath = modelResourcesDir + "en-ner-person.bin";



       InputStream modelIn = new FileInputStream(modelPath) ;

       //加载模型

       TokenNameFinderModel model = new TokenNameFinderModel(modelIn);

       //实例化模型

       NameFinderME nameFinder  = new NameFinderME(model);

      

       String tokens[] = new String[]{

               "Vinken",

               "is",

               "61",

               "years",

               "old",

               "Pierre",

               ".",

               "Pierre",

               };

       //命名检测

//Span 保存表示命名实体在tokens中的位置

       Span[] nameFinds= nameFinder.find(tokens);

      

       for(Span str:nameFinds){

       System.out.println("type:"+str.getType()+";Tostring:"+str.toString()+";length:"+nameFinds.length+"start:"+str.getStart()+";end:"+str.getEnd()+";name:"+tokens[str.getStart()]);

       }

    }

}

Output:

type:person;Tostring:[5..6) person;length:2start:5;end:6;name:Pierre

type:person;Tostring:[7..8) person;length:2start:7;end:8;name:Pierre

Guess you like

Origin blog.csdn.net/henku449141932/article/details/111319164