Stanford CoreNLP使用

Stanford CoreNLP是斯坦福大学的自然语言处理工具包，目前已经支持多种语言的处理。该工具包需要java的支持，因此机器上需要安装java。目前最新的版本是3.9.1。安装过程我不再赘述，我主要写一下工具包的架构和基本使用。

1、Annotations和Annotators
这两种类是CoreNLP里面的基本架构。Annotations表示一类数据结构，CoreNLP的整个工具包里面输入和输出都是这种数据结构，因此我们自己的文本（一般是String类型）要传给CoreNLP使用，需要先转换成一种Annotations。Annotators是一类功能类，比如我们想分词或者断句等，这每一项功能对应一种annotator类，annotator类接受一种annotation作为输入，然后输出一种annotation。

2、设置Annotators
对一段文本，如果我们有多项功能需求，那么我们需要先设定使用哪几种annotators，设置的形式如下，

// set up pipeline properties
Properties props = new Properties();
// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");

可以看到，设定使用哪几种annotators的属性是通过properties来实现的。annotators的种类可以看这里，可以根据自己需要的处理任务设置所需的annotators。

3、生成Annotations
前面我们说过，annotations是CoreNLP数据传输的基本数据结构，因此我们需要将我们自己的文本转换为Annotation再进行处理，转换的方法很简单，

// read some text in the text variable
String text = "...";

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

这里的Annotation类相当于一个基类，下面可以派生出许多不同annotator执行过后的生成的annotations。

4、利用StanfordCoreNLP类进行处理
属性设置好了，annotation也转换好了，现在就要用到StanfordCoreNLP这个类了。这个类是CoreNLP的接口类，所有参数都是传到这个类里面，执行也是通过这个类来完成，代码示例如下，

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

piepeline.annotate(document);

StanfordCoreNLP类接受properties类的一个实例进行初始化，然后再接受一个annotation作为参数就可以进行处理了。把上面的代码整合一下就是如下过程，

import edu.stanford.nlp.pipeline.*;
import java.util.*;

public class BasicPipelineExample {

    public static void main(String[] args) {

        // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        // read some text in the text variable
        String text = "...";

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(text);

        // run all Annotators on this text
        pipeline.annotate(document);

    }

}

可以看到使用CoreNLP还是比较简单的，StanfordCoreNLP作为接口，设置properties，生成annotation即可开始处理。处理完之后，我们如何提取执行完的结果呢？答案是运行完的结果都可以从document这个变量里面提取。

5、提取结果
我们知道，annotator运行后的结果是annotation类型的数据结构，这里运行后生成的Annotation类主要是两种类型：CoreMap和CoreLabel。CoreMap是一种Map类型的数据结构，即键值对的形式。在前面annotators的annotators的种类链接中，每一种annotator运行完后都会生成一种annotation，如下图所示
这里写图片描述
这个GENERATED ANNOTATIONS这一栏所对应的类即是CoreMap中的key的类型，因此我们要提取结果就好办了，只需要确定对应annotator所生成的annotation的种类即可，以ssplit这个annotator为例，

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
  // traversing the words in the current sentence
  // a CoreLabel is a CoreMap with additional token-specific methods
  for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);
  }

  // this is the parse tree of the current sentence
  Tree tree = sentence.get(TreeAnnotation.class);

  // this is the Stanford dependency graph of the current sentence
  SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
}

// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph = 
  document.get(CorefChainAnnotation.class);

这个SentencesAnnotation.class就是ssplit（断句）这个annotator所生成的annotation类，因此只要把这个参数传进去即可获取到ssplit运行的结果。CoreLabel也是一种特殊的CoreMap，只是有一些针对token的特殊方法（比如直接转换为String类型），本质上也是一种Map结构。从一个CoreMap我们可以生成语法树或者什么依存句法什么的，上面代码中的Tree就是语法树，SemanticGraph就是依存句法，可以直接打印出来，

for(CoreMap sentence: sentences) {
   // traversing the words in the current sentence
   // this is the parse tree of the current sentence

   Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);

   System.out.println("语法树：");

   System.out.println(tree.toString());



   // this is the Stanford dependency graph of the current sentence

   SemanticGraph dependencies = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);

   System.out.println("依存句法：");

   System.out.println(dependencies.toString());
}

有些annotator的实现需要另一些annotator，因此在确定properties的annotator属性时要确定好所依赖的annotator都被添加进去，各个annotator的依赖关系看这里。
上面的几个过程是对英文的完整处理过程，除此之外CoreNLP还提供了一个更简单的API调用方法，不过简单方法虽然更直观，但灵活性不太好，因此这里不介绍了，如果需要可以看这个链接Simple API。

6、针对中文的处理
上面的过程主要是针对英文，其实处理中文的方法也一样。处理中文需要一个额外的包，下载地址为(http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar)，把这个包和CoreNLP的包放到一起添加到工程中即可。

处理中文的代码也是一样的，只是在设置annotators这个属性时，一般使用的是默认的属性，这个配置是放在中文jar包下面的StanfordCoreNLP-chinese.properties文件里面

# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)
annotators = tokenize, ssplit, pos, lemma, ner, parse, coref

# segment
tokenize.language = zh
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

# sentence split
ssplit.boundaryTokenRegex = [.。]|[!?！？]+

# pos
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger

# ner
ner.language = chinese
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = true
ner.useSUTime = false

# regexner
ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab
ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE

# parse
parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz

# depparse
depparse.model    = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
depparse.language = chinese

# coref
coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
coref.input.type = raw
coref.postprocessing = true
coref.calculateFeatureImportance = false
coref.useConstituencyTree = true
coref.useSemantics = false
coref.algorithm = hybrid
coref.path.word2vec =
coref.language = zh
coref.defaultPronounAgreement = true
coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
coref.print.md.log = false
coref.md.type = RULE
coref.md.liberalChineseMD = false

# kbp
kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
kbp.language = zh
kbp.model = none

# entitylink
entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz

通常使用这个默认配置就可以了，使用方法如下，

public class nlp_Chinese_demo {
    public static void main(String[] args) {
        String props="StanfordCoreNLP-chinese.properties";
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        Annotation document;
        //从文件中导入文本
        //document = new Annotation(IOUtils.slurpFileNoExceptions(file));
        annotation = new Annotation("欢迎使用使用斯坦福大学自然语言处理工具包！");


        pipeline.annotate(document);
        pipeline.prettyPrint(document, System.out);
    }
}

7、写在最后
由于本人刚刚接触自然语言处理，因此很多概念还没有理解清楚，如果有什么不对的地方欢迎指正，谢谢！

参考：https://stanfordnlp.github.io/CoreNLP/api.html

猜你喜欢