使用Stanford CoreNLP进行中文分词

Stanford出品，先膜拜一下。。。官网地址：https://stanfordnlp.github.io/CoreNLP/index.html

Stanford CoreNLP is available on Maven Central.

所以可以直接配置gradle依赖。对不同的语言通过classifier选择对应的model。其中models是其他语言models的基础，默认可以处理English，必须引入。我们需要处理中文，所以还需要：models-chinese。

然而models和models-chinese两个包很大，下载有点慢（对网速自信的童鞋可以无视“然而”）。所以我就用迅雷下载好通过本地文件引入。

// Apply the java plugin to add support for Java
apply plugin: 'java'

// In this section you declare where to find the dependencies of your project
repositories {
    // Use 'jcenter' for resolving your dependencies.
    // You can declare any Maven/Ivy/file repository here.
	maven {
		url "http://maven.aliyun.com/nexus/content/groups/public"
	}
	jcenter()
}

// In this section you declare the dependencies for your production and test code
dependencies {
	// https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp
	compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0'
	compile files('lib/stanford-corenlp-3.8.0-models.jar')
	compile files('lib/stanford-chinese-corenlp-2017-06-09-models.jar')
	//compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0', classifier:'models'
	//compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0', classifier:'models-chinese'
    testCompile 'junit:junit:4.12'
}

之后就是通过配置实例化StanfordCoreNLP，然后对文本进行处理（annotate）。在models-chinese包中有针对中文的配置文件：StanfordCoreNLP-chinese.properties。可以在实例化StanfordCoreNLP时直接传入，这里通过properties加载，为了方便修改配置。

具体源码看官方Demo：StanfordCoreNlpDemo.java，这里只是针对中文处理进行了一些修改。由于中文处理需要的内存比较大，所以配置jvm参数：-Xms512M -Xmx4096M

    // Add in sentiment
    Properties props = new Properties();
    props.load(StanfordCoreNlpDemo.class.getClassLoader().getResourceAsStream("StanfordCoreNLP-chinese.properties"));
    //props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref, sentiment");

    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    //StanfordCoreNLP pipeline = new StanfordCoreNLP();
    // Initialize an Annotation with some text to be annotated. The text is the argument to the constructor.
    Annotation annotation;
    if (args.length > 0) {
      annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
    } else {
      annotation = new Annotation(" 循环经济是人类社会发展的必然选择,包装废弃物资源化是循环经济的要求。"
      		+ "包装废弃物资源化是一项系统工程,应从企业、区域和社会三个层面上进行,"
      		+ "因此,产生了三种包装废弃物资源化模式,即基于清洁生产、生态工业园区和基于社会层面的包装废弃物资源化模式。");
    }

    // run all the selected Annotators on this text
    pipeline.annotate(annotation);

StanfordCoreNLP的各个组件（annotator），即我们在properties中配置的annotators。他们之间有一定的依赖关系：https://stanfordnlp.github.io/CoreNLP/dependencies.html。

这里列举一部分annotator，详见 https://stanfordnlp.github.io/CoreNLP/annotators.html。

tokenize（Tokenization 分词）
ssplit（Sentence Splitting 断句）
pos（Part of Speech Tagging 词性标注）
lemma（Lemmatization 词干提取）
ner（Named Entity Recognition 命名实体识别）
parse（Constituency Parsing 语法分析）
depparse（Dependency Parsing 依存分析）
dcoref（Coreference Resolution 同指消解）
natlog（Natural Logic Polarity）
openie（Open Information Extraction https://nlp.stanford.edu/software/openie.html）

使用Stanford CoreNLP进行中文分词

猜你喜欢