Chinese word segmentation using Stanford CoreNLP

Stanford produced, first worship. . . Official website address: https://stanfordnlp.github.io/CoreNLP/index.html

Stanford CoreNLP is available on Maven Central

So you can configure gradle dependencies directly. Select the corresponding model through the classifier for different languages. Among them, models are the basis of other language models, which can handle English by default and must be introduced. We need to deal with Chinese, so we also need: models-chinese.

However, the two packages of models and models-chinese are very large, and the download is a bit slow (children's shoes who are confident in the speed of the Internet can ignore "however"). So I downloaded it with Thunder and imported it through a local file. 

// Apply the java plugin to add support for Java
apply plugin: 'java'

// In this section you declare where to find the dependencies of your project
repositories {
    // Use 'jcenter' for resolving your dependencies.
    // You can declare any Maven/Ivy/file repository here.
	maven {
		url "http://maven.aliyun.com/nexus/content/groups/public"
	}
	jcenter()
}

// In this section you declare the dependencies for your production and test code
dependencies {
	// https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp
	compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0'
	compile files('lib/stanford-corenlp-3.8.0-models.jar')
	compile files('lib/stanford-chinese-corenlp-2017-06-09-models.jar')
	//compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0', classifier:'models'
	//compile group: 'edu.stanford.nlp', name: 'stanford-corenlp', version: '3.8.0', classifier:'models-chinese'
    testCompile 'junit:junit:4.12'
}

After that, the StanfordCoreNLP is instantiated through the configuration, and then the text is processed (annotate). There is a configuration file for Chinese in the models-chinese package: StanfordCoreNLP-chinese.properties. It can be directly passed in when instantiating StanfordCoreNLP, which is loaded here through properties, in order to facilitate the modification of the configuration.

For the specific source code, see the official Demo: StanfordCoreNlpDemo.java, here only some modifications have been made for Chinese processing. Since the memory required for Chinese processing is relatively large, configure the jvm parameters: -Xms512M -Xmx4096M

    // Add in sentiment
    Properties props = new Properties();
    props.load(StanfordCoreNlpDemo.class.getClassLoader().getResourceAsStream("StanfordCoreNLP-chinese.properties"));
    //props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref, sentiment");

    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    //StanfordCoreNLP pipeline = new StanfordCoreNLP();
    // Initialize an Annotation with some text to be annotated. The text is the argument to the constructor.
    Annotation annotation;
    if (args.length > 0) {
      annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
    } else {
      annotation = new Annotation(" 循环经济是人类社会发展的必然选择,包装废弃物资源化是循环经济的要求。"
      		+ "包装废弃物资源化是一项系统工程,应从企业、区域和社会三个层面上进行,"
      		+ "因此,产生了三种包装废弃物资源化模式,即基于清洁生产、生态工业园区和基于社会层面的包装废弃物资源化模式。");
    }

    // run all the selected Annotators on this text
    pipeline.annotate(annotation);

Each component (annotator) of StanfordCoreNLP, that is, the annotators we configure in properties. There are certain dependencies between them: https://stanfordnlp.github.io/CoreNLP/dependencies.html .

Here is a list of some annotators, see  https://stanfordnlp.github.io/CoreNLP/annotators.html for details .

  1. tokenize(Tokenization 分词)
  2. ssplit (Sentence Splitting)
  3. pos (Part of Speech Tagging)
  4. lemma (lemmatization stemming)
  5. ner (Named Entity Recognition Named Entity Recognition)
  6. parse (Constituency Parsing)
  7. depparse (Dependency Parsing)
  8. dcoref (Coreference Resolution)
  9. natlog(Natural Logic Polarity)
  10. openie(Open Information Extraction https://nlp.stanford.edu/software/openie.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324398515&siteId=291194637