NLP: Generate familiar NLP open source tools, such as NLTK, HanLP, etc., and search, download, and familiarize with corpora such as PKU, CoreNLP, LTP MSR, AS CITYI, etc.

Table of contents

1. NLTK

NLTK (Natural Language Toolkit) is an open source natural language processing library for Python. It provides a large amount of preprocessed text data and corpus, as well as some commonly used text processing algorithms and NLP tools. For example, NLTK provides functions such as word segmentation, part-of-speech tagging, named entity recognition, and sentiment analysis. Here is an example of part-of-speech tagging using NLTK:

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)
tags = pos_tag(tokens)

print(tags)

# Output: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]

NLTK also provides a large corpus for study and research use. Some of these corpora include:

Gutenberg Corpus: Contains the text of over 25,000 free eBooks.
Brown Corpus: Contains about 500,000 words of different types of language samples, used to study language variants in natural language processing.
Reuters Corpus: Contains 10,788 news documents in 118 topic categories.
Movie Reviews Corpus: Contains 1000 movie review texts, each assigned a positive or negative label.

2. HanLP

HanLP is an open source natural language processing toolkit developed by the Natural Language Processing and Social Human Computing Laboratory of the Institute of Computing Technology, Chinese Academy of Sciences. It supports Chinese word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, keyword extraction and other functions. The following is an example of Chinese word segmentation using HanLP:

from pyhanlp import *

text = "自然语言处理是一项重要的人工智能技术。"

segmenter = HanLP.newSegment().enableCustomDictionary(False)
words = segmenter.seg(text)

for word in words:
    print(word.word)

# Output: 自然语言 处理 是 一项 重要 的 人工智能 技术 。

Chinese News Corpus: Contains more than 3.5 million news texts.
Chinese chat corpus: contains more than 5 million instant messaging texts.
People's Daily Corpus: Contains People's Daily texts from 1964 to 2018.

3. PKU

The PKU corpus is a very popular Chinese natural language processing corpus, which contains a large amount of text data and other language processing resources. It includes functions such as Chinese word segmentation, part-of-speech tagging, named entity recognition, and dependency parsing. The following is an example of using PKU for Chinese word segmentation:

import pkuseg

text = "自然语言处理是一项重要的人工智能技术。"

seg = pkuseg.pkuseg()
words = seg.cut(text)

print(words)

# Output: ['自然语言', '处理', '是', '一项', '重要', '的', '人工智能', '技术', '。']

The PKU corpus includes:

PKU People's Daily Chinese Corpus: Contains People's Daily texts from 1998 to 2010, including part-of-speech tagging, named entity recognition, etc.
News Corpus: Contains more than 10 million news texts, covering a time range of more than 20 years.

4. CoreNLP

CoreNLP is an open source natural language processing toolkit developed by the Natural Language Processing Group at Stanford University. It supports multiple languages, including English, Chinese, Arabic, etc., and can perform tasks such as word segmentation, part-of-speech tagging, named entity recognition, syntax analysis, and sentiment analysis. The following is an example of English word segmentation using CoreNLP:

import json
from pycorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP('http://localhost:9000')

text = "The quick brown fox jumps over the lazy dog."
output = nlp.annotate(text, properties={
    'annotators': 'tokenize',
    'outputFormat': 'json'
})

tokens = [token['word'] for sentence in output['sentences'] for token in sentence['tokens']]

print(tokens)

# Output: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

Penn Treebank Corpus: Contains various types of English text data for research on syntactic analysis and other natural language processing tasks.
OntoNotes corpus: Contains text data in multiple languages, used to study tasks such as named entity recognition and semantic role labeling.

Five, LTP

LTP (Language Technology Platform) is an open source natural language processing toolkit for Chinese developed by the Social Computing and Information Retrieval Research Center of Harbin Institute of Technology. It supports tasks such as Chinese word segmentation, part-of-speech tagging, named entity recognition, dependency syntax analysis, and semantic role tagging. The following is an example of using LTP for Chinese word segmentation:

from pyltp import Segmentor

segmentor = Segmentor()
segmentor.load("/path/to/your/model")
text = "自然语言处理是一项重要的人工智能技术。"
words = segmentor.segment(text)

print(words)

# Output: ['自然语言', '处理', '是', '一项', '重要', '的', '人工智能', '技术', '。']

SIGHAN2005 corpus: Contains various types of Chinese text data, used to study Chinese word segmentation, part-of-speech tagging and other tasks.
CTB5.1 Corpus: Contains more than 170,000 Chinese sentences for research on syntactic analysis and other natural language processing tasks.

6. MSR

MSR is a Chinese-oriented natural language processing toolset developed by Microsoft Research Asia. It can perform tasks such as Chinese word segmentation, part-of-speech tagging, named entity recognition, and dependency syntax analysis, and provides interfaces in multiple languages. The following is an example of Chinese word segmentation using MSR:

import msr

text = "自然语言处理是一项重要的人工智能技术。"
seg = msr.segment(text)

print(seg)

# Output: ['自然语言', '处理', '是', '一项', '重要', '的', '人工智能', '技术', '。']

MSR Chinese Word Segmentation Corpus: Contains 1 million Chinese sentences, used to study Chinese word segmentation and other tasks.
MSR Entity Recognition Corpus: Contains a large amount of entity annotation data for researching tasks such as named entity recognition.