Penn Treebank dataset introduction + syntax analysis parsed basic grammar + syntax analysis basics + NLP commonly used public dataset summary and download

Penn Treebank dataset introduction + syntax analysis parsed basic grammar + syntax analysis basics + NLP commonly used public dataset summary and download

Introduction to Penn Treebank Dataset

Penn Treebank is a commonly used PTB corpus in NLP. Penn Treebank is the name of a project that marks the corpus. The content of the markup includes: [part-of-speech tagging] and [syntax analysis].

  • Text source: The Wall Street Journal in 1989
  • Corpus size: 1M words, a total of 2499 articles
  • Corpus price: 1500~1700$

Applied to NLTK tools:

  1. tokenizing (word segmentation)
  2. tagging (part-of-speech tagging)
  3. chunking
  4. parsing (syntactic analysis)
"""
treebank示例目录中包含的文件,分别为raw,tagged, parsed,combined。四个示例类型如下所示:
"""

Guess you like

Origin blog.csdn.net/weixin_42782150/article/details/127447013