Penn Treebank dataset introduction + syntax analysis parsed basic grammar + syntax analysis basics + NLP commonly used public dataset summary and download
Introduction to Penn Treebank Dataset
Penn Treebank is a commonly used PTB corpus in NLP. Penn Treebank is the name of a project that marks the corpus. The content of the markup includes: [part-of-speech tagging] and [syntax analysis].
- Text source: The Wall Street Journal in 1989
- Corpus size: 1M words, a total of 2499 articles
- Corpus price: 1500~1700$
Applied to NLTK tools:
- tokenizing (word segmentation)
- tagging (part-of-speech tagging)
- chunking
- parsing (syntactic analysis)
"""
treebank示例目录中包含的文件,分别为raw,tagged, parsed,combined。四个示例类型如下所示:
"""