Chinese word segmentation article index and word segmentation data resource sharing

Yesterday, I shared the article contributed by Le Yuquan on the AINLP public account: " Those things about word segmentation ". Some students left a message saying "not addictive". I thought about it. In fact, I have accumulated a lot on the blog about natural language processing. For articles about Chinese word segmentation, in addition to deep learning-based word segmentation methods that have not yet been explored, Chinese word segmentation methods in the "classical" machine learning era are all involved, from dictionary-based Chinese word segmentation (maximum matching method) to statistical-based word segmentation methods ( HMM, Maximum Entropy Model, Conditional Random Field Model (CRF), Mecab, NLTK Chinese word segmentation are all involved. Looking back, these articles are about 10 years old, and now they are a little immature. It may not be suitable to be posted on the official account. But here is an index. Interested students can read them on the blog. Basically, they are There are codes for reference.

Introduction to Chinese Word Segmentation Series

  • The maximum matching method for the introduction of Chinese word segmentation

  • Chinese word segmentation introduction to the maximum matching method expansion 1

  • Chinese word segmentation introduction to the maximum matching method expansion 2

  • Introduction to Chinese Word Segmentation

  • Resources for getting started with Chinese word segmentation

  • Literature of Introduction to Chinese Word Segmentation

  • Chinese word segmentation method based on character tagging

  • Introduction to Chinese Word Segmentation Zhuangzi Marking 1

  • Introduction to Chinese Word Segmentation Zodiac Marking 2

  • Introduction to Chinese Word Segmentation Zodiac Marking 3

  • Introduction to Chinese Word Segmentation Zodiac Marking 4

  • Introduction to Chinese word segmentation full text document

Two Japanese translation documents of rickjin boss, very helpful

  • Darts: Double-ARray Trie System Translation Document

  • Mecab document of Japanese tokenizer


Chinese word segmentation related articles shared by other students on the 52nlp blog, thank you all

  • Beginners report: Implemented a maximum matching word segmentation algorithm

  • Beginner report (2): Implement 1-gram word segmentation algorithm

  • Beginner report (3) CRF Chinese word segmentation decoding process understanding

  • Itenyh version-Chinese word segmentation with HMM 1: Preface

  • Itenyh version-Chinese word segmentation with HMM 2: model preparation

  • Itenyh版-用HMM做中文分词三:前向算法和Viterbi算法的开销

  • Itenyh版-用HMM做中文分词四:A Pure-HMM 分词器

  • Itenyh版-用HMM做中文分词五:一个混合的分词器


最后关于中文分词的数据资源,多说两句,中文分词的研究时间比较长,方法比较多,从实际经验看,好的词库资源可能更重要一些,最后提供一份中文分词的相关资源,包括中文分词字标注法全文pdf文档,以及web上其他同学分享的词库资源,感兴趣的同学可以关注AINLP,回复“fenci"获取:

图片


Guess you like

Origin blog.51cto.com/15060464/2678518