A Language Modeling Approach to Predicting Reading Difficulty-paer

Volume:Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004
Authors:Kevyn Collins-Thompson | James P Callan 、
Year:2004
Venues:NAACL | HLT

数据不公开：
550英文document，12个等级，448715个token，17928个type，来自不同主题

1 introduction
公式的方法~线性回归模型
我们的统计模型~
1）捕捉每个单词的更细节的特征~我们在更短的文章甚至小于10个单词时，准确率也很高
2）统计的方法可以获得概率分布，而不仅仅是一个预测

2 Description of Web Corpus
token定义为任何一个word的出现
type定义为一个word字符串，无论出现多少次也只算一次
数据：550英文document，12个等级，448715个token，17928个type，来自不同主题
我们的假设是：即使文本内容的主题不一样，单词的使用模式和文本的难度是有明显关系的

3 Related Work
之前的可读性评价依赖于两个主要因素：
1）the familiarity of semantic units(words or phrases)语义单元的熟悉度，如word或短语
2）the complexity of syntax. 句法的复杂
最为常用的是‘vocabulary-based measures’：
使用一个单词列表来估计语法难度，而不是number of syllables in a word，例如以下都是用单词列别的一些类型来估计语法难度
the Lexile measure (Stenner et al., 1988)
the Revised Dale-Chall formula (Chall and Dale,1995)
the Fry Short Passage measure (Fry, 1990).
--Lexile (version 1.0) uses the Carroll- Davies-Richman corpus of 86,741 types (Carroll et al., 1971);
--Dale-Chall uses the Dale 3000 word list;
Fry's Short Passage Measure uses Dale & O'Rourke's
--‘The Living Word Vocabulary’ of 43,000 types (Dale
and O'Rourke, 1981)

和Si and Callan(2001)这篇最早的也是唯一的使用语言模型的方法相比：
2001：只使用了science一个主题，分为3个难度，贝叶斯，没有实现特征选择方法的分析，所以并不知道它们的分类是是否将话题预测和难度预测混为一谈
我们：不限主题，12个难度等级，训练集更大，也使用贝叶斯，但是每个类别并不是独立的，我们使用了混合等级模型，大大提高了准确率。也没有把句子长度作业一个句法成分。测试了特征提取以及模型的泛化能力

4 The Smoothed Unigram Model

A Language Modeling Approach to Predicting Reading Difficulty-paer

猜你喜欢