《Mining Text Data》阅读笔记---第1章 An Introduction to Text Mining

这是一本关于文本挖掘的很厚的英文电子书，看英文大部头，很容易边看边忘记。

1.An Introduction to Text Mining

1.1 介绍
文本挖掘的三个问题：
a. 主要的算法模型是什么？与其他数据挖掘的区别？
b. 有哪些可用的工具和技术？（模型是形而上的，技术是形而下的）
c. 有哪些关键的应用领域？

文本挖掘的特点：
a. 文本数据的高维度和稀疏性
b.文本数据可以在多层次进行分析，如单词，句，篇章，文本集合。
文本的语义表示很有用，如NER.

1.2 算法
本section介绍文本挖掘所覆盖的各种topic及其算法。
a. Information Extraction from Text Data:
   Information Extraction is one of the key problems of text mining, which serves as a starting
   point for many text mining algorithms.

b. Text Summarization:
   Another common function needed in many text mining applications is to summarize the text documents.

c. Unsupervised Learning Methods from Text Data:
The two main unsupervised learning methods commonly used in the context of text data are clustering and topic modeling.

d. LSI and Dimensionality Reduction for Text Mining:
representing the underlying data in compressed format for indexing and retrieval.
这点有点类似Text Summarization了。

e Supervised Learning Methods for Text Data

f. Transfer Learning with Text Data:
   用武之处： For example, labeled English documents are copious and easy to find. On the other hand, it is much
   harder to obtain labeled Chinese documents. 英语的实体库等如此open，的确是很大的机会去转移到中文上去。

g. Probabilistic Techniques for Text Mining:

h. Mining Text Streams:
文本数据类似音频流一样的输入，需要进行on-line连续处理，传统的off-line批处理不适用了。

i. Cross-Lingual Mining of Text Data:

j. Text Mining in Multimedia Networks:

k. Text Mining in Social Media:

l. Opinion Mining from Text Data:
这是最常见的应用了。

m. Text Mining from Biomedical Data:
这是在一个专业领域的应用了。

1.3 将来的方向
a. Scalable and robust methods for natural language understanding:
目前NLP的许多方法要scale to multiple domains比较困难，有监督学习对训练数据量的要求太高。

b. Domain adaptation and transfer learning
这也是解决有监督学习缺乏训练数据的问题。

c. Contextual analysis of text data:

d. Parallel text mining:

《Mining Text Data》阅读笔记---第1章 An Introduction to Text Mining

猜你喜欢