面向机器学习的特征工程 三、文本数据: 展开、过滤和分块

来源:ApacheCN《面向机器学习的特征工程》翻译项目

译者:@kkejili

校对:@HeYun

如果让你来设计一个算法来分析以下段落,你会怎么做?

Emma knocked on the door. No answer. She knocked again and waited. There was a large maple tree next to the house. Emma looked up the tree and saw a giant raven perched at the treetop. Under the afternoon sun, the raven gleamed magnificently. Its beak was hard and pointed, its claws sharp and strong. It looked regal and imposing. It reigned the tree it stood on. The raven was looking straight at Emma with its beady black eyes. Emma felt slightly intimidated. She took a step back from the door and tentatively said, “hello?” 

该段包含很多信息。我们知道它谈到了到一个名叫Emma的人和一只乌鸦。这里有一座房子和一棵树,艾玛正想进屋,却看到了乌鸦。这只华丽的乌鸦注意到艾玛,她有点害怕,但正在尝试交流。

那么,这些信息的哪些部分是我们应该提取的显着特征?首先,提取主要角色艾玛和乌鸦的名字似乎是个好主意。接下来,注意房子,门和树的布置可能也很好。关于乌鸦的描述呢?Emma的行为呢,敲门,退后一步,打招呼呢?

本章介绍文本特征工程的基础知识。我们从词袋(bags of words)开始,这是基于字数统计的最简单的文本功能。一个非常相关的变换是 tf-idf,它本质上是一种特征缩放技术。它将被我在(下一篇)章节进行全面讨论。本章首先讨论文本特征提取,然后讨论如何过滤和清洗这些特征。

阅读全文

猜你喜欢

转载自blog.csdn.net/wizardforcel/article/details/80759144