【翻译大老外的文】信息平台与数据科学的兴起

原标题:Information Platforms and the Rise of the Data Scientist 

原作者:Jeff Hammerbacher

/*

       作者简介:

       杰夫·哈默巴赫,数据科学家, 前FaceBook数据团队负责人,Cloudera联合创始人,

       在共同创立Cloudera之前,杰夫领导了Facebook的数据团队。

       杰夫·哈默巴赫在加入Cloudera之前就是Accel Partners的驻地企业家。

       哈佛大学数学学士。 

*/

Facebook Information Platform

& Businesses Intelligence

【写个脚本,全选文档的文字,让Google自动帮你翻译】

Our first attempt at an offline repository of information involved a Python script for farming queries out to Facebook’s tier of MySQL servers and a daemon process, written in C++, for processing our event logs in real time. When the scripts worked as planned, we collected about 10 gigabytes a day. I later learned that this aspect of our system is commonly termed the “ETL” process, for “Extract, Transform, and Load.”

      我们在离线信息库中的第一次尝试涉及一个Python脚本,用于向Facebook的MySQL服务器层查询,以及用C ++编写的守护进程,用于实时处理我们的事件日志。 当脚本按计划运行时,我们每天收集大约10千兆字节。 我后来才知道,我们系统的这方面功能通常被称为“提取,转换和加载”的“ETL”过程。

=============================================================

Once our Python scripts and C++ daemon had siphoned the data from Facebook’s source systems, we stuffed the data into a MySQL database for offline querying. We also had some scripts and queries that ran over the data once it landed in MySQL to aggregate it into more useful representations. It turns out that this offline database for decision support is better known as a “Data Warehouse.”

一旦我们的Python脚本和C ++守护进程从Facebook的源系统中吸取了数据,我们就会将数据填充到MySQL数据库中以进行离线查询。 我们还有一些脚本和查询用来在数据登陆MySQL后将其聚合到更有用的表示中。 事实证明,这个用于决策支持的离线数据库就是广为人知的“数据仓库”。

=============================================================

Finally, we had a simple PHP script to pull data from the offline MySQL database and display summaries of the information we had collected to internal users. For the first time, we were able to answer some important questions about the impact of certain site features on user activity. Early analyses looked at maximizing growth through several channels: the layout of the default page for logged-out users, the source of invitations, and the design of the email contact importer. In addition to analyses, we started to build simple products using historical data, including an internal project to aggregate features of sponsored group members that proved popular with brand advertisers.

最后,我们有一个简单的PHP脚本从离线MySQL数据库中提取数据,并将我们收集的信息摘要显示给内部用户。 一开始,我们能够回答一些关于某些网站功能对用户活动影响的重要问题。 早期分析着眼于通过多种渠道实现增长最大化:登出用户的默认页面布局,邀请来源以及电子邮件联系人导入器的设计。 除了分析之外,我们还开始使用历史数据构建简单的产品,包括一个内部项目,用于聚合受到品牌广告商欢迎的广告功能。

=============================================================

I didn’t realize it at the time, but with our ETL framework, Data Warehouse, and internal dashboard, we had built a simple “Business Intelligence” system.

我当时没有意识到这一点,但是通过我们的ETL框架,数据仓库和内部仪表板,我们构建了一个简单的“商业智能”系统。

=============================================================

A Business Intelligence System

In a 1958 paper in the IBM Systems Journal, Hans Peter Luhn describes a system for “selective dissemination” of documents to “action points” based on the “interest profiles” of the individual action points. The author demonstrates shocking prescience. The title of the paper is “A Business Intelligence System,” and it appears to be the first use of the term “Business Intelligence” in its modern context.

在1958年IBM Systems Journal的一篇论文中,Hans Peter Luhn描述了一种基于各个用户行为点的“兴趣概况”将文档“选择性地传播”到“行动点”的系统。 作者表现出令人震惊的先见之明。 该论文的标题是“商业智能系统”,它似乎是现代语境中“商业智能”一词的首次使用。

=============================================================

In addition to the dissemination of information in real time, the system was to allow for “information retrieval”—search—to be conducted over the entire document collection. Luhn’s emphasis on action points focuses the role of information processing on goal completion. In other words, it’s not enough to just collect and aggregate data; an organization must improve its capacity to complete critical tasks because of the insights gleaned from the data. He also proposes “reporters” to periodically sift the data and selectively move information to action points as needed.

除了实时传播信息外,该系统还允许对整个文件集进行“信息检索” - 研究。 Luhn强调行动要点侧重于信息处理对目标完成的作用。 换句话说,仅仅收集和汇总数据是远远不够的;组织必须提高其完成关键任务的能力,因为要从数据中获得远见(洞察一切)。 他还建议“记者”定期筛选数据,并根据需要有选择地将信息移动到行动点。

The field of Business Intelligence has evolved over the five decades since Luhn’s paper was published, and the term has come to be more closely associated with the management of structured data. Today, a typical business intelligence system consists of an ETL framework pulling data on a regular basis from an array of data sources into a Data Warehouse, on top of which sits a Business Intelligence tool used by business analysts to generate reports for internal consumption. How did we go from Luhn’s vision to the current state of affairs?

自Luhn的论文发表以来,商业智能领域已经发展了五十年,并且该术语与结构化数据的管理更加密切相关。 今天,典型的商业智能系统由一个ETL框架组成,该框架定期将数据从一系列数据源提取到数据仓库中,其中包括业务分析师用来生成内部消费报告的商业智能工具。 我们应该如何从Luhn的愿景转向当今的事务?

=============================================================

E. F. Codd first proposed the relational model for data in 1970, and IBM had a working prototype of a relational database management system (RDBMS) by the mid-1970s. Building user-facing applications was greatly facilitated by the RDBMS, and by the early 1980s, their use was proliferating.

E.F.Codd在1970年首次提出了数据的关系模型,到70年代中期,IBM已经有了一个关系数据库管理系统(RDBMS)的工作原型。RDBMS为构建面向用户的应用程序提供了极大的便利,到了20世纪80年代初,它们的使用正在激增。

=============================================================

In 1983, Teradata sold the first relational database designed specifically for decision support to Wells Fargo. A few years later, in 1986, Ralph Kimball founded Red Brick Systems to build databases for the same market. Solutions were developed using Teradata and Red Brick’s offerings, but it was not until 1991 that the first canonical text on data warehousing was published.

1983年,Teradata向Wells Fargo出售了专门为决策支持而设计的第一个关系数据库。几年后,也就是1986年,拉尔夫·金博尔创建了Red Brick Systems,为同一市场建立数据库。解决方案是用Teradata and Red Brick提供的产品开发的,但直到1991年才发布了关于数据仓库的第一个规范文本。

=============================================================

Bill Inmon’s Building the Data Warehouse (Wiley) is a coherent treatise on data warehouse design and includes detailed recipes and best practices for building data warehouses. Inmon advocates constructing an enterprise data model after careful study of existing data sources and business goals.

Bill Inmon建立的数据仓库(Wiley)是一篇关于数据仓库设计的连贯的论文,包括构建数据仓库的详细配方和最佳实践。Inmon倡导在仔细研究现有数据源和业务目标之后构建企业数据模型。

=============================================================

In 1995, as Inmon’s book grew in popularity and data warehouses proliferated inside enterprise data centers, The Data Warehouse Institute (TDWI) was formed. TDWI holds conferences and seminars and remains a critical force in articulating and spreading knowledge about data warehousing. That same year, data warehousing gained currency in academic circles when Stanford University launched its WHIPS research initiative.

1995年,随着Inmon的订制在企业数据中心中的普及和数据仓库的激增,数据仓库研究所(TDWI)成立了。TDWI举行会议和研讨会,并仍然是阐明和传播有关数据仓库知识的关键力量。同年,斯坦福大学(Stanford University)发起了“鞭子研究”计划,数据仓库在学术界获得了广泛应用。

=============================================================

A challenge to the Inmon orthodoxy came in 1996 when Ralph Kimball published The Data Warehouse Toolkit (Wiley). Kimball advocated a different route to data warehouse nirvana, beginning by throwing out the enterprise data model. Instead, Kimball argued that different business units should build their own data “marts,” which could then be connected with a “bus.” Further, instead of using a normalized data model, Kimball advocated the use of dimensional modeling, in which the relational data model was manhandled a bit to fit the particular workload seen by many data warehouse implementations.

1996年,拉尔夫·金博尔出版了数据仓库工具包(Wiley),对Inmon正统理论提出了挑战。Kimball倡导了另一条通往数据仓库“涅槃重生”的不同路线,首先是抛弃企业数据模型。相反,Kimball认为不同的业务部门应该构建自己的数据“集市”,然后可以将其与“总线”连接起来。此外,Kimball没有使用规范化的数据模型,而是提倡使用维度建模,在这种建模中,关系数据模型被错误地处理,以适应许多数据仓库实现所看到的特定工作负载。

=============================================================

As data warehouses grow over time, it is often the case that business analysts would like to manipulate a small subset of data quickly. Often this subset of data is parameterized by a few “dimensions.” Building on these observations, the CUBE operator was introduced in 1997 by a group of Microsoft researchers, including Jim Gray. The new operator enabled fast querying of small, multidimensional data sets.

随着数据仓库的增长,业务分析人员通常希望快速操作一小部分数据。通常,这个子集的数据是由几个“维度”参数化的。在这些观察的基础上,多维数据集操作于1997年由微软的一组研究人员引入,其中包括Jim Gray。新运算符允许快速查询小型多维数据集。

=============================================================

Both dimensional modeling and the CUBE operator were indications that, despite its success for building user-facing applications, the relational model might not be best for constructing an Information Platform. Further, the document and the action point, not the table, were at the core of Luhn’s proposal for a business intelligence system. On the other hand, an entire generation of engineers had significant expertise in building systems for relational data processing.

With a bit of history at our back, let’s return to the challenges at Facebook.

维度建模和多维数据集操作都表明,尽管它在构建面向用户的应用程序方面取得了成功,但关系模型可能并不适合于构建信息平台。此外,这份文件和行动要点,而不是表格,是卢恩关于建立商业情报系统的提议的核心。另一方面,整整一代的工程师在构建关系数据处理系统方面拥有丰富的专业知识。在历史的背景下,让我们回到Facebook的挑战。

=============================================================

The Death and Rebirth of a Data Warehouse

At Facebook, we were constantly loading more data into, and running more queries over, our MySQL data warehouse. Having only run queries over the databases that served the live site, we were all surprised at how long a query could run in our data warehouse. After some discussion with seasoned data warehousing veterans, I realized that it was normal to have queries running for hours and sometimes days, due to query complexity, massive data volumes, or both.

数据仓库的死亡和重生

在Facebook, 我们不断地在MySQL数据仓库中加载更多的数据,并运行更多的查询。由于只在服务于活动站点的数据库上运行查询,我们都对查询在数据仓库中运行耗时到惊讶。在与经验丰富的数据仓库老手进行了一些讨论之后,我意识到,由于查询的复杂性、海量的数据量,或者两者兼而有之,运行几个小时甚至几天的查询是正常的。

=============================================================

One day, as our database was nearing a terabyte in size, the mysqld daemon process came to a sudden halt. After some time spent on diagnostics, we tried to restart the database. Upon initiating the restart operation, we went home for the day.

有一天,当我们的数据库接近兆字节时,mysqld守护进程突然停止了。在诊断上花费了一段时间之后,我们尝试重新启动数据库。启动重新启动操作后,我们只能回家了。

=============================================================

When I returned to work the next morning, the database was still recovering. To get a consistent view of data that’s being modified by many clients, a database server maintains a persistent list of all edits called the “redo log” or the “write-ahead log.” If the database server is unceremoniously killed and restarted, it will reread the recent edits from the redo log to get back up to speed. Given the size of our data warehouse, the MySQL database had quite a bit of recovery to catch up on. It was three days before we had a working data warehouse again.

第二天早上我回来工作时,数据库还在恢复中。为了获得许多客户端正在修改的数据的一致视图,数据库服务器维护称为“重做日志”或“写前日志”的所有编辑的持久列表。如果数据库服务器被随机关闭并重新启动,它将重新读取重做日志中的最近编辑,以恢复到速度。考虑到数据仓库的大小,MySQL数据库有相当多的恢复时间。过了三天,我们才又有了一个正常工作的数据仓库。

=============================================================

We made the decision at that point to move our data warehouse to Oracle, whose database software had better support for managing large data sets. We also purchased some expensive high-density storage and a powerful Sun server to run the new data warehouse.

当时我们决定将数据仓库转移到Oracle,Oracle的数据库软件对管理大型数据集有更好的支持。我们还购买了一些昂贵的高密度存储和强大的Sun公司的服务器来运行新的数据仓库。

=============================================================

During the transfer of our processes from MySQL to Oracle, I came to appreciate the differences between supposedly standard relational database implementations. The bulk import and export facilities of each database used completely different mechanisms. Further, the dialect of SQL supported by each was different enough to force us to rewrite many of our queries. Even worse, the Python client library for Oracle was unofficial and a bit buggy, so we had to contact the developer directly.

在将我们的流程从MySQL转移到Oracle期间,我逐渐理解了所谓的标准关系数据库实现之间的差异。每个数据库的大量进出口设施使用完全不同的机制。此外,每个查询所支持的SQL方言都是不同的,足以迫使我们重写许多查询。更糟糕的是,Oracle的Python客户端库是非官方的,而且有点错误,因此我们不得不直接与开发人员联系。

=============================================================

After a few weeks of elbow grease, we had the scripts rewritten to work on the new Oracle platform. Our nightly processes were running without problems, and we were excited to try out some of the tools from the Oracle ecosystem. In particular, Oracle had an ETL tool called Oracle Warehouse Builder (OWB) that we hoped could replace our handwritten Python scripts. Unfortunately, the software did not expect the sheer number of data sources we had to support: at the time, Facebook had tens of thousands of MySQL databases from which we collected data each night. Not even Oracle could help us tackle our scaling challenges on the ETL side, but we were happy to have a running data warehouse with a few terabytes of data.

经过几周的努力,我们重新编写了脚本,以便在新的Oracle平台上工作。我们的夜间进程没有出现问题,我们很高兴能够尝试Oracle生态系统中的一些工具。特别是,Oracle有一个名为Oracle Warehouse Builder(OWB)的ETL工具,我们希望它能够取代我们手写的Python脚本。不幸的是,该软件并没有预料到我们必须支持的数据源的数量:当时,Facebook拥有数以万计的MySQL数据库,我们每晚都从这些数据库中收集数据。即使是Oracle也无法帮助我们解决ETL方面的扩展挑战,但我们很高兴拥有一个运行中的数据仓库,其中包含了数兆字节的数据。

=============================================================

And then we turned on clickstream logging: our first full day sent 400 gigabytes of unstructured data rushing over the bow of our Oracle database. Once again, we cast a skeptical eye on our data warehouse.

然后我们打开了点击流日志:我们的第一天发送了400 GB的非结构化数据,横扫我们的Oracle数据库。再一次,我们对自己的数据仓库持怀疑态度。

=============================================================

Beyond the Data Warehouse

According to IDC, the digital universe will expand to 1,800 Exabyte by 2011. The vast majority of that data will not be managed by relational databases. There’s an urgent need for data management systems that can extract information from unstructured data in concert with structured data, but there is little consensus on the way forward.

根据IDC的数据,到2011年,数字世界将扩展到1800 EB。其中绝大多数数据将不会由关系数据库管理。迫切需要数据管理系统,它可以与结构化数据一起从非结构化数据中提取信息,但在前进的道路上几乎没有共识。

=============================================================

Natural language data in particular is abundant, rich with information, and poorly managed by a data warehouse. To manage natural language and other unstructured data, often captured in document repositories and voice recordings, organizations have looked beyond the offerings of data warehouse vendors to various new fields, including one known as enterprise search.

自然语言数据尤其丰富,信息丰富,且数据仓库不能很好地管理它。为了管理自然语言和其他非结构化数据,这些数据通常被抓取到文档存储库和语音记录中,各组织已经将目光从数据仓库供应商的产品范围扩大到各种新的领域,包括称为企业搜索的领域。

=============================================================

While most search companies built tools for navigating the collection of hyperlinked documents known as the World Wide Web, a few enterprise search companies chose to focus on managing internal document collections. Autonomy Corporation founded in 1996 by Cambridge University researchers, leveraged Bayesian inference algorithms to facilitate the location of important documents.

虽然大多数搜索公司构建了用于导航称为万维网(World Wide Web)的超链接文档集合的工具,但少数企业搜索公司选择将重点放在管理内部文档集合上。由剑桥大学的研究人员于1996年成立的Autonomy公司,利用贝叶斯推理算法来帮助重要文档的定位。

=============================================================

Fast Search and Transfer (FAST) was founded in 1997 in Norway with more straightforward keyword search and ranking at the heart of its technology.

FAST搜索和传输(FAST)于1997年在挪威成立,其技术的核心是更直截了当的关键字搜索和排名。

=============================================================

Two years later, Endeca was founded with a focus on navigating document collections using structured metadata, a technique known as “faceted search.” Google, seeing an opportunity to leverage its expertise in the search domain, introduced an enterprise search appliance in 2000.

两年后,Endeca成立,专注于使用结构化元数据导航文档集合,这种技术被称为“分面搜索”。谷歌利用其在搜索领域的专业知识的机会,于2000年推出了一款企业搜索设备。

=============================================================

In a few short years, enterprise search has grown into a multibillion-dollar market segment that is almost totally separate from the data warehouse market. Endeca has some tools for more traditional business intelligence, and some database vendors have worked to introduce text mining capabilities into their systems, but a complete, integrated solution for structured and unstructured enterprise data management remains unrealized.

在短短几年里,企业搜索已经发展成为一个价值数十亿美元的市场细分市场,几乎完全脱离了数据仓库市场。Endeca为更传统的商业智能提供了一些工具,一些数据库供应商致力于将文本挖掘功能引入到他们的系统中,但用于结构化和非结构化企业数据管理的完整、集成的解决方案仍未实现。

=============================================================

Both enterprise search and data warehousing are technical solutions to the larger problem of leveraging the information resources of an organization to improve performance. As far back as 1944, MIT professor Kurt Lewin proposed “action research” as a framework that uses “a spiral of steps, each of which is composed of a circle of planning, action, and fact-finding about the result of the action.”

企业搜索和数据仓库都是利用组织的信息资源来提高性能这一更大问题的技术解决方案。早在1944年,麻省理工学院(MIT)教授库尔特·莱文(Kurt Lewin)就提出了“行动研究”这一框架,它使用的是“一系列步骤,每一步都由对行动结果的计划、行动和事实发现组成。”

=============================================================

A more modern approach to the same problem can be found in Peter Senge’s “Learning Organization” concept, detailed in his book The Fifth Discipline (Broadway Business).

彼得·森格的“学习组织”(Learning Organization)概念中可以找到一个更现代的方法来解决同样的问题。详细描述在他的书“第五学科”(百老汇商业)。

=============================================================

Both management theories rely heavily upon an organization’s ability to adapt its actions after reflecting upon information collected from previous actions. From this perspective, an Information Platform is the infrastructure required by a Learning Organization to ingest, process, and generate the information necessary for implementing the action research spiral.

这两种管理理论都在很大程度上依赖于一个组织在反思从以前的行动中收集到的信息后能够调整其行动的能力。从这个角度来看,信息平台是学习型组织吸收、处理和生成实施行动研究螺旋所需的信息所需的基础设施。

=============================================================

Having now looked at structured and unstructured data management, let’s get back to the Facebook story.

现在看看结构化和非结构化数据管理,让我们回到Facebook的故事。

=============================================================

On the first day of logging the Facebook clickstream, more than 400 gigabytes of data was collected. The load, index, and aggregation processes for this data set really taxed the Oracle data warehouse. Even after significant tuning, we were unable to aggregate a day of clickstream data in less than 24 hours. It was clear we’d need to aggregate our log files outside of the database and store only the summary information for later querying.

在记录Facebook点击流的第一天,就收集了超过400 GB的数据。此数据集的加载、索引和聚合过程确实对Oracle数据仓库进行了评估。即使在进行了重要的调优之后,我们也无法在不到24小时内聚集一天的点击流数据。显然,我们需要将日志文件聚合到数据库之外,并且只存储摘要信息以供以后查询。

Luckily, a top engineer from a large web property had recently joined our team and had experience processing clickstream data at web scale. In just a few weeks, he built a parallelized log processing system called Cheetah that was able to process a day of clickstream data in two hours. There was much rejoicing.

幸运的是,一位来自一家大型网站的顶级工程师最近加入了我们的团队,并拥有在web规模上处理点击流数据的经验。仅仅几个星期,他就建立了一个名为“猎豹”(Cheetah)的并行日志处理系统,能够在两小时内处理一天的点击流数据。真是令人高兴

Despite our success, Cheetah had some drawbacks: first, after processing the clickstream data, the raw data was stored in archival storage and could not be queried again.

尽管我们取得了成功,猎豹还是有一些缺点:首先,在处理点击流数据后,原始数据存储在存档存储中,无法再次查询。

=============================================================

In addition, Cheetah pulled the clickstream data from a shared NetApp filer with limited read bandwidth. The “schema” for each logfile was embedded in the processing scripts rather than stored in a format that could be queried.

此外,猎豹在有限的读取带宽下从共享的NetApp文件中提取点击流数据。每个日志文件的“模式”都嵌入在处理脚本中,而不是存储在可以查询的格式中。

=============================================================

We did not collect progress information and we scheduled Cheetah jobs using a basic Unix utility called cron, so no sophisticated loadsharing logic could be applied. Most importantly, however, Cheetah was not open source.

我们没有收集进度信息,我们使用一个名为cron的基本Unix实用程序来安排猎豹作业,因此无法应用复杂的负载共享逻辑。然而,最重要的是,猎豹不是开源的。

=============================================================

We had a small team and could not afford the resources required to develop, maintain, and train new users to use our proprietary system.

我们有一个很小的团队,负担不起开发、维护和培训新用户使用我们专有系统所需的资源。

=============================================================

The Apache Hadoop project, started in late 2005 by Doug Cutting and Mike Cafarella, was a top candidate to replace Cheetah. Named after the stuffed elephant of Doug’s son, the Hadoop project aimed to implement Google’s distributed file system and MapReduce technologies under the Apache 2.0 license. Yahoo! hired Doug Cutting in January 2006 and devoted significant engineering resources to developing Hadoop.

Apache Hadoop项目于2005年底由Doug Cutting和Mike Cafarella开始,是取代Cheetah的最佳候选者。 Hadoop项目以Doug的儿子填充大象命名,旨在Apache 2.0许可下实施Google的分布式文件系统和MapReduce技术。雅虎 2006年1月聘请Doug Cutting,并投入大量工程资源开发Hadoop。

In April 2006, the software was able to sort 1.9 terabytes in 47 hours using 188 servers. Although Hadoop’s design improved on Cheetah’s in several areas, the software was too slow for our needs at that time. By April 2008, however, Hadoop was able to sort 1 terabyte in 209 seconds using 910 servers. With the improved performance numbers in hand, I was able to convince our operations team to stick three 500-gigabyte SATA drives in the back of 60 unused web servers, and we went forward with our first Hadoop cluster at Facebook.

2006年4月,该软件使用188台服务器在47小时内对1.9TB进行了分类。 虽然Hadoop的设计在几个方面对Cheetah的设计有所改进,但该软件对我们当时的需求来说太慢了。 然而,到2008年4月,Hadoop能够使用910台服务器在209秒内对1TB进行排序。 随着性能数据的提升,我能够说服我们的运营团队在60个未使用的Web服务器后面贴上三个500 GB的SATA硬盘(驱动器),我们在Facebook上推出了我们的第一个Hadoop集群。

Initially, we started streaming a subset of our logs into both Hadoop and Cheetah. The enhanced programmability of Hadoop coupled with the ability to query the historical data led to some interesting projects. One application involved scoring all directed pairs of interacting users on Facebook to determine their affinity; this score could then be used for search and News Feed ranking. After some time, we migrated all Cheetah workflows to Hadoop and retired the old system. Later, the transactional database collection processes were moved to Hadoop as well.

最初,我们开始将一部分日志流到Hadoop和Cheetah中。Hadoop增强的可编程性加上查询历史数据的能力导致了一些有趣的项目。一个应用程序涉及在Facebook上对所有向交互的用户进行评分,以确定他们的亲和力;然后,这个分数可以用于搜索和新闻源排名。过了一段时间,我们将所有Cheetah工作流迁移到Hadoop,并使旧系统退役。后来,事务性数据库收集过程也被移到Hadoop。

=============================================================

With Hadoop, our infrastructure was able to accommodate unstructured and structured data analysis at a massive scale. As the platform grew to hundreds of terabytes and thousands of jobs per day, we learned that new applications could be built and new questions could be answered simply because of the scale at which we were now able to store and retrieve data.

使用Hadoop,我们的基础设施能够大规模地容纳非结构化和结构化的数据分析。当这个平台发展到几百兆字节和每天数以千计的工作时,我们了解到可以构建新的应用程序,新的问题可以被回答,仅仅是因为我们现在能够存储和检索数据的规模。

=============================================================

When Facebook opened registration to all users, the user population grew at disproportionately rapid rates in some countries. At the time, however, we were not able to perform granular analyses of clickstream data broken out by country. Once our Hadoop cluster was up, we were able to reconstruct how Facebook had grown rapidly in places such as Canada and Norway by loading all of our historical access logs into Hadoop and writing a few simple MapReduce jobs.

当Facebook向所有用户开放注册时,一些国家的用户数量以不成比例的速度增长。然而,当时我们无法对按国家分列的点击流数据进行粒度分析。一旦Hadoop集群运行完毕,我们就能够通过将所有的历史访问日志加载到Hadoop中并编写一些简单的MapReduce作业来重建Facebook在加拿大和挪威等地的快速发展。

=============================================================

Every day, millions of semi-public conversations occur on the walls of Facebook users. One internal estimate put the size of the wall post corpus at 10 times the size of the blogosphere! Before Hadoop, however, the contents of those conversations remained inaccessible for data analysis.

每天,数以百万计的半公开对话发生在Facebook用户的墙上。一项内部评估显示,墙后语料库的大小是博客圈的10倍!然而,在Hadoop之前,这些对话的内容仍然无法进行数据分析。=============================================================

In 2007, a summer intern with a strong interest in linguistics and statistics, Roddy Lindsay, joined the Data team. Using Hadoop, Roddy was able to single-handedly construct a powerful trend analysis system called Lexicon that continues to process terabytes of wall post data every night; you can see the results for yourself at https://facebook.com/lexicon.

2007年,一位对语言学和统计学有浓厚兴趣的暑期实习生罗迪·林赛(Roddy Lindsay)加入了数据团队。使用Hadoop,Roddy能够独自构建一个强大的趋势分析系统,名为lDicon,它每天晚上继续处理兆字节的墙帖数据;您可以在https://facebook.com/lexicon中自己看到结果。

=============================================================

Having the data from disparate systems stored in a single repository proved critical for the construction of a reputation scoring system for Facebook applications.

将来自不同系统的数据存储在一个单独的存储库中,对于构建Facebook应用程序的声誉评分系统至关重要。

=============================================================

Soon after the launch of the Facebook Platform in May of 2007, our users were inundated with requests to add applications. We quickly realized that we would need a tool to separate the useful applications from those the users perceived as spam.

2007年5月Facebook平台推出后不久,我们的用户就被添加应用程序的请求淹没了。我们很快意识到,我们需要一个工具来将有用的应用程序与那些被视为垃圾邮件的用户区分开来。

=============================================================

Using data collected from the API servers, user profiles, and activity data from the site itself, we were able to construct a model for scoring applications that allowed us to allocate invitations to the applications deemed most useful to users.

使用从API服务器、用户配置文件和站点本身收集的活动数据收集的数据,我们能够构建一个评分应用程序的模型,该模型允许我们将邀请分配给被认为对用户最有用的应用程序。

=============================================================

The Unreasonable Effectiveness of Data

In a recent paper, a trio of Google researchers distilled what they have learned from trying to solve some of machine learning’s most difficult challenges.

在最近的一篇论文中,三位谷歌研究人员总结了他们从试图解决机器学习中最困难的挑战中学到的东西,这些数据的有效性是不合理的。

=============================================================

When discussing the problems of speech recognition and machine translation, they state that, “invariably, simple models and a lot of data trump more elaborate models based on less data.”

在讨论语音识别和机器翻译问题时,他们说:“简单的模型和大量的数据总是胜过基于较少数据的更精细的模型。”

=============================================================

 I don’t intend to debate their findings; certainly there are domains where elaborate models are successful. Yet based on their experiences, there does exist a wide class of problems for which more data and simple models are better.

我不打算讨论他们的发现;当然,在某些领域,精细的模型是成功的。然而,根据他们的经验,确实存在着广泛的问题,更多的数据和简单的模型是更好的。

=============================================================

At Facebook, Hadoop was our tool for exploiting the unreasonable effectiveness of data. For example, when we were translating the site into other languages, we tried to target users who spoke a specific language to enlist their help in the translation task.

在Facebook,Hadoop是我们利用数据不合理有效性的工具。例如,当我们将站点翻译成其他语言时,我们试图以说特定语言的用户为目标,在翻译任务中寻求他们的帮助。

=============================================================

One of our Data Scientists, Cameron Marlow, crawled all of Wikipedia and built character trigram frequency counts per language. Using these frequency counts, he built a simple classifier that could look at a set of wall posts authored by a user and determine his spoken language.

我们的数据科学家卡梅隆·马洛(Cameron Marlow)爬行了所有维基百科,并构建了每种语言的字符Trigram频率计数。使用这些频率计数,他建立了一个简单的分类器,可以查看一组由用户编写的墙帖,并确定他的口语。

=============================================================

Using this classifier, we were able to actively recruit users into our translation program in a targeted fashion. Both Facebook and Google use natural language data in many applications; see Chapter 14 of this book for Peter Norvig’s exploration of the topic.

使用该分类器,我们能够以有针对性的方式积极地招募用户到我们的翻译程序中。Facebook和Google在许多应用程序中都使用自然语言数据;有关PeterNorvig对该主题的探索,请参阅本书第14章。

=============================================================

The observations from Google point to a third line of evolution for modern business intelligence systems: in addition to managing structured and unstructured data in a single system, they must scale to store enough data to enable the “simple models, lots of data” approach to machine learning.

谷歌的观察指出了现代商业智能系统的第三条进化路线:除了在单个系统中管理结构化和非结构化数据之外,它们还必须扩展到存储足够多的数据,以支持“简单模型、大量数据”的机器学习方法。

=============================================================

New Tools and Applied Research

Most of the early users of the Hadoop cluster at Facebook were engineers with a taste for new technologies. To make the information accessible to a larger fraction of the organization, we built a framework for data warehousing on top of Hadoop called Hive.

新工具和应用研究-Facebook Hadoop集群的大多数早期用户都是对新技术感兴趣的工程师。为了使更大一部分组织能够访问这些信息,我们在Hadoop之上构建了一个名为Hive的数据仓库框架。

=============================================================

Hive includes a SQL-like query language with facilities for embedding MapReduce logic, as well as table partitioning, sampling, and the ability to handle arbitrarily serialized data.

Hive包括一种类似SQL的查询语言,具有嵌入MapReduce逻辑、表分区、采样和处理任意序列化数据的功能。

=============================================================

The last feature was critical, as the data collected into Hadoop was constantly evolving in structure; allowing users to specify their own serialization format allowed us to pass the problem of specifying structure for the data to those responsible for loading the data into Hive.

最后一个特性是至关重要的,因为收集到Hadoop中的数据在结构上不断变化;允许用户指定自己的序列化格式使我们能够将指定数据结构的问题传递给那些负责将数据加载到Hive中的人。

=============================================================

In addition, a simple UI for constructing Hive queries, called HiPal, was built. Using the new tools, non-engineers from marketing, product management, sales, and customer service were able to author queries over terabytes of data.

此外,还构建了一个用于构造Hive查询的简单UI,称为HiPal。使用这些新工具,来自市场营销、产品管理、销售和客户服务的非工程师能够编写超过兆字节的数据查询。

=============================================================

After several months of internal use, Hive was contributed back to Hadoop as an official subproject under the Apache 2.0 license and continues to be actively developed.

经过几个月的内部使用,Hive作为Apache2.0许可证下的一个正式子项目被贡献给Hadoop,并继续积极开发。

=============================================================

In addition to Hive, we built a portal for sharing charts and graphs called Argus (inspired by IBM’s work on the Many Eyes project), a workflow management system called Databee, a framework for writing MapReduce scripts in Python called PyHive, and a storage system for serving structured data to end users called Cassandra (now available as open source in the Apache Incubator).

除了Hive之外,我们还构建了一个共享图表和图形的门户,名为Argus(受IBM在多个眼睛项目上的工作的启发)、一个名为Databee的工作流管理系统、一个用Python编写MapReduce脚本的框架PyHive,以及一个存储系统,用于为最终用户提供结构化数据,称为Cassandra(现在作为Apache Incubator中的开放源代码提供)。

=============================================================

As the new systems stabilized, we ended up with multiple tiers of data managed by a single Hadoop cluster. All data from the enterprise, including application logs, transactional databases, and web crawls, was regularly collected in raw form into the Hadoop distributed file system (HDFS).

随着新系统的稳定,我们最终得到了由单个Hadoop集群管理的多层数据。来自企业的所有数据,包括应用程序日志、事务性数据库和Web爬行,都以原始形式定期收集到Hadoop分布式文件系统(HDFS)中。

=============================================================

Thousands of nightly Databee processes would then transform some of this data into a structured form and place it into the directory of HDFS managed by Hive. Further aggregations were performed in Hive to generate reports served by Argus.

数千个夜间数据库进程然后将这些数据转换成结构化的表单,并将其放置到Hive管理的HDFS目录中。在Hive进行了进一步的汇总,以生成由Arguss提供的报告。

=============================================================

Additionally, within HDFS, individual engineers maintained “sandboxes” under their home directories against which prototype jobs could be run.

此外,在HDFS中,单个工程师在自己的主目录下维护“沙箱”,以便运行原型作业。

=============================================================

At its current capacity, the cluster holds nearly 2.5 PB of data, and new data is added at a rate of 15 TB per day. Over 3,000 MapReduce jobs are run every day, processing 55 terabytes of data. To accommodate the different priorities of jobs that are run on the cluster, we built a job scheduler to perform fair sharing of resources over multiple queues.

按照目前的容量,集群拥有将近2.5PB的数据,新的数据以每天15 TB的速度增加。每天运行3,000多个MapReduce作业,处理55 TB的数据。为了适应集群上运行的作业的不同优先级,我们构建了一个作业调度程序,用于在多个队列上公平地共享资源。

=============================================================

In addition to powering internal and external reports, a/b testing pipelines, and many different data-intensive products and services, Facebook’s Hadoop cluster enabled some interesting applied research projects.

除了为内部和外部报告、a/b测试管道以及许多不同的数据密集型产品和服务提供动力之外,Facebook的Hadoop集群还支持了一些有趣的应用研究项目。

=============================================================

One longitudinal study conducted by Data Scientists Itamar Rosenn and Cameron Marlow set out to determine what factors were most critical in predicting long-term user engagement.

数据科学家Itamar Rosenn和Cameron Marlow进行了一项纵向研究,以确定哪些因素是预测长期用户参与的最关键因素。

=============================================================

We used our platform to select a sample of users, trim outliers, and generate a large number of features for use in several least-angle regressions against different measures of engagement. Some features we were able to generate using Hadoop included various measures of friend network density and user categories based on profile features.

我们使用我们的平台选择一个用户样本,修剪离群点,并生成大量特性,用于针对不同的接触度量的几个最小角度回归。我们能够使用Hadoop生成的一些特性包括基于配置文件特征的朋友网络密度和用户类别的各种度量。

=============================================================

Another internal study to understand what motivates content contribution from new users was written up in the paper “Feed Me: Motivating Newcomer Contribution in Social Network Sites,” published at the 2009 CHI conference.

另一项旨在了解新用户贡献内容的内部研究发表在2009年CHI会议上发表的论文“FeedMe:LiggingNewcomer在社交网站上的贡献”中。

=============================================================

A more recent study from the Facebook Data team looks at how information flows through the Facebook social graph; the study is titled “Gesundheit! Modeling Contagion through Facebook News Feed,” and has been accepted for the 2009 ICWSM conference.

Facebook数据小组最近的一项研究着眼于信息是如何在Facebook社交图中流动的;这项研究的标题是“Gesundheit!通过Facebook News Feed建模宣传,“并已被2009ICWSM会议接受。

=============================================================

Every day, evidence is collected, hypotheses are tested, applications are built, and new insights are generated using the shared Information Platform at Facebook. Outside of Facebook, similar systems were being constructed in parallel.

每天都会收集证据,检验假设,建立应用程序,并利用Facebook的共享信息平台产生新的见解。在Facebook之外,类似的系统正在并行构建。

=============================================================

MAD Skills and Cosmos

In “MAD Skills: New Analysis Practices for Big Data,” a paper from the 2009 VLDB conference, the analysis environment at Fox Interactive Media (FIM) is described in detail.

MAD技能和宇宙在“疯狂技能:大数据的新分析实践”中,2009年VLDB会议的一篇论文,福克斯互动媒体(FIM)的分析环境被详细描述。

=============================================================

Using a combination of Hadoop and the Greenplum database system, the team at FIM has built a familiar platform for data processing in isolation from our work at Facebook.

利用Hadoop和Greenmers数据库系统的结合,FIM的团队构建了一个熟悉的数据处理平台,这与我们在Facebook的工作是分开的。

=============================================================

The paper’s title refers to three tenets of the FIM platform: Magnetic, Agile, and Deep. “Magnetic” refers to the desire to store all data from the enterprise, not just the structured data that fits into the enterprise data model.

本文的标题涉及到FIM平台的三个原则:磁性、敏捷性和深度。“磁性”指的是存储来自企业的所有数据的愿望,而不仅仅是适合于企业数据模型的结构化数据。

=============================================================

Along the same lines, an “Agile” platform should handle schema evolution gracefully, enabling analysts to work with data immediately and evolve the data model as needed. “Deep” refers to the practice of performing more complex statistical analyses over data.

按照同样的思路,“敏捷”平台应该优雅地处理模式演化,使分析人员能够立即处理数据,并根据需要改进数据模型。“深度”是指对数据进行更复杂的统计分析的做法。

=============================================================

In the FIM environment, data is separated into staging, production, reporting, and sandbox schemas within a single Greenplum database, quite similar to the multiple tiers inside of Hadoop at Facebook described earlier.

在FIM环境中,数据被分离为单个Greenmers数据库中的分阶段、生产、报告和沙箱模式,非常类似于前面描述的Facebook Hadoop内部的多层结构。

=============================================================

Separately, Microsoft has published details of its data management stack. In papers titled “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks” and “SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets,” Microsoft describes an information platform remarkably similar to the one we had built at Facebook.

另外,微软发布了其数据管理堆栈的详细信息。在题为“Dryad:分布式数据-序列构建块的并行程序”和“范围:大规模数据集的简单高效并行处理”的论文中,微软描述了一个与我们在Facebook上构建的平台非常相似的信息平台。

=============================================================

 Its infrastructure includes a distributed file system called Cosmos and a system for parallel data processing called Dryad; it has even invented a SQL-like query language called SCOPE.

它的基础设施包括一个名为Cosmos的分布式文件系统和一个称为Dryad的并行数据处理系统;它甚至发明了一种类似SQL的查询语言,称为Scope。

=============================================================

Three teams working with three separate technology stacks have evolved similar platforms for processing large amounts of data. What’s going on here? By decoupling the requirements of specifying structure from the ability to store data and innovating on APIs for data retrieval, the storage systems of large web properties are starting to look less like databases and more like data spaces.

三个团队与三个独立的技术栈一起开发了处理大量数据的类似平台。这里发生了什么事?通过将指定结构的要求与存储数据的能力和数据检索API的能力脱钩,大型Web属性的存储系统开始变得不像数据库,而更像数据空间。

=============================================================

Information Platforms As Data spaces

Anecdotally, similar petabyte-scale platforms exist at companies such as Yahoo!, Quantcast, and Last.fm. These platforms are not quite data warehouses, as they’re frequently not using a relational database or any traditional data warehouse modeling techniques.

有趣的是,雅虎、Quantcast和Last.fm等公司也存在类似的PB级扩展平台。这些平台并不完全是数据仓库,因为它们通常不使用关系数据库或任何传统的数据仓库建模技术。

=============================================================

They’re not quite enterprise search systems, as only some of the data is indexed and they expose far richer APIs. And they’re often used for building products and services in addition to traditional data analysis workloads.

它们并不完全是企业级的搜索系统,因为只有一些数据被编入了索引,并且它们公开了更丰富的API。它们通常用于构建产品和服务,以及传统的数据分析工作负载。

=============================================================

Similar to the brain and the library, these shared platforms for data processing serve as the locus of their organization’s efforts to ingest, process, and generate information, and with luck, they hasten their organization’s pace of learning from empirical data.

类似于大脑和图书馆,这些共享的数据处理平台是其组织努力吸收、处理和生成信息的中心,幸运的是,它们加快了组织从经验数据中学习的速度。

=============================================================

In the database community, there has been some work to transition the research agenda from purely relational data management to a more catholic system for storage and querying of large data sets called a “dataspace.”

在数据库社区中,有一些工作要把研究议程从纯关系型数据管理转变为一个更通用的系统,用于存储和查询称为“数据空间”的大型数据集。

=============================================================

In “From Databases to Dataspaces: A New Abstraction for Information Management”

(http://www.eecs.berkeley.edu/~franklin/Papers/dataspaceSR.pdf).

the authors highlight the need for storage systems to accept all data formats and to provide APIs for data access that evolve based on the storage system’s understanding of the data.

在“从数据库到数据存储:信息管理的新抽象”(http:/www.eecs.berkeley.edu/~Franklin/documents/dataspaceSR.pdf)中。作者强调,存储系统需要接受所有数据格式,并根据存储系统对数据的理解为数据访问提供API。

=============================================================

I’d contend that the Information Platforms we’ve described are real-world examples of dataspaces: single storage systems for managing petabytes of structured and unstructured data from all parts of an organization that expose a variety of data access APIs for engineering, analysis, and reporting.

我认为,我们描述的信息平台是数据领域的真实例子:用于管理组织所有部分的结构化和非结构化数据的单个存储系统,这些数据访问API公开了用于工程、分析和报告的各种数据访问API。

=============================================================

Given the proliferation of these systems in industry, I’m hopeful that the database community continues to explore the theoretical foundations and practical implications of dataspaces.

考虑到这些系统在工业中的扩散,我希望数据库同僚们继续探索数据领域的理论基础和实际意义。

=============================================================

An Information Platform is the critical infrastructure component for building a Learning Organization. The most critical human component for accelerating the learning process and making use of the Information Platform is taking the shape of a new role: the Data Scientist.

 

信息平台是构建学习型组织的关键基础设施。加速学习过程和利用信息平台的最关键的人类组成部分是形成一个新的角色:数据科学家。

 

=============================================================

The Data Scientist

In a recent interview, Hal Varian, Google’s chief economist, highlighted the need for employees able to extract information from the Information Platforms described earlier. As Varian puts it, “find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis.”

“数据科学家”在最近的一次采访中,谷歌首席经济学家哈尔·瓦里安(Hal Varian)强调,员工需要能够从前面描述的信息平台中提取信息。正如瓦里安所说,“在那里,你可以提供一种稀缺的、互补的服务,以满足无处不在和廉价的需求。”那么,什么东西变得无处不在又便宜呢?数据。什么是数据的补充?分析。“

=============================================================

At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team.

在Facebook上,我们觉得像商业分析师、统计学家、工程师和研究科学家这样的传统头衔并没有很好地反映出我们对团队的追求。

=============================================================

The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion.

角色的工作量是多种多样的:在任何一天,团队成员可以用Python编写多级处理管道,设计假设检验,用R对数据样本进行回归分析,设计并实现Hadoop中某些数据密集型产品或服务的算法,或者以清晰简洁的方式将我们分析的结果传递给组织的其他成员。

=============================================================

 To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”

为了捕捉执行这些大量任务所需的技能集,我们开创了“数据科学家”的角色

=============================================================

In the financial services domain, large data stores of past market activity are built to serve as the proving ground for complex new models developed by the Data Scientists of their domain, known as Quants. Outside of industry, I’ve found that grad students in many scientific domains are playing the role of the Data Scientist.

在金融服务领域,建立了过去市场活动的大型数据存储,作为其领域的数据科学家开发的复杂新模型的试验场,称为Quants。在工业之外,我发现很多科学领域的研究生都在扮演数据科学家的角色。

=============================================================

One of our hires for the Facebook Data team came from a bioinformatics lab where he was building data pipelines and performing offline data analysis of a similar kind. The well-known Large Hadron Collider at CERN generates reams of data that are collected and pored over by graduate students looking for breakthroughs.

Facebook数据团队的一名员工来自一个生物信息学实验室,他在那里建立数据管道,并进行类似的离线数据分析。欧洲核子研究中心著名的大型强子对撞机产生了大量的数据,这些数据被研究生收集和研究,以寻求突破。

=============================================================

Recent books such as Davenport and Harris’s Competing on Analytics (Harvard Business School Press, 2007), Baker’s The Numerati (Houghton Mifflin Harcourt, 2008), and Ayres’s Super Crunchers (Bantam, 2008) have emphasized the critical role of the Data Scientist across industries in enabling an organization to improve over time based on the information it collects.

 

达文波特(Davenport)和哈里斯(Harris)的“分析竞争”(Harvard Business School Press,2007)、贝克(Baker)的“数值”(Houghton Mifflin HarCourt,2008)和艾尔斯(Ayres)的“超级分析者”(Ayres‘s SuperCruncher,2008)等书都强调了数据科学家在跨行业推动企业根据收集到的信息不断改进的关键作用。

=============================================================

In conjunction with the research community’s investigation of dataspaces, further definition for the role of the Data Scientist is needed over the coming years. By better articulating the role, we’ll be able to construct training curricula, formulate promotion hierarchies, organize conferences, write books, and fill in all of the other trappings of a recognized profession.

随着研究界对数据领域的调查,未来几年需要对数据科学家的角色进行进一步的定义。通过更好地阐明这一角色,我们将能够构建培训课程,制定晋升等级,组织会议,撰写书籍。

=============================================================

In the process, the pool of available Data Scientists will expand to meet the growing need for expert pilots for the rapidly proliferating Information Platforms, further speeding the learning process across all organizations.

在这一过程中,现有的数据科学家将扩大,以满足迅速扩散的信息平台对专家试点的日益增长的需求,从而进一步加快所有组织的学习进程。

=============================================================

Conclusion

When faced with the challenge of building an Information Platform at Facebook, I found it helpful to look at how others had attempted to solve the same problem across time and problem domains.

面对在Facebook构建信息平台的挑战,我发现看看其他人是如何尝试跨时间和跨问题领域解决相同问题的,这是很有帮助的

=============================================================

As an engineer, my initial approach was directed by available technologies and appears myopic in hindsight. The biggest challenge was keeping focused on the larger problem of building the infrastructure and human components of a Learning Organization rather than specific technical systems, such as data warehouses or enterprise search systems.


作为一名工程师,我最初的方法是由现有技术指导的,事后看来是目光短浅的。最大的挑战是继续把重点放在建设学习型组织的基础设施和人力构成部分这一更大的问题上,而不是具体的技术系统,例如数据仓库或企业搜索系统。

=============================================================

I’m certain that the hardware and software employed to build an Information Platform will evolve rapidly, and the skills required of a Data Scientist will change at the same rate.

我确信,用于构建信息平台的硬件和软件将迅速发展,数据科学家所需的技能也将以同样的速度变化。

=============================================================

Staying focused on the goal of making the learning process move faster will benefit both organizations and science. The future belongs to the Data Scientist!

 

专注于使学习过程更快推进的目标将对组织和科学都有好处。未来属于数据科学家!

猜你喜欢

转载自blog.csdn.net/adidas74891496/article/details/85341660