DataOps: A New Discipline 数据治理的下一步

前言

I started using Hadoop in 2008 in my first job at Ask.com when its expensive Oracle cluster couldn’t handle the ever increasing analytics workload and had to switch to Hadoop. In my second job at Twitter as a data engineer, I was in the front line to participate and drive how data enables and empowers almost all Twitter’s products. Since 2008 I have witnessed the power of data, which I prefer to call instead of “big data”, and how it transformed the world. The transformation is not in the slightest sense if you read the articles about how Cambridge Analytics influenced the US election in 2016.

2008年我在我的第一份工作(Ask.com)中开始使用Hadoop。当时是因为昂贵的Oracle集群无法处理不断增加的分析工作量,公司不得不切换到Hadoop。 随后在Twitter担任数据工程师的第二份工作中,我在第一线参与并推动了如何使用数据给几乎所有Twitter的产品赋能(与其称之为“大数据”,我更愿意简单称之为“数据”)。自2008年以来,我亲眼目睹了数据的力量,以及见证了它如何改变世界。 如果你阅读过有关剑桥分析公司如何影响2016年美国大选的文章,那么你会感受到这种改变所带来的非凡意义。

However, after more than 10 years since the buzzword “big data” emerged, big data seems to be used only for a very limited number of companies. Almost all unicorns in silicon valley use big data extensively to drive their success. Companies in China like BAT have also mastered the art of big data, and we have decacorns like Toutiao built primarily on big data technologies. There are many jokes about how big data is so hard to use, but it is a sad truth that, for most of the companies, big data is either still a buzzword or too hard to implement.

然而,自流行词“大数据”出现10多年后,大数据似乎只对少数公司有用。 在硅谷,几乎所有的独角兽企业都广泛使用大数据来推动他们的成功。 在中国,像BAT这样的公司已经掌握了大数据的艺术,同时我们也有像头条这样主要以大数据技术为基础的超级独角兽公司,但是仍然有很多关于大数据是如何难以使用的笑话。并且令人遗憾的事实是,对于大多数公司来说,大数据要么仍然是流行词,要么的确是难以实现。

Luckily a new discipline is on the rise and is the key to unlock the power of data for the ordinary companies. DataOps, given the obvious similar name to DevOps and also the role similar to DevOps to software development, is the way data engineers want to simplify the use of data and really make data-driven a reality.

Today we will briefly introduce DataOps and why it is important for every company that wants to drive real value from their data.

幸运的是,一门新学科正在崛起,是解开普通公司数据能力的关键。 它就是DataOps,与DevOps明显相似的名称,以及与DevOps类似的软件开发角色,是数据工程师希望简化数据的使用并真正实现以数据来驱动企业成功的方法。

今天,我们将简要介绍DataOps以及为什么它对每个想要从数据中获取真正价值的公司都很重要。

What is DataOps

The definition of DataOps on Wikipedia is:

DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics.

The page for DataOps on Wikipedia was only created in February 2017, which says a lot about this new discipline. The definition of DataOps surely will evolve over time, but the key goal is very clear: to improve the quality and reduce the cycle time of data analytics.

Simply put, DataOps makes data analytics easier. It won’t make data analytics easy, since good analytics still need a lot of work, like deep understanding the relation of the data and the business, good discipline to use the data, and a data-driven culture for the company. However, it will vastly improve the efficiency people use the data and lower the barrier to use the data. Companies can start using the data much faster, much earlier, much better, and with much less cost and risk.

什么是DataOps

维基百科上DataOps的定义是:

DataOps是一种面向流程的自动化方法,由分析和数据团队使用,旨在提高质量并缩短数据分析的周期时间。

维基百科上的DataOps页面在2017年2月创建,其中详细介绍了这一新学科。 DataOps的定义肯定会随着时间的推移而发展,但其关键目标非常明确:提高数据分析的质量并缩短数据分析的周期。

DataOps可以降低数据分析的门槛,但是它并不会使数据分析变成一项简单的工作。实施成功的数据项目仍然需要大量工作,例如深入了解数据和业务的关系,良好的数据使用规范以及一个公司的数据驱动的文化培养。 不过,DataOps将极大地提高人们使用数据的效率并降低使用数据的门槛,公司可以更快,更早,更好地开始使用数据,并且成本和风险更低。

Problems DataOps Solve

Most of the applications of big data can be categorized as either AI (artificial intelligence) or BI (business intelligence). AI here has a broader meaning that its canonical definition, which includes machine learning, data mining, and other technologies that derive previously unknown knowledge from data. BI applies more statistical methods to summary massive data into simpler data points for a human to understand. Simply put, AI is using various algorithms on data to compute something new, and BI is counting numbers that people can understand.

Writing an AI / ML program is not hard. You could set up a TensorFlow face recognition program in a few hours. Using Matlab to plot some data, or even Excel is not hard either. The problem is that, to actually use the results in production to support user-facing products or to decide your company’s fate based on these magical results, you need a lot more than just the manual work.

大数据的大多数应用可以分类为AI(人工智能)或BI(商业智能)。 此处的AI是指广义的人工智能功能,包括机器学习,数据挖掘以及其他从数据中获取以前未知知识的技术。 BI则是更多地使用统计方法将大量数据汇总到更简单的报告,供人们理解。 简而言之,AI使用各种数据算法来计算新的东西,BI则是统计人们可以理解的数字。

编写AI / BI程序并不难。 你可以在几个小时内设置一个TensorFlow的人脸识别程序。或者使用Matlab绘制一些数据,甚至使用Excel也并不难实现这个目的。 问题在于,要实际使用生产结果来支持面向用户的产品或根据这些神奇的数字来决定公司的命运,你需要的不仅仅是手动工作。

One survey by Dimensional Research found that the following problems are most difficult for companies that want to implement big data applications:

Ensuring data quality;
Keeping the cost contained;
Satisfying business needs and expectations;
Quantifying the value of big data projects;
Difficult finding people with expertise of big data;
Fixing performance and configuration problems;
Selecting the right data framework;
There aren’t enough technical resources;
Maintaining operational reliability;
The big data project has taken longer than expected;
Too many technologies or vendors to manage;
Opening up data access to more consumers;
Difficulty creating actionable information;
Challenging problem resolution and debugging.

Dimensional Research的一项调查(如上图所示)发现,对于想要实施大数据应用的公司来说,以下问题最为困难:
确保数据质量;

控制成本;
满足业务需求和期望;
量化大数据项目的价值;
很难找到具有大数据专业知识的人;
修复性能和配置问题;
选择正确的数据框架;
技术资源不足;
保持运行可靠性;
大数据项目花费的时间比预期的要长;
要管理的技术或供应商太多;
开放对更多消费者的数据访问;
难以创建可操作的信息; 复杂问题解决和调试。


Another research by Google data analysts found that, for most machine learning projects, only 5% of the time is spent on writing ML code. The other 95% of the time is spent on setting up the infrastructure needed to run the ML code.

谷歌数据分析师的另一项研究发现,对于大多数机器学习项目,只有5%的时间花在编写ML代码上。 另外95%的时间用于设置运行ML代码所需的基础设施。

在这里插入图片描述
In both types of research, we can easily see that a lot of the hard work is not actually writing the code. Preparations for the whole infrastructure and running the code in production efficiency are costly and risky.

In the Google research, they cited my former colleagues Jimmy Lin and Dmitriy Ryaboy from the Twitter Analytics team that much of the work can be described as “plumbing”. DataOps, actually, is making the plumbing much easier and more efficient.

在这两项研究中,我们可以很容易地看到许多艰苦的工作实际上并不是在编写代码。 整个基础设施的准备工作以及高效运行生产级别的代码是非常费时费力的,而且经常伴随着各种风险。

在谷歌的研究中,他们引用了我的前同事Jimmy Lin和Dmitry Ryaboy(来自Twitter Analytics团队)的话:我们的大部分工作可以被描述为“数据管道工”。 实际上,DataOps使管道工的工作更简单和高效。

DataOps Goals

DataOps aims to reduce the whole analytics cycle time. Therefore, from setting up the infrastructure to applying the results, it usually requires to achieve the following goals:

  • Deployment: including the basic infrastructure and the applications. Provisioning new system environment should be fast and easy, regardless of the underlying hardware infrastructure. Deploying a new application should take seconds instead of hours or days;
  • Operation: scalability, availability, monitoring, recovery, and reliability of the system and the applications. The user should not worry about the operations and can focus on business logic;
  • Governance: security, quality, and integrity of data, including auditing and access controls. All data are managed in a cohesive and controlled manner in a secure environment that supports multi-tenancy.
  • Usability: the user should be able to choose the tools they want to use for the data and easily run them as they need. Support for different analytics / ML / AI frameworks should be integrated into the system;
  • Production-ready: it should be easy to transform analytics programs into production with scheduling and data monitoring; building a production-ready data pipeline from data ingestion to analytics, and then data consumption should be easy and managed by the system.

In short, it is similar to the DevOps methodology: the path from writing the code to deploying to production, including the scheduling and monitoring, should be done by the same person and follow a system-managed protocol. Also similar to DevOps that provided many standard CI, deployment, monitoring tools to enable fast delivery, by standardizing a lot of the big data components, newcomers can quickly set up a production ready big data platforms and put the data to use.

  • DataOps旨在减少整个分析周期时间。 因此,从搭建基础架构到使用数据应用的结果,通常需要实现以下功能:
  • 部署:包括基础架构和应用程序。无论底层硬件基础设施如何,配置新系统环境都应该快速而简单。部署新应用程序应该花费几秒而不是几小时或几天;
  • 运维:系统和应用程序的可扩展性,可用性,监视,恢复和可靠性。用户不必担心运维,可以专注于业务逻辑;
  • 治理:数据的安全性,质量和完整性,包括审计和访问控制。所有数据都在一个支持多租户的安全环境中以连贯和受控的方式进行管理。
  • 可用:用户应该能够选择他们想要用于数据的工具,并根据需要轻松运行它们。应将对不同分析/ ML / AI框架的支持整合到系统中;
  • 生产:通过调度和数据监控,可以轻松地将分析程序转换为生产应用,构建从数据抽取到数据分析的生产级数据流水线,并且数据的使用应该很容易并由系统管理。

简而言之,它类似于DevOps方法:从编写代码到生产部署的路径,包括调度和监视,应由同一个人完成,并遵循系统管理的标准。 与提供许多标准CI,部署,监控工具以实现快速交付的DevOps类似,通过标准化大量大数据组件,新手可以快速建立生产级的大数据平台并充分利用数据的价值。

DataOps Methodologies

The main methodologies for DataOps are still under rapid evolution. Companies like Facebook and Twitter usually have their Data Platform team handle the operations and achieve the goals. However, their approaches are mostly integrated with their existing Ops infrastructure and thus usually not applicable to others. We can learn from their successes and build a general big data platform that can be implemented easily by every company.

To build a general platform required by DataOps, we think the following technologies are required:

  • Cloud architecture: we have to use a cloud-based infrastructure to support the resource management, scalability, and operational efficiency;
  • Containers: containers are critical in the implementation of DevOps, and its role in resource isolation and providing a consistent dev/test/ops environment is still critical in implementing a data platform;
    Real-time and Stream processing: real-time and streaming process is now becoming more and more important in a data-driven platform. They should be the first class citizens of a modern data platform;
    Multiple analysis engines: MapReduce is the traditional distributed processing framework, but frameworks like Spark and TensorFlow are used more and more widely every day and should be integrated;
  • Integrated Application and Data Management: application and data management, including life-cycle management, scheduling, monitoring, logging support, is essential for a production data platform. Regular practices for DevOps can be applied to the application management, but extra efforts are needed for data management and the interactions among applications and data;
  • Multi-tenancy and security: data security is almost the most important issue in a data project: if the data can’t be secured, it is mostly can’t be used at all. The platform should provide a secure environment for everyone to use the data and have every operation authorized, authenticated, and audited.
  • Dev and Ops tools: the platform should provide efficient tools for the data scientists to analyze the data and produce the analytics program, for the data engineers to operate the production data pipelines, and for other people to consume the data and results.

DataOps的主要方法论仍处于快速发展阶段。 像Facebook和Twitter这样的公司通常会有一个专门的数据平台团队(Data Platform Team)处理数据运营并实现数据项目。 但是,他们的实现方式大多与公司现有的Ops基础设施集成,因此通常不适用于其他人。 我们可以从他们的成功中学习经验,并建立一个可以由每家公司轻松实施的通用大数据平台。

要构建DataOps所需的通用平台,我们认为需要以下技术:

  • 云架构:我们必须使用基于云的基础架构来支持资源管理,可扩展性和运营效率;
  • 容器:容器在DevOps的实现中至关重要,它在资源隔离和提供一致的dev / test / ops环境中的作用对于实现数据平台仍然至关重要;
  • 实时和流处理:实时和流处理现在在数据驱动平台中变得越来越重要,它们应该是现代数据平台的一等公民;
  • 多分析引擎:MapReduce是传统的分布式处理框架,但Spark和TensorFlow等框架日常使用越来越广泛,应该进行集成;
  • 集成的应用程序和数据管理:应用程序和数据管理,包括生命周期管理,调度,监视,日志记录支持,对于生产数据平台至关重要。 DevOps的常规实践可以应用于应用程序管理,但是数据管理以及应用程序和数据之间的交互需要很多额外的工作;
  • 多租户和安全性:数据安全性几乎是数据项目中最重要的问题:如果数据无法保护,则根本无法使用。该平台应为每个人提供一个安全的环境,以便每个人都可以使用这些数据并对每个操作进行授权,验证和审核。
  • Dev和Ops工具:该平台应为数据科学家提供有效的工具,以分析数据并生成分析程序,为数据工程师提供大数据流水线的工具,并为其他人提供消费数据和结果的方法。

Our Thoughts

The current big data technologies are powerful, but they are still too hard for ordinary people to use. Implementing a production-ready data platform is still a daunting task. For the companies that have already started the journey, their data platform teams are still doing similar things as others have done or reinvented the wheels most of the time.

There are companies that have realized these problems (Qubole, DataMeer, Bluedata, etc) and started to approach a general solution with different methods. Some of them use container-based solutions and some of them build their platforms from a Hadoop-centric point of view.

We (Linktime Cloud) are also working on a new generation data platform. Instead of treating big data application especially, we are using a proven distributed operating system (Apache Mesos) and containerizing most of the big data applications to run on Mesos in a uniform manner. This approach allows us to standardize the management of the application and the data, thus providing the key technologies and achieving the goals as described above.

We believe that, in the near future, instead of just of installing Hadoop and running some Hive queries, companies will use an integrated big data platform to build really big data applications quickly and easily.

目前的大数据技术是强大的,但它们对于普通人来说仍然太难使用。部署一个适合生产环境的数据平台仍然是一项艰巨的任务。对于已经开始这一过程的公司来说,他们的数据平台团队大部分时间仍在做相似的事情,就像重新造轮子。

有些公司已经意识到这些问题(Qubole,DataMeer,Bluedata等),并开始采用不同的方法来解决这个问题。其中一些使用基于容器的解决方案,另外一些以Hadoop为中心构建其平台。

我们(Linktime Cloud)也在开发新一代数据平台。我们使用经过验证的分布式操作系统(Apache Mesos)作为底层支持平台,集成了很多常用的大数据组件,并将大多数大数据应用程序容纳在统一的Mesos集群中,而不是对大数据应用程序做特殊处理。 这种方法使我们能够标准化应用程序和数据的管理,从而提供关键技术并实现上述目标。

我们相信,在不久的将来,很多公司将使用集成的大数据平台快速,轻松地构建真正的大数据流水线及应用,而不仅仅是安装Hadoop并运行一些Hive查询。


作者简介:
彭锋博士 智领云科技(LinkTime Cloud)CEO,联合创始人;
大规模分布式系统和大数据基础架构领域的资深专家;
前Twitter大数据架构师及大数据平台技术负责人 ;
前Ask.com工程总监;
文章来源:https://www.linktimecloud.com/posts/1925

猜你喜欢

转载自blog.csdn.net/qq_31557939/article/details/126900873