The evolution of data architecture

The rise of Big Data technology allows organizations to be more flexible and efficient use of their business data, extract more value from the important data, and the data analysis and mining applications in the enterprise out of the results of the decision-making, marketing, management and other applications field. But inevitably, with the introduction and use of more and more new technology, a set of large enterprise data management platform may be achieved by means of a number of open-source technology components.

01 traditional data infrastructure

As shown, the biggest feature traditional single data architecture (Monolithic Architecture) is 1-1 centralized data storage, there may be many internal systems, such as Web business systems, order systems, CRM systems, ERP systems, monitoring system, the transactional data storage system is mainly implemented based on a centralized database relation (the DBMS), most of the architecture is divided into calculation and storage layers.
Data storage layer is responsible for access to the system within the enterprise, and have final data consistency protection. These data reflect the current status of the business, such as order volume of the system, the number of active users of the site, each user's transaction volume changes, all the update operations require the aid of the same set of database implementations.
Data architecture. Evolution of data architecture
▲ Figure 1-1 conventional data structure
high initial efficiency single architecture, but as time goes on, more and more business, the system gradually becomes large, increasingly difficult to maintain and upgrade, the database is only accurate data sources, each application needs to access the database to retrieve the corresponding data, if the database changes or problems, it will have an impact on the entire business system.
Later, with the advent of micro-services architecture (Microservices Architecture), companies began service as a micro-architecture system of enterprise business systems. The core idea of micro-services architecture is that an application by a number of small, independent micro-service composition, these services run in its own process, developers and publishers are not dependent. Different services can be based on different business needs, built on different technology architectures, able to focus on a limited business functions.
Data architecture. Evolution of data architecture
▲ Figure 1-2 micro service architecture
shown in Figure 1-2, the micro-architecture of the service system is disassembled into different independent service modules, each module using separate database, this model system solves the problem of business expansion , but also brought new problems, that business transaction data fragmentation in different systems, it is difficult to centralized data management.
For internal data analysis or data mining or the like, it is required by the data extracted from different databases, periodically synchronize the data from the database to the data warehouse, then extracted data in a data warehouse, conversion, loading (the ETL), constructed so that different applications and data marts, to the service providing system.

02 大数据数据架构

起初数据仓库主要还是构建在关系型数据库之上,例如Oracle、Mysql等数据库,但是随着企业数据量的增长,关系型数据库已经无法支撑大规模数据集的存储和分析,因此越来越多的企业开始选择基于Hadoop构建企业级大数据平台。
同时众多Sql-On-Hadoop技术方案的提出,也让企业在Hadoop上构建不同类型的数据应用变得简单而高效,例如通过使用Apache Hive进行数据ETL处理,通过使用Apache Impala进行实时交互性查询等。
大数据技术的兴起,让企业能够更加灵活高效地使用自己的业务数据,从数据中提取出更多重要的价值,并将数据分析和挖掘出来的结果应用在企业的决策、营销、管理等应用领域。但不可避免的是,随着越来越多新技术的引入与使用,企业内部一套大数据管理平台可能会借助众多开源技术组件实现。
例如在构建企业数据仓库的过程中,数据往往都是周期性的从业务系统中同步到大数据平台,完成一系列ETL转换动作之后,最终形成数据集市等应用。但是对于一些时间要求比较高的应用,例如实时报表统计,则必须有非常低的延时展示统计结果,为此业界提出一套Lambda架构方案来处理不同类型的数据。
例图1-3所示,大数据平台中包含批量计算的Batch Layer和实时计算的Speed Layer,通过在一套平台中将批计算和流计算整合在一起,例如使用Hadoop MapReduce进行批量数据的处理,使用Apache Storm进行实时数据的处理。
这种架构在一定程度上解决了不同计算类型的问题,但是带来的问题是框架太多会导致平台复杂度过高、运维成本高等。在一套资源管理平台中管理不同类型的计算框架使用也是非常困难的事情。总而言之,Lambda架构是构建大数据应用程序的一种很有效的解决方案,但是还不是最完美的方案。
Data architecture. Evolution of data architecture
▲图1-3 大数据Lambada架构
后来随着Apache Spark的分布式内存处理框架的出现,提出了将数据切分成微批的处理模式进行流式数据处理,从而能够在一套计算框架内完成批量计算和流式计算。
但因为Spark本身是基于批处理模式的原因,并不能完美且高效地处理原生的数据流,因此对流式计算支持的相对较弱,可以说Spark的出现本质上是在一定程度上对Hadoop架构进行了一定的升级和优化。

03 有状态流计算架构

数据产生的本质,其实是一条条真实存在的事件,前面提到的不同的架构其实都是在一定程度违背了这种本质,需要通过在一定时延的情况下对业务数据进行处理,然后得到基于业务数据统计的准确结果。
实际上,基于流式计算技术局限性,我们很难在数据产生的过程中进行计算并直接产生统计结果,因为这不仅对系统有非常高的要求,还必须要满足高性能、高吞吐、低延时等众多目标。
而有状态流计算架构(如图1-4所示)的提出,从一定程度上满足了企业的这种需求,企业基于实时的流式数据,维护所有计算过程的状态,所谓状态就是计算过程中产生的中间计算结果,每次计算新的数据进入到流式系统中都是基于中间状态结果的基础上进行运算,最终产生正确的统计结果。
基于有状态计算的方式最大的优势是不需要将原始数据重新从外部存储中拿出来,从而进行全量计算,因为这种计算方式的代价可能是非常高的。从另一个角度讲,用户无须通过调度和协调各种批量计算工具,从数据仓库中获取数据统计结果,然后再落地存储,这些操作全部都可以基于流式计算完成,可以极大地减轻系统对其他框架的依赖,减少数据计算过程中的时间损耗以及硬件存储。
Data architecture. Evolution of data architecture
▲图1-4 有状态计算架构
如果计算的结果能保持一致,实时计算在很短的时间内统计出结果,批量计算则需要等待一定时间才能得出,相信大多数用户会更加倾向于选择使用有状态流进行大数据处理。

04 为什么会是Flink

可以看出有状态流计算将会逐步成为企业作为构建数据平台的架构模式,而目前从社区来看,能够满足的只有Apache Flink。Flink通过实现Google Dataflow流式计算模型实现了高吞吐、低延迟、高性能兼具实时流式计算框架。
同时Flink支持高度容错的状态管理,防止状态在计算过程中因为系统异常而出现丢失,Flink周期性地通过分布式快照技术Checkpoints实现状态的持久化维护,使得即使在系统停机或者异常的情况下都能计算出正确的结果。
Flink具有先进的架构理念、诸多的优秀特性,以及完善的编程接口,而Flink也在每一次的Release版本中,不断推出新的特性,例如Queryable State功能的提出,容许用户通过远程的方式直接获取流式计算任务的状态信息,数据不需要落地数据库就能直接从Flink流式应用中查询。对于实时交互式的查询业务可以直接从Flink的状态中查询最新的结果。
在未来,Flink将不仅作为实时流式处理的框架,更多的可能会成为一套实时的状态存储引擎,让更多的用户从有状态计算的技术中获益。
从单体到Flink:一文读懂数据架构的演变
Flink的具体优势有以下几点。

1. 同时支持高吞吐、低延迟、高性能

Flink是目前开源社区中唯一一套集高吞吐、低延迟、高性能三者于一身的分布式流式数据处理框架。像Apache Spark也只能兼顾高吞吐和高性能特性,主要因为在Spark Streaming流式计算中无法做到低延迟保障;而流式计算框架Apache Storm只能支持低延迟和高性能特性,但是无法满足高吞吐的要求。而满足高吞吐、低延迟、高性能这三个目标对分布式流式计算框架来说是非常重要的。

2. 支持事件时间(Event Time)概念

在流式计算领域中,窗口计算的地位举足轻重,但目前大多数框架窗口计算采用的都是系统时间(Process Time),也是事件传输到计算框架处理时,系统主机的当前时间。
Flink能够支持基于事件时间(Event Time)语义进行窗口计算,也就是使用事件产生的时间,这种基于事件驱动的机制使得事件即使乱序到达,流系统也能够计算出精确的结果,保持了事件原本产生时的时序性,尽可能避免网络传输或硬件系统的影响。

3. 支持有状态计算

Flink在1.4版本中实现了状态管理,所谓状态就是在流式计算过程中将算子的中间结果数据保存在内存或者文件系统中,等下一个事件进入算子后可以从之前的状态中获取中间结果中计算当前的结果,从而无须每次都基于全部的原始数据来统计结果,这种方式极大地提升了系统的性能,并降低了数据计算过程的资源消耗。
对于数据量大且运算逻辑非常复杂的流式计算场景,有状态计算发挥了非常重要的作用。

4. 支持高度灵活的窗口(Window)操作

在流处理应用中,数据是连续不断的,需要通过窗口的方式对流数据进行一定范围的聚合计算,例如统计在过去的1分钟内有多少用户点击某一网页,在这种情况下,我们必须定义一个窗口,用来收集最近一分钟内的数据,并对这个窗口内的数据进行再计算。
Flink将窗口划分为基于Time、Count、Session,以及Data-driven等类型的窗口操作,窗口可以用灵活的触发条件定制化来达到对复杂的流传输模式的支持,用户可以定义不同的窗口触发机制来满足不同的需求。

5. 基于轻量级分布式快照(Snapshot)实现的容错

Flink can be distributed to run on thousands of nodes, the process of dismantling a large computational task into small calculation process, and then distributed to the parallel nodes tesk processing. In the process of task execution, can automatically discover event processing errors caused by inconsistent data problems, such as: node goes down, the network transmission problems, or because the user because the upgrade or fix problems caused by computing services reboot.
In these cases, distributed by Checkpoints based snapshot technology, the execution state information in the process of persistent storage, once the task abnormal stop, Flink will be able to perform tasks automatically recover from Checkpoints to ensure that data in the process consistency.

6. Based on JVM memory management independence

Memory management is a part of all computing framework important consideration, especially for computationally intensive calculations scene, how to manage the data that is critical in memory. For memory management, Flink mechanisms that implement their own memory management to reduce the impact on the JVM GC system as much as possible.
Further, Flink converted serialization / deserialization method for all data objects stored in a binary memory, reduce the size of the data stored at the same time, can be more effectively utilize memory space, reduce or decrease in performance caused GC task abnormal risk and are Flink than other distributed processing framework appears more stable, not because JVM GC and other issues affecting the operation of the entire application.

7. Save Points (save point)

For streaming applications 24/7 operation, a steady stream of data access, in a period of time terminating the application may cause loss or inaccurate results, for example, the upgrade version of the cluster data, operation and maintenance operation or the like downtime operating.
It is worth mentioning that, Flink snapshot technology through Save Points task execution stored on the storage medium, when the task can be restarted directly involved in saving the Save Points to restore the original state of computing, making the task continues in accordance with the state before shutdown run, Save Points technology allows users to better manage the operation and maintenance and real-time streaming applications.


Guess you like

Origin blog.51cto.com/14163835/2421391