Flink's "Past and Present"

Why Flink?

People often correct understanding of something based on valid conclusions from evidence. To get to the conclusion that the most effective way is analyzed along the track events.

Many systems generate a continuous stream of events, such as moving automobiles emit GPS signals, a financial transaction, a mobile communication base station busy smartphone exchanging signals, the measurement results of network traffic, machine logs, industrial sensors and wearable device, and many more. If you can analyze massive data streams efficiently, our understanding of the system described above will be clearer and faster. In short, the data flow more realistically reflect the way we live.

Therefore, we naturally want the data collected by way of the flow of events and deal with them. But until now, this is not standard practice in the industry. Stream processing is not a new concept, but it really is a highly specialized and challenging technology. In fact, the common enterprise data architecture still assume that the data is from beginning to end of a finite set. Most of the reasons for this assumption is that the presence of the data storage and processing system with a limited set of built relatively simple matching. However, this is undoubtedly a natural flow to those scenes artificially increase the limit.

We are eager to process the data in accordance with the flow way, but very difficult to do; with large-scale data appear in all walks of life, more and more difficult. This is part of a category of physics problem: the large-scale distributed systems, data consistency and understanding of the sequence of events is necessarily limited. With the evolution of methods and techniques, we try to make this limitation does not jeopardize the business goals and operational objectives.

In this context, Apache Flink (hereinafter referred to as Flink) came into being. As the birth of open source software in the public community, Flink provides stream processing large amounts of data, and batch processing using the same technology.

Flink in the development process, the developers are focused on avoiding other stream processing methods have made compromises in efficiency or ease of use.

This book discusses some of the potential benefits of stream processing, to help you determine the data stream processing method is right for you based on your business goals. Some data sources and stream processing application scenarios may make you surprised. In addition, this book will help you understand how Flink technologies and techniques to overcome the difficulties faced by the stream processing.

This chapter describes what people want to stream data obtained by the analysis, and the difficulties faced by large-scale flow data analysis. This chapter is about the entry Flink introduction, you can usually see people (including in a production environment) is how to use it.

1.1 streaming with poor consequences

Who needs to deal with it and stream data? The first thing to mind is the staff engaged in sensor measurement and financial transactions. For them, the stream processing very useful. But the flow of data sources is very extensive, two common examples are: the site to reflect the obtained clickstream data user behavior, as well as machine log private data center. In fact, the flow of data sources everywhere, but the data obtained from continuous event does not mean you can use the data in batch calculation. Today, large-scale flow of data processing new technology is changing this situation.

If handling large data stream is a historic problem, why should we bother trying to build a better stream processing system? Before the introduction of the new architecture and new technology that supports Stream, let's talk about can not handle the flow of data, what will happen.

1.1.1 Retail and Marketing

In modern retail, website traffic on behalf of the sales. The site gets a lot of click data may be continuous, uniform. Difficult to handle the data of this magnitude with the conventional technique. Only to build a batch system processes the data stream 1 is very challenging: the result is likely to require a large and complex system. And, the traditional approach will bring the problem of data loss, delay, errors and other polymerization results. How can such a result be helpful to the business field?

1 in the book, "data stream" refers to a contiguous data stream; "streaming data" refers to data in the data stream. - Translator's Note

Suppose you are to report to the CEO on a quarterly sales data, you certainly do not want afterwards because of the use of inaccurate data and report the results to the CEO had to correct. If not handled well click data, you are likely to be inaccurate calculation of the traffic to the website, which will lead to advertising and offer performance figures are not accurate.

Air passenger services face the same challenge: Airlines need to quickly and accurately processing large amounts of data obtained from various sources. For example, when the check was a passenger, the passenger is required to check the ticket booking data, but also need to check baggage handling, flight status information and billing information. If there is no strong technical support to stream processing, data of this size is difficult not go wrong. In recent years, there are four major US airlines have suffered three large service outages, which can be attributed to several large-scale failures in real-time data processing failed.

Of course, many related issues (such as how to avoid duplication of booking a hotel or concert tickets), generally can be solved by efficient database operations, but this operation is quite costly, but also energy costs. Especially when the amount of data increases, the cost will soar, and in some cases, the reaction rate becomes extremely slow database. Due to lack of flexibility, development speed is affected, in large and complex project or system is changing slowly. We want to handle the flow of data in large systems, and effectively control costs while maintaining consistency, very difficult.

Fortunately, modern stream processors often can solve these problems in a new way, which makes the cost of real-time processing of massive data lower. Stream processing also inspired a new attempt, such as building a system that can give advice related to the current real-time based on customer purchase of goods, to see if they need to buy some other commodity. This does not mean that stream processors instead of the database (far not a substitute), but that is not good when the database processing, stream processor provides a better solution. It would also make the database freed, no longer involved in the real-time analysis of the current state of the business. Chapter 2 Introducing stream processing architecture that will change more in-depth explanation.

Things 1.1.2

Things are streaming data field is generally applied. In things, the low-latency data transmission and processing, as well as accurate data analysis is often critical. Various types of equipment in the sensor measurement data is obtained frequently, and transfers them to the data center as a stream. In the data center, real-time or near real-time application updates the display panel, running machine learning models, issued a warning, and provide feedback on a number of different services.

Transportation also reflects the importance of stream processing. For example, the advanced systems rely on the train sensor measurement data, the data train transmitted from the track, and then transmitted from the sensor along the train; At the same time, the report is sent back to the control center. Measurement data including speed and position of the train, the track conditions and the periphery. If the stream data is not handled correctly, adjust the comments and warnings can not be consequential and, thus, can not react to a dangerous situation to avoid accidents.

Another example is "smart" car, also known as networked car, they transmit data back to the manufacturer over the mobile network. (Nordic countries, France and the United Kingdom, the United States is just beginning) in some countries, networked car can even pass information to the insurance company; if the car is, the information can also be analyzed by the service station to transmit radio frequency link. In addition, some smart phone applications also support the millions of drivers sharing real-time traffic information.

Figure 1-1: many cases need to consider the timeliness of the data, including the use of transportation of Things data. For millions of drivers sharing real-time traffic information relies on reasonable and accurate analysis of streaming data in a timely manner (Source: © 2016 Friedman)

Things also have an impact on utilities. Related companies have started installing smart meters, to replace the need for manual meter readings every month old. Smart meter can be regularly fed back to the electricity company (eg every 15 minutes). Some companies are trying to carry out a measurement every 30 seconds. The use of smart meter This shift has brought a lot of streaming data, but also access to a large number of potential benefits. One advantage is that the device detects a failure or the like stealing machine learning model using exceptions. If you can not stream data for high-throughput, low-latency and accurate processing, these new goals can not be achieved.

If the stream processing done well, other things the project will suffer. Large-scale equipment, such as wind turbines, pumps and drilling equipment, rely on the analysis of the measurement data to obtain the sensor fault warning. If you can not handle the flow of data these devices, you will likely pay a high price, and even lead to catastrophic consequences.

1.1.3 Telecommunications

The telecommunications industry is a special case, it is widely used cross-boundary flow of event data generated based on various purposes. If the telecommunications company can not handle the flow of data, other base stations before the peak traffic flow to pre-assign can not appear in a mobile communication base stations, can not respond quickly during a power outage. Abnormality detection process performed by the data stream, such as equipment failure or a dropped call is detected, it is essential for the telecommunications industry.

1.1.4 Banking and Finance

Potential problem because the stream processing done well and bring to the banking and financial sector is extremely significant. Banks do not want to engage in retail business customer transactions are delayed or because of an error caused by statistical errors account balance. There was a saying called "banker working time", refers to the settlement banks need to close early in the afternoon, so as to ensure an accurate account calculated before the second day of business. This business model batch jobs have disappeared. Today, transactions and reports must be generated quickly and accurately; some emerging banks even provide real-time push notifications, and access mobile banking services anytime, anywhere. In a globalized economy, can provide 24-hour service is becoming increasingly important.

So, if the lack of sensitivity can be detected in real time user behavior abnormalities application, financial institutions will bring consequences? Credit card fraud detection requires timely monitoring and feedback. Detection of abnormal log can be found phishing attacks, in order to avoid huge losses.

 In many cases, people want to use real-time or low-latency stream processing to obtain time-sensitive data, provided that the streaming itself is accurate and efficient.

Objective 1.2 continuous event processing

Can be very low latency processing of data, this is not the only advantage of stream processing. People do not only want to stream processing low latency and high throughput, can handle interrupts. Excellent flow processing can be restarted after a system crash, and output accurate results; in other words, excellent fault tolerance can stream processing techniques, but also to ensure Once-exactly 2 .

2 Dui exactly-once explained, see 5.1. - Editor's Note

At the same time, technology to obtain this level of fault tolerance employed need not have much overhead in the absence of data errors. This technique requires time events it can be based on (rather than arbitrarily setting processing interval) to guarantee trace events in the correct order. For developers, whether it is to write code or error correction, the system must be easy to operate and maintain. It is also important, the system generates the result needs to be consistent with the sequence of events that actually happened, for example, be able to handle out of order flow of events (the fact that an unfortunate but unavoidable), and the ability to replace the streaming data accurately (it is at the time of the audit or debugging it works).

Evolution 1.3 streaming technology

Separate continuous real-time data processing and batch data is limited, you can make the system easier to build work, but this approach will manage the complexity of the two systems is left to the users of the system: application development team and DevOps teams need own use and manage two systems.

To handle this situation, some users developed its own stream processing system. In the open source world, Apache Storm project (hereinafter referred to as Storm) is a pioneer in streaming. Storm was first developed by a team of Nathan Marz and start-up companies BackType (later acquired by Twitter), and then later admitted the Apache Software Foundation. Storm provides a low-latency stream processing, but it paid a price for the real-time: it is difficult to achieve high throughput, and its accuracy could not reach the level normally required. In other words, it does not guarantee exactly-once; even if it can guarantee the correctness level, its cost is quite large.

Lambda Architecture Overview: Advantages and limitations

Demand for low-cost large-scale encourage people to start using the distributed file system, such as HDFS and computing system (MapReduce jobs) Batch Data. But this system is difficult to achieve low latency. Storm with a real-time stream processing technology development can help solve the problem of delay, but not perfect. One reason is that, Storm does not support exactly-once semantics, and therefore can not guarantee the accuracy of the state data, while it does not support the processing time-based events. Users have the above requirements have to add these functions in your own application code.

It later emerged that a hybrid method of analysis, it will combine the above two options, both to ensure low-latency, and guarantee correctness. This method is called Lambda architecture, but it provides an accurate calculation results, although somewhat delayed batch MapReduce jobs, at the same time by Storm will show up preliminary results of the latest data.

Lambda architecture is a building big data applications is very effective framework, but it is not good enough. For example, there is a window of time up to several hours based on the MapReduce and HDFS Lambda system, within this window, inaccurate results due to the failure to produce real-time tasks will always exist. Lambda architecture requires two different API (application programming interface, application programming interface) for the same service logic program twice: once for the calculated batch system, a computing system for streaming. Produced two code bases for the same business problems, they have different vulnerabilities. This system is actually very difficult to maintain.

 

 To rely on multiple streams events to calculate the result, the data must be retained from one event to the next event. The preserved data is called the calculated state. Consistency is essential for accurate processing status of calculation results. Can continue to accurately update the state after the failure or disruption is key for fault tolerance.

Maintain good fault tolerance in the stream processing system low latency and high throughput is very difficult, but in order to get guaranteed accurate state, people come up with an alternative method: dividing the continuous stream of data into a series of small events the batch job. If the split sufficiently small (i.e., so-called micro batch job), can be calculated almost true flow processing. Because there is a delay, it is impossible to be completely real-time, but each simple application can be implemented only a few seconds or even sub-second delay. This is the method that runs on a batch Spark engine Apache Spark Streaming (hereinafter referred to as Spark Streaming) used.

More importantly, the use of micro-batch method, you can achieve exactly-once semantics, in order to protect a consistent state. If a micro batch job fails, it can run again. This is easier than a continuous stream processing method. Storm Storm is an extension of the Trident, its underlying stream processing engine is based on the micro-batch approach calculated to achieve the exactly-once semantics, but paid a high price in terms of delay.

However, to simulate the stream processing is intermittently batch jobs, can lead to the development and operation and maintenance interdigitated. Time and tightly coupled data required to complete intermittent batch job arrives, any delay could lead to inconsistent results (or errors). Potential problem with this technique is that time is generated by the system in small quantities that part of the job full control. Spark Streaming and some stream processing frameworks to some extent weakened the drawbacks, but still can not be completely avoided. In addition, the calculation using this method has a poor user experience, especially those sensitive to delay the job, but it takes a lot of energy when writing code to enhance business performance.

In order to achieve the desired function, people continue to improve existing processors (such as Storm Trident's original intention is to try to overcome the limitations of the Storm). When an existing processor can not meet the needs of a variety of consequences must be faced and solved by the application developer. Micro-batch method, for example, people tend to expect event data is divided according to the actual situation, and the processor can only be based on the batch job time (recovery interval) multiple segmentation. When the flexibility and expressiveness lack of time, slows down the development, operation and maintenance cost.

So, Flink appeared. The data processor can avoid the above drawbacks, and has required a lot of functions, but also process the data efficiently according to successive events. Flink some of the features shown in Figure 1-2.

Figure 1-2: Flink One advantage is that it has many important flow calculation function. Other projects in order to achieve these functions, have to pay the price. For example, Storm achieve a low latency, high throughput but can not write in the book of Shihai, can not be calculated accurately process status in case of failure; Spark Streaming achieve a high throughput and fault tolerance through the use of micro-batch method, but at the expense of low latency and real-time processing capabilities, nor can match the time window and natural, and expressive poor

与 Storm 和 Spark Streaming 类似,其他流处理技术同样可以提供一些有用的功能,但是没有一个像 Flink 那样功能如此齐全。举例来说,Apache Samza(以下简称 Samza)是早期的一个开源流处理器,它不仅没能实现 exactly-once 语义,而且只能提供底层的 API;同样,Apache Apex 提供了与 Flink 相同的一些功能,但不全面(比如只提供底层的 API,不支持事件时间,也不支持批量计算)。这些项目没有一个能和 Flink 在开源社区的规模上相提并论。

下面来了解 Flink 是什么,以及它是如何诞生的。

1.4 初探Flink

Flink 的主页 3 在其顶部展示了该项目的理念:“Apache Flink 是为分布式、高性能、随时可用以及准确的流处理应用程序打造的开源流处理框架。”Flink 不仅能提供同时支持高吞吐和 exactly-once 语义的实时计算,还能提供批量数据处理,这让许多人感到吃惊。鱼与熊掌并非不可兼得,Flink 用同一种技术实现了两种功能。

3http://flink.apache.org

这个顶级的 Apache 项目是怎么诞生的呢?Flink 起源于 Stratosphere 项目,Stratosphere 是在 2010~2014 年由 3 所地处柏林的大学和欧洲的一些其他的大学共同进行的研究项目。当时,这个项目已经吸引了一个较大的社区,一部分原因是它出现在了若干公共开发者研讨会上,比如在柏林举办的 Berlin Buzzwords,以及在科隆举办的 NoSQL Matters,等等。强大的社区基础是这个项目适合在 Apache 软件基金会中孵化的一个原因。

2014 年 4 月,Stratosphere 的代码被复制并捐献给了 Apache 软件基金会,参与这个孵化项目的初始成员均是 Stratosphere 系统的核心开发人员。不久之后,创始团队中的许多成员离开大学并创办了一个公司来实现 Flink 的商业化,他们为这个公司取名为 data Artisans。在孵化期间,为了避免与另一个不相关的项目重名,项目的名称也发生了改变。Flink 这个名字被挑选出来,以彰显这种流处理器的独特性:在德语中,flink 一词表示快速和灵巧。项目采用一只松鼠的彩色图案作为 logo,这不仅因为松鼠具有快速和灵巧的特点,还因为柏林的松鼠有一种迷人的红棕色。

图 1-3:左侧:柏林的红松鼠拥有可爱的耳朵;右侧:Flink 的松鼠 logo 拥有可爱的尾巴,尾巴的颜色与 Apache 软件基金会的 logo 颜色相呼应。这是一只 Apache 风格的松鼠!

这个项目很快完成了孵化,并在 2014 年 12 月一跃成为 Apache 软件基金会的顶级项目。作为 Apache 软件基金会的 5 个最大的大数据项目之一,Flink 在全球范围内拥有 200 多位开发人员,以及若干公司中的诸多上线场景,有些甚至是世界 500 强的公司。在作者撰写本书的时候,共有 34 个 Flink 线下聚会在世界各地举办,社区大约有 12 000 名成员,还有众多 Flink 演讲者参与到各种大数据研讨会中。2015 年 10 月,第一届 Flink Forward 研讨会在柏林举行。

批处理与流处理

Flink 是如何同时实现批处理与流处理的呢?答案是,Flink 将批处理(即处理有限的静态数据)视作一种特殊的流处理。

Flink 的核心计算构造是图 1-4 中的 Flink Runtime 执行引擎,它是一个分布式系统,能够接受数据流程序并在一台或多台机器上以容错方式执行。Flink Runtime 执行引擎可以作为 YARN(Yet Another Resource Negotiator)的应用程序在集群上运行,也可以在 Mesos 集群上运行,还可以在单机上运行(这对于调试 Flink 应用程序来说非常有用)。

图 1-4:Flink 技术栈的核心组成部分。值得一提的是,Flink 分别提供了面向流处理的接口(DataStream API)和面向批处理的接口(DataSet API)。因此,Flink 既可以完成流处理,也可以完成批处理。Flink 支持的拓展库涉及机器学习(FlinkML)、复杂事件处理(CEP),以及图计算(Gelly),还有分别针对流处理和批处理的 Table API

能被 Flink Runtime 执行引擎接受的程序很强大,但是这样的程序有着冗长的代码,编写起来也很费力。基于这个原因,Flink 提供了封装在 Runtime 执行引擎之上的 API,以帮助用户更方便地生成流式计算程序。Flink 提供了用于流处理的 DataStream API 和用于批处理的 DataSet API。值得注意的是,尽管 Flink Runtime 执行引擎是基于流处理的,但是 DataSet API 先于 DataStream API 被开发出来,这是因为工业界对无限流处理的需求在 Flink 诞生之初并不大。

DataStream API 可以流畅地分析无限数据流,并且可以用 Java 或者 Scala 来实现。开发人员需要基于一个叫 DataStream 的数据结构来开发,这个数据结构用于表示永不停止的分布式数据流。

Flink 的分布式特点体现在它能够在成百上千台机器上运行,它将大型的计算任务分成许多小的部分,每个机器执行一个部分。Flink 能够自动地确保在发生机器故障或者其他错误时计算能持续进行,或者在修复 bug 或进行版本升级后有计划地再执行一次。这种能力使得开发人员不需要担心失败。Flink 本质上使用容错性数据流,这使得开发人员可以分析持续生成且永远不结束的数据(即流处理)。

 Flink 解决了许多问题,比如保证了 exactly-once 语义和基于事件时间的数据窗口。开发人员不再需要在应用层解决相关问题,这大大地降低了出现 bug 的概率。

因为不用再在编写应用程序代码时考虑如何解决问题,所以工程师的时间得以充分利用,整个团队也因此受益。好处并不局限于缩短开发时间,随着灵活性的增加,团队整体的开发质量得到了提高,运维工作也变得更容易、更高效。Flink 让应用程序在生产环境中获得良好的性能。尽管相对较新,但是 Flink 已经在生产环境中得到了应用,下一节将做更详细的介绍。

1.5 生产环境中的Flink

本章旨在探讨为何选择 Flink。一个好的方法是听听在生产环境中使用 Flink 的开发人员解释他们选择 Flink 的原因,以及如何使用它。

1.5.1 布衣格电信

布衣格电信(Bouygues Telecom)是法国第三大移动通信运营商,隶属世界 500 强企业布衣格集团。布衣格电信使用 Flink 来进行实时事件处理,每天不间断地分析数十亿条消息。2015 年 6 月,在为 data Artisans 的博客撰写的文章中,Mohamed Amine Abdessemed4 描述了布衣格电信的目标以及 Flink 为什么能实现这些目标。

4他在布衣格电信负责技术系统。——编者注

……布衣格电信最终选择了 Flink,因为它支持真正的流处理——通过上层的 API 和下层的执行引擎都能实时进行流处理,这满足了我们对可编程性和低延迟的需求。此外,使用 Flink,我们的系统得以快速上线,这是其他任何一种方案都做不到的。如此一来,我们就有了更多的人手开发新的业务逻辑。

Mohamed Amine Abdessemed 在于 2015 年 10 月举行的 Flink Forward 研讨会上也做了报告。布衣格电信试图给其工程师实时提供关于用户体验的反馈,让他们了解公司遍布全球的网络正在发生什么,并从网络的演进和运行的角度掌握发展动向。

为了实现这个目标,他们的团队搭建了一个用来分析网络设备日志的系统,它定义了衡量用户体验质量的各项指标。该系统每天处理 20 亿次事件,要求端到端延迟不超过 200 毫秒(包括由传输层负责的消息发布和由 Flink 操作的数据处理)。这些都在一个仅有 10 个节点(每个节点 1GB 内存)的小集群上完成。布衣格电信还希望这些经过部分处理的数据能被复用,从而在互不干扰的前提下满足各种商业智能分析需求。

该公司打算利用 Flink 的流处理能力来转换和挖掘数据。加工后的数据被推送回消息传输系统,以保证数据可以被不同的用户使用。

相比于其他处理方案,比如在数据进入消息队列之前进行处理,或者将数据分派给消费同一个消息队列的多个应用程序来分头处理,Flink 的处理方案更合适。

布衣格电信利用 Flink 的流处理能力完成了数据处理和数据迁移,它既满足了低延迟的要求,又具有很高的可靠性、可用性,以及易用性。Flink 为调试工作提供了极大的便利,甚至支持切换到本地进行调试。它也支持程序可视化,有利于理解程序的运行原理。此外,Flink 的 API 吸引了很多开发人员和数据科学家。Mohamed Amine Abdessemed 在其文章中还提及布衣格电信的其他团队使用 Flink 解决了不同的问题。

1.5.2 其他案例

King公司

King 公司的游戏非常流行,全世界几乎每时每刻都有人在玩它的在线游戏。作为在线娱乐行业的佼佼者,该公司称自己已经开发了 200 多款游戏,市场覆盖 200 多个国家和地区。

King 公司的工程师曾在一篇博客文章中写道:“我们每月有超过 3 亿的独立用户,每天从不同的游戏和系统中收到 300 亿次事件,基于这么大的数据量做任何流分析都是真正的技术挑战。因此,为我们的数据分析师开发工具来处理如此大规模的流数据,同时保证数据在应用中具有最大的灵活性,这些对于公司而言至关重要。”

King 公司用 Flink 构建的系统让其数据分析师得以实时地获取大量的流数据。Flink 的成熟度给他们留下了深刻的印象。即使面对像 King 公司这样复杂的应用环境,Flink 也能很好地提供支持。

Zalando公司

作为欧洲领先的在线时尚平台,Zalando 公司在全球拥有超过 1600 万的客户。该公司的网站将其组织结构描述为“多个敏捷、自主的小型团队”(换句话说,该公司采用了微服务架构)。

流处理架构为微服务提供了良好的支持。因此,Flink 提供的流处理能力满足了这种工作模式的需求,特别是支持业务流程监控和持续的 ETL5 过程。

5ETL 是 Extract、Transform 和 Load 的缩写,即抽取、转换和加载。——编者注

Otto集团

Otto 集团是全球第二大 B2C(business to consumer,企业对顾客电子商务)在线零售商,也是欧洲时尚和生活领域最大的 B2C 在线零售商。

它的商业智能部门在最初开始评估开源流处理平台时,没有找到一种能够符合其要求的平台,所以后来决定开发自己的流处理引擎。但是当试过 Flink 之后,该部门发现 Flink 满足了他们对流处理的所有需求,包括对众包用户代理的鉴定,以及对检索事件的辨识。

ResearchGate

从活跃用户的数量上看,ResearchGate 是最大的学术社交网络。它从 2014 年开始使用 Flink 作为其数据基础设施的一个主要工具,负责批处理和流处理。

阿里巴巴集团

阿里巴巴这个庞大的电子商务集团为买方和卖方提供平台。其在线推荐功能是通过基于 Flink 的系统 Blink 实现的。用户当天所购买的商品可以被用作在线推荐的依据,这是使用像 Flink 这样真正意义上的流处理引擎能够带来的好处之一。并且,这在那些用户活跃度异常高的特殊日期(节假日)尤其重要,也是高效的流处理相较于批处理的优势之一。

1.6 Flink的适用场景

本章开头提出了“为何选择 Flink”这一问题。比这个问题更大的则是“为何要用流数据?”本章解释了一些原因,比如在许多情况下,我们都需要观察和分析连续事件产生的数据。与其说流数据是特别的,倒不如说它是自然的——只不过从前我们没有流处理能力,只能做一些特殊的处理才能真正地使用流数据,比如将流数据攒成批量数据再处理,不然无法进行大规模的计算。使用流数据并不新鲜,新鲜的是我们有了新技术,从而可以大规模、灵活、自然和低成本地使用它们。

Flink 并不是唯一的流处理工具。人们正在开发和改进多种新兴的技术,以满足流处理需求。显然,任何一个团队选择某一种技术都是出于多方面的考虑,包括团队成员的已有技能。但是 Flink 的若干优点、易用性,以及使用它所带来的各种好处,使它变得非常有吸引力。另外,不断壮大且非常活跃的 Flink 社区也暗示着它值得一试。你会发现“为何选择 Flink”这个问题变成了“为何不选择 Flink 呢?”

在深入探讨 Flink 的工作原理之前,我们先来通过第 2 章了解如何设计数据架构才能从流处理中充分获益,以及流处理架构是如何带来诸多好处的。

                    </div>
                </div>
            </div>

Guess you like

Origin blog.csdn.net/Mirror_w/article/details/91345557