Big Data Technology: Ali, Baidu, Tencent chose Flink, in the end there is nothing magical about it?

 

 

Information from the media push, the shopping carnival of real-time data screen, real-time computing has been applied to a number of live, work scene, with the rapid growth of the business, we are increasing the demand for real-time computing.

 

Open source big data for real-time calculation engine to calculate a variety of options, such as Storm, Samza, Flink, etc., and support the flow of only one batch of Spark and Flink. Currently, many companies have or are computing tasks migrate from the old system to the Storm Flink, Tencent is one of them.

 

Tencent task team is to provide real-time computing and efficient for business, stable and easy to use real-time data services. The peak of its data access reached 210 million per day, the amount of data access reached 17 trillion daily data growth amounted to 3PB, real-time daily calculation of the amount needed to reach 20 trillion times.

 

Its early real-time computing platform to build on Storm, but with the business expanding, growing business needs, the original real-time computing platform encountered many problems, some flaws Storm has gradually exposed. In this context, Tencent real-time computing team chose to replace with Flink Storm as a new generation of real-time stream computing engines, the community version Flink conducted in-depth optimization, and on top of this to build a collection development, testing, deployment and operations Victoria in one-stop real-time visual computing platform --Oceanus.

 

Storm vs Flink

Why Tencent will turn Flink? We might do next comparison.

Storm

Storm is a free, open-source distributed computing framework stream processing, low latency, fault-tolerant, high availability and other characteristics. It can be easily and reliably handle unlimited data stream, real-time analysis, online machine learning, continuous computing, distributed RPC, ETL is an excellent choice.

 

Storm topology (the Topology) is designed to have the shape of a directed acyclic graph (DAG) of. On the edge of the chart is named Stream, it is infinite tuples, and to create a distributed parallel processing, the data points from one node to another node. And there are two nodes on the map, one Spout, Stream source topology, and second, all processing Bolt, topology are done with it. Topology is similar to Hadoop's MapReduce, but there is a crucial difference, Storm topology will run forever unless you kill it, and MapReduce jobs must end.

 

Key Features:

Very wide range of use cases: flow processing can be used, calculated continuously, distributed RPC etc.

Scalable: To expand the topology, you have to do is add machines and increase the degree of parallelism set topology

To ensure that no loss of data: real-time system must have a strong guarantee for the success of data processing, and Storm can ensure that each message will be processed

Fault-tolerant: If a failure occurs during the execution of the calculation, Storm will reassign tasks as needed. Storm ensure that the computer can run forever (or until you stop the calculation)

Programming language independent: Storm topology and processing components can be defined in any language, almost anyone can access the Storm

 

Disadvantages:

Stateless, the user needs to self-manage the state

No advanced features, such as time event processing, aggregation, window, conversation, watermarks, etc.

Flink

Flink is a data stream for simultaneous batch processing and data processing, and distributed processing engine open source framework, with high throughput, low latency, high scalability, support fault tolerance and other characteristics.

Program which performs any data stream and a parallel data in a pipelined manner, the system can execute batch processing and the program flow pipeline runs. In addition, when running Flink itself also support the implementation of an iterative algorithm.

 

Main features:

Flow Batch: Streaming priority is running, the program supports batch processing and data flow

Elegant: Java and Scala elegant flowing API

High throughput and low latency: while supporting very high throughput and low latency event runs

Tolerate delay data, late and out of order: out of order to solve based on data and data processing time of an event late, the delay problem

Flexible: very flexible definition of window

Fault Tolerance: Provides fault tolerance can recover data streaming applications to a consistent state

Back pressure: back pressure flow naturally in the media

 

Disadvantages:

The community is better Spark less powerful, but in the fast-growing

Stream processing is much popular in batch

 

See:

https://flink.apache.org/flink-architecture.html

https://github.com/apache/flink

 

哪些公司被 Flink 吸引?

去年年底,一份市场调查报告显示,Flink 是 2018 年开源大数据生态中发展“最快”的引擎,和 2017 年相比增长了 125% 。目前,全球有多家企业正在使用 Flink,比如 Amazon 的 Amazon Kinesis Data Analytics 是一种用于流处理的完全托管的云服务,它部分地使用 Flink 来支持其 Java 应用程序功能。Ebay 的监控平台由 Flink 提供支持,可评估数千条关于指标和日志流的可自定义警报规则。除此之外,还有Uber、Yelp 和 CapitalOne 等公司也是 Flink 的用户。

 

国内也有很多公司在使用 Flink ,我们在查询相关资料时发现,部分公司正是从 Storm 迁移到 Flink 的,比如前面我们提到的腾讯,还比如:

 

阿里巴巴:阿里巴巴在2015年开始尝试使用 Flink,但因当时 Flink 面世不久稍显稚嫩,阿里巴巴在 Flink 的基础上维护了一个内部版本的实时计算平台 Blink,以满足自身超大体量的业务需求。今年1月28日,Blink 被正式开源。在此之前,阿里巴巴使用的是 JStorm,与 Blink 相似,JStorm 是阿里巴巴用 Java 语言代替 Clojure 语言重写的 Storm,在原有基础上做了不少优化。JStorm 也是阿里巴巴开源的几个明星产品之一。

 

字节跳动:字节跳动的多个业务曾跑在 JStorm 计算引擎上,但集群过多等问题比较明显,考虑到 Flink 可以解决相关问题,且能兼容 JStorm,字节跳动便将 JStorm 任务迁移到了  Flink 上。

 

有赞:实时计算在有赞的发展路程和大多数互联网公司一样,是从早期的 Storm,到 JStorm,Spark 再到 Flink。2014年,第一个 Storm 应用在有赞内部开始使用;2016年,有赞使用 Spark  ;2018年,有赞在实时平台中增加了对 Flink 引擎的支持。

 

饿了么:饿了么的实时计算平台演进之路也是从 Storm 到 Spark,后来基于平台的发展,选择了拥抱 Flink 。

 

苏宁:与饿了么相同,从2014年到现在,苏宁的实时计算平台经历了从 Storm 到 Spark 再到 Flink 的演进。

 

美团:美团在实时计算系统建设初期部署的是 Storm,随着业务对实时数据的需求激增, Storm 无法跟上业务发展,经过调研,美团发现 Flink 的吞吐性能比 Storm 有显著提升,遂更换选型。

 

唯品会:目前,唯品会的实时计算平台并非统一框架,而是 Storm、Spark、Flink 三者共用。其中, Storm 作业最多,但是其业务重心正逐渐转变到 Flink。

 

除了上面我们提到的,应用 Flink 技术的公司还包括百度、携程、滴滴等。

如果你准备入行大数据,关于2019大数据目前的

【发展前景】戳我阅读

【就业岗位】戳我阅读

【大数据薪资待遇】戳我阅读

【完整的学习线路】戳我阅读

关注微信公众号itdaima获取大数据全套开发工具以及入门学习资料

Guess you like

Origin blog.csdn.net/huasdsadsa/article/details/94584986