From Storm to Flink, Youzan's five-year real-time computing efficiency improvement practice

Real-time computing is developing

From the perspective of the technology stack, our choices are consistent with most Internet companies, from the early Storm to JStorm, Spark Streaming and the recent rise of Flink. From the development stage, it has mainly experienced two stages, the initial stage and the platformization stage; the following will introduce the development process of real-time computing in Youzan according to the timeline in the figure below.

image

2.1 Initial stage

The basic characteristics of the initial stage here are the lack of overall real-time calculation planning, lack of platform-based task management, monitoring, and alarm tools. Users submit tasks directly by logging in to the AG server and submitting tasks to online clusters using command line commands, which is difficult to satisfy User requirements for usability. However, in the initial stage, a large number of internal real-time computing scenarios have been accumulated.

 2.1.1 Storm debut

In early 2014, the first Storm application started to be used inside Youzan. The initial scenario was to decouple the statistics of real-time events from the business logic. The Storm application did real-time calculations by monitoring MySQL binlog update events, and then updated the results to MySQL or Redis cache for online system use. Similar scenarios have been recognized by business development, and gradually began to support a large number of business scenarios.

In the early days, users logged into a group of AG servers in an online environment, and submitted tasks to the Storm cluster through the Storm client. In this way, in more than two years, Storm components accumulated nearly a hundred real-time applications. Storm also exposed many problems, mainly reflected in the system throughput, the scene of huge throughput, but not sensitive to delay, seems to be inadequate.

 2.1.2 Introducing Spark Streaming

At the end of 2016, with the increasing maturity of the Spark technology stack, and because the Storm engine itself has obvious disadvantages in throughput/performance compared with the Spark Streaming technology stack, some business teams began to try new streaming computing engines from then on . Because Youzan has experience in offline computing and a large number of Spark tasks, Spark Streaming has naturally become the first choice. With the access to the real-time applications of the early business log system and the buried log system, a large number of business parties have gradually started to access it. . Like Storm, after the business side completes the development of real-time computing tasks, it submits tasks to the big data Yarn cluster through a group of AG servers and the Spark client.

The initial stage lasts for a long time. Almost at the end of 2017, the deployment of Youzan real-time computing is shown in the following figure:

 2.1.3 Summary

This architecture is not a big problem when the business volume is small, but as the number of application tasks increases, some operation and maintenance problems are exposed, mainly in the following aspects:  

  1. Lack of business management mechanism. The big data team platform group, as a cluster manager, is difficult to understand the business ownership of the real-time tasks running on the current cluster, which leads to the inability to efficiently notify the business party to deal with the availability of the cluster or the cluster needs to be changed and upgraded. , The cost of communication is high;

  2. The monitoring and alarming of Storm and Spark Streaming are implemented separately and are in the stage of tooling. Many business parties will customize their own monitoring and alarming tools for availability, resulting in a lot of duplication and affecting development efficiency;

  3. There is no isolation of computing resources. Resource management is rough, and there is no isolation between offline systems and real-time systems; early offline tasks and Spark Streaming tasks run on the same set of Yarn resources. During the peak of offline tasks in the early morning, although the Yarn layer has the Queue isolation of the CapacityScheduler, but the HDFS layer has common physics It is inevitable that the network card and disk IO level will affect each other, resulting in a lot of delays in real-time tasks in the early hours of the morning;

  4. Lack of flexible resource scheduling. Users start real-time tasks through the AG server, and the cluster resources used by the tasks are also specified in the startup script. This method has great disadvantages in system availability. When the Yarn resource pool where real-time computing is located fails, it is difficult to switch between clusters for real-time tasks.

In general, there is a lack of a unified real-time computing platform to manage all aspects of real-time computing.

2.2 Platformization stage 2.2.1 Building a real-time computing platform

Following the previous section, in the face of the four problems mentioned above, the preliminary requirements for the real-time computing platform are as follows:  

  1. Business management functions. It is mainly to record the relevant information of real-time applications, and to do a good job with the business interface person;

  2. Provide task-level monitoring, task failures are automatically triggered, user-defined alarms based on indicators such as delay/throughput, traffic trend market and other functions;

  3. Do a good job of cluster planning and build an independent computing Yarn cluster for real-time applications to avoid mutual influence between offline tasks and real-time tasks;

  4. Provide task-spending switching computing clusters to ensure that tasks can be easily migrated to other clusters for temporary avoidance in the event of a cluster failure.

So at the beginning of the 18th, we set up the project and started the first phase of the real-time platform. As an attempt, we only completed the support for Spark Streaming real-time computing tasks, and completed the migration of all Spark Streaming tasks in a relatively short period of time. After 2 months of trial operation, it is obvious that the control over the business has become stronger. Then began to support Storm tasks, and migrated all Storm real-time computing tasks. AG servers were all offline, and business parties no longer need to log in to the server to submit tasks.

In mid-2018, the real-time tasks of two computing engines, Storm and Spark Streaming, were running on Youzan Online, which can meet most business needs. However, the two engines themselves have their own problems. Storm itself has limitations in throughput. Compared with Spark Streaming, the choice seems more difficult. We mainly consider from the following perspectives:  

  1. Delay, Flink wins. Spark Streaming is essentially a micro-batch computing framework. The processing delay is generally the same as Batch Interval, usually at the second level. In a high-throughput scenario with praise, the general batch size is about 15 seconds;

  2. Throughput, after actual testing, under the same conditions, Flink's throughput will be slightly lower than Spark Streaming, but there are few pairs of state storage support. Flink wins in this regard. For state data with a large amount of data, Flink can choose to directly store the calculation Node local memory or RocksDB, make full use of physical resources;

  3. 对 SQL 的支持,对当时两种框架的最新稳定版本的 SQL 功能做了调研,结果发现在对 SQL 的支持度上,Flink 也具有较大优势,主要体现在支持更多的语法;

  4. API 灵活性, Flink 的实时计算 API 会更加友好。

出于以上几点原因,有赞开始在实时平台中增加了对 Flink 引擎的支持。在完成 Flink 引擎的集成后,有赞实时计算的部署情况如下图所示:

 2.2.2 新的挑战

以上完成之后,基本上就可以提供稳定 / 可靠的实时计算服务;随之,业务方开发效率的问题开始显得突出。用户一般的接入流程包含以下几个步骤:  

  1. 熟悉具体实时计算框架的 SDK 使用,第一次需要半天左右;

  2. 申请实时任务上下游资源,如消息队列,Redis/MySQL/HBase 等在线资源,一般几个小时;

  3. 实时任务开发,测试,视复杂程度,一般在 1~3 天左右;

  4. 对于复杂的实时开发任务,实时任务代码质量很难保证,平台组很难为每个业务方做代码 review, 所以经常会有使用不当的应用在测试环境小流量测试正常后,发布到线上,引起各种各样的问题。

整个算下来,整个流程至少需要 2~3 天,实时应用接入效率逐渐成了眼前最棘手的问题。 对于这个问题。在做了很多调研工作后,最终确定了两个实时计算的方向:  

  1. 实时任务 SQL 化;

  2. 对于通用的实时数据分析场景,引入其他技术栈, 覆盖简单场景。

2.2.2.1 实时任务 SQL 化

实时任务 SQL 化可以大大简化业务的开发成本,缩短实时任务的上线周期。 在有赞,实时任务 SQL 化 基于 Flink 引擎,目前正在构建中,我们目前的规划是首先完成对以下功能的支持:  

  1. 基于 Kafka 流的流到流的实时任务开发

  2. 基于 HBaseSink 的流到存储的 SQL 任务开发

  3. 对 UDF 的支持

目前 SQL 化实时任务的支持工作正在进行中。

2.2.2.2 引入实时 OLAP 引擎

Through the observation of the business, we found that in the real-time application of the business, there is a large demand for statistics of uv and pv statistics in different dimensions, and the mode is relatively fixed. For this kind of demand, we focus on supporting real-time data Update and support real-time Olap query on the storage engine.

We mainly investigated the Kudu and Druid two technology stacks. The former is implemented in C++, a distributed columnar storage engine, which can efficiently do Olap queries and supports detailed data queries; the latter is a pre-aggregated Olap of event data implemented in Java. Class query engine~

After comprehensively considering operation and maintenance costs, integration with the current technology stack, query performance, and supporting scenarios, Druid was finally selected.

At present, the overall technical architecture of real-time computing in Youzan is as follows:

image

future plan

The first thing to be implemented is the SQLization of real-time tasks to improve the business scenarios that can be covered by SQL tasks (the target is 70%), so as to empower business from the perspective of improving business development efficiency.

After the initial completion of the SQL-based real-time tasks, the reuse of streaming data has become a measure of the highest ROI for improving efficiency. The preliminary plan will start the construction of the real-time data warehouse. The preliminary design of the real-time data warehouse is shown in the following figure:

image

Of course, a complete real-time data warehouse is never that simple. It is not only that the real-time computing-related infrastructure must reach a certain level of platformization, but also rely on the construction of supporting components such as real-time metadata management and real-time data quality management. The road is long. ~

to sum up

Youzan real-time computing is advancing under the needs of the business side. At different stages, the technical direction is always being adjusted to the direction with the highest current input-output ratio. This article does not go into the technical details, but follows the timeline to describe the development process of real-time computing. In some places, due to the limited knowledge of the author, it is inevitable to make mistakes, and colleagues are welcome to point out.


Guess you like

Origin blog.51cto.com/15060462/2679170