Flink achieve a massive push by the message data in real-time statistics

background

News reports issued mainly for statistical message for the task. For example, the total amount issued under the APP push message users a single number, the number of successfully pushed to your phone number, and how many users clicked on the pop-APP notification and open the APP and so on. Through news reports, we can visually see the Circulation of push messages, send messages to reach the success rate, the user clicks on the message and so on.

In providing a push push messaging, push to better understand the situation every day, we will conduct the statistical data from different dimensions, generate news reports. A push issued daily news push huge number, can reach tens of billions of level, we use off-line statistical system had been unable to meet business needs. With the continuous improvement of operational capacity, we chose Flink as the data processing engine to meet the real-time statistics to push for massive data messages.

This paper describes the main reasons for choosing Flink of Flink important features and real-time calculation method optimized.

Offline computing platform architecture

In the early message reporting system, we have adopted is calculated off-line manner, calculation engine used mainly as a spark, the original data stored in HDFS, aggregate data stored in Solr, Hbase Mysql and in which:

Flink achieve a massive push by the message data in real-time statistics

When queried, the first based on filter criteria, there are three main dimensions of the query:

  1. appId
  2. Issued time
  3. taskGroupName

Depending on the dimensions you can query to the list taskId then hbase obtain the appropriate query results based on task, get issued, display and click on the appropriate indicator data. When we consider to be transformed into real-time statistics, there will be a series of difficulties:

  1. A huge amount of raw data volume, the amount of data a day to reach tens of billions of scale, the need to support high throughput;
  2. The need to support real-time queries;
  3. Multiple data needs to be associated;
  4. The need to ensure data accuracy and integrity of data.

Why Flink

What is Flink

Flink is a distributed processing engine for streaming data and batch data. It is mainly implemented by the Java code. At present mainly rely on the contribution of the open source community and development.

For Flink, the main scene has to deal with is streaming data. Flink is the predecessor of the Berlin University of Technology a research project in 2014 was accepted by the Apache Incubator, and then quickly became one of the ASF (Apache Software Foundation) top-level project.

Scheme comparison

After a push to achieve real-time statistical news reports, before we consider as our spark streaming real-time calculation engine, but we are considering a number of points of difference spark streaming, storm and flink, and decided to use as a computational engine Flink:
Flink achieve a massive push by the message data in real-time statistics

For the above business pain points, Flink to meet the following needs:

  1. Flink way to push data pipeline, allowing Flink achieve high throughput.

  2. Flink is streaming in the true sense, lower latency, to meet the real-time requirements of our news reporting statistics.

  3. Flink can rely on a strong window function, data incremental polymerization; at the same time, can join operations in the data window.

  4. Our news reports related to the settlement amount, so for exact once the mechanism does not allow errors, Flink rely on their own to ensure that we do not repeat consumption data leakage and consumption.

Flink's important characteristics

Let's talk about the specific Flink some important characteristics, as well as to achieve its principles:

1) low-latency, high throughput

Flink speed is the reason why so fast, mainly because of its stream processing model.

Flink using Dataflow model, and Lambda different modes. Dataflow is a graph purely nodes, nodes in the graph can be performed batch calculations, flow calculations may be, may be a machine learning algorithm. The stream data flowing between nodes, the node processing functions on real-time processing apply, between the nodes is connected netty, netty Keepalive between the two, the network buffer is a natural anti-critical pressure.

After logic optimization and physical optimization, Dataflow of logic and physical topology runtime or less. This is pure design flow, latency and throughput theoretically optimal.

Briefly, when a data is processed, the sequence of the buffer, and immediately transferred to the next node through a network, executes the next node.

2)Checkpoint

Flink is achieved through a distributed checkpoint snapshot, supports Exactly-Once semantics.

Distributed snapshot snapshot consistency is an algorithm Chandy and Lamport in 1985 based on the design used to generate the current state of the distributed system, without loss of information and does not record duplicates.

Flink使用的是Chandy Lamport算法的一个变种,定期生成正在运行的流拓扑的状态快照,并将这些快照存储到持久存储中(例如:存储到HDFS或内存中文件系统)。检查点的存储频率是可配置的。
Flink achieve a massive push by the message data in real-time statistics

3)backpressure

back pressure出现的原因是为了应对短期数据尖峰。

旧版本Spark Streaming的back pressure通过限制最大消费速度实现,对于基于Receiver 形式,我们可以通过配置spark.streaming. receiver.maxRate参数来限制每个 receiver 每秒最大可以接收的记录的数据。

对于 Direct Approach 的数据接收,我们可以通过配置spark.streaming. kafka.maxRatePerPartition 参数来限制每次作业中每个 Kafka 分区最多读取的记录条数。

但这样是非常不方便的,在实际上线前,还需要对集群进行压测,来决定参数的大小。

Flink运行时的构造部件是operators以及streams。每一个operator消费一个中间/过渡状态的流,对它们进行转换,然后生产一个新的流。

描述这种机制最好的类比是:Flink使用有效的分布式阻塞队列来作为有界的缓冲区。如同Java里通用的阻塞队列跟处理线程进行连接一样,一旦队列达到容量上限,一个相对较慢的接受者将拖慢发送者。

消息报表的实时计算

优化之后,架构升级成如下:

Flink achieve a massive push by the message data in real-time statistics

可以看出,我们做了以下几点优化:

  1. Flink替换了之前的spark,进行消息报表的实时计算;
  2. ES替换了之前的Solr。

对于Flink进行实时计算,我们的关注点主要有以下4个方面:

  1. ExactlyOnce保证了数据只会被消费一次
  2. 状态管理的能力
  3. 强大的时间窗口
  4. 流批一体

为了实现我们实时统计报表的需求,主要依靠Flink的增量聚合功能。

首先,我们设置了Event Time作为时间窗口的类型,保证了只会计算当天的数据;同时,我们每隔一分钟增量统计当日的消息报表,因此分配1分钟的时间窗口。

Then we use .aggregate (AggregateFunction af, WindowFunction wf) to do incremental polymerization operation, it can use AggregateFunction aggregation advance out data, reducing storage pressure of state. Thereafter, the data is written into the ES we will increment the polymerization and Hbase.

Flow is as follows:

Flink achieve a massive push by the message data in real-time statistics

Meanwhile, at the time of the query, we taskID by dimension, date and other inquiries, Gets a collection of taskID ES start, followed by taskID query hbase, obtain statistical results.

to sum up

By using Flink, we achieve real-time statistics on message push data, the ability to view real-time message issued, impressions, clicks and other data indicators, while state aid FLink powerful management capabilities, stability and services has also been a certain degree of protection. Next, a push will also continue to optimize push messaging, and Flink introduced into other lines of business, in order to meet real-time requirements of some high-demand business scenarios.

Guess you like

Origin blog.51cto.com/13031991/2460684
Recommended