The stream data processing description

Verbatim https://www.dazhuanlan.com/2019/08/25/5d625f4bb2308/


Apache Flink is a stream distributed computing engine, developers can quickly implement flow computing thereon. Apache Flink origin established in 2009 at Berlin University of Technology 'Stratosphere' project. April 2014 to become an incubator of the Apache Software Foundation project, eight months after becoming the Apache Software Foundation top-level project. Today there are more than 250 individuals to contribute to an Flink. Stream computing technology is fast start-up companies, enterprises adopted because it is software development, system architecture, business analysis on more effective. This paper analyzes the differences between the conventional data processing architecture and data flow comb architecture to highlight the characteristic calculation flow.

Traditional methods of data analysis

Complex analysis of the traditional IT infrastructure, business application processes running on different operating systems, data landing in different databases, as data analysis time can not be met, a large table associated with the query performance either does not support or very poor performance.

Instead of the traditional database analysis program is a traditional data warehouse. During operation the data warehouse we call him ETL (extract-transform-load), this process includes data validation, standardization value, coding, mode conversion, duplicates removed. ETL is a very complicated a process that usually requires professional skills to complete the job. Fatal point is that the data warehouse is updated periodically, why is fatal to explain in detail a little later. The following is an ETL architecture diagram:

This model in a very long time, both as a carrier using an analytical database data warehouse. The last 10 years, there has been data processing framework MapReduce, HDFS, Hbase framed. This form has the following disadvantages. First of all, it is to maintain the smooth running of a large system of challenges, which need to configure a cluster development ETL tasks, task scheduling management, each step there is a big challenge. Second, this frame there is a large delay in data processing. Data generated from the business system to analyze the results of delay usually takes a few hours, some even more than a day. Of course, the platform can only be a batch process the data after the event, can not be processed at the time of the incident.

Data analysis evolution

Previously, data analysis within a few hours or a day of the delay may be acceptable. Now, however, the effectiveness of the system and business requirements for data is increasing. For example: product recommendation systems, system monitoring. These are real-time data collection and processing, make the appropriate response action according to the results.

Flow calculation process can meet the above requirements, the data generated in a few seconds up to obtain processed results. According to the conventional structure of FIG ETL flow computing architecture is as follows:

CDC (change data capture) service acquisition system records the change, for example, by collecting logs Binlog. Common message queue has Kafka, Metaq, TT, which does not guarantee order preserving sequence and message queue. Streaming statistics derived from the corresponding index, and the results present on the KV database (HBase), finally make the appropriate recommendation by pushing a dashboard display results or applications. A first advantage of the process stream is a small delay, since data load flow calculation process are not required, the timing of batch processing data. Because the data stream processing in the uptake and processing within a frame, the data stream processing system does not require ingestion process, task scheduling, it is more stable compared to batch systems.

Flow Data Analysis

For some applications with very low latency, batch unable to meet such low latency stream processing can meet this very scene. These applications such as:

  • Abnormality detection, for example: detecting network attacks;
  • Real-time recommendations, such as: the behavior of a few minutes according to user's most recent action, recommended for users of commodities;
  • Pattern recognition or complex transaction processing, such as: credit card fraud;
  • Online ETL, continuous data transfer and load the data generated;
  • In the application of emerging technologies, such as: Internet of Things.

However, the ability to solve the above scenario is not looking for the sole agent for distributed stream processing. It also provides an extensible data architecture for the data applications. Inter-system communication via defined interfaces, independent of each other system. Chart as follows:

The drawing, a web service log collection and real-time acquisition and write data to the message queue. Application of the process stream with the data state intake, processing, and writes the results into the database or message queue impressions out by visualization tools.

This architecture addition to the above advantages, it also has advantages. By persistent message queues in the application process with a state of the communication has the following advantages:

  • Multiple applications can read and write a data stream; this guarantees the consistent application of all consumption data, and the order is consistent;
  • 应用进程可以重复消费这份持久化在消息队列中的数据。当修复bug的时候用于AB测试非常有用;
  • 有状态流处理进程将状态数据持久化,便于失败的时候恢复;
  • 这种架构将实现了读写分离。数据采集只能追加,有非常好的写性能。下游读应用也有非常好的读性能;
  • 最后,这种框架容易扩展,应为消息队列和流处理都是分布式系统;

上述解决方案,为业务提供了OLTP库类似的解决方案,但又具备上面所列举的特性。

开源流处理演进

数据流处理不是一项新技术。第一个研究原型和商业化产品可以回溯到20世纪90年代。然而,最近的流处理技术主要是基于开源软件发展起来。今天,分布式开源流处理引擎为许多不同领域的企业提供关键业务应用,如:在线零售,社交媒体,电信,游戏,和银行。开源软件是这一趋势的主要推动力,主要是由于以下两个原因。其一,开源软件每个人都可以使用和改进它。其二,由于开源社区的努力,可扩展的流处理技术正在迅速成熟和发展。仅Apache软件基金就有10多个项目于流处理相关。新的流处理项目还在源源不断的进入开源社区,且以新的特性和能力挑战当前的新技术。这些新系统的许多特性正被其他流处理框架采纳和吸收。此外,开软软件的使用者可以请求或贡献缺少的新功能,以支持这些场景。就这样开源社区不断提高项目的处理能力,进一步推动流计算处理。我们将简要回顾一下流处理的发展历程并将展望未来。

第一个获得大量使用的开源分布式流计算处理引擎专注于毫米级的时间处理,并保证系统发生故障时事件不丢失。这些系统提供底层API并且不提供对流式应用的准确一致的结果支持,应为结果取决于事件抵达的时间和先后顺序。而且,就算事件不会在失败的时候丢失,同一个事件也会存在重复。

与提供准确和高延迟的批处理相反,第一个开源流处理器倾向于用准确性换地延迟。这样的数据处理系统能够提供低延迟和存在一定误差的结果。这种系统结构叫做Lambda架构,如下图:

Lambda架构利用地延迟的流计算处理支持Speed Layer增强传统批出来系统。数据抵达Lambda架构后由流处理器读取数据,并将数据写入批处理存储上,如:HDFS。流处理器几乎实时的计算出估计结果,并将结果写入Speed Table中。批处理器周期性的将写在批量存储器上的数据处理完。准确的结果将写入Batch Table并删除Speed Table上的估计结果。服务层通过合并来源于Speed Table的估计结果和Batch Table的精确结果为应用进程提供服务。Lambda架构目的时在于改善批处理系统的高延迟。然而,这种方法也存在非常明显的缺点。首先,需要分别实现两套处理系统语义不同的接口。其次,流计算处理器计算的结果是一个估计值,而不是一个准确的结果。再次,Lambda架构难以设置和维护。流处理器、批处理器、速度存储、批存储、数据摄入、批处理任务调度器都需要使用说明手册。

第二代流处理器,在第一代的基础上改善了准确性,实现了一个事件之消费一次的语义(exactly once)。同时提供的API相比第一代更加高级,在吞吐量和失败恢复上都有较大的改善。但是还是没有解决乱序的问题,计算结果还是受数据的消费顺序影响。

在第三代流处理器的计算结果不在依赖数据的消费顺序,能够准确计算出结果。另一个改进是吞吐-延迟衡量。这一代流计算处理器使得Lambda架构被淘汰。说了这么多,接下来我们看一个第三代流处理引擎flink。

Apache Flink是一个分布式流计算引擎,据有第三代引擎的特性。她在大规模数据处理上提供了准确的、高吞吐、地延迟的处理。以下是Flink的特性,详细的特性请参考Flink官网:https://flink.apache.org/。

  • Flink提供三种时间语义:事件事件(event-time),摄入事件(ingestion-time),处理事件(Processing-time);
  • Flink实现了仅处理一次(exactly-once)和至少(At-least-once)处理一次的语义;
  • Flink处理性能非常好。吞吐量在百万量级上处理延迟在毫秒级。同时Flink应用的规模可以扩展到一千多个核;
  • Flink为常见的功能提供了高级API,例如Window操作。同时也提供底层API以便满足用户的个性化需求;
  • Flink生态非常丰富,kafka、JDBC、HDFS等常用的大数据平台;
  • Flink可以7*24小时高可用的运行,同时可以采用YARN和Apache Mesos部署,可以快速恢复和动态扩展任务的规模;
  • Flink可以在不失去应用状态的情况下,动态更新代码或者合并不同版本的任务;
  • 系统对外提供Metrics用于采集系统或者应用的指标,帮助用户识别问题;
  • 最后一点,但是不是最终的一点。Flink已经是一个成熟的处理器;

除了以上功能外,Flink对外提供的API是非常有好的,并且允许Flink应用在IDE里运行于单个JVM上。这对开发者在IDE上调试Flink任务是非常有用的。

以下我们在本地集群上部署并执行第一个Flink进程,以便我们直观的感受这些特性。我们采用随机生成的温度数据聚合为例。运行环境需要Unix, java7。如果你运行在windows上,建议按照Linux虚拟机或者Cygwin。

1、到Apache 官网flink.apache.org下载二进制的Apache Flink 1.1.3(Hadoop 2.7,Scala 2.11)

2、执行一下命令解压压缩文档:

      
      
1
      
      
tar xvfz flink-1.1.3-bin-hadoop27-scala_2.11.tgz

3、启动Flink的本地模式:

      
      
1
2
      
      
cd flink-1.1.3
./bin/start-local.sh

4、在浏览器中输入http://localhost:8081打开监控页面。将会看一些统计之宝,表明本地Flink集群已经启动。它将显示单个任务管理器(Flink的工作进程)已连接,并且单个任务槽(由任务管理器提供的资源单元)可用。

5、下载测试用例。
wget https://streaming-with-flink.github.io/examples/download/examples-scala.jar

注:你需要自己通过README打包项目为jar包。

6、在本地集群上提交测试用例

      
      
1
      
      
./bin/flink run -c io.github.streamingwithflink.AverageSensorReadings examples-scala.jar

7、打开web监控将看到一个运行的job。可以点击job链接查看job的监控信息

8、Apache Flink集群默认数据输出的路径为./out。可以通过如下命令看看输出:

      
      
1
      
      
tail -f ./log/flink-<user>-jobmanager-<hostname>.out

将在终端看到如下类似信息:

      
      
1
2
3
4
      
      
SensorReading(sensor_2,1480005737000,18.832819812267438)
SensorReading(sensor_5,1480005737000,52.416477673987856)
SensorReading(sensor_3,1480005737000,50.83979980099426)
SensorReading(sensor_4,1480005737000,-17.783076985394775)

9, so far, it has been running for a flow calculation process. You need to manually stop the process can be operated at the command-line operation in the monitoring page.

      
      
1
      
      
./bin/stop-local.sh

Here already know how to follow the process and run Flink. In this book you can also learn more about the Flink.

In order to support the author, if you think this book can, buy books.

References:
"Stream Processing with the Apache Flink"

Guess you like

Origin www.cnblogs.com/petewell/p/11408899.html