Apache streaming framework Flink, Spark Streaming, Storm comparative analysis (2)

This article is published by   NetEase Cloud  .

 

The content of this article is connected to the previous Apache streaming framework Flink, Spark Streaming, Storm comparative analysis (1)

 

2.Spark Streaming architecture and feature analysis

 

2.1 Basic Architecture

 Spark streaming architecture based on spark core.

 

Spark Streaming breaks down stream computing into a series of short batch jobs. The batch engine here is Spark, that is, the input data of Spark Streaming is divided into pieces of data (Discretized Stream) according to the batch size (such as 1 second), and each piece of data is converted into RDD (Resilient Distributed Dataset) in Spark. Then, the transformation operation on DStream in Spark Streaming is changed into the transformation operation on RDD in Spark, and the RDD is converted into an intermediate result after the operation and stored in memory. The entire stream computing can superimpose intermediate results or store them in external devices according to business requirements.

In short, Spark Streaming divides the real-time input data stream into chunks with time slice Δt (such as 1 second), and Spark Streaming treats each chunk of data as an RDD and uses RDD operations to process each small chunk of data. Each block will generate a Spark Job for processing, and then submit the jobs to the cluster to run in batches. The process of running each job is no different from a real spark task.

 

job scheduler

 

Responsible for job scheduling

JobScheduler is the center of all job scheduling in SparkStreaming. The startup of JobScheduler will lead to the startup of ReceiverTracker and JobGenerator. The startup of the ReceiverTracker causes the Receiver running on the Executor side to start and receive data, and the ReceiverTracker records the data meta information received by the Receiver. The startup of JobGenerator causes DStreamGraph to be called to generate RDD Graph every BatchDuration and to generate Job. Thread pool in JobScheduler to submit encapsulated JobSet objects (time value, Job, meta of data source). The business logic is encapsulated in the job, which causes the action of the last RDD to be triggered, and is actually scheduled by DAGScheduler to execute the job on the Spark cluster.

 

JobGenerator

 

Responsible for job generation

Through the timer, a DAG graph is generated at regular intervals according to the dependencies of Dstream.

 

ReceiverTracker

 

Responsible for the receipt, management and distribution of data

When ReceiverTracker starts Receiver, it has ReceiverSupervisor. The implementation is ReceiverSupervisorImpl. ReceiverSupervisor will start Receiver when it starts up. Receiver continuously receives data and converts the data into Block through BlockGenerator. The timer will continuously store the Block data through BlockManager or WAL. After the data is stored, the ReceiverSupervisorlmpl will report the metadata Metadate of the stored data to the ReceiverTracker, which is actually reported to the RPC entity ReceiverTrackerEndpoint in the ReceiverTracker. ,main.

 

2.2 Yarn-based architecture analysis

The above picture shows the cluster mode of spark on yarn. After Spark on Yarn is started, the driver in Spark AppMaster (which will start the driver in AM, mainly the StreamingContext object) submits Receiver as a Task to a Spark Executor; Receive starts Then input data, generate data blocks, and then notify Spark AppMaster; Spark AppMaster will generate corresponding Jobs according to the data blocks, and submit the Job's Task to the idle Spark Executor for execution. The thick blue arrow in the figure shows the data stream being processed. The input data stream can be disk, network, HDFS, etc., and the output can be HDFS, database, etc. Comparing the cluster mode of Flink and spark streaming, it can be found that both components in AM (Flink is JM, spark streaming is Driver) carry task allocation and scheduling, and other containers carry task execution (Flink is TM, spark streaming is Executor), the difference is that each batch of spark streaming has to communicate with the driver for rescheduling, so the latency is much lower than Flink.

Implementation

Figure 2.1 Spark Streaming program converted to DStream Graph

Figure 2.2 DStream Graph converted to RDD Graph

Each step of Spark Core processing is based on RDDs, and there are dependencies between RDDs. The DAG of the RDD in the figure below shows that there are 3 Actions, which will trigger 3 jobs. The RDD depends from the bottom to the top, and the job generated by the RDD will be specifically executed. As can be seen from the DSteam Graph, the logic of DStream is basically the same as that of RDD. It is based on RDD and adds time dependence. The DAG of RDD can also be called the spatial dimension, which means that the entire Spark Streaming has an additional time dimension, which can also become the space-time dimension. The program written in Spark Streaming is very similar to the Spark program. Resilient Distributed Datasets (Resilient Distributed Datasets) provides interfaces, such as map, reduce, filter, etc., to realize batch processing of data. In Spark Streaming, the interfaces provided by DStream (the RDD sequence representing the data stream) are operated, which are similar to the interfaces provided by RDD.

 

Spark Streaming converts DStream operations in the program into DStream Graph. In Figure 2.1, for each time slice, DStream Graph will generate an RDD Graph; for each output operation (such as print, foreach, etc.), Spark Streaming will create an RDD Graph. Spark action; For each Spark action, Spark Streaming will generate a corresponding Spark job and hand it over to the JobScheduler. A Jobs queue is maintained in the JobScheduler, and Spark jobs are stored in this queue. The JobScheduler submits the Spark job to the Spark Scheduler, and the Spark Scheduler is responsible for scheduling the tasks to be executed on the corresponding Spark Executor, and finally forming a spark job.

                                                                                           Figure 2.3 DAG of RDD generation in time dimension

The Y-axis is the operation on the RDD, the dependencies of the RDD constitute the logic of the entire job, and the X-axis is the time. As time passes, a fixed time interval (Batch Interval) will generate a job instance, which will then run in the cluster.

 

Code

Based on the spark streaming source code interpretation of spark 1.5, the basic architecture has not changed much.

 

2.3 Component stack

It supports obtaining data from a variety of data sources, including Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP sockets. After obtaining data from the data source, advanced functions such as map, reduce, join, and window can be used to process complex algorithms. Finally, the processing results can be stored in file systems, databases and field dashboards. On the basis of "One Stack rule them all", other sub-frameworks of Spark, such as cluster learning and graph computing, can also be used to process streaming data.

 

2.4 Characteristic Analysis

 

Throughput and Latency

Spark can currently scale linearly to 100 nodes (4Core per node) on EC2, can process 6GB/s data volume (60M records/s) with a delay of several seconds, and its throughput is also 2~2 higher than the popular Storm. 5 times, Figure 4 is a test done by Berkeley using WordCount and Grep two use cases. In this test, the throughput of each node in Spark Streaming is 670k records/s, while Storm is 115k records/s.

Spark Streaming decomposes stream computing into multiple Spark Jobs. The processing of each piece of data will go through Spark DAG graph decomposition and Spark task set scheduling process. The minimum Batch Size is selected between 0.5 and 2 seconds. (Storm's current minimum delay is about 100ms), so Spark Streaming can meet all streaming quasi-real-time computing scenarios except for very high real-time requirements (such as high-frequency real-time transactions).

 

exactly-once semantics

More stable support for exactly-once semantics.

 

Back pressure capability support

Spark Streaming has introduced a back-pressure mechanism since v1.5, which adapts the data processing capacity of the cluster by dynamically controlling the data reception rate.

 

How does Sparkstreaming backpressure?

Simply put, the back pressure mechanism needs to adjust the rate at which the system accepts or processes data, but the rate at which the system processes data cannot be easily adjusted. Therefore, it is only possible to estimate the rate at which the current system processes data and adjust the rate at which the system accepts data to match it.

 

How does Flink backpressure?

Strictly speaking, Flink does not need backpressure because the rate at which the system receives data and the rate at which it is processed are naturally matched. The premise for the system to receive data is that the task receiving the data must have free and available Buffers, and the premise for the data to continue processing is that the downstream tasks also have free and available Buffers. Therefore, there is no system that accepts too much data, resulting in more than the system can handle.

It can be seen that Spark's micro-batch model causes it to introduce a backpressure mechanism separately.

 

Back pressure and high loads

Backpressure typically arises when short-term load spikes cause the system to receive data at a much higher rate than it can process it.

However, the high load that the system can bear is determined by the data processing capability of the system. The back pressure mechanism is not to improve the system's ability to process data, but to adjust the rate at which the system receives data when the load is higher than its capacity.

 

fault tolerance

Drivers and executors use a write-ahead log (WAL) method to save the state, and at the same time combine the fault-tolerant mechanism of the lineage of the RDD itself.

 

API and Class Libraries

Spark 2.0 introduces structured data stream, unifies the API of SQL and Streaming, uses DataFrame as a unified entry, and can operate Streaming like writing a normal Batch program or directly operating SQL, which is easy to program.

Broad integration

 

In addition to reading HDFS, Flume, Kafka, Twitter and ZeroMQ data sources, we can also define data sources ourselves, support running on Yarn, Standalone and EC2, and ensure high availability through Zookeeper and HDFS, and the processing results can be directly written to HDFS

deployability

Depends on the java environment, as long as the application can be loaded into the spark-related jar package.

 

3.Storm architecture and feature analysis

3.1 Basic Architecture

 

The Storm cluster adopts a master-slave architecture. The master node is Nimbus, the slave node is Supervisor, and the scheduling-related information is stored in the ZooKeeper cluster. The architecture is as follows:

 

 

Nimbus

The Master node of the Storm cluster is responsible for distributing user code and assigning it to the Worker node on the specific Supervisor node to run the Task of the component (Spout/Bolt) corresponding to the Topology.

 

Supervisor

The slave node of the Storm cluster is responsible for managing the startup and termination of each Worker process running on the Supervisor node. Through the supervisor.slots.ports configuration item in Storm's configuration file, you can specify the maximum number of slots allowed on a Supervisor. Each slot is uniquely identified by a port number, and a port number corresponds to a Worker process (if the Worker process is start up).

 

ZooKeeper

It is used to coordinate Nimbus and Supervisor. If the Supervisor cannot run the Topology due to a fault, Nimbus will sense it at the first time and reassign the Topology to run on other available Supervisors.

 

run architecture

run process

 

1) The client submits the topology to nimbus.

2) Nimbus establishes a local directory for the topology, calculates tasks according to the configuration of the topology, assigns tasks, and establishes the correspondence between the assignments node storage task and the worker in the supervisor machine node on the zookeeper;

Create a taskbeats node on zookeeper to monitor the heartbeat of the task; start the topology.

3) Supervisor goes to zookeeper to get the assigned tasks, starts multiple workers, each worker generates tasks, one task and one thread; initializes the connection between tasks according to the topology information; between tasks and tasks are managed by zeroMQ; Then the whole topology is up and running.

3.2 Yarn-based architecture

To develop an application on YARN, you usually only need to develop two components, namely the client and the ApplicationMaster. The main function of the client is to submit the application to YARN, and interact with YARN and ApplicationMaster to complete some instructions sent by the user. ; and ApplicationMaster is responsible for applying for resources to YARN, and communicating with NodeManager to start tasks.

 

It can be run on YARN without modifying any Storm source code. The easiest way to achieve this is to run Storm's various service components (including Nimbus and Supervisor) as separate tasks on YARN, and Zookeeper as a public service Running on a few nodes outside of the YARN cluster.

 

1) Submit Storm Application to YARN RM through YARN-Storm Client;

2) RM applies for resources for YARN-Storm ApplicationMaster and runs it on a node (Nimbus);

3) YARN-Storm ApplicationMaster starts Nimbus and UI services within itself;

4) YARN-Storm ApplicationMaster applies for resources to RM according to the user configuration, and starts the Supervisor service in the applied Container;

 

3.3 Component stack

 

3.4 Characteristic Analysis

 

Simple programming model.

Similar to MapReduce reducing the complexity of parallel batch processing, Storm reduces the complexity of doing real-time processing.

 

Servicing

A service framework that supports hot deployment, instant online or offline App.

 

Various programming languages ​​are available

You can use various programming languages ​​on top of Storm. Clojure, Java, Ruby and Python are supported by default. To add support for other languages, simply implement a simple Storm communication protocol.

 

fault tolerance

Storm manages the failure of worker processes and nodes.

 

Horizontal expansion

Computations are performed in parallel across multiple threads, processes, and servers.

Reliable message handling

Storm guarantees that each message is fully processed at least once. When the task fails, it is responsible for retrying the message from the message source.

 

fast

The design of the system ensures that messages can be processed quickly, using ZeroMQ as its underlying message queue.

 

local mode

Storm has a "local mode" that fully simulates a Storm cluster during processing. This allows you to develop and unit test quickly.

 

deployability

To rely on Zookeeper for task state maintenance, Zookeeper must be deployed first.

4. Comparative analysis of the three frameworks

 

Comparative analysis

If the latency requirement is not high, it is recommended to use Spark Streaming, rich advanced API, easy to use, natural connection with other components in the Spark ecological stack, high throughput, easy deployment, more intelligent UI interface, and active community It has a high degree of development, and the response speed is relatively fast when there are problems. It is more suitable for streaming ETL, and the development momentum of Spark is obvious to all. It is believed that the performance and functions will be more perfect in the future.

 

If the latency requirements are relatively high, it is recommended to try Flink. Flink is a relatively popular streaming system at present. It adopts a native stream processing system to ensure low latency, and it is also relatively complete in terms of API and fault tolerance. , it is relatively simple to use, easy to deploy, and the development momentum is getting better and better. I believe that the response speed of the community issues should be relatively fast.

 

Personally, I am more optimistic about Flink, because of the native stream processing concept, on the premise of ensuring low latency, the performance is still relatively good, and it is more and more easy to use, and the community is constantly developing.

 

NetEase has a number: an enterprise-level big data visualization analysis platform. The self-service agile analysis platform for business personnel uses PPT mode for report production, which is easier to learn and use, and has powerful exploration and analysis functions, which really help users gain insight into data and discover value. Click here for a free trial .

 

Learn about NetEase Cloud:
NetEase Cloud Official Website: https://www.163yun.com/
New User Gift Package: https://www.163yun.com/gift
NetEase Cloud Community: https://sq.163yun.com/

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325893426&siteId=291194637