[Big Data] Detailed Explanation of Flink (1): Basics

Flink Detailed Explanation (1): Basics

1. What is Flink?

insert image description here

Flink is a highly available, high-performance distributed computing engine centered on streams . It has the characteristics of streaming and batch integration , high throughput, low latency, fault tolerance, large-scale and complex calculations, and provides functions such as data distribution and communication on the data stream .

2. Can you explain in detail the concepts of data flow, flow-batch integration, and fault tolerance?

Data flow : All generated data naturally has the concept of time . Arranging events in chronological order forms an event flow, also known as data flow.

Flow batch integration :

First of all, we must first understand what is bounded data and unbounded data :

insert image description here

  • Bounded data : It is a data flow within a certain time range, with a beginning and an end. Once determined, it will not change again. Generally, batch processing is used to process bounded data, as shown in the figure abovebounded stream.
  • Unbounded data : It is the continuously generated data flow. The data is infinite, with a beginning and no end. Generally, stream processing is used to process unbounded data. As shown in the pictureunbounded stream.

The design idea of ​​Flink is based on flow . Batch is a special case of flow. It is good at processing unbounded and bounded data. Flink provides precise time control capability and stateful computing mechanism, which can easily deal with unbounded data flow and provide window processing bounded data flow. Therefore, it is called the integration of flow and batch.

Fault tolerance : In a distributed system, hardware failures, process exceptions, application exceptions, network failures and other exceptions are ubiquitous. The Flink engine must not only restart the application after a failure occurs , but also ensure that its internal state remains consistent . From the end Start again at the right time.

Flink provides cluster-level fault tolerance and application-level fault tolerance:

  • Cluster-level fault tolerance : Flinkis closely connected with cluster managers , such as YARN and Kubernetes . When a process hangs up, a new process is automatically restarted to take over the previous work. At the same time, it has high availability to eliminate all single points of failure,
  • Application-level fault tolerance : Flink uses lightweight distributed snapshots and design checkpoints (checkpoint) to achieve reliable fault tolerance.

Flink uses the checkpoint feature to provide Exactly-once semantics at the framework level, that is, end-to-end consistency, ensuring that data is only processed once, and will not be repeated or lost. Even if a failure occurs, it can also ensure that data is only written once.

3. What is the difference between Flink and Spark Streaming?

The biggest difference between Flink and Spark Streaming is that: Flink is a standard real-time processing engine, based on event-driven, with flow as the core ; while the RDD of Spark Streaming is actually a set of RDD collections of small batches, which is a micro-batch ( Micro-Batch) model, Take the batch as the core .

Below we describe the main differences between the two frameworks:

1. Architecture model

The main roles of Spark Streaming at runtime include:

  • Service Architecture Cluster and Resource Management Master / Yarn Application Master;
  • Work node Work / Node Manager;
  • task scheduler Driver; task executor Executor.

insert image description here

Flink mainly includes: client Client, job management Jobmanager, and task management at runtime Taskmanager.

insert image description here
2. Task scheduling

Spark Streaming continuously generates small batches of data to build a directed acyclic graph DAG, and Spark Streaming will create DStreamGraph and JobScheduler in turn.

insert image description here
Flink generates a StreamGraph based on the code submitted by the user , generates a JobGraph after optimization , and then submits it to the JobManager for processing. The JobManager generates an ExecutionGraph based on the JobGraph. The ExecutionGraph is the core data structure for Flink scheduling. Deploy to Taskmanager to form specific Task execution.

insert image description here
3. Time Mechanism

Spark Streaming supports a limited time mechanism, only processing time .

Flink supports three definitions of stream processing programs in time: event time EventTime , intake time IngestionTime , and processing time ProcessingTime . At the same time, it also supports the watermark mechanism to deal with lagging data.

insert image description here
4. Fault tolerance mechanism

  • For the Spark Streaming task, we can set it checkpoint, and then if a failure occurs and restarts, we can resume from the last checkpointtime, but this behavior can only prevent data from being lost, and may be processed repeatedly, and cannot achieve exactly one processing semantics.
  • Flink uses a two-phase commit protocol to solve this problem.

4. What does Flink's architecture include?

The Flink architecture is divided into two parts: technical architecture and operational architecture .

5. Briefly introduce the technical architecture of Flink.

The following figure shows the Flink technical architecture:

insert image description here
As a distributed computing engine integrating flow and batch, Flink must provide an API layer for developers. At the same time, it needs to interact with external data storage and requires connectors . After job development and testing are completed, it needs to be submitted to the cluster for execution and requires a deployment layer . At the same time, it also requires operation and maintenance personnel to be able to manage and monitor, and also provides graph computing, machine learning, SQL, etc., and requires an application framework layer .

6. Introduce the operating architecture of Flink in detail.

The following figure shows the Flink operating architecture:

insert image description here
The Flink cluster adopts a Master-Slave architecture. The role of the Master is JobManager , which is responsible for cluster and job management. The role of the Slave is TaskManager , which is responsible for executing computing tasks. At the same time, Flink provides a client Client to manage the cluster and submit tasks. JobManager and TaskManager are cluster process.

  • Client : The Flink client is a CLI command line tool provided by Flink, which is used to submit Flink jobs to the Flink cluster, and is responsible for the construction of StreamGraph (flow graph) and JobGraph (job graph) in the client.
  • JobManager : The JobManager decomposes the Flink application submitted by the Flink client into subtasks according to the degree of parallelism, and applies for the required computing resources from the resource manager ResourceManager. After the resources are available, the JobManager starts to distribute tasks to the TaskManager to execute the Task, and is responsible for application fault tolerance and job tracking Execution status, if an exception is found, resume the job, etc.
  • TaskManager : TaskManager receives the subtasks distributed by JobManage, and manages the life cycle stages of subtasks such as start, stop, destruction, and exception recovery according to its own resource conditions. There must be a TaskManager in a Flink program.

7. Introduce the parallelism of Flink.

When the Flink program is executed, it will be mapped into a Streaming Dataflow . A Streaming Dataflow is composed of a set of Stream and Transformation Operator. Starts with one or more Source Operators and ends with one or more Sink Operators at startup.

Flink programs are inherently parallel and distributed . During execution, a stream (stream) contains one or more stream partitions , and each operator contains one or more operator subtasks. Operational subtasks are independent of each other and executed in different threads, even on different machines or different containers.

The number of operator subtasks is the degree of parallelism for this particular operator . Different operators in the same program have different levels of parallelism.

insert image description here
A Stream can be divided into multiple Stream partitions, that is, Stream Partition. An Operator can also be divided into multiple Operator Subtasks . As shown in the figure above, Source is divided into Source1 and Source2, which are Operator Subtasks of Source respectively. Each Operator Subtask is executed independently in different threads. The parallelism of an Operator is equal to the number of Operator Subtasks .

The parallelism of Source in the above figure is 2 22 . The parallelism of a Stream is equal to the parallelism of the Operator it generates. There are two modes when data is passed between two operators:

  • One to One mode : When the two operators are passed in this mode, the number of partitions and the sorting of the data will be maintained; as in the above figure, Source1 to Map1, it will retain the partitioning characteristics of the Source and the order in which the partition elements are processed sex.
  • Redistributing (redistribution) mode : This mode will change the number of partitions of the data; each operator subtask will send data to different target subtasks according to the selected transformation, for example,keyBy()will be repartitioned by hashcode,broadcast()andrebalance()the method will be randomly repartitioned.

8. How to set the parallelism of Flink?

In the actual production environment, we can set the degree of parallelism from four different levels:

  • Operator level ( Operator Level)
  • Execution environment level ( Execution Environment Level)
  • client level ( Client Level)
  • system level ( System Level)

Priorities that need attention: operator level > environment level > client level > system level.

9. Do you understand the Flink programming model?

Flink applications are mainly composed of three parts, source , transformation , and destination sink. These streaming dataflows form a directed graph, starting with one or more sources (source) and ending with one or more destinations (sink).

insert image description here
insert image description here

10. How about DataStream and Transformation in Flink jobs?

In a Flink job, there are two basic blocks: data flow ( DataStream) and transformation ( Transformation).

DataStream is a logical concept that provides developers with an API interface. Transformation is an abstraction of processing behavior, including data reading, calculation, and writing. Therefore, the DataStream API call in the Flink job actually builds multiple data processing pipelines (Pipeline) composed of Transformation.

The transformation of DataStream API and Transformation is as follows:

insert image description here

11. Do you understand Flink's partition strategy?

Data partitions are called Partitions in Flink. In essence, distributed computing is to divide a job into sub-tasks and assign different data to different tasks for calculation.

In distributed storage, the concept of Partition is to divide the data set into blocks, and each block of data is stored on a different machine. Similarly, for a distributed computing engine, it is also necessary to split the data and hand it over to tasks located on different physical nodes for computing.

StreamPartitioner is the abstract interface of data stream partitioning in Flink, which determines the data stream distribution mode in actual operation, and divides the data to Task for calculation, and each Task is responsible for calculating a part of the data stream. All data partitioners implement the ChannelSelector interface, which defines the load balancing selection behavior.

// ChannelSelector 接口定义
public interfaceChannelSelector<T extends IOReadablewritable> {
    
     
    //下游可选 Channel 的数量
    void setup (intnumberOfChannels); 
    //选路方法
    int selectChannel (T record); 
    //是否向下游广播
    boolean isBroadcast();
 }

It can be seen from this interface that each partitioner knows the number of downstream channels, which is fixed in a job run, and this value will not change unless the parallelism of the job is modified.

Currently Flink supports 8 8The implementation of 8 partition strategies, the data partition system is as follows:
insert image description here
(1)GlobalPartitioner

The data will be distributed to the first instance of the downstream operator for processing.

(2)ForwardPartitioner

On the API level, ForwardPartitioner is applied to DataStream to generate a new DataStream.

This Partitioner is rather special, and is used for data forwarding between upstream and downstream operators in the same OperatorChain. In fact, the data is directly transmitted to the downstream, requiring the same degree of parallelism between the upstream and downstream.

(3)ShufflePartitioner

Randomly partitioning elements can ensure that downstream tasks can obtain data evenly. The code is as follows:

dataStream.shuffle();

(4)RebalancePartitions

Assign partitions to each element in a round-robin manner to ensure that downstream tasks can obtain data evenly and avoid data skew. Use the code as follows:

dataStream.rebalance();

(5)RescalePartitioner

Partition according to the number of upstream and downstream tasks, and use Round-robin to select a downstream task for data partitioning, such as the upstream has 2 22 Source., 6 6downstream6 Maps, then each Source will be allocated3 33 fixed downstream Maps will not write data to partitions that are not assigned to them. This is different from ShufflePartitioner and RebalancePartitioner, which will write to all downstream partitions.
insert image description here

The running code is as follows:

dataStream.rescale();

(6)BroadcastPartitioner

Broadcast the record to all partitions, that is, NNFor N partitions, copy the data toNNN copies,11 copy, the usage code is as follows:

dataStream.broadcast();

(7)KeyGroupStreamPartitioner

At the API level, KeyGroupStreamPartitioner is applied to KeyedStream to generate a new KeyedStream.

KeyedStream is partitioned according to the keyGroup index number, and the data will be output to the downstream operator instance according to the Hash value of the Key. This partitioner is not intended for user use.

KeyedStream uses the KeyedGroup partition form by default when constructing Transformation, so as to support the job Rescale function at the bottom layer.

(8)CustomPartitionerWrapper

User-defined partitioner. Users need to implement the Partitioner interface to define their own partition logic.

static class CustomPartitioner implements Partitioner<String> {
    
    
      @Override
      public int partition(String key, int numPartitions) {
    
    
          switch (key){
    
    
              case "1":
                  return 1;
              case "2":
                  return 2;
              case "3":
                  return 3;
              default:
                  return 4;
          }
      }
  }

12. Describe the steps involved in the execution of Flink Wordcount?

It mainly includes the following steps:

  • Get the running environment StreamExecutionEnvironment;
  • Access source source;
  • Perform conversion operations such as map(), flatmap(), keyby(), sum();
  • Output sink source, such as print();
  • Execute execute.

To provide an example:

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;


public class WordCount {
    
    

    public static void main(String[] args) throws Exception {
    
    
        //定义socket的端口号
        int port;
        try{
    
    
            ParameterTool parameterTool = ParameterTool.fromArgs(args);
            port = parameterTool.getInt("port");
        }catch (Exception e){
    
    
            System.err.println("没有指定port参数,使用默认值9000");
            port = 9000;
        }
        //获取运行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //连接socket获取输入的数据
        DataStreamSource<String> text = env.socketTextStream("10.192.12.106", port, "\n");
        //计算数据
        DataStream<WordWithCount> windowCount = text.flatMap(new FlatMapFunction<String, WordWithCount>() {
    
    
            public void flatMap(String value, Collector<WordWithCount> out) throws Exception {
    
    
                String[] splits = value.split("\\s");
                for (String word:splits) {
    
    
                    out.collect(new WordWithCount(word,1L));
                }
            }
        })//打平操作,把每行的单词转为<word,count>类型的数据
                .keyBy("word")//针对相同的word数据进行分组
                .timeWindow(Time.seconds(2),Time.seconds(1))//指定计算数据的窗口大小和滑动窗口大小
                .sum("count");
        //把数据打印到控制台
        windowCount.print()
                .setParallelism(1);//使用一个并行度
        // 注意:因为flink是懒加载的,所以必须调用execute方法,上面的代码才会执行
        env.execute("streaming word count");
    }
    /**
     * 主要为了存储单词以及单词出现的次数
     */
    public static class WordWithCount{
    
    
        public String word;
        public long count;
        public WordWithCount(){
    
    }
        public WordWithCount(String word, long count) {
    
    
            this.word = word;
            this.count = count;
        }
        @Override
        public String toString() {
    
    
            return "WordWithCount{" + 
                    "word='" + word + '\'' +
                    ", count=" + count +
                    '}';
        }
    }
}

13. What are the commonly used operators in Flink?

Divided into two parts:

(1) Data reading , which is the starting point of Flink stream computing applications, commonly used operators are:

  • read from memory:fromElements
  • Read from file:readTextFile
  • Socket access:socketTextStream
  • Custom read:createInput

(2) Operators for processing data are mainly used in the conversion process. Commonly used operators include:

  • Single input single output:Map
  • Single input, multiple output:FlatMap
  • filter:Filter
  • grouping:KeyBy
  • polymerization:Reduce
  • window:Window
  • connect:Connect
  • Split:Split

[1] Apache Flink - Stateful Computing on Data Streams: https://flink.apache.org/zh/

【2】Apache Flink documentation: https://nightlies.apache.org/flink/flink-docs-release-1.17/zh/

【3】Flink Encyclopedia-Blog full of foreign groups

[4] The most complete dry goods in history! Flink interview summary

Guess you like

Origin blog.csdn.net/be_racle/article/details/132122334