[Big Data] Detailed Explanation of Flink (5): Core Part IV

This series includes:


45. Do you understand the Flink broadcast mechanism?

insert image description here
It can be understood from the figure that broadcasting is a public shared variable. The broadcast variable is stored in the memory of TaskManager, so the broadcast variable should not be too large. After broadcasting a data set, different Tasks can be obtained on the node. Each There is only one copy of the node. If broadcasting is not used, each Task will copy a data set, resulting in waste of memory resources.

46. ​​Do you understand Flink backpressure?

Backpressure (backpressure) is a very common problem in the development of real-time computing applications, especially in streaming computing. Back pressure means that a certain node in the data pipeline becomes a bottleneck, and the downstream processing rate cannot keep up with the upstream data sending rate , and the upstream speed needs to be limited. Since real-time computing applications usually use message queues to decouple the production end and the consumer end, the data source of the consumer end is independentpull-based, so the back pressure is usually transmitted from a certain node to the data source and reducesKafka consumerthe ingestion rate of the data source (for example).

To put it simply, the downstream processing rate cannot keep up with the upstream data sending rate , and the downstream has no time to consume. As a result, when the queue is full, the upstream production will be blocked, and eventually the ingestion of the data source will be blocked.

47. What are the effects of Flink back pressure?

Backpressure affects two metrics: checkpointduration and statesize.

(1) The former is because checkpoint barrierwill not pass through ordinary data, and the blocking of data processing will also cause checkpoint barrierthe length of time to flow through the entire data pipeline, so checkpointthe overall time ( End to End Duration) will become longer.

(2) The latter is because in order to ensure EOS( Exactly-Once-Semantics, exactly once), for an Operator with more than two input pipelines, checkpoint barrierit needs to be aligned ( Alignment). After receiving a faster input pipeline barrier, the data behind it will be cached but not processed , until the slower input pipeline's barrieralso arrives, these cached data will be put stateinto , causing statebecomes larger.

These two effects are very dangerous for jobs in the production environment , because checkpointis the key to ensuring data consistency, checkpointa longer time may cause checkpointtimeout failure, and statethe size of may also slow down checkpointor even cause OOM(use Heap-based StateBackend) or physical memory usage to exceed RocksDBStateBackendStability issues with container resources (usage ).

48. How to solve Flink back pressure?

The Flink community proposed FLIP-76, which introduced a non-aligned checkpoint ( unaligned checkpoint) to decouple the Checkpoint mechanism and the backpressure mechanism.

To solve the back pressure, the first thing to do is to locate the node that caused the back pressure. There are two main methods:

  • Through the back pressure monitoring panel that comes with Flink Web UI
  • Flink Task Metrics

(1) Back pressure monitoring panel

The back pressure monitoring of Flink Web UI provides SubTask level back pressure monitoring. The principle is to judge the node by periodically sampling the stack information of the Task thread and getting the frequency that the thread is blocked in the request buffer (meaning blocked by the downstream queue). Whether it is in a state of back pressure. In the default configuration, this frequency is 0.1 0.1Below 0.1OK ,0.1 0.10.1 to0.5 0.50.5 isLOW, and more than0.5 0.50.5 isHIGH.

insert image description here
(2)Task Metrics

Task Metrics provided by Flink is a better means of back pressure monitoring.

  • If the buffer occupancy rate of the sending end of a subtask is high, it indicates that it is limited by the downstream backpressure.
  • If a Subtask's receiving end Buffer usage is high, it indicates that it will transmit back pressure to the upstream.

49. What data types does Flink support?

The data types supported by Flink are shown in the figure below:
insert image description here
From the figure, we can see that Flink types can be divided into basic types ( Basic), arrays ( Arrays), composite types ( Composite), auxiliary types ( Auxiliary), generic and other types ( Generic). Flink supports arbitrary Java or Scala types.

50. How does Flink perform serialization and deserialization?

The meaning of the so-called serialization and deserialization:

  • Serialization : It is to convert a memory object into a binary string to form a network transmission or persistent data stream.
  • Deserialization : Convert binary strings to memory pairs.

TypeInformationIs the core class of Flink type system.

In Flink, when data needs to be serialized, it will use the TypeInformationgenerated serializer interface to call a createSerialize()method, which is created TypeSerializerand TypeSerializerprovides serialization and deserialization capabilities.

The serialization process of Flink is shown in the following figure:

insert image description here
For most data types, Flink can automatically generate corresponding serializers , which can serialize and deserialize data sets very efficiently, as shown in the following figure:

insert image description here
For example, BasicTypeInfo, WritableTypeIno, but for GenericTypeInfotype, Flink will use Kyrofor serialization and deserialization. Among them, Tuple, Pojoand CaseClasstypes are composite types, and they may nest one or more data types. In this case, their serializers are also composite. They delegate the serialization of embedded types to the corresponding type's serializer.

Introduce Flink serialization and deserialization through a case:

insert image description here

As shown in the figure above, when creating a Tuple3 object, it contains three levels, one is inttype, one is doubletype, and the other is Person. PersonThe object contains two fields, one of inttype idand the other Stringof type name.

  • During the serialization operation, the serializer corresponding to the specific serialization will be entrusted to perform the corresponding serialization operation. It can be seen from the figure that Tuple3 will serialize intthe type through , and at this time only needs to occupy four bytes.IntSerializerint
  • PersonThe class will be Pojotreated as an object, and PojoSerializerthe serializer will store some attribute information in one byte. Similarly, its fields are serialized with the corresponding serializer. In the serialized result, you can see that all the data is supported by MemorySegment.

MemorySegmentWhat role does it have?

MemorySegmentIn Flink, objects are serialized to pre-allocated memory blocks, which represent 1 11 fixed-length memory, the default size is32 kb 32\ kb32 kb  . MemorySegmentIt represents the smallest memory allocation unit in Flink, which is equivalent to anbytearray in Java. Each record is stored in one or more in serialized formMemorySegment.

51. Why does Flink use autonomous memory instead of JVM memory management?

Because when storing large amounts of data in memory (including caching and efficient processing), the JVM will face many problems, including the following:

  • Java objects have low storage density . Java's object storage in memory contains 3 33 main parts:object header,instance data,alignment padding. For example, an object containing onlybooleanproperties accounts for16 1616 byte : the object header occupies8 88 byte , boolean attributes account for1 11 byte , in order to align up to8 8Multiples of 8 account for an extra 7 77 byte . and actually only need1 11 (bit/8 1/81/8 byte) is enough.
  • Full GC can greatly affect performance . Especially for the JVM that has opened a large memory space to handle larger data, GC (Garbage Collection) will reach the second level or even the minute level.
  • OOM issues affect stability . Memory overflow (OutOfMemoryError) is a problem often encountered in distributed computing frameworks. When the size of all objects in the JVM exceeds the memory size allocated to the JVM, an error will occur, causing theOutOfMemoryErrorJVM to crash, and the robustness and performance of the distributed framework will be affected. Influence.
  • Cache miss problem . When the CPU performs calculations, it obtains data from the CPU cache. The CPU of the modern system has a multi-level cache, and when loading, it is loaded in units of Cache Line. If objects can be stored contiguously, Cache Miss will be greatly reduced. Make the CPU focus on business processing instead of idling.

52. How does Flink's autonomous memory manage objects?

Flink does not store a large number of objects in the heap memory, but serializes the objects into a pre-allocated memory block. This memory block is called, which MemorySegmentrepresents a fixed length of memory (the default size is 32 3232 KB), which is also the smallest memory allocation unit in Flink, and provides a very efficient read and write method. Many operations can directly operate on binary data without deserialization. Each record is stored in one or more in serialized formMemorySegment. If more data needs to be processed than can fit in memory, Flink's operators will spill some of the data to disk.

53. Tell me about the Flink memory model?

Flink's overall memory class diagram is as follows:

insert image description here
It mainly includes JobManagermemory model and TaskManagermemory model.

(1) JobManager memory model

insert image description here

in 1.10 1.10In 1.10 , Flink unified the memory management and configuration on the TM (TaskManager) side, correspondingly in1.11 1.11In 1.11 , Flink further modified the memory configuration on the JM (JobManager) side, making its options and configuration methods consistent with those on the TM side.

insert image description here
(2) TaskManager memory model

Big 1.10 1.101.10 has made major changes to the TaskManager's memory model and configuration options for Flink applications, allowing users to more tightly control their memory overhead.

insert image description here

insert image description here

  • JVM Heap (JVM heap memory)

    • Framework Heap Memory (memory on the framework heap) : The memory used by the Flink framework itself, that is, the memory on the heap occupied by the TaskManager itself, is not included in the resources of the Slot. Configuration parameters:taskmanager.memory.framework.heap.size = 128MB, default 128 128128 MB。
    • Task Heap Memory (Task heap memory) : The memory on the heap used by Task when executing user code. Configuration parameters:taskmanager.memory.task.heap.size.
  • Off-Heap Mempry (off-heap memory)

    • DirectMemory (direct memory)
      • Framework Off-Heap Memory : The memory used by the Flink framework itself, that is, the external memory occupied by the TaskManager itself, is not included in the Slot resource. Configuration parameters:taskmanager.memory.framework.off-heap.size = 128MB, default 128 128128 MB。
      • Task Off-Heap Memory (Task off-heap memory) : The external memory used by Task to execute user code. Configuration parameters:taskmanager.memory.task.off-heap.size = 0, default 0 00
      • Network Memory : The size of off-heap memory used for network data exchange, such as network data exchange buffer.
    • Managed Memory : The off-heap memory managed by Flink is used for sorting, hash tables, caching intermediate results, and the local memory of RocksDB State Backend.
  • JVM Specific Memory (the memory used by the JVM itself)

    • JVM Metaspace (JVM Metaspace)
    • JVM Overhead (JVM execution overhead) : The content required by the JVM itself during execution, including the memory used by thread stacks, IO, and compilation caches. Configuration parameters:taskmanager.memory.jvm-overhead.min = 192MB ,taskmanager.memory.jvm-overhead.max = 1GB,taskmanager.memory.jvm-overhead.fraction = 0.1.
  • overall memory

    • Total Process Memory : The total memory consumed by the Flink Java application (including user code) and the JVM running the entire process. Total process memory = memory used by Flink + JVM metaspace + JVM execution overhead. Configuration item:taskmanager.memory.process.size: 1728m.

    • Flink total memory : memory consumed by the Flink Java application only, including user code, but excluding memory allocated by the JVM for its operation. Flink uses memory = inside and outside the framework heap +taskinside and outside the heapnetwork++manage.

54. How does Flink manage resources?

Flink can be divided into two layers in resource management: cluster resources and its own resources . Cluster resources support mainstream resource management systems, such as Yarn, Mesos, K8setc., and also support independently started Standaloneclusters. Self-resources are related to taskthe resource usage of each child and are maintained by Flink itself.

1. Analysis of Cluster Architecture

The operation of Flink is mainly composed of a client , a JobManager (hereinafter referred to as JM) and more than one TaskManager (referred to as TMor Worker).

insert image description here

  • Client : The client is mainly used to submit tasks to the cluster. In the Session or Per Job mode, the client program is also responsible for parsing the user code and generating JobGraph; in the Application mode, it is sufficient to directly submit the user and execution parametersjar. The client generally supports two modes:detachedmode, the client automatically exits after submitting;attachedmode, the client blocks after submitting and waits for the task to be executed before exiting.
  • JobManager : JM is responsible for deciding when the application is scheduledtask,taskwhat to do when the execution ends or fails, and coordinates checkpoints and failure recovery. The process mainly consists of the following parts:
    • ResourceManager : Responsible for resource application, release, and managementslot(the most granular resource management unit in a Flink cluster). Flink implements various RM implementations to adapt to various resource management frameworks, such asYarn,Mesos,K8sorStandalone. InStandalonemode, RMs can only allocateslot, not start new TMs. Note: The RM mentioned hereYarnis not the same thing as the RM here, and the RM here is an independent service in JM.
    • Dispatcher : Provides an interface for Flink to submit tasksrest, starts a new JobMaster for each submitted task, provides a Web UI for all tasks, and queries task execution status.
    • JobMaster : Responsible for managing and executing a single JobGraph, multiple tasks can be started in a cluster at the same time, each with its own JobMaster. Note the difference between JobMaster and JobManager here.
  • TaskManager : Also known as TMworker, it is used to execute tasks in the data flow graph, cache and exchange data. The cluster has at least one TM, and the smallest resource management unit in TM is thatsloteachslotcan execute onetask, soslotthe number in TM represents the number of tasks that can be executed at the same time.

2. Slot and resource management

Each TM is an independent JVM process, which executes one or more tasks internally based on independent threads. In order to control the execution resources of each task, TM uses task slotto manage. Each task slotrepresents a part of fixed resources in TM, for example, a TM has 3 33 ,sloteachslotwill get1/3 1/31/3 memory resources. There will be no resource preemption between different tasks. Note that the GPU is not currently isolated, and currentlyslotonly memory resources can be divided.

For example, the following data flow graph, after being expanded into a parallel flow graph, taskmay be split into multiple tasks and executed in parallel in the cluster. The operation chain can combine multiple different tasks, so as to support the execution of multiple tasks in one thread, without frequently releasing the application thread. At the same time, the operation chain can also cache data uniformly, increase data processing throughput, and reduce processing delay.

In Flink, several conditions need to be met to combine different subtasks:

  • The incoming edge of the downstream node is 1 11 (guaranteed that no data existsshuffle);
  • The upstream and downstream subtasks are not empty;
  • The connection policy is always ALWAYS;
  • The partition type is ForwardPartitioner;
  • Consistent parallelism;
  • Currently Flink enables Chainthe feature.

insert image description here
The execution graph in the cluster may be as follows:
insert image description here
Flink also supports slotthe sharing of , that is, different tasks are assigned to the same according to the dependencies of the tasks slot. This brings several benefits: it is convenient to count the maximum resource configuration required by the current task (the maximum parallelism of a subtask); avoid slotexcessive application and release of and improve slotthe efficiency of usage.

insert image description here
By slotsharing, it is possible slotto contain a complete task execution link in a .

3. Application execution

A Flink application is a user-written mainfunction, which may contain one or more Flink tasks. These tasks can be executed locally or started on a remote cluster, which can run for a long time or be started independently. The following are currently supported task submission schemes:

  • Session Cluster
    • Lifecycle: The cluster is created in advance and runs for a long time, and the client connects to the cluster when submitting a task. Even after all tasks are executed, the cluster will keep running unless stopped manually. So the life cycle of the cluster is independent of the tasks.
    • Resource isolation: TM slotis applied for by RM, and will be released automatically when the above tasks are executed. Since multiple tasks will share the same cluster, there will be competition between tasks, such as network bandwidth. If a TM dies, all tasks on it will fail.
    • Other aspects: Having a cluster created in advance can avoid too much consideration of the cluster problem every time it is used. It is more suitable for scenarios where the execution time is very short and the startup time is relatively high, such as interactive query analysis.
  • Per Job cluster
    • Life cycle: Create a separate cluster for each submitted task. When the client submits a task, it directly communicates with the ClusterManager to apply for creating a JM and runs the submitted task internally. TM delays the application according to the resources required for task operation. Once the tasks are executed, the cluster will be recycled.
    • Resource isolation: If a task encounters a fatal problem, it will only affect its own task.
    • Other aspects: Since RM needs to apply for and wait for resources, the startup time will be slightly longer. It is suitable for tasks that are relatively large, run for a long time, need to ensure long-term stability, and do not care about startup time.
  • Application cluster
    • Lifecycle: Similar to Per Job, except that mainthe method runs in the cluster. The task submission procedure is very simple. It does not need to start or connect to the cluster, but directly packages the application program into the resource management system and starts the corresponding EntryPoint, calls the method of the user program in the EntryPoint, parses and generates the JobGraph, and then starts the operation main. The lifecycle of a cluster is the same as that of an application.
    • Resource isolation: RM and Dispatcher are application level.

Guess you like

Origin blog.csdn.net/be_racle/article/details/132384103