In-depth understanding of Apache Flink core technology

The Apache Flink (hereinafter referred to as Flink) project is a rising star in the field of big data processing recently, and its many features different from other big data projects have attracted more and more people's attention. This article will deeply analyze some key technologies and features of Flink, hoping to help readers have a deeper understanding of Flink, and it will also benefit other big data system developers. This article assumes that the reader has some understanding of big data processing frameworks such as MapReduce, Spark, and Storm, and is familiar with the basic concepts of stream processing and batch processing.

Introduction to Flink

The core of Flink is a streaming data stream execution engine, which provides data distribution, data communication, and fault tolerance mechanisms for distributed computing of data streams. Based on the stream execution engine, Flink provides many APIs at a higher abstraction level for users to write distributed tasks:

The DataSet API performs batch operations on static data and abstracts static data into distributed data sets. Users can easily use various operators provided by Flink to process distributed data sets, supporting Java, Scala and Python.

DataStream API, which performs stream processing operations on data streams, abstracts streaming data into distributed data streams, and allows users to easily perform various operations on distributed data streams, and supports Java and Scala.

Table API, query operations on structured data, abstract structured data into relational tables, and perform various query operations on relational tables through SQL-like DSL, supporting Java and Scala.

In addition, Flink also provides domain libraries for specific application areas, such as:

Flink ML, Flink's machine learning library, provides the Machine Learning Pipelines API and implements various machine learning algorithms.

Gelly, Flink's graph computing library, provides APIs related to graph computing and the implementation of various graph computing algorithms.

Flink's technology stack is shown in Figure 1:

 

 

Figure 1 Flink technology stack

In addition, Flink can also be easily integrated with other projects in the Hadoop ecosystem. For example, Flink can read static data stored in HDFS or HBase, use Kafka as a streaming data source, and reuse MapReduce or Storm code directly, or use Kafka as a streaming data source. YARN applies for cluster resources, etc.

Unified batch and stream processing system

In the field of big data processing, batch tasks and stream processing tasks are generally considered to be two different tasks. A big data project is generally designed to process only one of these tasks. For example, Apache Storm and Apache Smaza only support stream processing. tasks, while Aapche MapReduce, Apache Tez, and Apache Spark only support batch tasks. Spark Streaming is a subsystem on top of Apache Spark that supports stream processing tasks. It seems to be a special case, but it is not. Spark Streaming adopts a micro-batch architecture, that is, the input data stream is divided into fine-grained batches, and the A batch Spark task is submitted for each batch data, so Spark Streaming is essentially based on the Spark batch processing system to process streaming data, which is completely different from Apache Storm, Apache Smaza and other completely streaming data processing methods. Through its flexible execution engine, Flink can support both batch and stream processing tasks.

At the execution engine level, the biggest difference between a stream processing system and a batch system is the data transmission method between nodes. For a stream processing system, the standard model of data transmission between nodes is: when a piece of data is processed, it is serialized into the cache, and then immediately transmitted to the next node through the network, and the next node continues processing. For a batch system, the standard model of data transmission between nodes is: when a piece of data is processed, it is serialized into the cache, and not immediately transmitted to the next node through the network. When the cache is full, it will persist. After all data has been processed, the processed data will be transmitted to the next node through the network. These two data transmission modes are two extremes, corresponding to the low latency requirements of stream processing systems and the high throughput requirements of batch processing systems. Flink's execution engine adopts a very flexible way and supports both data transfer models. Flink performs network data transmission in units of fixed cache blocks. Users can specify the transmission timing of cache blocks through the cache block timeout value. If the timeout value of the cache block is 0, the data transmission method of Flink is similar to the standard model of the stream processing system mentioned above, and the system can obtain the lowest processing delay. If the timeout value of the cache block is infinite, the data transmission method of Flink is similar to the standard model of the batch system mentioned above, and the system can obtain the highest throughput. At the same time, the timeout value of the cache block can also be set to any value between 0 and infinity. The smaller the timeout threshold for cache blocks, the lower the data processing latency of the Flink stream processing execution engine, but the lower the throughput, and vice versa. By adjusting the timeout threshold of cache blocks, users can flexibly trade off system latency and throughput according to their needs.

 

 

Figure 2 Flink execution engine data transfer mode

Based on the unified streaming execution engine, Flink supports both stream computing and batch processing, and guarantees performance (latency, throughput, etc.). Compared with other native stream processing and batch processing systems, it is not affected by the unified execution engine, which greatly reduces the cost of user installation, deployment, monitoring, and maintenance.

Fault tolerance mechanism of Flink stream processing

For a distributed system, it often happens that a single process or node crashes and the entire job fails. It is one of the features that a distributed system must support without losing user data and automatically recovering when an exception occurs. This section mainly introduces the fault tolerance mechanism at the task level of the Flink stream processing system.

It is easier to implement a fault-tolerant mechanism in a batch system. Since files can be accessed repeatedly, when a task fails, the task can be restarted. But when it comes to the stream processing system, since the data source is an infinite data stream, it is basically impossible to cache or persist all the data for repeated access after a stream processing task is executed for several months. Flink implements fault tolerance based on distributed snapshots and partially retransmitted data sources. Users can customize the time interval for taking snapshots of the entire job. When the task fails, Flink will restore the entire job to the latest snapshot and resend the data after the snapshot from the data source. Flink's distributed snapshot implementation draws on a paper on distributed snapshots published by Chandy and Lamport in 1985. The main ideas of its implementation are as follows:

According to the user-defined distributed snapshot interval, Flink will periodically insert a special snapshot marker message into all data sources. These snapshot marker messages flow in the DAG like other messages, but will not be processed by user-defined business logic. Each snapshot mark message divides the data stream in which it is located into two parts: the current snapshot data and the next snapshot data.

 

 

Figure 3 Flink's message flow with snapshot marker messages

The snapshot mark message flows through each operator along the DAG. When the operator processes the snapshot mark message, it will take a snapshot of its own state and store it. When an operator has multiple inputs, Flink will cache the snapshot mark message that arrives first and the messages after it. When all the snapshot mark messages corresponding to the current snapshot in all the inputs arrive, the operator will automatically store its own snapshot mark message. The state is snapshotted and stored, after which all cached messages after the snapshot-marked message are processed. Operators can perform asynchronous and incremental operations on snapshotting and storing their state without blocking message processing. The process of distributed snapshot is shown in Figure 4:

 

 

Figure 4 Flink distributed snapshot flowchart

When all Data Sinks (terminal operators) receive the snapshot marker information and snapshot and store their own state, the entire distributed snapshot is completed, and the data source is notified to release all messages before the snapshot marker message. If an abnormal situation such as a node crash occurs later, it is only necessary to restore the previously stored distributed snapshot state and re-send the messages after the snapshot from the data source.

Exactly-Once is a very important feature that the stream processing system needs to support. It ensures that each message is processed only once by the stream processing system. The business logic of many stream processing tasks depends on the Exactly-Once feature. Compared with At-Least-Once or At-Most-Once, the Exactly-Once feature has stricter requirements on stream processing systems and is more difficult to implement. Flink implements the Exactly-Once feature based on distributed snapshots.

Compared with the fault-tolerant solutions of other stream processing systems, Flink's distributed snapshot-based solution has many advantages in terms of functionality and performance, including:

Low latency. Since the storage of operator state can be asynchronous, the process of taking a snapshot basically does not block the processing of messages and therefore does not negatively affect message latency.

High throughput. When there are few operator states, there is basically no impact on throughput. When there are many operator states, compared with other fault tolerance mechanisms, the interval of distributed snapshots is user-defined, so users can adjust the interval of distributed snapshots by weighing the error recovery time and throughput requirements.

Isolation from business logic. Flink's distributed snapshot mechanism is completely isolated from the user's business logic, and the user's business logic will not depend on or have any impact on the distributed snapshot.

Error recovery cost. The shorter the interval between distributed snapshots, the less time for error recovery, which is negatively related to throughput.

Time windows for Flink stream processing

For stream processing systems, there is no upper limit on incoming messages, so for operations such as aggregation or connection, stream processing systems need to segment incoming messages, and then aggregate or connect based on each piece of data. The segmentation of a message is called a window. There are many types of windows supported by the stream processing system. The most common one is the time window, which processes messages in segments based on time intervals. This section mainly introduces the various time windows supported by the Flink stream processing system.

For most current stream processing systems, the time window is generally divided according to the local clock of the node where the task is located. This method is easier to implement and will not cause blocking. However, some application requirements may not be met, such as:

The message itself has a time stamp, and the user wants to perform segmentation processing according to the time characteristics of the message itself.

Since the clocks of different nodes may be different, and the delay of messages flowing through each node is different, messages processed in a node belonging to the same time window may be split into different time windows when flowing to the next node, thus produces unexpected results.

Flink supports 3 types of time windows, which are suitable for users' requirements for different types of time windows:

Operator Time. The time window divided according to the local clock of the node where the task is located.

Event Time. Messages have their own timestamps and are processed according to the timestamps of the messages to ensure that all messages with timestamps in the same time window will be processed correctly. Since the messages may flow into the Task out of order, the Task needs to cache the message processing status of the current time window until it is confirmed that all messages belonging to the time window have been processed, and then it can be released. If the out-of-order message delay is high, it will affect the distributed system. throughput and latency.

Ingress Time. Sometimes the message itself does not have timestamp information, but the user still wants to divide the time window according to the message instead of the node clock, for example, to avoid the second problem mentioned above, it can be automatically generated when the message source flows into the Flink stream processing system. Incremental timestamps are assigned to messages, and the subsequent processing is the same as Event Time. Ingress Time can be regarded as a special case of Event Time. Since its timestamps at the message source must be ordered, in the stream processing system, the out-of-order message delay will not be very high compared to Event Time, so There will also be less impact on the throughput and latency of the Flink distributed system.

Implementation of Event Time Time Window

Flink draws on Google's MillWheel project to support Event Time-based time windows through WaterMark.

When an operator processes data through an Event Time-based time window, it must determine that all messages belonging to that time window have flowed into the operator before it can begin data processing. But since messages may be out of order, the operator cannot directly determine when all messages belonging to the time window have flowed into the operator. WaterMark contains a timestamp. Flink uses WaterMark to mark that all messages less than a certain timestamp have flowed in. After confirming that all messages less than a certain timestamp have been output to the Flink stream processing system, the Flink data source will generate a message containing this timestamp. The WaterMark of the timestamp is inserted into the message stream and output to the Flink stream processing system. The Flink operator caches all incoming messages according to the time window. When the operator processes the WaterMark, it processes all the time window data less than the WaterMark timestamp. Process and send to the next operator node, then also send the WaterMark to the next operator node.

In order to ensure that all messages belonging to a certain time window can be processed, the operator must wait until the WaterMark larger than this time window can start to process the messages of this time window. Compared with the time window based on Operator Time, Flink needs to occupy more memory. , and will directly affect the delay time of message processing. In this regard, a possible optimization measure is that for the operators of the aggregation class, the aggregation operation can be performed on some messages in advance. When new messages belonging to the time window flow in, the calculation is continued based on the previous partial aggregation results. In this case, It is enough to cache intermediate calculation results, not all messages for that time window.

For the operator based on the Event Time time window, it is the simplest and ideal situation that the timestamp flowing into the WaterMark is consistent with the clock of the current node, but it is impossible in the actual environment, due to the disorder of the message and the processing efficiency of the previous node. There will always be some messages whose inflow time is greater than their own timestamps. The difference between the real WaterMark timestamp and the ideal WaterMark timestamp is called Time Skew, as shown in Figure 5:

 

 

Figure 5 Time Skew diagram of WaterMark

Time Skew determines the time that all data in the time window between the WaterMark and the previous WaterMark needs to be cached. The longer the Time Skew time is, the longer the delay of the time window data will be, and the longer the memory will be occupied. negatively affects throughput.

Timestamp-based sorting

In a stream processing system, ordering of messages is basically considered infeasible since the incoming messages are infinite. But in the Flink stream processing system, based on WaterMark, Flink implements global sorting based on timestamps. The implementation idea of ​​sorting is as follows: the sorting operator caches all incoming messages, and when it receives the WaterMark, it sorts the messages with a timestamp less than the WaterMark and sends it to the next node, releasing all the timestamps in this sorting operator For messages smaller than this WaterMark, continue to cache incoming messages and wait for the next WaterMark to trigger the next sorting.

Since WaterMark guarantees that no message with a timestamp smaller than it will appear after it, the correctness of the ordering can be guaranteed. It should be noted that if the sorting operator has multiple nodes, it can only ensure that the outgoing messages of each node are in order, and the messages between nodes cannot be guaranteed to be in order. To achieve global ordering, there can only be one ordering. operator node.

By supporting Event Time-based message processing, Flink expands the application scope of its stream processing system, so that more stream processing tasks can be performed by Flink.

Customized memory management

The Flink project is based on JVM languages ​​such as Java and Scala. The JVM itself is an execution platform for various types of applications, and its management of Java objects is also based on general processing strategies. Its garbage collector estimates the life cycle of Java objects. Efficient management.

For different types of applications, users may need to configure specific JVM parameters according to the characteristics of the type of applications to manage Java objects more efficiently, thereby improving performance. This black magic of JVM tuning requires users to have an in-depth understanding of the application itself and various parameters of the JVM, which greatly increases the tuning threshold for distributed computing platforms. The Flink framework itself understands the data transmission of each step of the computing logic. Compared with the JVM garbage collector, it understands more about the life cycle of Java objects, which makes it possible to manage Java objects more efficiently.

Problems with the JVM

Java object overhead

Compared with languages ​​that are closer to the bottom layer such as c/c++, the storage density of Java objects is relatively low. For example [1], a simple string such as "abcd" requires 4 bytes of storage in UTF-8 encoding, but uses Java that stores strings in UTF-16 encoding requires 8 bytes. At the same time, Java objects have other additional information such as headers. A 4-byte string object requires 48 bytes of space to store in Java. For most big data applications, memory is a scarce resource, and more efficient memory storage means higher CPU data access throughput and fewer disk landings.

Cache miss caused by object storage structure

In order to alleviate the gap between CPU processing speed and memory access speed [2], modern CPU data access generally has multiple levels of cache. When loading data from the memory to the cache, the data is generally loaded in the unit of cache line, so when the data accessed by the CPU is stored continuously in the memory, the access efficiency will be very high. If the data to be accessed by the CPU is not in all cache lines currently cached, the corresponding data needs to be loaded from memory, which is called a cache miss. When cache misses are very high, the CPU spends most of its time waiting for data to be loaded, rather than actually processing the data. Java objects are not stored contiguously in memory, and many Java data structures have poor data aggregation.

Garbage Collection for Big Data

Java's garbage collection mechanism has always been loved and hated by Java developers. On the one hand, it eliminates the steps for developers to recycle resources by themselves, improves development efficiency, and reduces the possibility of memory leaks. On the other hand, garbage collection is also a Java application. Irregular time bombs, and sometimes second-level or even minute-level garbage collections greatly affect the performance and availability of Java applications. In today's data centers, large-capacity memory has been widely used, and even a single machine is equipped with TB memory. At the same time, big data analysis usually traverses the entire source data set to transform, clean, and process the data. In this process, a large number of Java objects will be generated, and the execution efficiency of JVM garbage collection has a great impact on performance. Improving garbage collection efficiency through JVM parameter tuning requires users to have an in-depth understanding of the application and distributed computing frameworks and JVM parameters, and sometimes this is far from enough.

OOM problem

OutOfMemoryError is a problem often encountered by distributed computing frameworks. When the size of all objects in the JVM exceeds the memory size allocated to the JVM, an OutOfMemoryError error occurs, the JVM crashes, and the robustness and performance of the distributed framework will be affected. Applications that manage memory through the JVM and at the same time try to solve the OOM problem usually need to check the size of Java objects and set thresholds in some data structures that store a lot of Java objects for control. However, the JVM does not provide an official tool for checking the size of Java objects, and third-party tool libraries may not be able to accurately and universally determine the size of Java objects [6]. Intrusive threshold checking also adds a lot of extra code to the implementation of distributed computing frameworks that is not related to business logic.

Flink's processing strategy

In order to solve the problems mentioned above, high-performance distributed computing frameworks usually require the following technologies:

Custom serializers. The prerequisite step for explicit memory management is serialization, which serializes Java objects into binary data and stores them in memory (on heap or off-heap). Common serialization frameworks, such as Java, use java.io.Serializable by default to take all the meta information of Java objects and their member variables as part of their serialized data, and the serialized data contains all the information required for deserialization. This is necessary in some scenarios, but for distributed computing frameworks like Flink, these metadata information may be redundant data. Customized serialization frameworks, such as Hadoop's org.apache.hadoop.io.Writable, require users to implement this interface and customize the serialization and deserialization methods of the class. This method is the most efficient, but requires extra work from the user and is not friendly enough.

Explicit memory management. The general practice is to apply and release memory in batches. Each JVM instance has a unified memory manager, and all memory application and release are performed through the memory manager. This avoids common memory fragmentation problems, and because data is stored in binary, it greatly reduces garbage collection pressure.

Cache-friendly data structures and algorithms. For computationally intensive data structures and algorithms, directly manipulate the serialized binary data rather than deserialize the object and then manipulate it. At the same time, only the operation-related data is continuously stored, which can maximize the use of L1/L2/L3 cache, reduce the probability of cache miss, and improve the throughput of CPU computing. Taking sorting as an example, since the main operation of sorting is to compare keys, if the keys and values ​​of all sorted data are separated and stored continuously, the cache hit rate when accessing keys will be greatly improved.

custom serializer

The premise that a distributed computing framework can use a custom serialization tool is that the data streams to be processed are usually of the same type. Since the type of the dataset objects is fixed, only one copy of the object schema information can be saved, saving a lot of storage space. At the same time, for fixed-size types, it can also be accessed through a fixed offset position. When you need to access a member variable of an object, through the customized serialization tool, it is not necessary to deserialize the entire Java object, but directly through the offset, so that only a specific object member variable needs to be deserialized. If there are many member variables of the object, the creation overhead of Java objects and the copy size of memory data can be greatly reduced. Flink datasets support any Java or Scala types. By automatically generating custom serialization tools, it ensures that the API interface is user-friendly (there is no need to inherit and implement the org.apache.hadoop.io.Writable interface like Hadoop), It also achieves serialization efficiency similar to Hadoop.

Flink analyzes the type information of the dataset, and then automatically generates a customized serialization tool class. Flink supports any Java or Scala type. The Java Reflection framework is used to analyze the type information of the return type of the Java-based Flink program UDF (User Define Function), and the Scala Compiler is used to analyze the type information of the return type of the Scala-based Flink program UDF. Type information is represented by the TypeInformation class, which has many concrete implementation classes, such as:

BasicTypeInfo Arbitrary Java basic types (wrapped or unwrapped) and String types.

BasicArrayTypeInfo Arbitrary Java basic type arrays (packed or unpacked) and String arrays.

WritableTypeInfo Any implementation class of Hadoop's Writable interface.

TupleTypeInfo Arbitrary Flink tuple type (Tuple1 to Tuple25 are supported). Flink tuples are fixed-length, fixed-type Java Tuple implementations.

CaseClassTypeInfo Arbitrary Scala CaseClass (including Scala tuples).

PojoTypeInfo Any POJO (Java or Scala), such as all member variables of a Java object, are either defined with the public modifier or have getter/setter methods.

GenericTypeInfo can't match any of the previous types of classes.

The first 6 types of data sets almost cover most of the Flink programs. For the first 6 types of data sets, Flink can automatically generate the corresponding TypeSerializer custom serialization tool, which can efficiently serialize and deserialize the data sets. change. For the 7th type, Flink uses Kryo for serialization and deserialization. In addition, for types that can be used as keys, Flink also automatically generates TypeComparator at the same time to assist in directly performing compare, hash and other operations on the serialized binary data. For combined types such as Tuple, CaseClass, and Pojo, the TypeSerializer and TypeComparator automatically generated by Flink are also combined, and the serialization/deserialization of its members is delegated to the corresponding TypeSerializer and TypeComparator of its members, as shown in Figure 6:

 

 

Figure 6 Flink composite type serialization

In addition, if necessary, users can customize and implement their own serialization tools by integrating the TypeInformation interface.

Explicit memory management

Garbage collection is an unavoidable problem of JVM memory management. The G1 algorithm of JDK8 improves the efficiency and availability of JVM garbage collection, but it is far from enough for the actual environment of big data processing. This is also in conflict with the current development trend of distributed frameworks. More and more distributed computing frameworks hope to put as many data sets to be processed into memory as possible, and for JVM garbage collection, the more Java objects in memory are. The less and the shorter the survival time, the higher the efficiency. OutOfMemoryError is also a difficult problem to solve for memory management through the JVM. At the same time, in JVM memory management, Java objects have potential fragmented storage problems (all information of Java objects may be stored continuously in memory), and it is also possible that an OutOfMemoryError may occur when the size of all Java objects does not exceed the memory allocated by the JVM. Flink divides memory into 3 parts, each of which has a different purpose:

Network buffers: Some buffers in units of 32KB Byte arrays are mainly used by network modules for network transmission of data.

Memory Manager pool A large number of memory pools in units of 32KB Byte arrays. All runtime algorithms (such as Sort/Shuffle/Join) apply for memory from this memory pool, store the serialized data in it, and release it back to memory after the end. pool.

Remaining (Free) Heap is mainly reserved for the Java objects created by the user in the UDF and managed by the JVM.

Network buffers are mainly based on Netty's network transmission in Flink, no need to say more. Remaining Heap is used for Java objects created by users in UDFs. In UDFs, users usually process data in a stream, which does not require a lot of memory. At the same time, Flink does not encourage users to cache a lot of data in UDFs, because this will cause the front many issues mentioned. The Memory Manager pool (hereinafter referred to as the memory pool) is usually configured as the largest piece of memory, which will be described in detail next.

In Flink, the memory pool consists of multiple MemorySegments, each MemorySegment represents a contiguous piece of memory, the underlying storage is byte[], and the default size is 32KB. MemorySegment provides various methods for accessing data based on offsets, such as get/put int, long, float, double, etc., and methods such as data copying between MemorySegments are similar to java.nio.ByteBuffer. For the data structure of Flink, it usually includes multiple MemeorySegments applied to the memory pool. After all the objects to be stored are serialized by TypeSerializer, binary data is stored in MemorySegment, and is deserialized by TypeSerializer when fetched. The data structure accesses specific binary data through the set/get methods provided by MemorySegment. The main advantages of Flink's seemingly complex memory management method are:

Binary data storage greatly improves data storage density and saves storage space.

All runtime data structures and algorithms can only apply for memory through the memory pool, which ensures that the memory size used is fixed, and OOM will not occur due to runtime data structures and algorithms. For most distributed computing frameworks, this part is most likely to cause OOM due to the large amount of data being cached.

Although the memory pool occupies most of the memory, the MemorySegment in it has a large capacity (32KB by default), so there are actually very few Java objects in the memory pool, and they are always referenced by the memory pool. All of them will soon enter the persistent generation during garbage collection. Greatly reduces the pressure of JVM garbage collection.

Although the memory of the Remaining Heap is managed by the JVM, because it is mainly used to store the streaming data processed by the user, the life cycle is very short, and the fast Minor GC will all be recycled, and the Full GC will generally not be triggered.

Flink's current memory management is based on byte[] at the bottom, so the data is still on-heap. Recently, Flink has added off-heap memory management support. The advantages of Flink off-heap memory management over on-heap are mainly:

Starting a JVM with large memory allocations (eg 100G) is time consuming and garbage collection is slow. If off-heap is used, the remaining Network buffer and Remaining heap will be small, and the Java objects in the MemorySegment will not be considered for garbage collection.

More efficient IO operations. In off-heap, writing MemorySegment to disk or network can support zeor-copy technology, while on-heap requires at least one memory copy.

Off-heap can be used for error recovery, such as JVM crash, data is also lost during on-heap, but under off-heap, off-heap data may still be there. In addition, the data on the off-heap can be shared with other programs.

cache friendly computation

Disk IO and network IO have always been considered the bottleneck of Hadoop system, but with the development of new generation distributed computing frameworks such as Spark and Flink, more and more trends make CPU/Memory gradually become the bottleneck. These trends include:

More advanced IO hardware is gaining popularity. 10GB network and SSD hard drives have been used by more and more data centers.

More efficient storage format. Columnar storage such as Parquet and ORC is supported by more and more Hadoop projects, and its very efficient compression performance greatly reduces the amount of data stored on the ground.

More efficient execution plans. For example, the Fliter-Push-Down optimization of the execution plan optimizer of many SQL systems will advance the filter conditions as much as possible, and even advance to the data access layer of Parquet, so that many actual workloads do not require a lot of disk IO.

Due to the gap between CPU processing speed and memory access speed, the key to improving CPU processing efficiency is to maximize the use of L1/L2/L3/Memory and reduce any unnecessary Cache misses. The customized serialization tool provides Flink with the possibility. Through the customized serialization tool, the binary data itself accessed by Flink occupies less memory and has a relatively large storage density, and can also be stored continuously when designing data structures and algorithms. Reduce the impact of memory fragmentation on the cache hit rate, and even go a step further, Flink can only store part of the data that needs to be operated (such as keys during sorting) continuously, and store other parts of the data in other places, so as to maximize the improvement Cache hit probability.

Taking sorting in Flink as an example, sorting is usually a very heavy operation in the distributed computing framework. Flink achieves very good performance through a specially designed sorting algorithm. The implementation of the sorting algorithm is as follows:

The data to be sorted is serialized and stored in two different MemorySegment sets. All serialized values ​​of the data are stored in one of the MemorySegment sets. The serialized key and the pointer to the value in the first MemorySegment set are stored in the second MemorySegment set.

Sort the keys in the second MemorySegment set. If you need to exchange the key positions, you only need to exchange the corresponding Key+Pointer positions, and the data in the first MemorySegment set does not need to be changed. When comparing the size of two keys, TypeComparator provides a comparison method directly based on binary data without deserializing any data.

After the sorting is completed, when accessing the data, access the data in the order of the keys in the second MemorySegment set, and find the position of the data in the first MemorySegment set through the Pointer value, and use TypeSerializer to deserialize it into a Java object and return it.

 

 

Figure 7 Flink sorting algorithm

The benefits of doing this are:

The key and full data are stored separately to minimize the data to be manipulated, improve the probability of Cache hits, and thus improve the throughput of the CPU.

When moving data, you only need to move the Key+Pointer without moving the data itself, which greatly reduces the amount of data copied in memory.

TypeComparator operates directly on binary data, saving deserialization time.

Through customized memory management, Flink greatly improves the execution efficiency of the CPU by making full use of memory and CPU cache. At the same time, since most of the memory is controlled by the framework itself, it also greatly improves the robustness of the system and reduces the occurrence of OOM. possible.

Summarize

This article mainly introduces some key features of the Flink project. Flink is a project with many features, including its unified batch and stream processing execution engines, the technical combination of general big data computing frameworks and traditional database systems, and the integration of stream processing systems. Many technical innovations, etc. Due to the limited space, Flink has some other interesting features that are not introduced in detail, such as the DataSet API-level execution plan optimizer, native iteration operators, etc. Interested readers can learn more through the Flink official website. Details of Flink. I hope that through the introduction of this article, readers will have a better understanding of Flink, and more people will use and even participate in the Flink project. Big data learning group 724693112, welcome everyone to share technical experience, exchange and download materials to learn. Friends who want to know about big data can also chat together.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325344927&siteId=291194637