After reading this to ensure that you understand how Flink is to manage memory?

Personal blog navigation page (click on the right link to open a personal blog): Daniel take you on technology stack 

Foreword

Today, many open-source systems for the analysis of large data sets are using Java or JVM-based programming language. The most famous example is Apache Hadoop, as well as the newer frameworks such as Apache Spark, Apache Drill, Apache Flink. It is how to store large amounts of data (including caching and efficient handling) in memory of JVM based on a common challenge faced by data analysis engine. Reasonably good JVM memory management can be difficult to configure and unpredictable system with a small amount of configuration and stable operation of the system to distinguish.

In this article, we will discuss how to manage Apache Flink memory, to discuss its custom serialization and de-serialization mechanism, and how it is working with binary data.

Data objects directly on the heap memory

The most direct way to deal with large amounts of data in the JVM is to these data as objects stored in the heap memory and manipulate the data directly in memory, if you want to sort the object list is to be sorted. However, there are some obvious drawback of this method, first, a frequent target of creating and destroying a large number of times, monitor and control the use of heap memory is not a very simple thing. If the object allocating too much, then will lead to excessive use of memory, triggering OutOfMemoryError, the JVM process leads directly killed. Another aspect is that because these objects are mostly living in the Cenozoic, when the JVM garbage collection, garbage collection overhead can easily reach 50% or more. Finally, Java is an object with a certain space overhead (depending on the JVM and platform). For a data set with a number of small objects, which can significantly reduce the effective amount of available memory. If you are proficient in system design and system tuning, you can adjust the system according to specific parameters, the number may be more or less controlled OutOfMemoryError occur and avoiding excessive use of heap memory, but the effect of such an arrangement is limited and tuning , especially in the case of large amount of data and execution environment changes.

Flink is how to do?

Apache Flink originated in a research project that aims to combine the best technology MapReduce-based systems and parallel database systems. In this context, Flink has always had its own memory data processing method. Flink the target sequence into a fixed number of memory segments allocated in advance, rather than directly to the object on the heap memory. It's DBMS-style sorting algorithm and connected as much as possible of the binary data to operate, in order to serialization and de-serialization costs to a minimum. If the data to be processed than the data can be stored in memory, the operator would Flink overflow portions of the data to disk. In fact, a lot of internal Flink implementation looks more like C / C ++, rather than ordinary Java. The diagram below summarizes how Flink and overflows to disk, if necessary in the storage memory segment of the sequence data:

Flink active memory management and operation of the binary data has several advantages:

1, the memory efficient and secure execution algorithm outer core  since the number of allocated memory segments is fixed, thus monitoring the remaining memory resources are very simple. In case of insufficient memory, the process operator can be more efficiently a large number of memory segments written to disk, and then read them back to back memory. Therefore, OutOfMemoryError is effectively prevented.

2, reduce garbage collection pressure  because all the data are in a long life cycle management Flink memory in binary notation, so all data objects are short-lived, even variable and can be reused. Short life cycle of an object can be more effective garbage collection, which greatly reduces the pressure on garbage collection. Now, the pre-allocated memory segments are long-standing target JVM heap, in order to reduce the pressure of garbage collection, Flink community is actively distribute it to external memory heap. Such efforts will enable JVM heap becomes smaller, garbage collection time consumed will be less.

3, the data storage space-saving  Java object has a storage overhead, if the data is stored in binary form, you can avoid this overhead.

4, efficient binary operations and cache sensitivity  in the given case a suitable binary representation, and can operate efficiently compare binary data. Further, the binary representation of the correlation values, hash codes, keys, and pointers may be stored contiguously in memory. This makes the data structure generally has a more efficient buffer access mode.

These features proactive memory management in a data processing system for large-scale data analysis is highly desirable, but the cost to implement these functions are also high. To achieve the automatic memory management of binary data and easy operation, using  java.util.HashMap a ratio of the overflow is achieved  hash-table (byte array and support custom serialization). Of course, Apache Flink not the only JVM based binary data and operating the data processing system. For example Apache Drill, Apache Ignite, Apache Geode also use similar technology, recently announced that it would also Apache Spark Evolution in this direction.

Below we will discuss in detail how to allocate memory Flink, if the object is serialized and de-serialized and if operate on binary data. We will also through some process performance data to compare objects on the heap and operations on binary data.

Flink how to allocate memory?

Flink TaskManager is composed of several internal components: actor system (for coordination with Flink master), IOManager (responsible for the overflow data to the disks and read back), MemoryManager (responsible for coordinating the memory usage). In this article, we explain the main MemoryManager.

MemoryManager responsible for assigning MemorySegments, distributed computing and data processing operator, for example, sort and join other operators. Flink is MemorySegment memory allocation unit, supported by a conventional Java array of bytes (default size is 32 KB). MemorySegment provide very effective support read and write access to its array of bytes by using the Java unsafe methods. You can MemorySegment seen as a customized version of the Java NIO ByteBuffer. In order to operate on a plurality MemorySegment larger contiguous memory block, Flink java.io.DataOutput and using a logical view of a Java implementation of the interface java.io.DataInput.

MemorySegments allocated once when TaskManager starts and destroyed when the TaskManager closed. Therefore, in the TaskManager's entire life cycle, MemorySegment is reusable, and will not be garbage collected. After all internal data structures TaskManager initialized and started all core services, MemoryManager start creating MemorySegments. By default, the service initialization, 70% available by the JVM heap memory allocation MemoryManager (all may be arranged). The remaining JVM heap memory during task processing for instantiated objects, including objects created by a user-defined function. The following figure shows the memory layout TaskManager JVM startup of:

How Flink serialized object?

Java ecosystem provides several libraries, objects can be converted to a binary representation and return. Common alternative is the standard Java serialization, Kryo, Apache Avro, Apache Thrift or Google's Protobuf. Flink contains its own custom framework sequences, the control data to binary representation. This is important, because binary data operation requires an accurate understanding of the sequence of the layout. Further, the configuration based on the operation sequence executed in the layout of binary data can significantly improve performance. Flink type serialization mechanism exploits the fact that before the execution of the program, to serialization and deserialization of the objects is completely known.

Flink processing program may be expressed in arbitrary Java data objects or Scala. Before the optimization program, the data type required for each processing step of identifying the program data stream. For Java programs, Flink provided a reflection type extraction assembly based, user-defined functions for analyzing the return type. Scala program can be analyzed with the help of Scala compiler. Flink using TypeInformation represent each data type.

Flink type classification

NOTE: This figure Dongwei Ke selected article "Introduction and Apache Flink type serialization mechanism", invasion deleted

Flink There are several types of data TypeInformations:

  • BasicTypeInfo: All basic types of Java or java.lang.String

  • BasicArrayTypeInfo: array or java.lang.String Java basic types is

  • WritableTypeInfo: Any implementation Writable interface for Hadoop

  • TupleTypeInfo: Any Flink tuple (Tuple1 to Tuple25). Flink tuples are Java tuples having a fixed length field indicates the type of

  • CaseClassTypeInfo: Any Scala CaseClass (including Scala tuples)

  • PojoTypeInfo: Any POJO (Java or Scala), that is, all the fields are public or objects accessed through getter and setter, follow the universal naming convention

  • GenericTypeInfo: data type can not be identified as any other type of

TypeInformation class inheritance diagram

Attached Java / C / C ++ / machine learning / Algorithms and Data Structures / front-end / Android / Python / programmer reading / single books books Daquan:

(Click on the right to open there in the dry personal blog): Technical dry Flowering
===== >> ① [Java Daniel take you on the road to advanced] << ====
===== >> ② [+ acm algorithm data structure Daniel take you on the road to advanced] << ===
===== >> ③ [database Daniel take you on the road to advanced] << == ===
===== >> ④ [Daniel Web front-end to take you on the road to advanced] << ====
===== >> ⑤ [machine learning python and Daniel take you entry to the Advanced Road] << ====
===== >> ⑥ [architect Daniel take you on the road to advanced] << =====
===== >> ⑦ [C ++ Daniel advanced to take you on the road] << ====
===== >> ⑧ [ios Daniel take you on the road to advanced] << ====
=====> > ⑨ [Web security Daniel take you on the road to advanced] << =====
===== >> ⑩ [Linux operating system and Daniel take you on the road to advanced] << = ====

There is no unearned fruits, hope you young friends, friends want to learn techniques, overcoming all obstacles in the way of the road determined to tie into technology, understand the book, and then knock on the code, understand the principle, and go practice, will It will bring you life, your job, your future a dream.

Published 47 original articles · won praise 0 · Views 294

Guess you like

Origin blog.csdn.net/weixin_41663412/article/details/104843144