Apache Arrow: Cross-platform memory data exchange format

Apache Arrow: Cross-platform memory data exchange format

Past memory big data Past memory big data
Apache Arrow is a brand new open source project under the Apache Foundation, and it is also a top-level project. Its purpose is to speed up the operation of big data analysis projects as a cross-platform data layer. It provides specifications for the processing and interaction of columnar memory storage. Currently, developers from 13 communities including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm are committed to making it a de facto standard for big data system projects .

In addition to using big data platforms such as Hadoop as an economic storage and batch processing platform, users also value the scalability and performance of the analysis system when applying big data analysis. In the past few years, the open source community has released many tools to improve the big data analysis ecosystem. These tools cover all aspects of data analysis, such as columnar storage format (Parquet/ORC), in-memory computing layer (Drill, Spark, Impala and Storm) and powerful API interface (Python and R language). Arrow is the latest to join, it provides a cross-platform and cross-application memory data exchange format. For example, in the recently released Spark 3.0, Arrow has been used to increase the performance of SparkR by at least 40%. See the official version of Apache Spark 3.0.0 has finally been released, and the important features are fully analyzed.

An important means to improve the performance of big data analysis is the design and processing of columnar data. Columnar data processing enables us to fully tap the potential of the hardware by means of vector computing and SIMD. Apache Drill, the big data query engine, exists in columns, whether on the hard disk or in the memory, and Arrow is developed from the value vector data format in Drill. In addition to columnar data, Apache Arrow also supports relational and dynamic data sets, which makes it an ideal format choice for processing data such as the Internet of Things.

The possibilities that Apache Arrow brings to the big data ecosystem are endless. With Apache Arrow as the standard data exchange format in the future, the interactivity between various data analysis systems and applications can be said to have reached a new level. In the past, most of the CPU cycles were spent on serialization and deserialization of data, but now we can achieve seamless data sharing between different systems. This means that users no longer have to worry about data formats when combining different systems. The following figure shows the difference between column storage format and row storage format:
Apache Arrow: Cross-platform memory data exchange format

Comparison of advantages before and after using Apache Arrow

Before the birth of Apache Arrow, if data needs to be exchanged between different systems, it had to be handled as follows
Apache Arrow: Cross-platform memory data exchange format

The problems caused by this are as follows:

  • Each system has its own internal memory format;
  • 70-80% of the CPU is wasted in the serialization and deserialization process;
  • Similar functions are implemented in multiple projects, and there is no single standard.

After using Apache Arrow for system data interaction in the project, the architecture becomes the following form: It
Apache Arrow: Cross-platform memory data exchange format
can be seen that:

  • All systems use the same memory format;
  • Avoid the overhead of communication between systems;
  • Functions can be shared between projects (such as Parquet-to-Arrow reader)

Apache Arrow has the following advantages:

  • The columnar memory layout can make the speed of random access reach O(1). This memory layout is very efficient on modern processors that process analysis streams and allow SIMD (Single input multiple data) optimization; developers can develop very efficient algorithms to process the data structure of Apache Arrow;
  • It makes the interaction of data between systems very efficient, and avoids the consumption of data serialization and deserialization;
  • Support complex data types.
    The reason why Apache Arrow can make random access to data reaches O(1). This is because Apache Arrow is optimized for analyzing structured data, such as the following data:

people = [
 {
   name: ‘mary’, age: 30,
   places_lived: [
     {city: ‘Akron’, state: ‘OH’},
     {city: ‘Bath’, state: OH’}
   ]
 },
 {
   name: ‘mark’, age: 33,
   places_lived: [
     {city: ‘Lodi’, state: ‘OH’},
     {city: ‘Ada’, state: ‘OH’},
     {city: ‘Akron’, state: ‘OH}
   ]
 }
]

Now suppose we need to access the value of people.places_lived.city. In Arrow, access to array values ​​looks as follows:

Apache Arrow: Cross-platform memory data exchange format
Arrow records the offset of the places_lived field and the city field, and we can get the value of the field through this offset. Because Arrow records the offset to make data access very efficient.

Apache Arrow project official website: http://arrow.apache.org/
Apache Arrow project Github: https://github.com/apache/arrow

Guess you like

Origin blog.51cto.com/15127589/2677604