Hadoop columnar storage engine Parquet/ORC and snappy compression

Compared with the traditional row storage format, the column storage engine has a higher compression ratio and less IO operations and is favored. Disadvantages of columnar storage: When the number of columns is large, each time most of the columns are operated, the CPU pressure increases suddenly, and the processing time increases. Advantages: In scenarios where there are many cloumns and several columns are operated at a time, columnar storage is cost-effective and has higher performance.

In many big data application scenarios, the amount of data is large and there are many single-column data fields; for example, in the telecommunications industry,

For data with certain rules, there are many fields, but each query only targets a few of these fields. At this time, columnar storage is an excellent choice. At present, in the open source implementation of big data, there are various columnar storage implementations, from the earliest hive-supported rcfile to the later ORC, parquet; different SQL-on-Hadoop solutions, for different columnar storage

through different optimizations:

- Impala推荐使用parquet格式，不支持ORC,Rcfile
- Hive 0.x版本推荐使用rcfile
- PrestoDB推荐使用ORC
- Spark支持ORC,Parquet,Rcfile

Apache Parquet

Derived from the google Dremel system (downloadable papers for reference), Parquet is equivalent to the data storage engine in Google Dremel, and Apache's top-level open source project, Drill, is the open source implementation of Dremel. The original design motivation of Apache Parquet is to store nested data, such as Protocolbuffer, thrift, json, etc., store such data in a columnar format to facilitate efficient compression and encoding, and use fewer IO operations to retrieve the required data. Data, which is also the advantage of Parquet compared to ORC, it can transparently store Protobuf and thrift data in columnar format. Today, when Protobuf and thrift are widely used, it is not easy and natural to integrate with parquet. thing. In addition to the above advantages, compared to ORC, Parquet does not have many other remarkable points, such as it does not support update operations (data cannot be modified after being written), does not support ACID, etc.

Apache ORC

ORC (OptimizedRC File) storage is derived from the storage format of RC (RecordColumnar File). RC is a columnar storage engine and has poor support for schema evolution (modifying the schema requires regenerating data), while ORC is an improvement to RC. However, it still has poor support for schema evolution, mainly in terms of compression encoding and query performance. RC/ORC was originally used in Hive, and finally developed well and became a separate project. The Hive 1.x version supports transaction and update operations based on ORC (other storage formats are not currently supported). With the development of ORC today, it has some very advanced features, such as support for update operations, support for ACID, support for struct, and array complex types. You can use complex types to build a nested data architecture similar to parquet, but when the number of layers is very large, it is very troublesome and complicated to write, and the schema expression provided by parquet is easier to express multi-level nested data types.

Parquet and ORC

注意：如果用比较大量的数据转换列式存储格式，对服务器压力是非常大的，时间换空
间，权衡之.不同的SQL-on-Hadoop框架生成的列式存储格式消耗时间也不一致，由于
Spark底层是多线程，目前生成各种列式存储效率最高，其他框架Impala高于PrestoDB.
看图

hbase && snappy

hbase chooses Snappy compression, Hbase is used as a big data database, and an open source database that takes into account columnar storage, typical of NoSQL

On behalf of, it provides a scenario of column-oriented retrieval and efficient response. Based on kv, it can be highly compressible. In order to provide CRUD operations, the final consistency of the results is required, while update and delete operations need to store data in the hbase region splitting process to achieve the purpose of deleting data. It is region: split &&compact; if the amount of data entered into hbase is too large, several T data per day,

If each region exceeds 100g, the region will be split automatically, which will take up a lot of cluster IO, making other frameworks such as spark, Impala, and PrestoDB inefficient. At this time, it is particularly important to select a high compression algorithm based on column family for Hbase to reduce the amount of data. , reduce the IO operation, by pre-establishing the region, strictly control the region size = about 1g, close the automatic region split && compact operation, manually do the split && compact at a reasonable time, reduce the cluster

IO pressure, and snappy is the best choice, as shown in the figure:

Hadoop compression algorithm selection:
- mapreduce.map.output.compress.codec
- mapreduce.output.fileoutputformat.compress.codec
- mapreduce.output.fileoutputformat.compress.type
  - org.apache.hadoop.io.compress.DefaultCodec
  - org.apache.hadoop.io.compress.SnappyCodec [best choice]
  - org.apache.hadoop.io.compress.BZip2Codec /GzipCodec [GzipCodec has the highest compression, but it is time-consuming]

Summarize

With the in-depth use of Hadoop technology in all walks of life, and Hadoop in data warehouses, large-scale data analysis scenarios

It is particularly important to choose a reasonable columnar storage format and an efficient compression algorithm. Columnar storage is also gradually being applied to various product lines. For example, twitter and facebook have converted most data formats to parquet, ORC

Isometric storage format reduces space and query time by about 35%. Compression capabilities for Parquet && ORC file formats

Significantly reduces the disk space used and the time it takes to execute queries and other operations.