Introduce and compare the file formats supported by Apache Hive

Apache Hive supports several well-known file formats used by Hadoop, and Hive can also load and query different file formats created by other Hadoop components, such as Pig or MapReduce. This article compares different file formats in Hive, such as: TextFile, SequenceFile, RCFile, AVRO, ORC, Parquet, and Cloudera Impala also supports these formats. In Apache Hive, different file formats and compression codec methods have significantly different effects on different data sets. It is very important to choose the appropriate file format based on the scene, just like selecting the appropriate storage engine in ClickHouse. The following describes the various file formats supported by Hive.

Hive Text File Format

Hive Text File Format is the default file format, which can be used to transfer data with other client applications. The text file format is supported by most applications. The data is stored in rows, each row represents a record, and each row ends with a carriage return (\n).

Text files are simple flat file formats that can be compressed using BZIP2 to reduce storage space. The Hive create table command can use STORED AS TEXTFILEthe supported storage format, the sample syntax is as follows:

Create table textfile_table
(column_specs)
stored as textfile;

Hive Sequence File Format

The Sequence file format is a flat file supported by Hadoop, and the data is stored in a binary key-value pair format. These files are in binary format and can be split, the main advantage is the ability to merge two or more files into one.

Creating sequential file tables in Hive can be achieved by adding storage options: STORED AS SEQUENCEFILE. Here is example syntax:

Create table sequencefile_table
(column_specs)
stored as sequencefile;

Hive RC File Format

The RC file format is a row-column file format, another file format that Hive offers high row-level compression. If you need to execute multiple lines at once, you can use the RCFile format.

The RCFile format is very similar to the sequential file format, and also stores data in key-value pairs. Options can be specified when Hive creates the RCFile table STORED AS RCFILE. Example syntax is as follows:

Create table RCfile_table
(column_specs)
stored as rcfile;

Hive AVRO File Format

AVRO is an open source project that provides data serialization and data exchange services for Hadoop. It can be used to exchange data between the Hadoop ecosystem and applications written in any programming language. Avro is the most popular file format for Hadoop-based applications.

Creating a Hive AVRO table can specify STORED AS AVROoptions:

Create table avro_table
(column_specs)
stored as avro;

Hive ORC File Format

The ORC (Optimized Row Columnar) file format provides a more efficient way to store Hive table data. This file system is actually designed to overcome the restrictive characteristics of other Hive file formats. Using ORC files can improve performance when Hive reads, writes, and processes data from large tables.

Creating a Hive ORC table can specify STORED AS ORCoptions:

Create table orc_table
(column_specs)
stored as orc;

Hive Parquet File Format

Parquet is a class-oriented binary file format, which is very efficient for large-scale query applications, especially for specific column data in query tables, and can be compressed using Snappy or gzip. The default is Snappy. For the advantages of the parquet file format, please refer to: Understanding Parquet file format based on R language

Creating a Hive Parquet table can specify STORED AS ORCoptions:

Create table parquet_table
(column_specs)
stored as parquet;

Summarize

This article introduces the different file formats supported in Hive. It is very important to understand and select the appropriate file format for big data applications.

Guess you like

Origin blog.csdn.net/neweastsun/article/details/129200741