Apache Hive supports several well-known file formats used by Hadoop, and Hive can also load and query different file formats created by other Hadoop components, such as Pig or MapReduce. This article compares different file formats in Hive, such as: TextFile, SequenceFile, RCFile, AVRO, ORC, Parquet, and Cloudera Impala also supports these formats. In Apache Hive, different file formats and compression codec methods have significantly different effects on different data sets. It is very important to choose the appropriate file format based on the scene, just like selecting the appropriate storage engine in ClickHouse. The following describes the various file formats supported by Hive.
Hive Text File Format
Hive Text File Format is the default file format, which can be used to transfer data with other client applications. The text file format is supported by most applications. The data is stored in rows, each row represents a record, and each row ends with a carriage return (\n).
Text files are simple flat file formats that can be compressed using BZIP2 to reduce storage space. The Hive create table command can use STORED AS TEXTFILE
the supported storage format, the sample syntax is as follows:
Create table textfile_table
(column_specs)
stored as textfile;
Hive Sequence File Format
The Sequence file format is a flat file supported by Hadoop, and the data is stored in a binary key-value pair format. These files are in binary format and can be split, the main advantage is the ability to merge two or more files into one.
Creating sequential file tables in Hive can be achieved by adding storage options: STORED AS SEQUENCEFILE
. Here is example syntax:
Create table sequencefile_table
(column_specs)
stored as sequencefile;
Hive RC File Format
The RC file format is a row-column file format, another file format that Hive offers high row-level compression. If you need to execute multiple lines at once, you can use the RCFile format.
The RCFile format is very similar to the sequential file format, and also stores data in key-value pairs. Options can be specified when Hive creates the RCFile table STORED AS RCFILE
. Example syntax is as follows:
Create table RCfile_table
(column_specs)
stored as rcfile;
Hive AVRO File Format
AVRO is an open source project that provides data serialization and data exchange services for Hadoop. It can be used to exchange data between the Hadoop ecosystem and applications written in any programming language. Avro is the most popular file format for Hadoop-based applications.
Creating a Hive AVRO table can specify STORED AS AVRO
options:
Create table avro_table
(column_specs)
stored as avro;
Hive ORC File Format
The ORC (Optimized Row Columnar) file format provides a more efficient way to store Hive table data. This file system is actually designed to overcome the restrictive characteristics of other Hive file formats. Using ORC files can improve performance when Hive reads, writes, and processes data from large tables.
Creating a Hive ORC table can specify STORED AS ORC
options:
Create table orc_table
(column_specs)
stored as orc;
Hive Parquet File Format
Parquet is a class-oriented binary file format, which is very efficient for large-scale query applications, especially for specific column data in query tables, and can be compressed using Snappy or gzip. The default is Snappy. For the advantages of the parquet file format, please refer to: Understanding Parquet file format based on R language
Creating a Hive Parquet table can specify STORED AS ORC
options:
Create table parquet_table
(column_specs)
stored as parquet;
Summarize
This article introduces the different file formats supported in Hive. It is very important to understand and select the appropriate file format for big data applications.