HIVE storage format ORC, PARQUET Comparative

  hive has three default storage format, TEXT, ORC, PARQUET. TEXT is the default format, ORC, PARQUET is stored in a column format, space and query efficiency is different, specially after the test record it.

A: construction of the table statement differences

create table if not exists text(
a bigint
) partitioned by (dt string)
row format delimited fields terminated by '\001'
location '/hdfs/text/';

create table if not exists orc(
a bigint)
partitioned by (dt string)
row format delimited fields terminated by '\001'
stored as orc
location '/hdfs/orc/';

create table if not exists parquet(
a bigint)
partitioned by (dt string)
row format delimited fields terminated by '\001'
stored as parquet
location '/hdfs/parquet/';

 

Is actually stored behind is not the same as with

Two: HDFS storage Comparison

parquet orc text
709M 275M 1G
687M 249M 1G
647M 265M 1G

 

Three: query time comparison

parquet orc text
36.451 26.133 42.574
38.425 29.353 41.673
36.647 27.825 43.938

Four: How to file generation

val sparkSession = SparkSession.builder().master("local").appName("pushFunnelV3").getOrCreate()
val javasc = new JavaSparkContext(sparkSession.sparkContext)
val nameRDD = javasc.parallelize(util.Arrays.asList("{'name':'zhangsan','age':'18'}", "{'name':'lisi','age':'19'}")).rdd;
sparkSession.read.json(nameRDD).write.mode(SaveMode.Overwrite).csv("/data/aa")
sparkSession.read.json(nameRDD).write.mode(SaveMode.Overwrite).orc("/data/bb")
sparkSession.read.json(nameRDD).write.mode(SaveMode.Overwrite).parquet("/data/cc")

Guess you like

Origin www.cnblogs.com/wuxiaolong4/p/11809291.html