1、准备测试数据、在hive上创建表page_views,并将测试将数据导入
create table page_views( track_time string, url string, session_id string, referer string, ip string, end_user_id string, city_id string )ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"; load data local inpath "/opt/data/page_views.dat" overwrite into table page_views;
2、检查未经压缩的文件大少
[root@hadoop001 data]# hadoop fs -du -h /user/hive/warehouse/demo.db/page_views/page_views.dat 18.1 M 54.4 M /user/hive/warehouse/demo.db/page_views/page_views.dat [root@hadoop001 data]# du -h page_views.dat 19M page_views.dat3、使用不同的格式对文件进行压缩,并比较大小
注意:hadoop checknative 检查是否支持压缩,不支持的话,需要进行源码编译将Native library编译进hadoop才行。具体请移步至 http://blog.csdn.net/qq_26369213/article/details/78925760
[root@hadoop001 data]# hadoop checknative 18/03/01 00:54:52 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native 18/03/01 00:54:52 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library Native library checking: hadoop: true /opt/software/hadoop-2.6.0-cdh5.7.0/lib/native/libhadoop.so zlib: true /lib64/libz.so.1 snappy: true /usr/lib64/libsnappy.so.1 lz4: true revision:99 bzip2: true /lib64/libbz2.so.1 openssl: true /usr/lib64/libcrypto.so
使用BZip2压缩
SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec; create table page_views_bzip2 ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" as select * from page_views; Time taken: 21.224 seconds
使用Snappy压缩
SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; create table page_views_snappy ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" as select * from page_views; Time taken: 17.899 seconds
使用Lz4Codec
SET hive.exec.compress.output=true; SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.Lz4Codec; create table page_views_lz4 ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" as select * from page_views; Time taken: 17.663 seconds
4、检查源数据与压缩后的数据大小
hadoop fs -du -h /user/hive/warehouse/demo.db/page_views/page_views.dat /user/hive/warehouse/demo.db/page_views_snappy/000000_0.snappy /user/hive/warehouse/demo.db/page_views_bzip2/000000_0.bz2 /user/hive/warehouse/demo.db/page_views_lz4/000000_0.lz4 | sort -nk1 3.6 M 10.9 M /user/hive/warehouse/demo.db/page_views_bzip2/000000_0.bz2 8.3 M 25.0 M /user/hive/warehouse/demo.db/page_views_lz4/000000_0.lz4 8.4 M 25.2 M /user/hive/warehouse/demo.db/page_views_snappy/000000_0.snappy 18.1 M 54.4 M /user/hive/warehouse/demo.db/page_views/page_views.dat
5、执行sql进行效率比较
hive> select count(*) from page_views; Query ID = root_20180301002323_fdf409bc-f1af-4c2c-b26c-755722c31bfd Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1519828756164_0009, Tracking URL = http://hadoop001:8088/proxy/application_1519828756164_0009/ Kill Command = /opt/software/hadoop/bin/hadoop job -kill job_1519828756164_0009 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2018-03-01 01:28:22,389 Stage-1 map = 0%, reduce = 0% 2018-03-01 01:28:28,791 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.56 sec 2018-03-01 01:28:35,089 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.78 sec MapReduce Total cumulative CPU time: 2 seconds 780 msec Ended Job = job_1519828756164_0009 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.78 sec HDFS Read: 19021459 HDFS Write: 16 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 780 msec OK 100000 Time taken: 22.692 seconds, Fetched: 1 row(s) hive> select count(*) from page_views_lz4; Query ID = root_20180301002323_fdf409bc-f1af-4c2c-b26c-755722c31bfd Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1519828756164_0006, Tracking URL = http://hadoop001:8088/proxy/application_1519828756164_0006/ Kill Command = /opt/software/hadoop/bin/hadoop job -kill job_1519828756164_0006 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2018-03-01 01:14:53,317 Stage-1 map = 0%, reduce = 0% 2018-03-01 01:15:01,713 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.15 sec 2018-03-01 01:15:09,076 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.38 sec MapReduce Total cumulative CPU time: 4 seconds 380 msec Ended Job = job_1519828756164_0006 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.38 sec HDFS Read: 8753905 HDFS Write: 16 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 380 msec OK 100000 Time taken: 24.707 seconds, Fetched: 1 row(s) hive> select count(*) from page_views_snappy; Query ID = root_20180301002323_fdf409bc-f1af-4c2c-b26c-755722c31bfd Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1519828756164_0007, Tracking URL = http://hadoop001:8088/proxy/application_1519828756164_0007/ Kill Command = /opt/software/hadoop/bin/hadoop job -kill job_1519828756164_0007 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2018-03-01 01:16:11,531 Stage-1 map = 0%, reduce = 0% 2018-03-01 01:16:18,858 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.74 sec 2018-03-01 01:16:26,216 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.0 sec MapReduce Total cumulative CPU time: 3 seconds 0 msec Ended Job = job_1519828756164_0007 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.0 sec HDFS Read: 8820268 HDFS Write: 16 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 0 msec OK 100000 Time taken: 22.719 seconds, Fetched: 1 row(s) hive> select count(*) from page_views_bzip2; Query ID = root_20180301002323_fdf409bc-f1af-4c2c-b26c-755722c31bfd Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1519828756164_0008, Tracking URL = http://hadoop001:8088/proxy/application_1519828756164_0008/ Kill Command = /opt/software/hadoop/bin/hadoop job -kill job_1519828756164_0008 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2018-03-01 01:21:36,104 Stage-1 map = 0%, reduce = 0% 2018-03-01 01:21:50,147 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 2.95 sec 2018-03-01 01:21:53,385 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 7.77 sec 2018-03-01 01:21:55,517 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 8.63 sec 2018-03-01 01:21:56,549 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.46 sec 2018-03-01 01:22:02,852 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.77 sec MapReduce Total cumulative CPU time: 11 seconds 770 msec Ended Job = job_1519828756164_0008 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 11.77 sec HDFS Read: 4106723 HDFS Write: 16 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 770 msec OK 100000 Time taken: 34.986 seconds, Fetched: 1 row(s)
5、对比数据