hadoop lzo使用测试

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_35744460/article/details/89315074

lzo测试

上篇文章hadoop lzo配置介绍lzo配置
本片测一下lzo使用

准备原始数据

   压缩前1.4G lzop压缩后 213M, 
    [root@spark001 hadoop]# du -sh *
    1.4G	baidu.log
    [root@spark001 hadoop]# lzop baidu.log
    [root@spark001 hadoop]# du -sh *
    1.4G	baidu.log
    213M	baidu.log.lzo

上传hdfs

[root@spark001 hadoop]# hdfs dfs -mkdir -p /user/hadoop/compress/log/200M
[root@spark001 hadoop]# hdfs dfs -put baidu.log.lzo /user/hadoop/compress/log/200M/


lzo不分片
    清洗测试
        hadoop jar hadoop-train-1.0.jar  com.bigdata.hadoop.mapreduce.driver.LogETLDirverLzo  /user/hadoop/compress/log/200M/ 	/user/hadoop/compress/log/etl_lzo/200/

      [root@spark001 hadoop]#  hadoop jar hadoop-train-1.0.jar  com.bigdata.hadoop.mapreduce.driver.LogETLDirverLzo  /user/hadoop/compress/log/200M/ /user/hadoop/compress/log/etl_lzo/200/
    19/04/15 17:06:34 INFO driver.LogETLDirverLzo: Processing trade with value: /user/hadoop/compress/log/etl_lzo/200/  
    19/04/15 17:06:34 INFO client.RMProxy: Connecting to ResourceManager at spark001/172.31.220.218:8032
    19/04/15 17:06:34 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    19/04/15 17:06:35 INFO input.FileInputFormat: Total input paths to process : 1
    19/04/15 17:06:35 INFO mapreduce.JobSubmitter: number of splits:1  只有1个分片 说明这种lzo 不支持分片

lzo支持分片

需要建立lzo索引

[root@spark001 hadoop]# hdfs dfs -mkdir -p /user/hadoop/compress/log/200M_index
[root@spark001 hadoop]# hdfs dfs -put baidu.log.lzo /user/hadoop/compress/log/200M_index/
创建索引
 hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.13.1.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/user/hadoop/compress/log/200M_index
同目录下生成了一个index后缀的文件
[root@spark001 hadoop]# hdfs dfs -ls /user/hadoop/compress/log/200M_index/
Found 2 items
-rw-r--r--   3 root supergroup  222877272 2019-04-15 16:00 /user/hadoop/compress/log/200M_index/baidu.log.lzo
-rw-r--r--   3 root supergroup      43384 2019-04-15 16:09 /user/hadoop/compress/log/200M_index/baidu.log.lzo.index
执行etl操作
[root@spark001 hadoop]#  hadoop jar hadoop-train-1.0.jar  com.bigdata.hadoop.mapreduce.driver.LogETLDirverLzo  /user/hadoop/compress/log/200M_index/ /user/hadoop/compress/log/etl_lzo/200_index/
19/04/15 17:10:09 INFO driver.LogETLDirverLzo: Processing trade with value: /user/hadoop/compress/log/etl_lzo/200_index/  
19/04/15 17:10:09 INFO client.RMProxy: Connecting to ResourceManager at spark001/172.31.220.218:8032
19/04/15 17:10:09 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/04/15 17:10:09 INFO input.FileInputFormat: Total input paths to process : 2
19/04/15 17:10:10 INFO mapreduce.JobSubmitter: number of splits:2                  2个分片
说明想要lzo支持分片需要创建索引

总结

   213M支持分片的话应该有 2个split,不支持就一个split,
    支持分片当大于blocksize时,会有2个map处理,提高效率
    不支持分片不论多大都只有一个 map处理  耗费时间
	所以工作中使用lzo要合理控制生成的lzo大小,不要超过一个block大小。因为如果没有lzo的index文件,该lzo会由一个map处理。如果lzo过大,	会导致某个map处理时间过长。

	也可以配合lzo.index文件使用,这样就支持split,好处是文件大小不受限制,可以将文件设置的稍微大点,这样有利于减少文件数目。但是生成lzo.index文件虽然占空间不大但也本身需要开销。

链接 具体ETL代码

猜你喜欢

转载自blog.csdn.net/qq_35744460/article/details/89315074