Hadoop支持Lzo压缩

1.前置要求

  • 编译安装好hadoop

  • java & maven 安装配置好

  • 安装前置库

     yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool
    

2.安装 lzo

2.1 下载

  #下载
  wget www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
  
  # 解压缩
  [hadoop@hadoop000 app]$ tar -zxvf lzo-2.06.tar.gz -C ../app

2.2 增加参数

[hadoop@hadoop000 app]$ cd lzo-2.06/
[hadoop@hadoop000 lzo-2.06]$ export CFLAGS=-m64

# 创建文件夹,用来存放编译之后的lzo
[hadoop@hadoop000 lzo-2.06]$ mkdir complie

#指定编译之后的位置
[hadoop@hadoop000 lzo-2.06]$ ./configure -enable-shared -prefix=/home/hadoop/app/lzo-2.06/complie/

#开始编译安装
[hadoop@hadoop000 lzo-2.06]$ make && make install

# 查看编译是否成功 只要有如下内容 就可以了
[hadoop@hadoop000 lzo-2.06]$ cd complie/
[hadoop@hadoop000 complie]$ ll
total 12
drwxrwxr-x 3 hadoop hadoop 4096 Dec  6 17:08 include
drwxrwxr-x 2 hadoop hadoop 4096 Dec  6 17:08 lib
drwxrwxr-x 3 hadoop hadoop 4096 Dec  6 17:08 share
[hadoop@hadoop000 complie]$ 


3. 安装hadoop-lzo

3.1 下载 & 解压

# 下载
[hadoop@hadoop000 soft]$ wget https://github.com/twitter/hadoop-lzo/archive/master.zip

#解压
[hadoop@hadoop000 soft]$ unzip master

# 如果提示没有 unzip  记得用yum 安装下
[root@hadoop000 ~]# yum -y install unzip

3.2 修改hadoop-lzo-master下的pom.xml文件

   <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.current.version>2.6.0</hadoop.current.version> #这里修改成对应的hadoop版本号
    <hadoop.old.version>1.0.4</hadoop.old.version>
  </properties>

3.3 增加配置

[hadoop@hadoop000 app]$ cd hadoop-lzo-master/
[hadoop@hadoop000 hadoop-lzo-master]$ export CFLAGS=-m64
[hadoop@hadoop000 hadoop-lzo-master]$  export CXXFLAGS=-m64
[hadoop@hadoop000 hadoop-lzo-master]$ export C_INCLUDE_PATH=/home/hadoop/app/lzo-2.06/complie/include/     # 这里需要提供编译好的lzo的include文件
[hadoop@hadoop000 hadoop-lzo-master]$ export LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib/           # 这里需要提供编译好的lzo的lib文件
[hadoop@hadoop000 hadoop-lzo-master]$  

3.4 开始编译

[hadoop@hadoop000 hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true

出现 BUILD SUCCESS 的时候 说明成功!

3.5 执行如下操作

[hadoop@hadoop000 hadoop-lzo-master]$ 
# 查看编译成功之后的包
[hadoop@hadoop000 hadoop-lzo-master]$ ll
total 80
-rw-rw-r--  1 hadoop hadoop 35147 Oct 13  2017 COPYING
-rw-rw-r--  1 hadoop hadoop 19753 Dec  6 17:18 pom.xml
-rw-rw-r--  1 hadoop hadoop 10170 Oct 13  2017 README.md
drwxrwxr-x  2 hadoop hadoop  4096 Oct 13  2017 scripts
drwxrwxr-x  4 hadoop hadoop  4096 Oct 13  2017 src
drwxrwxr-x 10 hadoop hadoop  4096 Dec  6 17:21 target

# 进入target/native/Linux-amd64-64 目录下执行如下命令
[hadoop@hadoop000 hadoop-lzo-master]$ cd target/native/Linux-amd64-64
[hadoop@hadoop000 Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~
./
./libgplcompression.so
./libgplcompression.so.0
./libgplcompression.la
./libgplcompression.a
./libgplcompression.so.0.0.
[hadoop@hadoop000 Linux-amd64-64]$ cp ~/libgplcompression* $HADOOP_HOME/lib/native/


# 这里很重要  需要把hadoop-lzo-0.4.21-SNAPSHOT.jar 复制到hadoop中
[hadoop@hadoop000 hadoop-lzo-master]$  cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/ 
[hadoop@hadoop000 hadoop-lzo-master]$  cp target/hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/mapreduce/lib

4.配置hadoop配置文件

4.1 修改 vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh

# 增加 编译好的lzo包下的lib
export LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib

4.2 修改 vim $HADOOP_HOME/etc/hadoop/core-site.

<property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,
            org.apache.hadoop.io.compress.DefaultCodec,
            org.apache.hadoop.io.compress.BZip2Codec,
            com.hadoop.compression.lzo.LzoCodec,
            com.hadoop.compression.lzo.LzopCodec
    </value>
</property>
<property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

4.3 修改 vim $HADOOP_HOME/etc/hadoop/mapred-site.xml

<property>
    <name>mapred.child.env </name>
    <value>LD_LIBRARY_PATH=/home/hadoop/app/lzo-2.06/complie/lib</value>
</property>
<property>
    <name>mapreduce.map.output.compress</name>
    <value>true</value>
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

4.4 重启hadoop

5.hadoop使用Lzo

5.1 数据准备

我准备好了一个大的数据文件,使用lzo压缩下

# 原始大小
[hadoop@hadoop000 data]$ ls -lh
total 5.0G
-rw-r--r-- 1 hadoop hadoop 5.0G Dec  5 17:58 access.20161111.log
[hadoop@hadoop000 data]$ 
# 使用lzo 压缩
lzop access.20161111.log
# 压缩之后的大小
[hadoop@hadoop000 data]$ ls -lh
total 5.9G
-rw-r--r-- 1 hadoop hadoop 5.0G Dec  5 17:58 access.20161111.log
-rw-r--r-- 1 hadoop hadoop 878M Dec  5 17:58 access.20161111.log.lzo
[hadoop@hadoop000 data]$ 

5.2 上传数据到hdfs中

# 上传
[hadoop@hadoop000 data]$ hdfs dfs -put access.20161111.log.lzo /data

#查看上传结果
[hadoop@hadoop000 data]$ hdfs dfs -ls /data
Found 1 items
-rw-r--r--   1 hadoop supergroup  920128684 2018-12-06 18:36 /data/access.20161111.log.lzo
[hadoop@hadoop000 data]$ 

5.3 执行hadoop wc 应用

[hadoop@hadoop000 mapreduce]$ hadoop jar \
hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount \
/data/access.20161111.log.lzo \
/out

查看执行过程,可以看到 ** number of splits:1** ,说明 hadoop并没有 给我的lzo文件切片


18/12/06 18:39:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/06 18:39:00 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 18:39:00 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
18/12/06 18:39:00 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
18/12/06 18:39:01 INFO mapreduce.JobSubmitter: number of splits:1
18/12/06 18:39:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1544089631050_0001
18/12/06 18:39:01 INFO impl.YarnClientImpl: Submitted application application_1544089631050_0001
18/12/06 18:39:01 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1544089631050_0001/
18/12/06 18:39:01 INFO mapreduce.Job: Running job: job_1544089631050_0001

5.4 给lzo文件建立索引

通过之前的学习,我知道,如果使用lzo压缩的话,需要有lzo的索引文件,接下来我们生产索引文件

[hadoop@hadoop000 hadoop-2.6.0-cdh5.7.0]$ hadoop jar \
share/hadoop/mapreduce/lib/hadoop-lzo-0.4.21-SNAPSHOT.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/data/access.20161111.log.lzo

[hadoop@hadoop000 mapreduce]$ hdfs dfs -ls /data
Found 2 items
-rw-r--r--   1 hadoop supergroup  920128684 2018-12-06 18:36 /data/access.20161111.log.lzo
-rw-r--r--   1 hadoop supergroup     163088 2018-12-06 18:42 /data/access.20161111.log.lzo.index
[hadoop@hadoop000 mapreduce]$ 

如上我的索引文件以及生成,那我继续执行wc程序,看我的hadoop是否能支持lzo文件的切片

[hadoop@hadoop000 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /data/access.20161111.log.lzo /out1
18/12/06 18:45:01 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/06 18:45:02 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 18:45:02 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
18/12/06 18:45:02 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
18/12/06 18:45:02 INFO mapreduce.JobSubmitter: number of splits:1
18/12/06 18:45:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1544089631050_0003
18/12/06 18:45:03 INFO impl.YarnClientImpl: Submitted application application_1544089631050_0003
18/12/06 18:45:03 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1544089631050_0003/
18/12/06 18:45:03 INFO mapreduce.Job: Running job: job_1544089631050_0003

从上面的执行过程,我看出来我的hadoop还是不能将我的lzo文件给切片,接着翻阅资料…

5.5 更改

通过翻阅资料得知,单纯的做了索引还是不行的,在运行程序的时候还要对要运行的程序进行相应的更改,
把inputformat设置成LzoTextInputFormat,不然还是会把索引文件也当做是输入文件,还是只运行一个map来处理。

所以修改我的提交任务的方式 增加 -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat

[hadoop@hadoop000 mapreduce]$ 
[hadoop@hadoop000 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount \
-Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
/data/access.20161111.log.lzo \
/out3
18/12/06 18:48:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/12/06 18:48:40 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 18:48:40 INFO mapreduce.JobSubmitter: number of splits:7
18/12/06 18:48:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1544089631050_0005
18/12/06 18:48:41 INFO impl.YarnClientImpl: Submitted application application_1544089631050_0005
18/12/06 18:48:41 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1544089631050_0005/
18/12/06 18:48:41 INFO mapreduce.Job: Running job: job_1544089631050_0005
^C^C[hadoop@hadoop000 mapreduce]$ ^C

从上述结果看出来hadoop已经能自动的将我的lzo文件给我切片了~~

成功~~

猜你喜欢

转载自blog.csdn.net/weixin_40420525/article/details/84869883