Hadoop复习（四） --- 远程调试，Maven集成Ant，centos上安装snappy和lzo，SequenceFile，MapFile，combine

一、远程调试
--------------------------------------------
1.设置服务器java vm的-agentlib:jdwp选项.
[server]
//windwos
//set JAVA_OPTS=%JAVA_OPTS% -agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=n

//linux
export HADOOP_CLIENT_OPTS=-agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=y

2.在server启动java程序
hadoop jar HdfsDemo.jar com.it18zhang.hdfs.mr.compress.TestCompress

3.server会暂挂在8888.
Listening ...

4.客户端通过远程调试连接到远程主机的8888.

5.客户端就可以调试了。

二、在pom.xml中引入新的插件Ant(maven-antrun-plugin),实现文件的复制
-----------------------------------------------------------------------
1.在pom.xml中，添加如下语句（与depends平级）

<project>
...
...
<build>
<plugins>
<plugin>

<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-antrun-plugin</artifactId>
<version>1.8</version>

<executions>
<execution>
<phase>package</phase>
<goals>
<goal>run</goal>
</goals>
<configuration>
<tasks>
<echo>---------开始复制jar包到共享目录下----------</echo>
<delete file="D:\share\my.jar"></delete>
<copy file="target\Test01-1.0-SNAPSHOT.jar" toFile="D:\share\my.jar">
</copy>
</tasks>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
...
...
</project>

三、在centos上使用yum安装snappy
----------------------------------------------------------------
[google snappy]
$>sudo yum search snappy #查看是否有snappy库
$>sudo yum install -y snappy.x86_64 #安装snappy压缩解压缩库

四、安装和使用lzo（centos）
--------------------------------------------------------------
1.在pom.xml引入lzo依赖
<dependency>
<groupId>org.anarres.lzo</groupId>
<artifactId>lzo-hadoop</artifactId>
<version>1.0.0</version>
</dependency>

2.在centos上安装lzo库
$>sudo yum -y install lzo

3.使用mvn命令下载工件中的所有依赖(按照自己的pom.xml进行填写)
a.进入pom.xml所在目录，运行cmd：
mvn -DoutputDirectory=./lib -DgroupId=groupId -DartifactId=Test01 -Dversion=1.0-SNAPSHOT dependency:copy-dependencies
b.mvn会将所有的依赖包，从.m2仓库中，拷贝到./lib目录下

4.在lib下存放依赖所有的第三方jar

5.找出lzo-hadoop.jar + lzo-core.jar + 305.jar复制到hadoop的common目录下。
$>cp lzo-hadoop.jar lzo-core.jar /soft/hadoop/shared/hadoop/common/lib

6.执行远程程序即可。

五、修改maven使用aliyun镜像
-----------------------------------------------------------------------
[maven/conf/settings.xml]
<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
...
...

<mirrors>
<mirror>
<id>nexus-aliyun</id>
<mirrorOf>*</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
</mirrors>

...
...
</settings>

六、文件格式:SequenceFile
-------------------------------------------------------------------
1.SequenceFile
Key-Value对方式存储

2.不是文本文件，是二进制文件。不能直接查看

3.可切割
因为有同步点。
reader.sync(pos); //定位到pos之后的第一个同步点。
writer.sync(); //写入同步点
一般同步点占用20个位置，单独成一块。
两条数据之间写入一个同步点，相当于在第一条数据结尾和第二条数据开头插入一个标记。
注意：next()是将指针移动到数据的尾巴,而非数据下面的同步点尾巴。与同步点没有关系

4.压缩方式
1.不压缩
2.record压缩 //只压缩value
3.块压缩 //按照多个record形成一个block.key value 全压缩

5.查看seq文件
hdfs dfs -text file:///d://seq//1.seq6.测试类

  /**
 * 测试InputFileFormat--SequenceFile
 */
public class TestSeqFile {

    /**
     * 写操作
     */
    @Test
    public void tsWrite() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS" ,"file:///");
        FileSystem fs = FileSystem.get(conf);

        Path name = new Path("d:\\seq\\1.seq");
        SequenceFile.Writer w =  SequenceFile.createWriter(fs, conf, name, IntWritable.class, Text.class);
        for (int i = 0; i < 10; i++) {
            //写入一个值，就加一个同步点
            w.append(new IntWritable(i),new Text("tom" + i));
            w.sync();
        }
        w.close();
    }


    /**
     * 读操作
     */
    @Test
    public void tsRead() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS" ,"file:///");
        FileSystem fs = FileSystem.get(conf);

        Path name = new Path("d:\\seq\\1.seq");

        SequenceFile.Reader reader = new SequenceFile.Reader(fs, name,conf);
        IntWritable key = new IntWritable();
        Text value = new Text();
        while (reader.next(key, value)) {
            System.out.println(key.get() + value.toString());
        }
        reader.close();
    }

    /**
     * 指针操作
     */
    @Test // 153 198 243 288 333 378
    public void tsPosSync() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS" ,"file:///");
        FileSystem fs = FileSystem.get(conf);

        Path name = new Path("d:\\seq\\1.seq");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, name,conf);

        //定位到153的绝对位置
        //reader.seek(153);

        //定位到154之后的第一个同步点的起始位置，然后调用next的话，会读取定位同步点之后的第一条数据
        reader.sync(154);
        IntWritable key = new IntWritable();
        Text value = new Text();
        while (reader.next(key, value)) {
            System.out.println(key.get() + "_" +value.toString() + "_" + reader.getPosition());
        }
        reader.close();
    }


    /**
     * 压缩
     */
    @Test
    public void tsCompression() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS" ,"file:///");
        FileSystem fs = FileSystem.get(conf);

        Path name = new Path("d:\\seq\\1.seq");

        SequenceFile.CompressionType compressionType = SequenceFile.CompressionType.BLOCK;

        CompressionCodec codec = new GzipCodec();

        SequenceFile.Writer w = SequenceFile.createWriter(fs, conf, name,IntWritable.class, Text.class, compressionType,codec);

        for (int i = 0; i < 10; i++) {
            //写入一个值，就加一个同步点
            w.append(new IntWritable(i),new Text("tom" + i));
            w.sync();
        }
        w.close();
    }
}

七、文件格式:MapFile
---------------------------------------------------------------
1.Key-value
2.key按升序写入(可重复)。但key不能降序.key重复，不覆盖。会全部写入进去。
3.mapFile对应一个目录，目录下有index和data文件,都是序列文件。
4.index文件划分key区间,用于快速定位。
5.测试

public class TestMapFile {


    /**
     * 写操作
     */
    @Test
    public void tsWrite() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS" ,"file:///");
        FileSystem fs = FileSystem.get(conf);
        MapFile.Writer w = new MapFile.Writer(conf,fs,"d:\\map",IntWritable.class,Text.class);
        w.append(new IntWritable(1),new Text("tom" + 1));
        w.append(new IntWritable(1),new Text("tom" +1));
        for (int i = 2; i < 10; i++) {
            //写入一个值，就加一个同步点
            w.append(new IntWritable(i),new Text("tom" + i));
        }
        w.close();
    }


    /**
     * 读操作
     */
    @Test
    public void tsRead() throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS" ,"file:///");
        FileSystem fs = FileSystem.get(conf);

        MapFile.Reader reader = new MapFile.Reader(fs,"d:\\map",conf);
        IntWritable key = new IntWritable();
        Text value = new Text();
        while (reader.next(key, value)) {
            System.out.println(key.get() + value.toString());
        }
        reader.close();
    }
}

八、自定义分区函数
--------------------------------------------------------------------
1.定义分区类
public class MyPartitioner extends Partitioner<Text, IntWritable>{
public int getPartition(Text text, IntWritable intWritable, int numPartitions) {
return 0;
}
}
2.程序中配置使用分区类
job.setPartitionerClass(MyPartitioner.class);

九、combinerd（合成）继承了Reducer 任何的Reducer类都能被他使用
--------------------------------------------------------------------------
1.Map端的Reducer 预先化简
2.为了减少网络带宽将Map端发出的的数据进行聚合但是并不是所有的都可以用combiner

Hadoop复习（四） --- 远程调试，Maven集成Ant，centos上安装snappy和lzo，SequenceFile，MapFile，combine

猜你喜欢