一、远程调试
--------------------------------------------
1.设置服务器java vm的-agentlib:jdwp选项.
[server]
//windwos
//set JAVA_OPTS=%JAVA_OPTS% -agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=n
//linux
export HADOOP_CLIENT_OPTS=-agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=y
2.在server启动java程序
hadoop jar HdfsDemo.jar com.it18zhang.hdfs.mr.compress.TestCompress
3.server会暂挂在8888.
Listening ...
4.客户端通过远程调试连接到远程主机的8888.
5.客户端就可以调试了。
二、在pom.xml中引入新的插件Ant(maven-antrun-plugin),实现文件的复制
-----------------------------------------------------------------------
1.在pom.xml中,添加如下语句(与depends平级)
<!-- 构建 - 插件 -->
<project>
...
...
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-antrun-plugin</artifactId>
<version>1.8</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>run</goal>
</goals>
<configuration>
<tasks>
<echo>---------开始复制jar包到共享目录下----------</echo>
<delete file="D:\share\my.jar"></delete>
<copy file="target\Test01-1.0-SNAPSHOT.jar" toFile="D:\share\my.jar">
</copy>
</tasks>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
...
...
</project>
三、在centos上使用yum安装snappy
----------------------------------------------------------------
[google snappy]
$>sudo yum search snappy #查看是否有snappy库
$>sudo yum install -y snappy.x86_64 #安装snappy压缩解压缩库
四、安装和使用lzo(centos)
--------------------------------------------------------------
1.在pom.xml引入lzo依赖
<dependency>
<groupId>org.anarres.lzo</groupId>
<artifactId>lzo-hadoop</artifactId>
<version>1.0.0</version>
</dependency>
2.在centos上安装lzo库
$>sudo yum -y install lzo
3.使用mvn命令下载工件中的所有依赖(按照自己的pom.xml进行填写)
a.进入pom.xml所在目录,运行cmd:
mvn -DoutputDirectory=./lib -DgroupId=groupId -DartifactId=Test01 -Dversion=1.0-SNAPSHOT dependency:copy-dependencies
b.mvn会将所有的依赖包,从.m2仓库中,拷贝到./lib目录下
4.在lib下存放依赖所有的第三方jar
5.找出lzo-hadoop.jar + lzo-core.jar + 305.jar复制到hadoop的common目录下。
$>cp lzo-hadoop.jar lzo-core.jar /soft/hadoop/shared/hadoop/common/lib
6.执行远程程序即可。
五、修改maven使用aliyun镜像
-----------------------------------------------------------------------
[maven/conf/settings.xml]
<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
...
...
<mirrors>
<mirror>
<id>nexus-aliyun</id>
<mirrorOf>*</mirrorOf>
<name>Nexus aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>
</mirrors>
...
...
</settings>
六、文件格式:SequenceFile
-------------------------------------------------------------------
1.SequenceFile
Key-Value对方式存储
2.不是文本文件,是二进制文件。不能直接查看
3.可切割
因为有同步点。
reader.sync(pos); //定位到pos之后的第一个同步点。
writer.sync(); //写入同步点
一般同步点占用20个位置,单独成一块。
两条数据之间写入一个同步点,相当于在第一条数据结尾和第二条数据开头插入一个标记。
注意:next()是将指针移动到数据的尾巴,而非数据下面的同步点尾巴。与同步点没有关系
4.压缩方式
1.不压缩
2.record压缩 //只压缩value
3.块压缩 //按照多个record形成一个block.key value 全压缩
5.查看seq文件
hdfs dfs -text file:///d://seq//1.seq6.测试类
/**
* 测试InputFileFormat--SequenceFile
*/
public class TestSeqFile {
/**
* 写操作
*/
@Test
public void tsWrite() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS" ,"file:///");
FileSystem fs = FileSystem.get(conf);
Path name = new Path("d:\\seq\\1.seq");
SequenceFile.Writer w = SequenceFile.createWriter(fs, conf, name, IntWritable.class, Text.class);
for (int i = 0; i < 10; i++) {
//写入一个值,就加一个同步点
w.append(new IntWritable(i),new Text("tom" + i));
w.sync();
}
w.close();
}
/**
* 读操作
*/
@Test
public void tsRead() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS" ,"file:///");
FileSystem fs = FileSystem.get(conf);
Path name = new Path("d:\\seq\\1.seq");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, name,conf);
IntWritable key = new IntWritable();
Text value = new Text();
while (reader.next(key, value)) {
System.out.println(key.get() + value.toString());
}
reader.close();
}
/**
* 指针操作
*/
@Test // 153 198 243 288 333 378
public void tsPosSync() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS" ,"file:///");
FileSystem fs = FileSystem.get(conf);
Path name = new Path("d:\\seq\\1.seq");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, name,conf);
//定位到153的绝对位置
//reader.seek(153);
//定位到154之后的第一个同步点的起始位置,然后调用next的话,会读取定位同步点之后的第一条数据
reader.sync(154);
IntWritable key = new IntWritable();
Text value = new Text();
while (reader.next(key, value)) {
System.out.println(key.get() + "_" +value.toString() + "_" + reader.getPosition());
}
reader.close();
}
/**
* 压缩
*/
@Test
public void tsCompression() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS" ,"file:///");
FileSystem fs = FileSystem.get(conf);
Path name = new Path("d:\\seq\\1.seq");
SequenceFile.CompressionType compressionType = SequenceFile.CompressionType.BLOCK;
CompressionCodec codec = new GzipCodec();
SequenceFile.Writer w = SequenceFile.createWriter(fs, conf, name,IntWritable.class, Text.class, compressionType,codec);
for (int i = 0; i < 10; i++) {
//写入一个值,就加一个同步点
w.append(new IntWritable(i),new Text("tom" + i));
w.sync();
}
w.close();
}
}
七、文件格式:MapFile
---------------------------------------------------------------
1.Key-value
2.key按升序写入(可重复)。但key不能降序.key重复,不覆盖。会全部写入进去。
3.mapFile对应一个目录,目录下有index和data文件,都是序列文件。
4.index文件划分key区间,用于快速定位。
5.测试
public class TestMapFile {
/**
* 写操作
*/
@Test
public void tsWrite() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS" ,"file:///");
FileSystem fs = FileSystem.get(conf);
MapFile.Writer w = new MapFile.Writer(conf,fs,"d:\\map",IntWritable.class,Text.class);
w.append(new IntWritable(1),new Text("tom" + 1));
w.append(new IntWritable(1),new Text("tom" +1));
for (int i = 2; i < 10; i++) {
//写入一个值,就加一个同步点
w.append(new IntWritable(i),new Text("tom" + i));
}
w.close();
}
/**
* 读操作
*/
@Test
public void tsRead() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS" ,"file:///");
FileSystem fs = FileSystem.get(conf);
MapFile.Reader reader = new MapFile.Reader(fs,"d:\\map",conf);
IntWritable key = new IntWritable();
Text value = new Text();
while (reader.next(key, value)) {
System.out.println(key.get() + value.toString());
}
reader.close();
}
}
八、自定义分区函数
--------------------------------------------------------------------
1.定义分区类
public class MyPartitioner extends Partitioner<Text, IntWritable>{
public int getPartition(Text text, IntWritable intWritable, int numPartitions) {
return 0;
}
}
2.程序中配置使用分区类
job.setPartitionerClass(MyPartitioner.class);
九、combinerd(合成) 继承了Reducer 任何的Reducer类都能被他使用
--------------------------------------------------------------------------
1.Map端的Reducer 预先化简
2.为了减少网络带宽 将Map端发出的的数据进行聚合 但是并不是所有的都可以用combiner