学习hadoop遇到的错误

0 为何reduce也会有分组：

文件1--->map1分组---> 张三一组，李四一组
文件2--->map2分组---> 张三一组，李四一组

在map阶段，文件1和文件2仅仅在本map内分组但是map1和map2之间不会分组，因此只有在reduce的时候才能将所有数据合并并分组。

0.1

map任务 ---> 由调用文件hdfs的block个数决定
map函数: 调用文件每一行调用一次

reduce任务 ---> 由分区决定，分区代码需要自定义实现，默认分一个区。

具体见 hadoop patition 分区简介和自定义
reduce函数: 由map处理后得到的分组个数决定调用多少次

1 在eclipse写自定义reduce时，

要么Context带上泛型，

class MyReducer2 extends Reducer<LongWritable, LongWritable, LongWritable, LongWritable>{

	protected void reduce(LongWritable k2, Iterable<LongWritable> v2s,
			org.apache.hadoop.mapreduce.Reducer<LongWritable,LongWritable,LongWritable,LongWritable>.Context context)
			throws IOException, InterruptedException {
		 System.out.println("reduce2");
	}

}

要么不带泛型也不需要带上包路径：

class MyReducer1 extends Reducer<LongWritable, LongWritable, LongWritable, LongWritable>{
	protected void reduce(LongWritable k2, java.lang.Iterable<LongWritable> v2s, Context context) throws java.io.IOException ,InterruptedException {
	   System.out.println("reduce");
	};
}

如果带上包路径又不带上泛型，则reduce走不进去：这种写法eclipse会有黄色波浪线提示，提示你应该加上泛型

class MyReducer2 extends Reducer<LongWritable, LongWritable, LongWritable, LongWritable>{

	protected void reduce(LongWritable k2, Iterable<LongWritable> v2s,
			org.apache.hadoop.mapreduce.Reducer.Context context)
			throws IOException, InterruptedException {
		 System.out.println("reduce2");
	}

}

2 在map节点自定义key（一般是个实体类）时，如果这个类的属性有string类型，那么在流输入输出写法和

long等的写法不同，具体如下：

public static class MyUser implements Writable, DBWritable{
		int id;
		String name;
	
		@Override
		public void write(DataOutput out) throws IOException {
			out.writeInt(id);
			Text.writeString(out, name); // 使用org.apache.hadoop.io.Text类实现读写
		}

		@Override
		public void readFields(DataInput in) throws IOException {
			this.id = in.readInt();
			this.name = Text.readString(in); // // 使用org.apache.hadoop.io.Text类实现读写
		}

否则报错如下：

java.io.DataInputStream.readFully(Unknown Source)

学习hadoop遇到的错误

猜你喜欢