Reduce阶段的join - (气象站与天气记录)

代码地址：
https://gitee.com/tanghongping/hadoopMapReduce/tree/master/src/com/thp/bigdata/rjon/station

情形：
假设有两个数据集，气象站数据库和天气记录数据库，并且考虑如何合二为一。
一个典型的查询：输出气象站的历史信息，同时各行记录也包含气象站的元数据信息。
在这里插入图片描述

Reduce join
在reduce 端进行连接是MapReduce框架实现join操作最常见的方式，其具体的的实现原理如下：
Map端的主要工作：为来自不同表(文件)的 key/value对打标签以区别不同源的记录。然后使用连接字段（两张表中的相同或字段）作为key,其余部分和新加的标志作为value，最后进行输出。
Reduce端的主要工作：在reduce端以连接字段作为key的分组已经完成，我们只需要在每一个分组中将那些来源不同文件的记录（在map阶段已经打标志）分开，最后进行合并就OK了。

这里我们出现了一个问题，因为map阶段会将相同的气象站id的两个文件（station.txt和record.txt）的数据都向reduce输出。那么我们就需要进行区分，那些数据是从 statition.txt文件获取的，那些数据是从record.txt文件获取的。
但是我们还需要记住一点的是，两个文件其实在表中是一对多的关系。
station.txt对应的气象站的表是一方，record.txt对应的是天气记录的表是多方。
也就是说我们同一个reduce进行处理数据的时候，迭代器里面装的都是相同的station ID 的数据，但是这个数据，只会有一个是来自station.txt,剩下的数据就都是来自record.txt里面的数据。

因此我们的解决办法就是使用二次排序。

需要注意的一点
二次排序适用的场景是：其中一个表的连接字段key唯一。

package com.thp.bigdata.rjon.station;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;

/**
 * 自定义组合键
 * @author tommy
 *
 */
public class TextPair implements WritableComparable<TextPair>{

	
	/**
	 * 为什么这连个属性都是使用 Text，我认为就是在序列化输出和写入的时候能够更加方便
	 */
	private Text first;    // first 代表的是 气象站的id
	
	private Text second;   // second 则是一个标记符号
	
	// 多使用几个构造方法，能够更加灵活方便
	public TextPair() {
		set(new Text(), new Text());
	}
	
	public TextPair(String first, String second) {
		set(new Text(first), new Text(second));
	}
	
	public TextPair(Text first, Text second) {
		set(first, second);
	}
	
	
	public void set(Text first, Text second) {
		this.first = first;
		this.second = second;
	}

	public Text getFirst() {
		return first;
	}

	
	public Text getSecond() {
		return second;
	}

	
	// 序列化操作：将对象转换为字节流并写入到输出流out中
	@Override
	public void write(DataOutput out) throws IOException {
		first.write(out);
		second.write(out);
	}
	// 序列化操作 ： 从输入流中读取字节流反序列化为对象
	@Override
	public void readFields(DataInput in) throws IOException {
		first.readFields(in);
		second.readFields(in);
	}

	// 这个hashCode 需要重写，重写是为了保证hashCode的性能
	@Override
	public int hashCode() {
		return first.hashCode()*163 + second.hashCode();
	}
	
	// equals() 方法重写，当first跟second都一样时，就说明两个数据是相同的额数据
	@Override
	public boolean equals(Object obj) {
		if(obj instanceof TextPair) {
			TextPair tp = (TextPair) obj;
			return first.equals(tp.first) && second.equals(tp.second);
		}
		return false;
	}
	
	
	
	/*@Override
	public int compareTo(TextPair tp) {
		if(!first.equals(tp.first)) {
			return first.compareTo(tp.first);
		} else if(!second.equals(tp.second)) {
			return second.compareTo(second);
		} 
		return 0;
	}*/
	/*public int compareTo(TextPair o) {
		// TODO Auto-generated method stub
		if (!first.equals(o.first)) {
			return first.compareTo(o.first);
		} else if (!second.equals(o.second)) {
			return second.compareTo(o.second);
		} else {
			return 0;
		}
	}*/
	
	// compareTo() 方法写的时候一定要小心一点，一旦写错了，就会对整个程序造成影响
	// compareTo()方法需要重写，因为我们是将TextPair最想作为key输出的，会调用compareTo()方法进行比较
	// 有了compareTo() 方法我们设置的标识属性 second才会生效
	@Override
	public int compareTo(TextPair o) {
		if(!first.equals(o.first)) {
			return first.compareTo(o.first);
		} else if(!second.equals(o.second)) {
			return second.compareTo(o.second);
		}
		return 0;
	}
	
	
	@Override
	public String toString() {
		return first + "\t" + second;
	}
	
}

station.txt:
在这里插入图片描述
这个文件每一行就只有两个字段 – （stationId,跟stationName）

record.txt
在这里插入图片描述

这里我们也把这个文件看成两个字段，一个字段是stationID,另一个字段就是剩下的描述字段（都归结为一个，因为都是字符串）

我们现在使用连个Mapper用来处理数据，一个用来处理的是station.txt文件的数据，一个用来处理record.txt文件里面的数据。

处理station.txt文件的Mapper:
station.txt是属于一方，所以为了数据树立，需要第一个到达reducer

	/**
	 * 气象站  mapper 标记为 "0" ,先到达reducer
	 * @author lenovo
	 *
	 */
	public static class JoinStationMapper extends Mapper<LongWritable, Text, TextPair, Text> {
		@Override
		protected void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			String line = value.toString();
			String[] arr = line.split("\\s+");  // \\s 表示  空格  回车 换行 等空白符   + 表示多个
			int length = arr.length;
			if(length == 2) {  // 满足这种数据格式
				// key = 气象站id   value = 气象站名称
				// 注意是将气象站的名字输出
				context.write(new TextPair(arr[0], "0"), new Text(arr[1]));
			}
			
		}
	}

处理record.txt文件的Mapper:

    /**
	 * 天气记录mapper 标记为 "1",后到达reducer
	 * @author lenovo
	 *
	 */
	public static class JoinRecordMapper extends Mapper<LongWritable, Text, TextPair, Text> {
		@Override
		protected void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			String line = value.toString();
			String[] arr = line.split("\\s+");  // 解析天气记录数据
			if(arr.length == 3) {
				// key = 气象站id   value = 天气记录数据
				context.write(new TextPair(arr[0], "1"), new Text(arr[1] + "\t" + arr[2]));
			}
		}
	}

自定义Partitioner类：

    /**
	 * 自定义分区方法：将气象站id相同的记录分到相同的reducer中
	 * hadoop 默认的Parition 的实现 HashPartitioner  就是使用下面的这种方法
	 */
	static class KeyPartitioner extends Partitioner<TextPair, Text> {
		@Override
		public int getPartition(TextPair key, Text value, int numPartitions) {
			// 根据气象站id进行选择分区，而不是组合键的整体
			return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
		}
	}

由于我们使用的TextPair这个对象作为输出的key，但是这个对象的比较compareTo()方法是需要根据两个属性（first和second）来确定，但是我们作为key的时候，值能根据stationId这个唯一的来确定，所以只能鼻尖first。

    /**
	 * 自定义的比较器   气象站的id相同就是两个key是相同的
	 * @author lenovo
	 *
	 */
	public static class GroupingComparator extends WritableComparator {
		public GroupingComparator() {
			super(TextPair.class, true);
		}
		
		@Override
		public int compare(WritableComparable wc1, WritableComparable wc2) {
			TextPair tp1 = (TextPair) wc1;
			TextPair tp2 = (TextPair) wc2;
			Text L = tp1.getFirst();
			Text R = tp2.getFirst(); 
			return L.compareTo(R);
		}
	}

Reduce阶段：

    /**
	 * 通过上面的分组，
	 * @author lenovo
	 *
	 */
	public static class JoinReducer extends Reducer<TextPair, Text, Text, Text> {
		@Override
		protected void reduce(TextPair key, Iterable<Text> values, Context context)
				throws IOException, InterruptedException {
			Iterator<Text> iterator = values.iterator();
			Text stationName = new Text(iterator.next());  // 气象站名称
			
			
			System.out.println("迭代器里面的第一个元素 ：  (stationName) -- " + stationName.toString());
			
			System.out.println("---------------");
			while(iterator.hasNext()) {
				// 天气记录的每条数据
				Text record = iterator.next();
				System.out.println("迭代器里面的第二个元素 ：(record) --  " + record.toString());
				// 最终输出的数据
				Text outValue = new Text(stationName.toString() + "\t" + record.toString());
				context.write(key.getFirst(), outValue);  // key.getFirst()  就是气象站的id  刚好对应
			}
			System.out.println("--------------");
		}
	}

启动方法，我选择先在本地文件系统中启动，可以调错：
我们可以选择在主类中继承 Configured 并且实现 Tool接口，这样我们也可以启动程序。
在这里插入图片描述

实现run（）方法：

    @Override
	public int run(String[] args) throws Exception {
		Configuration conf = new Configuration();
		
		
		conf.set("mapreduce.framework.name", "local");
		
		conf.set("fs.defaultFS", "file:///");
		
		Job job = Job.getInstance(conf);
		
		job.setJar("f:/rjoin.jar");
		
		
		Path recordInputPath = new Path(args[0]);  // 天气记录数据源
		Path stationInputPath = new Path(args[1]); // 气象站数据源
		
		Path outputPath = new Path(args[2]);   // 输出路径
		
		// 如果输出文件路径存在就删除
		FileSystem fs = outputPath.getFileSystem(conf);
		if(fs.isDirectory(outputPath)) {
			fs.delete(outputPath, true);
		}
		
		
		
		// 读取天气记录的Mapper
		MultipleInputs.addInputPath(job, recordInputPath, TextInputFormat.class, JoinRecordMapper.class);
		// 读取气象站的Mapper
		MultipleInputs.addInputPath(job,stationInputPath,TextInputFormat. class ,JoinStationMapper.class);
		
		
		FileOutputFormat.setOutputPath(job, outputPath);
		
		job.setReducerClass(JoinReducer.class);  // Reducer
		
		// 自定义分区
		job.setPartitionerClass(KeyPartitioner.class);
		job.setNumReduceTasks(2);  // 这个写的数字就是使用多少个reduce来处理，然后会生成对应数量的文件
		// 自定义分组  -- 排序	
		job.setGroupingComparatorClass(GroupingComparator.class);
		
		
		job.setMapOutputKeyClass(TextPair.class);
		job.setMapOutputValueClass(Text.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputKeyClass(Text.class);
		
		
		return job.waitForCompletion(true) ? 0 : 1;
	}

main方法：

public static void main(String[] args) throws Exception {
		String[] args0 = {"f:/station/input/record.txt","f:/station/input/station.txt","f:/station/output"};
		// String[] args0 = {};
		int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args0);
		System.out.println("----------------------------");
		System.out.println(exitCode);
		System.out.println("----------------------------");
		System.exit(exitCode);
	}

参考的文章：
http://www.cnblogs.com/LiCheng-/p/7353825.html

https://blog.csdn.net/u010834071/article/details/51365642

Reduce阶段的join - (气象站与天气记录)

Reduce阶段：

猜你喜欢