1. Background
Recently, I am learning Hadoop's MapReduce. Here is a record of how to achieve TopN
the effect and how to implement it in MapReduce 自定义分组
.
2. Demand
We have a piece of data with the following 3 fields, 订单编号
, 订单项
and 订单项价格
. The output data needs to be as follows:
订单编号
Output is required between order number and order number正序
.- output
每个订单价格最高的2个订单项
.
3. Analysis
订单编号
If it needs to be output with the order number正序
,订单编号
it must be usedKey
, because only Key has sorting operations.- Output
每个订单价格最高的2个订单项
: This output is inreduce
the stage, and it is每个订单
, so it needs to订单编号
be grouped according to (前后2个key比较,相同则为一组
), and the grouping is onlyKey
available, so it needsJavaBean(订单编号、订单项、订单项价格)
to be used组合Key
. 订单编号
正序
It needs to output && output between the order number每个订单价格最高的2个订单项
: It can be seen that the sorting rules in Key are: according to订单编号
ascending order, then according to订单项价格
reverse order, and订单编号
grouping according to.- We know
默认MapReduce中默认的分区规则
that partitioning is performed according to the hascode of the key,而 分区 下是有多个 分组
and each group is called oncereduce方法
. And our idea above is订单编号
to group according to. When our Key is a JavaBean combination Key,订单编号
will the same JavaBean be divided into a group? This is not necessarily because the hashcode of the JavaBean is not necessarily consistent, so we need Custom partition (inheritancePartitioner
class).此处我们job.setNumReduceTasks设置为1个,因此不考虑这个分区的问题
. - There are multiple groups under a partition, and each group calls
reduce
the method once.
4. Prepare data
4.1 Prepare data
20230713000010 item-101 10
20230713000010 item-102 30
20230713000015 item-151 10
20230713000015 item-152 20
20230713000010 item-103 20
20230713000015 item-153 30
20230713000012 item-121 50
20230713000012 item-122 10
20230713000012 item-123 30
4.2 Data format of each row
订单编号 订单项 订单项价格
20230713000012 item-123 30
The delimiter for each row of data is空格
4.3 Expected output
20230713000010 item-102 30
20230713000010 item-103 20
20230713000012 item-121 50
20230713000012 item-123 30
20230713000015 item-153 30
20230713000015 item-152 20
5. Coding implementation
5.1 Import jar package
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.4</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.22</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>3.2.2</version>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
<mainClass>com.huan.hadoop.mr.TopNDriver</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
5.2 Writing Entity Classes
package com.huan.hadoop.mr;
import lombok.Getter;
import lombok.Setter;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* 订单数据
*
* @author huan.fu
* @date 2023/7/13 - 14:20
*/
@Getter
@Setter
public class OrderVo implements WritableComparable<OrderVo> {
/**
* 订单编号
*/
private long orderId;
/**
* 订单项
*/
private String itemId;
/**
* 订单项价格
*/
private long price;
@Override
public int compareTo(OrderVo o) {
// 排序: 根据 订单编号 升序, 如果订单编号相同,则根据 订单项价格 倒序
int result = Long.compare(this.orderId, o.orderId);
if (result == 0) {
// 等于0说明 订单编号 相同,则需要根据 订单项价格 倒序
result = -Long.compare(this.price, o.price);
}
return result;
}
@Override
public void write(DataOutput out) throws IOException {
// 序列化
out.writeLong(orderId);
out.writeUTF(itemId);
out.writeLong(price);
}
@Override
public void readFields(DataInput in) throws IOException {
// 反序列化
this.orderId = in.readLong();
this.itemId = in.readUTF();
this.price = in.readLong();
}
@Override
public String toString() {
return this.getOrderId() + "\t" + this.getItemId() + "\t" + this.getPrice();
}
}
WritableComparable
The interface needs to be implemented here- Need to write
排序
and序列化
method
5.3 Writing grouping methods
package com.huan.hadoop.mr;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
/**
* 分组: 订单编号相同说明是同一组,否则是不同的组
*
* @author huan.fu
* @date 2023/7/13 - 14:30
*/
public class TopNGroupingComparator extends WritableComparator {
public TopNGroupingComparator() {
// 第二个参数为true: 表示可以通过反射创建实例
super(OrderVo.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
// 订单编号 相同说明是同一个对象,否则是不同的对象
return ((OrderVo) a).getOrderId() == ((OrderVo) b).getOrderId() ? 0 : 1;
}
}
- Implement
WritableComparator
the interface and customize the grouping rules. - The grouping occurs in
reduce
the stage, and the two keys before and after are compared, and if they are the same, they are a group, and areduce
method is called once for a group.
5.4 Writing the map method
package com.huan.hadoop.mr;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* map 操作: 输出的key为OrderVo, 输出的value为: price
*
* @author huan.fu
* @date 2023/7/13 - 14:28
*/
public class TopNMapper extends Mapper<LongWritable, Text, OrderVo, LongWritable> {
private final OrderVo outKey = new OrderVo();
private final LongWritable outValue = new LongWritable();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, OrderVo, LongWritable>.Context context) throws IOException, InterruptedException {
// 获取一行数据 20230713000010 item-101 10
String row = value.toString();
// 根据 \t 进行分割
String[] cells = row.split("\\s+");
// 获取订单编号
long orderId = Long.parseLong(cells[0]);
// 获取订单项
String itemId = cells[1];
// 获取订单项价格
long price = Long.parseLong(cells[2]);
// 设置值
outKey.setOrderId(orderId);
outKey.setItemId(itemId);
outKey.setPrice(price);
outValue.set(price);
// 写出
context.write(outKey, outValue);
}
}
- Map operation: The output key is OrderVo, and the output value is: price
5.5 Writing the reduce method
package com.huan.hadoop.mr;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* reduce操作: Key(OrderVo)相同的分为一组, 此处 OrderVo 作为key, 分组是根据 TopNGroupingComparator 来实现,
* 即 订单编号 相同的认为一组
*
* @author huan.fu
* @date 2023/7/13 - 14:29
*/
public class TopNReducer extends Reducer<OrderVo, LongWritable, OrderVo, NullWritable> {
@Override
protected void reduce(OrderVo key, Iterable<LongWritable> values, Reducer<OrderVo, LongWritable, OrderVo, NullWritable>.Context context) throws IOException, InterruptedException {
int topN = 0;
// 随着每次遍历, key的 orderId 是相同的(因为是根据这个分组的),但是里面的itemId和price是不同的
for (LongWritable price : values) {
topN++;
if (topN > 2) {
break;
}
// 注意: 此处的key每次输出都不一样
context.write(key, NullWritable.get());
}
}
}
- Reduce operation: The same key (OrderVo) is divided into a group, where OrderVo is used as the key, and the grouping is realized according to the TopNGroupingComparator, that is, the same order number is considered a group.
- With each traversal, the orderId of the key is the same (because it is grouped according to this), but the itemId and price inside are different
5.6 Write driver class
package com.huan.hadoop.mr;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/**
* @author huan.fu
* @date 2023/7/13 - 14:29
*/
public class TopNDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
// 构建配置对象
Configuration configuration = new Configuration();
// 使用 ToolRunner 提交程序
int status = ToolRunner.run(configuration, new TopNDriver(), args);
// 退出程序
System.exit(status);
}
@Override
public int run(String[] args) throws Exception {
// 构建Job对象实例 参数(配置对象,Job对象名称)
Job job = Job.getInstance(getConf(), "topN");
// 设置mr程序运行的主类
job.setJarByClass(TopNDriver.class);
// 设置mr程序运行的 mapper类型和reduce类型
job.setMapperClass(TopNMapper.class);
job.setReducerClass(TopNReducer.class);
// 指定mapper阶段输出的kv数据类型
job.setMapOutputKeyClass(OrderVo.class);
job.setMapOutputValueClass(LongWritable.class);
// 指定reduce阶段输出的kv数据类型,业务mr程序输出的最终类型
job.setOutputKeyClass(OrderVo.class);
job.setOutputValueClass(NullWritable.class);
// 配置本例子中的输入数据路径和输出数据路径,默认输入输出组件为: TextInputFormat和TextOutputFormat
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 先删除输出目录(方便本地测试)
FileSystem.get(this.getConf()).delete(new Path(args[1]), true);
// 设置分组
job.setGroupingComparatorClass(TopNGroupingComparator.class);
return job.waitForCompletion(true) ? 0 : 1;
}
}
- Need to set group
job.setGroupingComparatorClass(TopNGroupingComparator.class);
5.7 Running Results
6. Complete code
https://gitee.com/huan1993/spring-cloud-parent/tree/master/hadoop/mr-topn-group