MapReduce realizes the effect of TopN

1. Background

Recently, I am learning Hadoop's MapReduce. Here is a record of how to achieve TopNthe effect and how to implement it in MapReduce 自定义分组.

2. Demand

We have a piece of data with the following 3 fields, 订单编号, 订单项and 订单项价格. The output data needs to be as follows:

  1. 订单编号Output is required between order number and order number 正序.
  2. output 每个订单价格最高的2个订单项.

3. Analysis

  1. 订单编号If it needs to be output with the order number 正序, 订单编号it must be used Key, because only Key has sorting operations.
  2. Output 每个订单价格最高的2个订单项: This output is in reducethe stage, and it is 每个订单, so it needs to 订单编号be grouped according to ( 前后2个key比较,相同则为一组), and the grouping is only Keyavailable, so it needs JavaBean(订单编号、订单项、订单项价格)to be used 组合Key.
  3. 订单编号正序It needs to output && output between the order number 每个订单价格最高的2个订单项: It can be seen that the sorting rules in Key are: according to 订单编号ascending order, then according to 订单项价格reverse order, and 订单编号grouping according to.
  4. We know 默认MapReduce中默认的分区规则that partitioning is performed according to the hascode of the key, 而 分区 下是有多个 分组and each group is called once reduce方法. And our idea above is 订单编号to group according to. When our Key is a JavaBean combination Key, 订单编号will the same JavaBean be divided into a group? This is not necessarily because the hashcode of the JavaBean is not necessarily consistent, so we need Custom partition (inheritance Partitionerclass). 此处我们job.setNumReduceTasks设置为1个,因此不考虑这个分区的问题.
  5. There are multiple groups under a partition, and each group calls reducethe method once.

4. Prepare data

4.1 Prepare data

20230713000010  item-101    10
20230713000010  item-102    30
20230713000015  item-151    10
20230713000015  item-152    20
20230713000010  item-103    20
20230713000015  item-153    30
20230713000012  item-121    50
20230713000012  item-122    10
20230713000012  item-123    30

4.2 Data format of each row

订单编号          订单项      订单项价格
20230713000012  item-123    30

The delimiter for each row of data is空格

4.3 Expected output

20230713000010  item-102    30
20230713000010  item-103    20
20230713000012  item-121    50
20230713000012  item-123    30
20230713000015  item-153    30
20230713000015  item-152    20

5. Coding implementation

5.1 Import jar package

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.4</version>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <version>1.18.22</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
            <version>3.2.2</version>
            <configuration>
                <archive>
                    <manifest>
                        <addClasspath>true</addClasspath>
                        <classpathPrefix>lib/</classpathPrefix>
                        <mainClass>com.huan.hadoop.mr.TopNDriver</mainClass>
                    </manifest>
                </archive>
            </configuration>
        </plugin>
    </plugins>
</build>

5.2 Writing Entity Classes

package com.huan.hadoop.mr;

import lombok.Getter;
import lombok.Setter;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * 订单数据
 *
 * @author huan.fu
 * @date 2023/7/13 - 14:20
 */
@Getter
@Setter
public class OrderVo implements WritableComparable<OrderVo> {
    
    
    /**
     * 订单编号
     */
    private long orderId;
    /**
     * 订单项
     */
    private String itemId;
    /**
     * 订单项价格
     */
    private long price;

    @Override
    public int compareTo(OrderVo o) {
    
    
        // 排序: 根据 订单编号 升序, 如果订单编号相同,则根据 订单项价格 倒序
        int result = Long.compare(this.orderId, o.orderId);
        if (result == 0) {
    
    
            // 等于0说明 订单编号 相同,则需要根据 订单项价格 倒序
            result = -Long.compare(this.price, o.price);
        }
        return result;
    }

    @Override
    public void write(DataOutput out) throws IOException {
    
    
        // 序列化
        out.writeLong(orderId);
        out.writeUTF(itemId);
        out.writeLong(price);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
    
    
        // 反序列化
        this.orderId = in.readLong();
        this.itemId = in.readUTF();
        this.price = in.readLong();
    }

    @Override
    public String toString() {
    
    
        return this.getOrderId() + "\t" + this.getItemId() + "\t" + this.getPrice();
    }
}

  1. WritableComparableThe interface needs to be implemented here
  2. Need to write 排序and 序列化method

5.3 Writing grouping methods

package com.huan.hadoop.mr;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

/**
 * 分组: 订单编号相同说明是同一组,否则是不同的组
 *
 * @author huan.fu
 * @date 2023/7/13 - 14:30
 */
public class TopNGroupingComparator extends WritableComparator {
    
    

    public TopNGroupingComparator() {
    
    
        // 第二个参数为true: 表示可以通过反射创建实例
        super(OrderVo.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
    
    
        // 订单编号 相同说明是同一个对象,否则是不同的对象
        return ((OrderVo) a).getOrderId() == ((OrderVo) b).getOrderId() ? 0 : 1;
    }
}

  1. Implement WritableComparatorthe interface and customize the grouping rules.
  2. The grouping occurs in reducethe stage, and the two keys before and after are compared, and if they are the same, they are a group, and a reducemethod is called once for a group.

5.4 Writing the map method

package com.huan.hadoop.mr;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * map 操作: 输出的key为OrderVo, 输出的value为: price
 *
 * @author huan.fu
 * @date 2023/7/13 - 14:28
 */
public class TopNMapper extends Mapper<LongWritable, Text, OrderVo, LongWritable> {
    
    

    private final OrderVo outKey = new OrderVo();
    private final LongWritable outValue = new LongWritable();

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, OrderVo, LongWritable>.Context context) throws IOException, InterruptedException {
    
    
        // 获取一行数据 20230713000010  item-101    10
        String row = value.toString();
        // 根据 \t 进行分割
        String[] cells = row.split("\\s+");
        // 获取订单编号
        long orderId = Long.parseLong(cells[0]);
        // 获取订单项
        String itemId = cells[1];
        // 获取订单项价格
        long price = Long.parseLong(cells[2]);

        // 设置值
        outKey.setOrderId(orderId);
        outKey.setItemId(itemId);
        outKey.setPrice(price);
        outValue.set(price);

        // 写出
        context.write(outKey, outValue);
    }
}

  1. Map operation: The output key is OrderVo, and the output value is: price

5.5 Writing the reduce method

package com.huan.hadoop.mr;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * reduce操作: Key(OrderVo)相同的分为一组, 此处 OrderVo 作为key, 分组是根据 TopNGroupingComparator 来实现,
 * 即 订单编号 相同的认为一组
 *
 * @author huan.fu
 * @date 2023/7/13 - 14:29
 */
public class TopNReducer extends Reducer<OrderVo, LongWritable, OrderVo, NullWritable> {
    
    

    @Override
    protected void reduce(OrderVo key, Iterable<LongWritable> values, Reducer<OrderVo, LongWritable, OrderVo, NullWritable>.Context context) throws IOException, InterruptedException {
    
    
        int topN = 0;
        // 随着每次遍历, key的 orderId 是相同的(因为是根据这个分组的),但是里面的itemId和price是不同的
        for (LongWritable price : values) {
    
    
            topN++;
            if (topN > 2) {
    
    
                break;
            }
            // 注意: 此处的key每次输出都不一样
            context.write(key, NullWritable.get());
        }
    }
}

  1. Reduce operation: The same key (OrderVo) is divided into a group, where OrderVo is used as the key, and the grouping is realized according to the TopNGroupingComparator, that is, the same order number is considered a group.
  2. With each traversal, the orderId of the key is the same (because it is grouped according to this), but the itemId and price inside are different

5.6 Write driver class

package com.huan.hadoop.mr;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * @author huan.fu
 * @date 2023/7/13 - 14:29
 */
public class TopNDriver extends Configured implements Tool {
    
    

    public static void main(String[] args) throws Exception {
    
    
        // 构建配置对象
        Configuration configuration = new Configuration();
        // 使用 ToolRunner 提交程序
        int status = ToolRunner.run(configuration, new TopNDriver(), args);
        // 退出程序
        System.exit(status);
    }

    @Override
    public int run(String[] args) throws Exception {
    
    
        // 构建Job对象实例 参数(配置对象,Job对象名称)
        Job job = Job.getInstance(getConf(), "topN");
        // 设置mr程序运行的主类
        job.setJarByClass(TopNDriver.class);
        // 设置mr程序运行的 mapper类型和reduce类型
        job.setMapperClass(TopNMapper.class);
        job.setReducerClass(TopNReducer.class);
        // 指定mapper阶段输出的kv数据类型
        job.setMapOutputKeyClass(OrderVo.class);
        job.setMapOutputValueClass(LongWritable.class);
        // 指定reduce阶段输出的kv数据类型,业务mr程序输出的最终类型
        job.setOutputKeyClass(OrderVo.class);
        job.setOutputValueClass(NullWritable.class);
        // 配置本例子中的输入数据路径和输出数据路径,默认输入输出组件为: TextInputFormat和TextOutputFormat
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // 先删除输出目录(方便本地测试)
        FileSystem.get(this.getConf()).delete(new Path(args[1]), true);

        // 设置分组
        job.setGroupingComparatorClass(TopNGroupingComparator.class);

        return job.waitForCompletion(true) ? 0 : 1;
    }
}
  1. Need to set groupjob.setGroupingComparatorClass(TopNGroupingComparator.class);

5.7 Running Results

operation result

6. Complete code

https://gitee.com/huan1993/spring-cloud-parent/tree/master/hadoop/mr-topn-group

Guess you like

Origin blog.csdn.net/fu_huo_1993/article/details/131701393