14. Join operation in MapReduce

      As we all know, the two most important processes in MapReduce are the Map phase and the Reduce phase. For Join operations, there are of course Map Join and Reduce Join. This article mainly introduces the Join operation in MapReduce, pay attention to the column "Broken Cocoon into Butterfly-Hadoop" to view related series of articles~


table of Contents

一、Reduce Join

1.1 How Reduce Join Works

1.2 Example of Reduce Join

1.2.1 Requirements and data

1.2.2 First define the Bean class

1.2.3 Write Mapper Class

1.2.4 Write Comparator class

1.2.5 Write the Reducer class

1.2.6 Write Driver driver class

1.2.7 Test

二、Map Join

2.1 Introduction to Map Join

2.2 Map Join example

2.2.1 Requirements and data

2.2.2 Writing Mapper Class

2.2.3 Write Driver driver class

2.2.4 Test


 

一、Reduce Join

1.1 How Reduce Join Works

       The Map side labels the k/v pairs from different tables or files to distinguish records from different sources. Then use the connection field as the rest of the key and the newly added flag as the value, and finally output. The grouping of the reduce side with the connection field as the key has been completed. We only need to divide the records from different files in each group, and finally merge them.

1.2 Example of Reduce Join

1.2.1 Requirements and data

       Let's take a look at the data first. There are two files here, one is the order file orders, and the other is the product information info. The contents of the orders are as follows:

       The three columns represent order id (id), product id (gid), and quantity (amount). Let's take a look at the content of the info file:

       The two columns represent product id (gid) and product name (name) respectively.

       The existing requirement is as follows: merge the data in the info file into the orders file according to the product id. If these two files are two data tables, then this requirement is implemented using SQL as shown in the following statement:

select a.id, b.name, a.amount from orders a left join info b on a.gid = b.gid;

       How can we achieve it if we use code? First, we use the association condition as the key output on the Map side, and send the data that meets the Join condition in the two tables and carry the file information of the data source to the same Reduce Task, and the data is concatenated in the Reduce. Let's take a look at the specific implementation code and process together~

1.2.2 First define the Bean class

package com.xzw.hadoop.mapreduce.join.reducejoin;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/16 11:24
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class OrderBean implements WritableComparable<OrderBean> {
    private String id;
    private String gid;
    private int amount;
    private String name;

    @Override
    public String toString() {
        return id + '\t' + amount + "\t" + name;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getGid() {
        return gid;
    }

    public void setGid(String gid) {
        this.gid = gid;
    }

    public int getAmount() {
        return amount;
    }

    public void setAmount(int amount) {
        this.amount = amount;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    @Override
    public int compareTo(OrderBean o) {
        int compare = this.gid.compareTo(o.gid);

        if (compare == 0) {
            return o.name.compareTo(this.name);
        } else {
            return compare;
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(id);
        out.writeUTF(gid);
        out.writeInt(amount);
        out.writeUTF(name);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.id = in.readUTF();
        this.gid = in.readUTF();
        this.amount = in.readInt();
        this.name = in.readUTF();
    }
}

1.2.3 Write Mapper Class

package com.xzw.hadoop.mapreduce.join.reducejoin;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/16 11:30
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class OrderMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable> {
    private OrderBean orderBean = new OrderBean();
    private String fileName;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        FileSplit fs = (FileSplit) context.getInputSplit();
        fileName = fs.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split("\t");

        if (fileName.equals("orders")) {
            orderBean.setId(fields[0]);
            orderBean.setGid(fields[1]);
            orderBean.setAmount(Integer.parseInt(fields[2]));
            orderBean.setName("");
        } else {
            orderBean.setId("");
            orderBean.setGid(fields[0]);
            orderBean.setAmount(0);
            orderBean.setName(fields[1]);
        }
        context.write(orderBean, NullWritable.get());
    }
}

1.2.4 Write Comparator class

package com.xzw.hadoop.mapreduce.join.reducejoin;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

/**
 * @author: xzw
 * @create_date: 2020/8/16 11:43
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class OrderComparator extends WritableComparator {

    public OrderComparator() {
        super(OrderBean.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        OrderBean oa = (OrderBean) a;
        OrderBean ob = (OrderBean) b;
        return oa.getGid().compareTo(ob.getGid());
    }
}

1.2.5 Write the Reducer class

package com.xzw.hadoop.mapreduce.join.reducejoin;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Iterator;

/**
 * @author: xzw
 * @create_date: 2020/8/16 11:43
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class OrderReducer extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable> {

    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        //拿到迭代器
        Iterator<NullWritable> iterator = values.iterator();

        iterator.next();//迭代器指针下移一位,获取第一个OrderBean中的name。
        String name = key.getName();

        while (iterator.hasNext()) {
            iterator.next();//迭代器指针下移,获取需要写入的文件内容
            key.setName(name);//替换掉原先的name内容

            context.write(key, NullWritable.get());
        }
    }
}

1.2.6 Write Driver driver class

package com.xzw.hadoop.mapreduce.join.reducejoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/16 12:19
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class OrderDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        args = new String[]{"e:/input", "e:/output"};

        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(OrderDriver.class);

        job.setMapperClass(OrderMapper.class);
        job.setReducerClass(OrderReducer.class);

        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);

        job.setGroupingComparatorClass(OrderComparator.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

1.2.7 Test

二、Map Join

2.1 Introduction to Map Join

       Map Join is suitable for a small table and a large table. Map Join will load the entire small table into memory. This operation avoids the problem of data skew. Let's take a concrete example to take a look at this process.

2.2 Map Join example

2.2.1 Requirements and data

       Using the data in 1.2.1, the requirements are the same as before, the only difference is that the info is used as a small table (this is just for testing, so in the actual scenario, it should be distinguished according to the actual business).

2.2.2 Writing Mapper Class

package com.xzw.hadoop.mapreduce.join.mapjoin;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.*;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;

/**
 * @author: xzw
 * @create_date: 2020/8/16 13:40
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class OrderMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    private Map<String, String> map = new HashMap<>();
    private Text text = new Text();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        URI[] cacheFiles = context.getCacheFiles();
        String path = cacheFiles[0].getPath().toString();

        //本地流
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream(path)));
        //HDFS流
//        FileSystem fileSystem = FileSystem.get(context.getConfiguration());
//        FSDataInputStream bufferedReader = fileSystem.open(new Path(path));

        String line;
        while (StringUtils.isNotEmpty(line = bufferedReader.readLine())) {
            String[] fields = line.split("\t");
            map.put(fields[0], fields[1]);
        }
        IOUtils.closeStream(bufferedReader);
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split("\t");
        String name = map.get(fields[1]);

        text.set(fields[0] + "\t" + name + "\t" + fields[2]);
        context.write(text, NullWritable.get());
    }
}

2.2.3 Write Driver driver class

package com.xzw.hadoop.mapreduce.join.mapjoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;

/**
 * @author: xzw
 * @create_date: 2020/8/16 12:19
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class OrderDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        args = new String[]{"file:///e://input//info", "e:/input/orders", "e:/output"};

        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(OrderDriver.class);

        job.setMapperClass(OrderMapper.class);
        job.setNumReduceTasks(0);

        //作为小表一次性全部读取到内存中
        job.addCacheFile(URI.create(args[0]));

        FileInputFormat.setInputPaths(job, new Path(args[1]));
        FileOutputFormat.setOutputPath(job, new Path(args[2]));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

2.2.4 Test

 

       At this point, this article is over. What problems did you encounter in this process, welcome to leave a message, let me see what problems you encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/108033429