The second core component of Hadoop: MapReduce framework Section 2

6. MapReduce workflow principle (simple version)

1. When the client executes the MR program, the client first slices the input data file (getSplits) according to the set InputFormat implementation class. If the InputFormat implementation class is not set, the MR program will use the default implementation class (TextInputFormat–> Subclass of FileInputFormat) performs slicing planning and generates a slicing planning file

2. After the client's slicing planning file is generated, the client will also encapsulate the configuration items (Configuration configuration) of the entire MR program into a job.xml file, and at the same time, the MR program code will include the job.xml file and slices. The planning file is submitted to the resource scheduler (YARN/windowsCPU). The resource scheduler will first allocate resources to start the MRAPPMaster process.

3. MRAPPMaster will apply for resources from the resource scheduler to start the corresponding number of MapTask tasks according to the number of slices planned to run the calculation logic of the Mapper stage.

4. After MapTask is started successfully, it will use the createRecoder method in the specified InputFormat implementation class to read the kv data from the corresponding slice according to the slice plan, and then hand it over to the map method for processing.

5. The map method completes the processing of the sliced ​​kv data and writes the kv data into a memory buffer (100M). If the memory buffer exceeds 80% of the capacity, it will overflow to the disk. When overwriting the disk, it will be output according to the map The key values ​​are sorted and partitioned according to the specified Partitioner partitioning mechanism. There may be multiple overflow files. When the map phase is completed, the multiple overflow files corresponding to each MapTask and the data that have not yet been overwritten in the buffer will be merged together to form a final large file (partition sorting )

6. Immediately after, MRAPPMaster will apply for resources from the resource manager to start the ReduceTask. If the ReduceTask is successfully started, it will copy the data of the corresponding partition from the merged large overflow files of different MapTasks. The ReduceTask will then re-process all the copied data. Do a sorting.

7. ReduceTask will group the sorted data according to key. After grouping, a group of the same key values ​​will call the reduce method for calculation. The calculated data will use the specified OutputFormat class (not specified, the default is the TextOutputFormat class - FileOutputFormat implementation subclass) writes the key-value data to the final result file part-r-xxxxx

7. Serialization mechanism issues in MapReduce

Both the Map phase and the Reduce phase of the MR program require that the input data and output data must be key-value pair type data, and key-value must be serialized type data.

Serialization: Convert a certain data type in Java into binary data

Deserialization: Convert binary data into a certain data type

The reason why the MR program uses the serialization mechanism : The reason why the MR program requires the input and output data to be of KV type is because the MR program is a distributed computing program. The MR program can run on multiple nodes at the same time, and multiple The results calculated by the calculation program may be transmitted across nodes and networks. If data is to be transmitted across nodes and networks, the data must be binary data. (When the MapReduce program is running, the input and output of the Mapper stage and Reducer stage are in key-value format. At the same time, the data required in the tasks of the Mapper and Reducer stages may be transmitted across the network or across nodes, so we require , all input and output data during the running of the MR program must be serializable.)

When Hadoop serializes Key-Value, it does not use Java's serialization mechanism (Serializable, Externalizable) because Java's serialization mechanism is very cumbersome. Therefore, Hadoop provides a new serialization mechanism based on Java. A lightweight serialization mechanism specifically suitable for MR programs.

Hadoop provides two interfaces : Writable and WritableComparable, and two serialization mechanisms provided by Hadoop.

Writable

  • There are only serialization and deserialization effects. If a customized data type (Java class) we want to use as the value of the MR program, the Java class must implement the Writable interface and override two methods (write - serialization write , readFields - deserialization read), these two methods specify the content of serialization and deserialization.

  • The usage of Writable is similar to the Externalizable serialization mechanism in Java.

WritableComparable

  • In addition to the serialization and deserialization capabilities, the interface also has a method for comparing size relationships .

  • If you want to use a custom data type (Java class) as a key value in an MR program, you must implement this interface so that the custom data type can be serialized, deserialized, and compared to determine the size.

  • If the custom data type only needs to be used as a value in the MR program, you only need to implement the Writable interface, and there is no need to compare the size.

Common serialization types in Hadoop (Hadoop has encapsulated the corresponding Hadoop serialization types for us in Java wrapper classes and String types) - implements the WritableComparable interface

Java types Hadoop Writable type
boolean BooleanWritable
byte ByteWritable
int IntWritable
float FloatWritable
long LongWritable
double DoubleWritable
string Text
map MapWritable
array ArrayWritable

[Note]
1. If the MR program runs without an error in the future, but the output directory does not have any content, it may generally be because the custom types of input and output key-value have not been serialized.
2. If a custom JavaBean acts as the Reducer stage to output key-value, it is best to rewrite the toString method, otherwise the final output result of the Reducer is the address value of the JavaBean.

8. Traffic statistics case implementation (implementation of serialization mechanism)

import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;

/**
 * JavaBean:是Java中一种很干净的类,类当中只具备私有化的属性、构造器、getter   setter方法 hashCode  equals方法 toString方法
 * 实体类:实体类又是一种特殊的JavaBean,当JavaBean是和数据库中数据表对应的类的时候,JavaBean称之为实体类
 *
 * JavaBean可以自己手动的生成,也可以使用Lombok技术基于注解快速创建Java类
 *      Lombok使用慎重,Lombok对代码的侵占性是非常大的
 *
 * 如果自定义的JavaBean要当MR程序的输入和输出的KV值,最好让JavaBean存在一个无参构造器(MR程序底层反射构建这个类的对象)
 * 如果自定义的JavaBean要去充当Reducer阶段KEY和Value,那也就意味着JavaBean的结果要写到最终的结果文件中,JavaBean的数据往结果文件写的格式还是按照JavaBean的toString方法去写的。
 */
public class FlowBean implements Writable {
    
    
    private Long upFlow;//上行流量
    private Long downFlow;//下行流量
    private Long sumFlow;//总流量

    public FlowBean() {
    
    
    }

    public FlowBean(Long upFlow, Long downFlow, Long sumFlow) {
    
    
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = sumFlow;
    }

    public Long getUpFlow() {
    
    
        return upFlow;
    }

    public void setUpFlow(Long upFlow) {
    
    
        this.upFlow = upFlow;
    }

    public Long getDownFlow() {
    
    
        return downFlow;
    }

    public void setDownFlow(Long downFlow) {
    
    
        this.downFlow = downFlow;
    }

    public Long getSumFlow() {
    
    
        return sumFlow;
    }

    public void setSumFlow(Long sumFlow) {
    
    
        this.sumFlow = sumFlow;
    }

    @Override
    public boolean equals(Object o) {
    
    
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        FlowBean flowBean = (FlowBean) o;
        return Objects.equals(upFlow, flowBean.upFlow) && Objects.equals(downFlow, flowBean.downFlow) && Objects.equals(sumFlow, flowBean.sumFlow);
    }

    @Override
    public int hashCode() {
    
    
        return Objects.hash(upFlow, downFlow, sumFlow);
    }

    @Override
    public String toString() {
    
    
        return upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    /**
     * 序列化写的方法
     * @param out <code>DataOuput</code> to serialize this object into.
     * @throws IOException
     */
    @Override
    public void write(DataOutput out) throws IOException {
    
    
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);
    }

    /**
     * 反序列化读取数据的方法
     * @param in <code>DataInput</code> to deseriablize this object from.
     * @throws IOException
     */
    @Override
    public void readFields(DataInput in) throws IOException {
    
    
        upFlow = in.readLong();
        downFlow = in.readLong();
        sumFlow = in.readLong();

    }
}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * 现在有一个文件 phone_data.txt,文件中记录着手机号消耗的流量信息
 * 文件中每一行数据代表一条手机的流量消耗,每一条数据是以\t制表符分割的多个字段组成的
 * 使用MR程序统计每一个手机号消耗的总的上行流量、总的下行流量、总流量
 */
public class FlowDriver {
    
    
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException {
    
    
        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS","hdfs://192.168.31.104:9000");

        Job job = Job.getInstance(configuration);

        //设置MR程序默认使用的InputFormat类 —— 负责进行切片  负责读取数据源的数据为key value类型的
//        job.setInputFormatClass(FileInputFormat.class);//默认确实是FileInputFormat   但是是个 抽象类  MR程序默认使用的是这个抽象类的子类
        FileInputFormat.setInputPaths(job,"/phone_data.txt");

        //封装Mapper阶段
        job.setMapperClass(FlowMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);

        //封装Reducer阶段
        job.setReducerClass(FlowReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        //封装输出结果路径
//        job.setOutputFormatClass(FileOutputFormat.class);
        //MR程序要求输出路径不能提前存在 如果提前存在就会报错
        Path path = new Path("/output");
        //是用来解决输出目录如果存在MR程序报错问题的
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), configuration, "root");
        if (fileSystem.exists(path)){
    
    
            fileSystem.delete(path,true);
        }
        FileOutputFormat.setOutputPath(job,path);

        //最后提交程序运行即可
        boolean b = job.waitForCompletion(true);
        System.out.println(b?0:1);
    }
}
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * 读取切片数据,一行数据读取一次  而且读取的key(偏移量) value LongWritable Text
 * 输出的key(手机号) value 是 Text FlowBean
 */
public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
    
    
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, FlowBean>.Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        String[] array = line.split("\t");
        String phoneNumber = array[1];
        Long downFlow = Long.parseLong(array[array.length - 2]);
        Long upFlow = Long.parseLong(array[array.length - 3]);
        FlowBean flowBean = new FlowBean(upFlow,downFlow,upFlow + downFlow);
        //需要将这一条数据以手机号为key,以flowBean为value输出给reduce
        context.write(new Text(phoneNumber),flowBean);
    }
}
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 *
 */
public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
    
    
    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Reducer<Text, FlowBean, Text, FlowBean>.Context context) throws IOException, InterruptedException {
    
    
        Long upFlowSum = 0L;
        Long downFlowSum = 0L;
        Long sumFlowSum = 0L;
        for (FlowBean value : values) {
    
    
            upFlowSum += value.getUpFlow();
            downFlowSum += value.getDownFlow();
            sumFlowSum =+ value.getSumFlow();
            //需要以手机号为key,以flowBean为value将结果输出,flowBean需要将我们计算出来总流量信息封装起来
            FlowBean flowBean = new FlowBean(upFlowSum,downFlowSum,sumFlowSum);
            context.write(key,flowBean);
        }
    }
}

image-20230723154212141

image-20230723154240240

image-20230723154254681

image-20230723154313960

package com.kang.flow02;

import com.kang.flow.FlowDriver;
import jdk.nashorn.internal.runtime.regexp.joni.Config;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * 基于以前统计的手机消耗流量信息的结果文件,要求对结果文件进行二次分析,得到以下结果:
 * 1、要求对数据中的手机号按照归属地不同进行分区:
 *       134开头的手机号  0号分区
 *       135开头的手机号  1号分区
 *       136开头的手机号  2号分区
 *       137开头的手机号  3号分区
 *       其余的手机号     4号分区
 * 2、同时还要求每一个分区按照消耗的总流量从高到底进行排序
 */
public class FlowDriver02 {
    
    
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException {
    
    
        Configuration configuration = new Configuration();

        Job job = Job.getInstance(configuration);

        job.setJarByClass(FlowDriver02.class);

        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.setInputPaths(job,new Path("/output/part-r-00000"));

        job.setMapperClass(FlowMapper02.class);
        job.setMapOutputKeyClass(FlowBean02.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setPartitionerClass(FlowPartitioner.class);

        job.setReducerClass(FlowReducer02.class);
        job.setOutputKeyClass(FlowBean02.class);
        job.setOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(5);

        Path path =new Path("/output1");
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.31.104:9000"), configuration, "root");
        if (fs.exists(path)){
    
    
            fs.delete(path);
        }
        FileOutputFormat.setOutputPath(job,path);

        boolean flag = job.waitForCompletion(true);
        System.exit(flag?0:1);
    }
}

class FlowMapper02 extends Mapper<LongWritable, Text,FlowBean02,NullWritable> {
    
    
    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, FlowBean02, NullWritable>.Context context) throws IOException, InterruptedException {
    
    
        String line = value.toString();
        String[] message = line.split("\t");
        String phoneNumber = message[0];
        Long upFlow = Long.parseLong(message[1]);
        Long downFlow = Long.parseLong(message[2]);
        Long sumFlow = Long.parseLong(message[3]);
        FlowBean02 flowBean02 = new FlowBean02(phoneNumber,upFlow,downFlow,sumFlow);
        context.write(flowBean02,NullWritable.get());
    }
}

oneNumber.startsWith("137")) {
    
    
            return 3;
        }else  {
    
    
            return 4;
        }
//        String message = flowBean02.toString();
//        String[] array = message.split("\t");
//        String phoneNumber = array[0];
//        char w1 = phoneNumber.charAt(0);
//        char w2 = phoneNumber.charAt(1);
//        char w3 = phoneNumber.charAt(2);
//        if (w1 == '1' && w2 == '3') {
    
    
//            if (w3 == '4') return 0;
//            if (w3 == '5') return 1;
//            if (w3 == '6') return 2;
//            if (w3 == '7') return 3;
//        }
//        return 4;
    }
}

class FlowReducer02 extends Reducer<FlowBean02,NullWritable,FlowBean02, NullWritable>{
    
    
    @Override
    protected void reduce(FlowBean02 key, Iterable<NullWritable> values, Reducer<FlowBean02, NullWritable, FlowBean02, NullWritable>.Context context) throws IOException, InterruptedException {
    
    
        context.write(key,NullWritable.get());
    }
}

package com.kang.flow02;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Objects;

public class FlowBean02 implements WritableComparable<FlowBean02> {
    
    
    private String phoneNumber;
    private Long upFlow;
    private Long downFlow;
    private Long sumFlow;

    public FlowBean02() {
    
    
    }

    public FlowBean02(String phoneNumber, Long upFlow, Long downFlow, Long sumFlow) {
    
    
        this.phoneNumber = phoneNumber;
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = sumFlow;
    }

    public String getPhoneNumber() {
    
    
        return phoneNumber;
    }

    public void setPhoneNumber(String phoneNumber) {
    
    
        this.phoneNumber = phoneNumber;
    }

    public Long getUpFlow() {
    
    
        return upFlow;
    }

    public void setUpFlow(Long upFlow) {
    
    
        this.upFlow = upFlow;
    }

    public Long getDownFlow() {
    
    
        return downFlow;
    }

    public void setDownFlow(Long downFlow) {
    
    
        this.downFlow = downFlow;
    }

    public Long getSumFlow() {
    
    
        return sumFlow;
    }

    public void setSumFlow(Long sumFlow) {
    
    
        this.sumFlow = sumFlow;
    }

    @Override
    public boolean equals(Object o) {
    
    
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        FlowBean02 that = (FlowBean02) o;
        return Objects.equals(phoneNumber, that.phoneNumber) && Objects.equals(upFlow, that.upFlow) && Objects.equals(downFlow, that.downFlow) && Objects.equals(sumFlow, that.sumFlow);
    }

    @Override
    public int hashCode() {
    
    
        return Objects.hash(phoneNumber, upFlow, downFlow, sumFlow);
    }

    @Override
    public String toString() {
    
    
        return phoneNumber + "\t" + upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    @Override
    public int compareTo(FlowBean02 o) {
    
    
        if (this.sumFlow > o.sumFlow){
    
    
            return 1;
        } else if (this.sumFlow < o.sumFlow) {
    
    
            return -1;
        }else {
    
    
            return 0;
        }
    }

    @Override
    public void write(DataOutput out) throws IOException {
    
    
        out.writeUTF(phoneNumber);
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);

    }

    @Override
    public void readFields(DataInput in) throws IOException {
    
    
        phoneNumber = in.readUTF();
        upFlow = in.readLong();
        downFlow = in.readLong();
        sumFlow = in.readLong();

    }
}

image-20230726214325943

image-20230726214343317

Guess you like

Origin blog.csdn.net/weixin_57367513/article/details/132718054