10. Serialization of Hadoop

       The previous blog introduced MapReduce, the core component of Hadoop. This article mainly introduces the serialization of Hadoop. Follow the column "Broken Cocoon and Become a Butterfly-Hadoop" to view related series of articles~


table of Contents

One, serialization overview

Two, serialized instance

2.1 Data and requirements

2.2 Write Bean Object

2.3 Write Mapper Class

2.4 Write the Reducer class

2.5 Write the Driver class

2.6 Testing


 

One, serialization overview

       Serialization is to convert objects in memory into byte sequences (or other data transmission protocols) for storage to disk (persistence) and network transmission. Deserialization is to convert the received byte sequence (or other data transfer protocol) or the persistent data of the disk into an object in memory.

       Java serialization is a heavyweight serialization framework (Serializable). After an object is serialized, it will be accompanied by a lot of additional information, such as various verification information, headers, etc. It is not convenient for efficient transmission in the network. So Hadoop has developed a set of serialization mechanism (Writable). Features of Hadoop serialization: (1) Storage space can be used efficiently. (2) The additional cost of reading and writing data is small. (3) It can be upgraded as the communication protocol is upgraded. (4) Support multi-language interaction.

Two, serialized instance

2.1 Data and requirements

       Let’s take a look at the data first, and analyze the data first:

       This is a piece of processed Nginx background log information. The fields are: time, version, client ip, access path, status, domain name, server ip, size, and response time. Now we count the total size corresponding to each client ip, the file name is nginx_log.

2.2 Write Bean Object

package com.xzw.hadoop.mapreduce.nginx;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/7/28 10:01
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class NginxBean implements Writable {//实现Writable接口
    private long size;//size

    //反序列化时,需要反射调用空参构造器,所以必须有空参构造器
    public NginxBean() {
    }

    public void set(long size) {
        this.size = size;
    }

    public long getSize() {
        return size;
    }

    public void setSize(long size) {
        this.size = size;
    }

    /**
     * 序列化方法
     * @param dataOutput
     * @throws IOException
     */
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(size);
    }

    /**
     * 反序列化方法:反序列化方法读取顺序必须跟序列化方法写顺序一致
     * @param dataInput
     * @throws IOException
     */
    public void readFields(DataInput dataInput) throws IOException {
        this.size = dataInput.readLong();
    }

    /**
     * 编写toString方法,方便后续打印到文本
     * @return
     */
    @Override
    public String toString() {
        return size + "\t";
    }
}

2.3 Write Mapper Class

package com.xzw.hadoop.mapreduce.nginx;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/7/28 10:30
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class NginxMapper extends Mapper<LongWritable, Text, Text, NginxBean> {
    NginxBean nginxBean = new NginxBean();
    Text text = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //1、获取一行数据
        String line = value.toString();

        //2、切割字段
        String[] fields = line.split("\t");

        //3、封装对象
        String clientIP = fields[2];//取出客户端的IP
        long size = Long.parseLong(fields[fields.length - 2]);//取出size

        text.set(clientIP);
        nginxBean.set(size);

        //4、写出
        context.write(text, nginxBean);
    }
}

2.4 Write the Reducer class

package com.xzw.hadoop.mapreduce.nginx;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/7/28 10:39
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class NginxReducer extends Reducer<Text, NginxBean, Text, NginxBean> {
    private NginxBean nginxBean = new NginxBean();

    @Override
    protected void reduce(Text key, Iterable<NginxBean> values, Context context) throws IOException, InterruptedException {
        long sum_size = 0;

        //1、遍历所有的数据,将需要的数据进行累加
        for (NginxBean nginxBean: values) {
            sum_size += nginxBean.getSize();
        }

        //2、封装对象
        nginxBean.set(sum_size);

        //3、写出
        context.write(key, nginxBean);
    }
}

2.5 Write the Driver class

package com.xzw.hadoop.mapreduce.nginx;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/7/28 10:48
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class NginxDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //输入输出路径
        args = new String[]{"e:/input/nginx_log", "e:/output"};

        //1、获取配置信息或者job实例
        Job job = Job.getInstance(new Configuration());

        //2、设置类路径
        job.setJarByClass(NginxDriver.class);

        //3、设置Mapper和Reducer
        job.setMapperClass(NginxMapper.class);
        job.setReducerClass(NginxReducer.class);

        //4、设置Mapper和Reducer的输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NginxBean.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NginxBean.class);

        //5、设置输入输出的数据
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //6、提交job
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

2.6 Testing

       Run Driver to get the test result:

       Of course, it can also be submitted to the cluster. The specific packaging method will not be repeated here. You can refer to the packaging operation mode in the previous blog "Nine, MapReduce of Hadoop Core Components" .

 

       At this point, this article is over. What problems did you encounter in this process, welcome to leave a message, let me see what problems you encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/107613044