Serialization of Hadoop

1. Serialization Overview

1.1 What is serialized

  1. Serialization: is to 内存中的对象,转换成字节序列(or other data transfer protocol) for the data persisted to disk and network transmission.
  2. Deserialization: is the received sequence of bytes (or other data transfer protocol) or 磁盘的持久化数据,转换成内存对象的过程.

1.2 Why should serialize

In general, the "活的"object can only exist in memory, power-down and disappeared. And "活的"the object can only be used by a local process that can not be sent to another computer on the network, however 序列化可以存储"活的"对象,并将它发送到远程计算机.

1.3 Why not use Java serialization of (Serializable)

Java serialization, serialization framework is a heavyweight (Serializable), after an object is serialized, will be included with a lot of additional information (check all kinds of information, Header, inheritance system, etc.), 不便于在网络中高效传输so, Hadoop developed its own a serialization mechanism - Writable.

1.4 Hadoop serialization characteristics

  1. Compact: efficient and practical storage space
  2. Fast: small overhead reading and writing data
  3. Scalable: With upgraded communication protocol and can be upgraded
  4. Interoperability: supports interactive multi-language

2. Custom Bean object, implement the serialization Interface (the Writable)

Enterprise development, often need to use custom objects Bean, a Bean if you pass an object within the Hadoop framework, then the object would need to implement serial interfaces.

  1. Writable must implement the interface
  2. Constructor needs no argument, no-argument constructor must be deserialized
  3. Method override sequence -write
  4. Method override deserialization -readFields
  5. 值得注意的是,反序列化的属性read顺序需要跟序列化的属性write顺序一致.
  6. You want to display the results in a file, you need to override the toString () method, you can separate the field content with "\ t", to facilitate later use.
  7. If the customized Bean to as Key, transmitted in MapReduce is required Bean Comparable interface, because MapReduce framework Shuffleprocess requirements 对key必须排序.

3. Serialization Sample

  1. Demand
    statistics for each phone number of upstream traffic, downstream traffic, total traffic.
  2. Data Format:
id mobile phone number ip Upstream traffic Downstream traffic Network status code
1 13700009999 8.8.8.8 1000 3500 200
  1. The desired output format:
mobile phone number Upstream traffic Downstream traffic Total flow
13700009999 1000 3500 4500
  1. Sample Code

Custom Bean

package cstmbean;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlowBean implements Writable {

	// 上行流量
    private Long upFlow;
    
    // 下行流量
    private Long downFlow;
    
    // 总流量
    private Long sumFlow;

    public FlowBean() {
        // 无参构造函数,如无任何显示的带参构造函数,则可省略
    }

    public void set(Long upFlow, Long downFlow) {
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = upFlow + downFlow;
    }

    @Override
    public String toString() {
        return this.upFlow + "\t" + this.downFlow + "\t" + this.sumFlow;
    }

    public Long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(Long upFlow) {
        this.upFlow = upFlow;
    }

    public Long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(Long downFlow) {
        this.downFlow = downFlow;
    }

    public Long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(Long sumFlow) {
        this.sumFlow = sumFlow;
    }

	// 序列化方法
	@Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(this.upFlow);
        dataOutput.writeLong(this.downFlow);
        dataOutput.writeLong(this.sumFlow);
    }

	// 反序列化方法
	@Override
    public void readFields(DataInput dataInput) throws IOException {
        this.upFlow = dataInput.readLong();
        this.downFlow = dataInput.readLong();
        this.sumFlow = dataInput.readLong();
    }
}

Mapper

package cstmbean;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
    private Text phone = new Text();
    private FlowBean bean = new FlowBean();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] words = value.toString().split("\t");
        // 手机号码
        phone.set(words[1]);
        // 上行流量,倒数第三列
        Long upFlow = Long.parseLong(words[words.length - 3]);
        // 下行流量,倒数第二列
        Long downFlow = Long.parseLong(words[words.length - 2]);
        // 根据上、下行流量计算总流量
        bean.set(upFlow, downFlow);
        context.write(phone, bean);
    }
}

Reducer

package cstmbean;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
    private Text phone = new Text();
    private FlowBean bean = new FlowBean();
    Long upFLow;
    Long downFlow ;
    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
        upFLow = 0L;
        downFlow = 0L;
        phone.set(key);
        for(FlowBean bean : values) {
            upFLow += bean.getUpFlow();
            downFlow += bean.getDownFlow();
        }
        bean.set(upFLow, downFlow);
        context.write(phone, bean);
    }
}

Driver

package cstmbean;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(FlowDriver.class);
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        job.setMapOutputKeyClass(Text.class);
        // Mapper的输出Value为FlowBean类型
        job.setMapOutputValueClass(FlowBean.class);
        job.setOutputKeyClass(Text.class);
        // Reducer的输出Value为FlowBean类型
        job.setOutputValueClass(FlowBean.class);

        FileInputFormat.addInputPath(job, new Path("i:\\bean_input"));
        FileOutputFormat.setOutputPath(job, new Path("i:\\bean_output"));

        boolean rtn = job.waitForCompletion(true);
        System.exit(rtn ? 0 : 1);
    }
}
Published 62 original articles · won praise 3 · views 20000 +

Guess you like

Origin blog.csdn.net/Leonardy/article/details/103871805