15. Simple application of data cleaning (ETL) in Hadoop

       Data cleaning is an indispensable part of every business. Before running the MapReduce program of the core business, data will be cleaned in the future. The process of data cleaning often only needs to run the Mapper program, not the Reducer program. This article mainly introduces the simple application of data cleaning. Follow the column "Broken Cocoon and Become a Butterfly-Hadoop" to view related series of articles~


table of Contents

1. The beginning 

2. Requirements and data

Three, define the Bean class

Fourth, write the Mapper class

Five, write Driver driver class

Six, test


 

1. The beginning 

       Because it is a simple application, this article mainly uses an example to introduce ETL cleaning of data. It is worth mentioning that in enterprises, there are almost always specialized data cleaning tools, and the methods described in this article may not be used much. Not much nonsense, let's look at a simple application example.

2. Requirements and data

       Still use our previous Nginx log data. The data format is as follows:

       The fields are: time, version, client ip, access path, status, domain name, server ip, size, response time. Now we need to clean the data, filter out the access requests that are not 200, and filter out the log fields in each row with less than 5, and the number of fields is less than 5. We think it is useless data for us.

Three, define the Bean class

package com.xzw.hadoop.mapreduce.etl;

/**
 * @author: xzw
 * @create_date: 2020/8/18 13:38
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class ETLLogBean {
    private String date;
    private String version;
    private String clientIP;
    private String url;
    private String status;
    private String domainName;
    private String serverIP;
    private String size;
    private String responseDate;

    private boolean valid = true;//判断数据是否合法

    @Override
    public String toString() {

        StringBuilder sb = new StringBuilder();

        sb.append(this.valid);
        sb.append("\001").append(this.date);
        sb.append("\001").append(this.version);
        sb.append("\001").append(this.clientIP);
        sb.append("\001").append(this.url);
        sb.append("\001").append(this.status);
        sb.append("\001").append(this.domainName);
        sb.append("\001").append(this.serverIP);
        sb.append("\001").append(this.size);
        sb.append("\001").append(this.responseDate);

        return sb.toString();
    }

    public String getDate() {
        return date;
    }

    public void setDate(String date) {
        this.date = date;
    }

    public String getVersion() {
        return version;
    }

    public void setVersion(String version) {
        this.version = version;
    }

    public String getClientIP() {
        return clientIP;
    }

    public void setClientIP(String clientIP) {
        this.clientIP = clientIP;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public String getStatus() {
        return status;
    }

    public void setStatus(String status) {
        this.status = status;
    }

    public String getDomainName() {
        return domainName;
    }

    public void setDomainName(String domainName) {
        this.domainName = domainName;
    }

    public String getServerIP() {
        return serverIP;
    }

    public void setServerIP(String serverIP) {
        this.serverIP = serverIP;
    }

    public String getSize() {
        return size;
    }

    public void setSize(String size) {
        this.size = size;
    }

    public String getResponseDate() {
        return responseDate;
    }

    public void setResponseDate(String responseDate) {
        this.responseDate = responseDate;
    }

    public boolean isValid() {
        return valid;
    }

    public void setValid(boolean valid) {
        this.valid = valid;
    }
}

       Here to explain, the following separator is used in the toString method of the above code:

sb.append("\001").append(this.date);

       This \001 separator is the default separator in Hive. Hive will explain it in a later course. Just know this first.

Fourth, write the Mapper class

package com.xzw.hadoop.mapreduce.etl;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/18 13:44
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class ETLLogMapper02 extends Mapper<LongWritable, Text, Text, NullWritable> {
    Text k = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();

        //解析日志是否合法
        ETLLogBean bean = parseLog(line);

        if (!bean.isValid())
            return;

        k.set(bean.toString());

        //输出
        context.write(k, NullWritable.get());
    }

    /**
     * 解析日志
     * @param line
     * @return
     */
    private ETLLogBean parseLog(String line) {
        ETLLogBean logBean = new ETLLogBean();

        String[] fields = line.split("\t");

        if (fields.length > 5) {
            logBean.setDate(fields[0]);
            logBean.setVersion(fields[1]);
            logBean.setClientIP(fields[2]);
            logBean.setUrl(fields[3]);
            logBean.setStatus(fields[4]);
            logBean.setDomainName(fields[5]);
            logBean.setServerIP(fields[6]);
            logBean.setSize(fields[7]);
            logBean.setResponseDate(fields[8]);

            if (Integer.parseInt(logBean.getStatus()) >= 400) {
                logBean.setValid(false);
            }
        } else {
            logBean.setValid(false);
        }
        return logBean;
    }
}

Five, write Driver driver class

package com.xzw.hadoop.mapreduce.etl;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author: xzw
 * @create_date: 2020/8/17 9:31
 * @desc:
 * @modifier:
 * @modified_date:
 * @desc:
 */
public class ETLLogDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        args = new String[]{"e:/input/nginx_log", "e:/output"};

        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(ETLLogDriver.class);

        job.setMapperClass(ETLLogMapper02.class);

        job.setNumReduceTasks(0);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean b = job.waitForCompletion(true);

        System.exit(b ? 0 : 1);
    }
}

Six, test

       Run and see the effect:

       This is the data we cleaned. One more sentence, in the log running in the background, there is such a paragraph

       This is a built-in counter in Hadoop, and we can use these data to describe multiple indicators. For example: you can monitor the amount of input data and output data that have been processed, and so on.

 

       This article is relatively simple. What problems did you encounter in this process, welcome to leave a message and let me see what problems you encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/108163604