Data cleaning is an indispensable part of every business. Before running the MapReduce program of the core business, data will be cleaned in the future. The process of data cleaning often only needs to run the Mapper program, not the Reducer program. This article mainly introduces the simple application of data cleaning. Follow the column "Broken Cocoon and Become a Butterfly-Hadoop" to view related series of articles~
table of Contents
Fourth, write the Mapper class
Five, write Driver driver class
1. The beginning
Because it is a simple application, this article mainly uses an example to introduce ETL cleaning of data. It is worth mentioning that in enterprises, there are almost always specialized data cleaning tools, and the methods described in this article may not be used much. Not much nonsense, let's look at a simple application example.
2. Requirements and data
Still use our previous Nginx log data. The data format is as follows:
The fields are: time, version, client ip, access path, status, domain name, server ip, size, response time. Now we need to clean the data, filter out the access requests that are not 200, and filter out the log fields in each row with less than 5, and the number of fields is less than 5. We think it is useless data for us.
Three, define the Bean class
package com.xzw.hadoop.mapreduce.etl;
/**
* @author: xzw
* @create_date: 2020/8/18 13:38
* @desc:
* @modifier:
* @modified_date:
* @desc:
*/
public class ETLLogBean {
private String date;
private String version;
private String clientIP;
private String url;
private String status;
private String domainName;
private String serverIP;
private String size;
private String responseDate;
private boolean valid = true;//判断数据是否合法
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
sb.append(this.valid);
sb.append("\001").append(this.date);
sb.append("\001").append(this.version);
sb.append("\001").append(this.clientIP);
sb.append("\001").append(this.url);
sb.append("\001").append(this.status);
sb.append("\001").append(this.domainName);
sb.append("\001").append(this.serverIP);
sb.append("\001").append(this.size);
sb.append("\001").append(this.responseDate);
return sb.toString();
}
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public String getVersion() {
return version;
}
public void setVersion(String version) {
this.version = version;
}
public String getClientIP() {
return clientIP;
}
public void setClientIP(String clientIP) {
this.clientIP = clientIP;
}
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public String getStatus() {
return status;
}
public void setStatus(String status) {
this.status = status;
}
public String getDomainName() {
return domainName;
}
public void setDomainName(String domainName) {
this.domainName = domainName;
}
public String getServerIP() {
return serverIP;
}
public void setServerIP(String serverIP) {
this.serverIP = serverIP;
}
public String getSize() {
return size;
}
public void setSize(String size) {
this.size = size;
}
public String getResponseDate() {
return responseDate;
}
public void setResponseDate(String responseDate) {
this.responseDate = responseDate;
}
public boolean isValid() {
return valid;
}
public void setValid(boolean valid) {
this.valid = valid;
}
}
Here to explain, the following separator is used in the toString method of the above code:
sb.append("\001").append(this.date);
This \001 separator is the default separator in Hive. Hive will explain it in a later course. Just know this first.
Fourth, write the Mapper class
package com.xzw.hadoop.mapreduce.etl;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @author: xzw
* @create_date: 2020/8/18 13:44
* @desc:
* @modifier:
* @modified_date:
* @desc:
*/
public class ETLLogMapper02 extends Mapper<LongWritable, Text, Text, NullWritable> {
Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
//解析日志是否合法
ETLLogBean bean = parseLog(line);
if (!bean.isValid())
return;
k.set(bean.toString());
//输出
context.write(k, NullWritable.get());
}
/**
* 解析日志
* @param line
* @return
*/
private ETLLogBean parseLog(String line) {
ETLLogBean logBean = new ETLLogBean();
String[] fields = line.split("\t");
if (fields.length > 5) {
logBean.setDate(fields[0]);
logBean.setVersion(fields[1]);
logBean.setClientIP(fields[2]);
logBean.setUrl(fields[3]);
logBean.setStatus(fields[4]);
logBean.setDomainName(fields[5]);
logBean.setServerIP(fields[6]);
logBean.setSize(fields[7]);
logBean.setResponseDate(fields[8]);
if (Integer.parseInt(logBean.getStatus()) >= 400) {
logBean.setValid(false);
}
} else {
logBean.setValid(false);
}
return logBean;
}
}
Five, write Driver driver class
package com.xzw.hadoop.mapreduce.etl;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author: xzw
* @create_date: 2020/8/17 9:31
* @desc:
* @modifier:
* @modified_date:
* @desc:
*/
public class ETLLogDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
args = new String[]{"e:/input/nginx_log", "e:/output"};
Job job = Job.getInstance(new Configuration());
job.setJarByClass(ETLLogDriver.class);
job.setMapperClass(ETLLogMapper02.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
Six, test
Run and see the effect:
This is the data we cleaned. One more sentence, in the log running in the background, there is such a paragraph
This is a built-in counter in Hadoop, and we can use these data to describe multiple indicators. For example: you can monitor the amount of input data and output data that have been processed, and so on.
This article is relatively simple. What problems did you encounter in this process, welcome to leave a message and let me see what problems you encountered~