InputFormat/OutputFormat

在这里插入图片描述

InputFormat&OutputFormat

InputFormat
- FileInputFormat
  - TextInputFormat
```
key/Value  :key表示行字节的偏移量、value表示一行文本数据
切片计算规则 :以文件为单位，以SpliSize做切割
```
  - NlineInputFormat
```
key/Value  :key表示行字节的偏移量、value表示一行文本数据
切片计算规则 :以文件为单位，以n行作为一个切片
```
  mapreduce.input.lineinputformat.linespermap=1000控制一个切片的数据量
  - KeyValueTextInputFormat
```
key/Value  :key 、value使用`\t`分割一行数据（默认）
切片计算规则 :以文件为单位，以SpliSize做切割
```
  mapreduce.input.keyvaluelinerecordreader.key.value.separator=\t
  - SequenceFileInputFormat
  - CombineTextInputFormat
```
key/Value  :key表示行字节的偏移量、value表示一行文本数据
切片计算规则 :以SpliSize做切割
```
  可以解决小文件计算，优化MR任务。要求所有小文件格式必须一致
- MultipleInputs
解决不同格式输入的数据，要求Mapper端输出的key/value必须一致。
- DBInputFormat
- TableInputFormat(第三方)
OutputFormat
- FileOutputFormat
  - TextOutputFormat
  - SequenceFileOutputFormat
  - LazyOutputFormat
- MultipleOutputs
- DBOutputFormat
- TableOutputFormat(第三方)

Map Reduce Shuffle(洗牌)

掌握MR任务提交源码流程
解决MR任务计算过程中的Jar包依赖

JobSubmitter(DFS|YARNRunner)
    submitJobInternal(Job.this, cluster);
    	checkSpecs(job);#检查输出目录是否存在
    	JobID jobId = submitClient.getNewJobID();#获取jobid
    	#构建资源目录
    	Path submitJobDir = new Path(jobStagingArea, jobId.toString());
    	#代码jar|上传第三方资源jar|配置文件
    	copyAndConfigureFiles(job, submitJobDir);
    	int maps = writeSplits(job, submitJobDir);#任务切片
    	writeConf(conf, submitJobFile);#上传任务上`下文信息`job.xml
    	status = submitClient.submitJob(#提交任务给ResourceManager
          jobId, submitJobDir.toString(), job.getCredentials());

在这里插入图片描述
解决程序在运行期间的jar包依赖问题

提交时依赖,配置HADOOP_CLASSPATH

HADOOP_CLASSPATH=/root/mysql-connector-java-xxxx.jar
HADOOP_HOME=/usr/hadoop-2.6.0
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
export HADOOP_HOME
export HADOOP_CLASSPATH

一般是在任务提交初期，需要连接第三方数据库，计算任务切片。

任务计算时依赖

可以通过程序设置

conf.set("tmpjars","file://jar路径,...");

如果是hadoop jar发布任务

hadoop jar xxx.jar 主类名 -libjars jar路径,...

掌握 Map Shuffle和Reduce Shuffle
只有Mapper、没有Reducer（数据清洗|降噪）

使用正则表达式提取子串

112.116.25.18 - [11/Aug/2017:09:57:24 +0800] "POST /json/startheartbeatactivity.action HTTP/1.1" 200 234 "http://wiki.wang-inc.com/pages/resumedraft.action?draftId=12058756&draftShareId=014b0741-df00-4fdc-90ca-4bf934f914d1" Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 - 0.023 0.023 12.129.120.121:8090 200

String regex="^(\\d{3}\\.\\d{3}\\.\\d{1,3}\\.\\d{1,3})\\s-\\s\\[(.*)\\]\\s\".*\"\\s(\\d+)\\s(\\d+)\\s.*";
Pattern pattern = Pattern.compile(regex);
 Matcher matcher = pattern.matcher(input);
if(matcher.matches()){
    String value= matcher.group(1);//获取第一个（）匹配的内容
}

自定义WritableComparable
- Map端输出key类型，必须实现WritableComparable（key需要参与排序）
- Map端输出value类型，必须实现Writable即可
Reduce端输出的key/value需要注意哪些事项？252
如何干预MR程序的分区策略

public class CustomPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}
job.setPartitionerClass(CustomPartitioner.class);
//或者
conf.setClass("mapreduce.job.partitioner.class",CustomPartitioner.class,Partitioner.class);

为何在做MR计算的时候，会产生数据倾斜？

因为不合理的KEY，导致了数据的分布不均匀。选择合适的key作为统计依据，使得数据能够在爱分区均匀分布。一般需要程序员对分析的数据有一定的预判！

开启Map端压缩(只能在实际的环境下测试)

可以有效，减少Reduce Shuffle过程的网络带宽占用。可能在计算过程中需要消耗额外的CPU进行数据的压缩和解压缩。

conf.setBoolean("mapreduce.map.output.compress", true);
conf.setClass("mapreduce.map.output.compress.codec",
GzipCodec.class, CompressionCodec.class);

如果是本地仿真可能会抛出not a gzip file错误，因此推荐大家在集群环境下测试！

combiner

job.setCombinerClass(CombinerReducer.class);

CombinerReducer实际就是一个Class extends Reducer,combiner一般发生在溢写阶段和溢写文件合并阶段。

上一篇：Hadoop之MapReduce
下一篇：HDFS|YRAN HA

Map Reduce Shuffle(洗牌)

InputFormat/OutputFormat

InputFormat&OutputFormat

Map Reduce Shuffle(洗牌)

猜你喜欢