大数据学习笔记(四)_MapReduce分布式处理框架

官网概要：
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and MRAppMaster per application (see YARN Architecture Guide).

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.

The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in Java™, MapReduce applications need not be written in Java.

Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
Hadoop Pipes is a SWIG-compatible C++ API to implement MapReduce applications (non JNI™ based).

整理MapReduce几个关键点：

1.分布式的处理框架
2.提供可靠的，容错的在大量节点上并行处理大数据计算的功能

将大数据的计算，分摊到许多的服务器节点上，来进行并行计算，同时调度任务、监视它们并重新执行失败的任务

以词频统计的案列来解析MapReduce的工作流程
需求：对hello.txt 文件进行词频统计文件一共两行，每行有单词，并以 \t 分割，内容如下：
hello world hello world hello
welcome world

统计结果：
hello 3
world 3
welcome 1

一：MapReduce 工作流
以词频统计的案列为例子：
在这里插入图片描述
执行步骤：
准备map 处理的输入数据
Mapper 处理
Shuffle
Reduce 处理
结果输出

具体步骤：
1.将文件内容加载
2.将文件内容以行的形式切分
3.Mapping 操作，按key-value的形式输出
4.Shuffling操作，按key-value的形式以相同key为主将内容合并
5.Reducec操作，按key-value的形式将相同key将value值相加
6.输出最终结果。

二：Combiner
Map端的聚合操作就叫做Combiner

Combiner
优点：能减少IO,提升作业的执行性能
局限性：求平均数（除法就会存在问题）

//在Job 调用时 设置加上 Combiner 的操作
//（和job.setReducerClass(WordCountReducer.class); 一致）
job.setCombinerClass(WordCountReducer.class);

三：Partitioner

Partitioner 决定maptask输出的数据交由哪个reducetask 处理
默认实现：分发的key 的hash 值与reduce task个数取模

//源码使用的Partitioner 示例
public class HashPartitioner<K, V> extends Partitioner<K, V> {
    
    

  /** Use {
    
    @link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    
    
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
//numReduceTasks： 你的作业所指定的reducer 的个数，决定了reduce作业输出文件的个数
//HashPartitioner是MapReduce默认的分区规则
//reducer个数：3
//1 % 3 = 1
//2 % 3 = 2
//3 % 3 = 0
}

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
//自定义Partitioner
//Phone 是自定义的类
public class PhonePartitioner extends Partitioner<Text,Phone>{
    
    
//此处自定义的含义是： 手机号码13开头的输出到0号文件，15开头的输出到1号文件，其他输出到2号文件，一共产生3个文件
	@Override
	public int getPartition(Text phone, Phone value, int numPartitions) {
    
    
		if(phone.toString().startsWith("13")) {
    
    
			return 0;
		}else if(phone.toString().startsWith("15")) {
    
    
			return 1;
		}else {
    
    
			return 2;
		}
	}
}

//在Job 调用时 设置自定义的Partitioner 类即可
//设置自定义的Partitioner
job.setPartitionerClass(PhonePartitioner.class);
		
//设置numReduceTasks
job.setNumReduceTasks(3);

大数据学习笔记(四)_MapReduce分布式处理框架

猜你喜欢