Hadoop learning (3) -mapreduce plus yarn quick start installation

mapreduce is an operational framework for multiple machines in parallel operation,

He put all the calculations are divided into two stages, one map stage, a stage is reduce

 

map phase: hdfs the read file, give maptask on multiple machines, when the sub-file is divided according to the size of the file

For example, each maptask will process the file size of 128M, 500M and then there is a file, it will start ceil (500/128) a maptask

Each processing line to read the file, you need to write their own, pay attention to the processing logic is the same for each maptask

We must deal with the results of a pair of key and value.

maptask inside method called map (long k, string v, context); k is the starting offset file, v is the content,

context should be generated key, value is placed on the container.

 

 

 

reduce phase: There reducetask on each machine, and its key role is produced by polymerizing value maptask

The principle is the same as the polymerization key must be distributed to a reducetask, this operation is called shuffle

Then the same key data is processed as a group. Final results will be written inside hdfs

Each has several reducetask, will generate several part-r-xxxx file

reducetask method which reduce (k, value the iterator, context), K is the key, a key iterates over each of the same value, and in the context hdfs is written, and is also a key value

 

Sample entry: wordcount

Design ideas, each maptask read the file,

k start offset map which is useless, we read each line v, that key is generated for each word, and then it is set to a value on the line, this key, value into which context

In the reduce phase, each key will be the same as a group, which is the same word as a group, will appear several times on the trip statistics.

 

Mapreduce began to write business logic in esclipe, first of all we need some jar package, related to the jar package share in the extract from the hadoop / hadoop folder

These folders a jar and the jar package lib under these folders are copied to the re buildPath esclipe

First, write mapper method

package test;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * KEYIN :是map task读取到的数据的key的类型,是一行的起始偏移量Long
 * VALUEIN:是map task读取到的数据的value的类型,是一行的内容String
 * 
 * KEYOUT:是用户的自定义map方法要返回的结果kv数据的key的类型,在wordcount逻辑中,我们需要返回的是单词String
 * VALUEOUT:是用户的自定义map方法要返回的结果kv数据的value的类型,在wordcount逻辑中,我们需要返回的是整数Integer
 * 
 * 
 * 但是,在mapreduce中,map产生的数据需要传输给reduce,需要进行序列化和反序列化,而jdk中的原生序列化机制产生的数据量比较冗余,就会导致数据在mapreduce运行过程中传输效率低下
 * 所以,hadoop专门设计了自己的序列化机制,那么,mapreduce中传输的数据类型就必须实现hadoop自己的序列化接口
 * 
 * hadoop为jdk中的常用基本类型Long String Integer Float等数据类型封住了自己的实现了hadoop序列化接口的类型:LongWritable,Text,IntWritable,FloatWritable
 * 
 * 
 * 
 * 
 */
//第一个泛型为起始偏移量,没啥用,第二个为字符串,为读取到的一行内容,第三个,第四个为context中的key,和value,即发送给reduce的k,v对
public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    
    @Override
    //重写map方法
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 切单词
        String line = value.toString();
        String[] words = line.split(" ");
        for(String word:words){
            context.write(new Text(word), new IntWritable(1));
        }
    }
}

接下来是reduce类

package test;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
//第一个,第二个为接收到的map的key,value,第三第四为写入到hdfs的key,value
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    
    
    @Override
    //一个key,众多value的迭代器,一个context;
    protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
    
        int count = 0; 
        Iterator<IntWritable> iterator = values.iterator();
        while(iterator.hasNext()){
            
            IntWritable value = iterator.next();
            count += value.get();
        }
        context.write(key, new IntWritable(count));
    }
    
    

}

 

 

 然而我们写的程序需要提交给我们的hadoop集群去运行,而管理这个事情的就是我们的yarn

yarn是一个分布式程序的运行调度平台

yarn中有两大核心角色

1、Resource Manager

接受用户提交的分布式计算程序,并为其划分资源

接收客户端要运行几个容器,进行任务调度

管理、监控各个Node Manager上的资源情况,以便于均衡负载

2、Node Manager

 

管理它所在机器的运算资源创建容器(cpu + 内存)

负责接受Resource Manager分配的任务,创建容器、回收资源

我们需要把我们的程序的jar包分发给每一个NodeManager,让他们去运行,

 

node manager在物理上应该跟data node部署在一起

resource manager在物理上应该独立部署在一台专门的机器上

 

yarn的安装

yarn我们不需要再下载了,在我们的hadoop里面已经有了yarn,我们只需要写一下配置文件就行

[root@hdp-04 ~]# vi apps/hadoop-2.8.1/etc/hadoop/yarn-site.xml

第一个指明哪一台机器当做resourcemanager,第二个指明nodemanager的任务是什么

 

<configuration>

<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hdp-01</value>
</property>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

</configuration>

然后复制这个文件到你的其他机器上

在你的resourcemanager机器上敲 start-yarn.sh,(关闭时stop-yarn.sh)

hadoop就会启动resourcemanager,其他的nodemanager,hadoop是通过slave文件知道的(在/root/apps/hadoop-2.8.1/etc/hadoop/slaves),里面写入你的nodemanager的ip就行,一行一个。

启动之后可以敲jps看一下

或者看网页的形式,resourcemanager的端口号是8088.比如hdp-01:8088

 

然后安装完yarn之后嘞,我们就可以写一个java的提交任务的程序了

 

package test;


import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class JobSubmitter {
    
    public static void main(String[] args) throws Exception {
        
        //在JVM中设置访问hdfs的用户身份为root,因为要对存在datanode节点的文件进行读写,不然可能会权限不够
        // 构造一个访问指定HDFS系统的客户端对象: 参数1:——HDFS系统的URI,参数2:——客户端要特别指定的参数,参数3:客户端的身份(用户名)
        //FileSystem fs = FileSystem.get(new URI("hdfs://172.31.2.38:9000/"), conf, "root");
        //如果是这样设置访问用户身份是不行的,因为不光是自己的客户端访问hdfs,
        //job还会创建自己的hdfs的对象FileSystem去访问datanode,那么job创建的对象是从系统环境变量拿到的用户名,所以这样设置身份
        System.setProperty("HADOOP_USER_NAME", "root");
        
        //设置配置参数
        Configuration conf = new Configuration();
        //设置job运行时要访问的默认文件系统
        conf.set("fs.defaultFS", "hdfs://172.31.2.38:9000");
        
        //设置job要提交到那里去运行,可以是yarn,也可以是local
        conf.set("mapreduce.framework.name", "yarn");
        //设置resourcemanager在哪
        conf.set("yarn.resourcemanager.hostname", "172.31.2.38");
        //如果从windows提交job,需要设置跨平台提交时,把windows中的命令,替换成linux的
        //比如运行jar包中某个程序,在linux和windows是不一样的,这样可以自动转化
        conf.set("mapreduce.app-submission.cross-platform","true");
        //设置job
        Job job = Job.getInstance(conf);
        
        //封装jar包在windows下的位置 
        job.setJar("d:/wc.jar");
        //设置本次job所要调用的Mapper的class类和reduce的class类
        job.setMapperClass(WordcountMapper.class);
        job.setReducerClass(WordcountReducer.class);
        
        //设置mapper实现类的产生结果的key,value类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        //设置reduce实现类的产生结果的key,value类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        
    
        //设置map时,job要处理的数据的路径,和产生的结果的路径在哪
        FileInputFormat.setInputPaths(job, new Path("/wordcount/input"));
        //注意输出路径一定要不存在
        FileOutputFormat.setOutputPath(job, new Path("/wordcount/output"));  
        
        //设置想要启动reduce task的数量是多少
        job.setNumReduceTasks(2);
        
        //提交给yarn,等待这个job完成才退出
        boolean res = job.waitForCompletion(true);
        
        System.exit(res?0:-1);
        
    }
    
    

}

 

Guess you like

Origin www.cnblogs.com/wpbing/p/11238504.html