尚硅谷大数据技术Hadoop教程-笔记05【Hadoop-Yarn】

视频地址:尚硅谷大数据Hadoop教程(Hadoop 3.x安装搭建到集群调优)

  1. 尚硅谷大数据技术Hadoop教程-笔记01【大数据概论】
  2. 尚硅谷大数据技术Hadoop教程-笔记02【Hadoop-入门】
  3. 尚硅谷大数据技术Hadoop教程-笔记03【Hadoop-HDFS】
  4. 尚硅谷大数据技术Hadoop教程-笔记04【Hadoop-MapReduce】
  5. 尚硅谷大数据技术Hadoop教程-笔记05【Hadoop-Yarn】
  6. 尚硅谷大数据技术Hadoop教程-笔记06【Hadoop-生产调优手册】
  7. 尚硅谷大数据技术Hadoop教程-笔记07【Hadoop-源码解析】

目录

05_尚硅谷大数据技术之Hadoop(Yarn)V3.3

P125【125_尚硅谷_Hadoop_Yarn_课程介绍】05:20

P126【126_尚硅谷_Hadoop_Yarn_基础架构】04:56

P127【127_尚硅谷_Hadoop_Yarn_工作机制】06:43

P128【128_尚硅谷_Hadoop_Yarn_全流程作业】03:36

P129【129_尚硅谷_Hadoop_Yarn_FIFO调度器】04:18

P130【130_尚硅谷_Hadoop_Yarn_容量调度器】10:24

P131【131_尚硅谷_Hadoop_Yarn_公平调度器】19:25

P132【132_尚硅谷_Hadoop_Yarn_常用命令】14:50

P133【133_尚硅谷_Hadoop_Yarn_生产环境核心参数配置】10:26

P134【134_尚硅谷_Hadoop_Yarn_Linux集群快照】04:15

P135【135_尚硅谷_Hadoop_Yarn_生产环境核心参数配置案例】15:33

P136【136_尚硅谷_Hadoop_Yarn_生产环境多队列创建&好处】05:43

P137【137_尚硅谷_Hadoop_Yarn_容量调度器多队列案例】12:40

P138【138_尚硅谷_Hadoop_Yarn_容量调度器任务优先级】06:52

P139【139_尚硅谷_Hadoop_Yarn_公平调度器案例】15:06

P140【140_尚硅谷_Hadoop_Yarn_Tool接口案例环境准备】05:12

P141【141_尚硅谷_Hadoop_Yarn_Tool接口案例完成】19:16

P142【142_尚硅谷_Hadoop_Yarn_课程总结】10:26


05_尚硅谷大数据技术之Hadoop(Yarn)V3.3

P125【125_尚硅谷_Hadoop_Yarn_课程介绍】05:20

P126【126_尚硅谷_Hadoop_Yarn_基础架构】04:56

Yarn资源调度器:Yarn是一个资源调度平台,负责为运算程序提供服务器运算资源,相当于一个分布式的操作系统平台,而MapReduce等运算程序则相当于运行于操作系统之上的应用程序

Yarn基础架构:YARN主要由ResourceManagerNodeManagerApplicationMasterContainer等组件构成。

P127【127_尚硅谷_Hadoop_Yarn_工作机制】06:43

P128【128_尚硅谷_Hadoop_Yarn_全流程作业】03:36

HDFS、YARN、MapReduce三者关系
作业提交过程之YARN
作业提交过程之HDFS&MapReduce

P129【129_尚硅谷_Hadoop_Yarn_FIFO调度器】04:18

1.4 Yarn调度器和调度算法

目前,Hadoop作业调度器主要有三种:FIFO、容量(Capacity Scheduler)和公平(Fair Scheduler)。Apache Hadoop3.1.3默认的资源调度器是Capacity Scheduler

1.4.1 先进先出调度器(FIFO)

FIFO调度器(First In First Out):单队列,根据提交作业的先后顺序,先来先服务。

P130【130_尚硅谷_Hadoop_Yarn_容量调度器】10:24

1.4.2 容量调度器(Capacity Scheduler)

P131【131_尚硅谷_Hadoop_Yarn_公平调度器】19:25

1.4.3 公平调度器(Fair Scheduler)

P132【132_尚硅谷_Hadoop_Yarn_常用命令】14:50

1.5 Yarn常用命令

P133【133_尚硅谷_Hadoop_Yarn_生产环境核心参数配置】10:26

1.6 Yarn生产环境核心参数

P134【134_尚硅谷_Hadoop_Yarn_Linux集群快照】04:15

P135【135_尚硅谷_Hadoop_Yarn_生产环境核心参数配置案例】15:33

2.1 Yarn生产环境核心参数配置案例

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!-- 指定MR走shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- 指定ResourceManager的地址-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node2</value>
    </property>

    <!-- 环境变量的继承 -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>

    <!-- 开启日志聚集功能 -->
	<property>
	    <name>yarn.log-aggregation-enable</name>
	    <value>true</value>
	</property>
	<!-- 设置日志聚集服务器地址 -->
	<property>  
	    <name>yarn.log.server.url</name>  
	    <value>http://node1:19888/jobhistory/logs</value>
	</property>
	<!-- 设置日志保留时间为7天 -->
	<property>
	    <name>yarn.log-aggregation.retain-seconds</name>
	    <value>604800</value>
	</property>

<!-- 选择调度器,默认容量 -->
<property>
	<description>The class to use as the resource scheduler.</description>
	<name>yarn.resourcemanager.scheduler.class</name>
	<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

<!-- ResourceManager处理调度器请求的线程数量,默认50;如果提交的任务数大于50,可以增加该值,但是不能超过3台 * 4线程 = 12线程(去除其他应用程序实际不能超过8) -->
<property>
	<description>Number of threads to handle scheduler interface.</description>
	<name>yarn.resourcemanager.scheduler.client.thread-count</name>
	<value>8</value>
</property>

<!-- 是否让yarn自动检测硬件进行配置,默认是false,如果该节点有很多其他应用程序,建议手动配置。如果该节点没有其他应用程序,可以采用自动 -->
<property>
	<description>Enable auto-detection of node capabilities such as
	memory and CPU.
	</description>
	<name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
	<value>false</value>
</property>

<!-- 是否将虚拟核数当作CPU核数,默认是false,采用物理CPU核数 -->
<property>
	<description>Flag to determine if logical processors(such as
	hyperthreads) should be counted as cores. Only applicable on Linux
	when yarn.nodemanager.resource.cpu-vcores is set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true.
	</description>
	<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
	<value>false</value>
</property>

<!-- 虚拟核数和物理核数乘数,默认是1.0 -->
<property>
	<description>Multiplier to determine how to convert phyiscal cores to
	vcores. This value is used if yarn.nodemanager.resource.cpu-vcores
	is set to -1(which implies auto-calculate vcores) and
	yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The	number of vcores will be calculated as	number of CPUs * multiplier.
	</description>
	<name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
	<value>1.0</value>
</property>

<!-- NodeManager使用内存数,默认8G,修改为4G内存 -->
<property>
	<description>Amount of physical memory, in MB, that can be allocated 
	for containers. If set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
	automatically calculated(in case of Windows and Linux).
	In other cases, the default is 8192MB.
	</description>
	<name>yarn.nodemanager.resource.memory-mb</name>
	<value>4096</value>
</property>

<!-- nodemanager的CPU核数,不按照硬件环境自动设定时默认是8个,修改为4个 -->
<property>
	<description>Number of vcores that can be allocated
	for containers. This is used by the RM scheduler when allocating
	resources for containers. This is not used to limit the number of
	CPUs used by YARN containers. If it is set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
	automatically determined from the hardware in case of Windows and Linux.
	In other cases, number of vcores is 8 by default.</description>
	<name>yarn.nodemanager.resource.cpu-vcores</name>
	<value>4</value>
</property>

<!-- 容器最小内存,默认1G -->
<property>
	<description>The minimum allocation for every container request at the RM	in MBs. Memory requests lower than this will be set to the value of this	property. Additionally, a node manager that is configured to have less memory	than this value will be shut down by the resource manager.
	</description>
	<name>yarn.scheduler.minimum-allocation-mb</name>
	<value>1024</value>
</property>

<!-- 容器最大内存,默认8G,修改为2G -->
<property>
	<description>The maximum allocation for every container request at the RM	in MBs. Memory requests higher than this will throw an	InvalidResourceRequestException.
	</description>
	<name>yarn.scheduler.maximum-allocation-mb</name>
	<value>2048</value>
</property>

<!-- 容器最小CPU核数,默认1个 -->
<property>
	<description>The minimum allocation for every container request at the RM	in terms of virtual CPU cores. Requests lower than this will be set to the	value of this property. Additionally, a node manager that is configured to	have fewer virtual cores than this value will be shut down by the resource	manager.
	</description>
	<name>yarn.scheduler.minimum-allocation-vcores</name>
	<value>1</value>
</property>

<!-- 容器最大CPU核数,默认4个,修改为2个 -->
<property>
	<description>The maximum allocation for every container request at the RM	in terms of virtual CPU cores. Requests higher than this will throw an
	InvalidResourceRequestException.</description>
	<name>yarn.scheduler.maximum-allocation-vcores</name>
	<value>2</value>
</property>

<!-- 虚拟内存检查,默认打开,修改为关闭 -->
<property>
	<description>Whether virtual memory limits will be enforced for
	containers.</description>
	<name>yarn.nodemanager.vmem-check-enabled</name>
	<value>false</value>
</property>

<!-- 虚拟内存和物理内存设置比例,默认2.1 -->
<property>
	<description>Ratio between virtual memory to physical memory when	setting memory limits for containers. Container allocations are	expressed in terms of physical memory, and virtual memory usage	is allowed to exceed this allocation by this ratio.
	</description>
	<name>yarn.nodemanager.vmem-pmem-ratio</name>
	<value>2.1</value>
</property>

</configuration>

P136【136_尚硅谷_Hadoop_Yarn_生产环境多队列创建&好处】05:43

2.2 容量调度器多队列提交案例

1)在生产环境怎么创建队列?

1)调度器默认就1default队列,不能满足生产要求。

2)按照框架:hive /spark/ flink 每个框架的任务放入指定的队列(企业用的不是特别多)

3)按照业务模块:登录注册、购物车、下单、业务部门1、业务部门2

2)创建多队列的好处?

1)因为担心员工不小心,写递归死循环代码,把所有资源全部耗尽。

2)实现任务的降级使用,特殊时期保证重要的任务队列资源充足。11.11、6.18

业务部门1(重要)=> 业务部门2(比较重要)=> 下单(一般)=> 购物车(一般)=> 录注册(次要)

P137【137_尚硅谷_Hadoop_Yarn_容量调度器多队列案例】12:40

2.2.1 需求

P137更改后capacity-scheduler

<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

  <property>
    <name>yarn.scheduler.capacity.maximum-applications</name>
    <value>10000</value>
    <description>
      Maximum number of applications that can be pending and running.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.1</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run 
      application masters i.e. controls number of concurrent running
      applications.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
    <description>
      The ResourceCalculator implementation to be used to compare 
      Resources in the scheduler.
      The default i.e. DefaultResourceCalculator only uses Memory while
      DominantResourceCalculator uses dominant-resource to compare 
      multi-dimensional resources such as Memory, CPU etc.
    </description>
  </property>

  <!-- 指定多队列,增加hive队列 -->
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default, hive</value>
    <description>
      The queues at the this level (root is the root queue).
    </description>
  </property>

  <!-- 降低default队列资源额定容量为40%,默认100% -->
  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>40</value>
    <description>Default queue target capacity.</description>
  </property>
  <!-- 指定hive队列的资源额定容量 -->
  <property>
    <name>yarn.scheduler.capacity.root.hive.capacity</name>
    <value>60</value>
    <description>Default queue target capacity.</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
    <value>1</value>
    <description>
      Default queue user limit a percentage from 0.0 to 1.0.
    </description>
  </property>
  <!-- 用户最多可以使用队列多少资源,1表示 -->
  <property>
    <name>yarn.scheduler.capacity.root.hive.user-limit-factor</name>
    <value>1</value>
    <description>
      Default queue user limit a percentage from 0.0 to 1.0.
    </description>
  </property>

  <!-- 降低default队列资源最大容量为60%,默认100% -->
  <property>
    <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
    <value>60</value>
    <description>
      The maximum capacity of the default queue. 
    </description>
  </property>
  <!-- 指定hive队列的资源最大容量 -->
  <property>
    <name>yarn.scheduler.capacity.root.hive.maximum-capacity</name>
    <value>80</value>
    <description>
      The maximum capacity of the default queue. 
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.state</name>
    <value>RUNNING</value>
    <description>
      The state of the default queue. State can be one of RUNNING or STOPPED.
    </description>
  </property>
  <!-- 启动hive队列 -->
  <property>
    <name>yarn.scheduler.capacity.root.hive.state</name>
    <value>RUNNING</value>
    <description>
      The state of the default queue. State can be one of RUNNING or STOPPED.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
    <value>*</value>
    <description>
      The ACL of who can submit jobs to the default queue.
    </description>
  </property>
  <!-- 哪些用户有权向队列提交作业 -->
  <property>
    <name>yarn.scheduler.capacity.root.hive.acl_submit_applications</name>
    <value>*</value>
    <description>
      The ACL of who can submit jobs to the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
    <value>*</value>
    <description>
      The ACL of who can administer jobs on the default queue.
    </description>
  </property>
  <!-- 哪些用户有权操作队列,管理员权限(查看/杀死) -->
  <property>
    <name>yarn.scheduler.capacity.root.hive.acl_administer_queue</name>
    <value>*</value>
    <description>
      The ACL of who can administer jobs on the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_application_max_priority</name>
    <value>*</value>
    <description>
      The ACL of who can submit applications with configured priority.
      For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
    </description>
  </property>
  <!-- 哪些用户有权配置提交任务优先级 -->
  <property>
    <name>yarn.scheduler.capacity.root.hive.acl_application_max_priority</name>
    <value>*</value>
    <description>
      The ACL of who can submit applications with configured priority.
      For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
    </description>
  </property>

   <property>
     <name>yarn.scheduler.capacity.root.default.maximum-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Maximum lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        This will be a hard time limit for all applications in this
        queue. If positive value is configured then any application submitted
        to this queue will be killed after exceeds the configured lifetime.
        User can also specify lifetime per application basis in
        application submission context. But user lifetime will be
        overridden if it exceeds queue maximum lifetime. It is point-in-time
        configuration.
        Note : Configuring too low value will result in killing application
        sooner. This feature is applicable only for leaf queue.
     </description>
   </property>
   <!-- 如果application指定了超时时间,则提交到该队列的application能够指定的最大超时时间不能超过该值。 -->
   <property>
     <name>yarn.scheduler.capacity.root.hive.maximum-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Maximum lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        This will be a hard time limit for all applications in this
        queue. If positive value is configured then any application submitted
        to this queue will be killed after exceeds the configured lifetime.
        User can also specify lifetime per application basis in
        application submission context. But user lifetime will be
        overridden if it exceeds queue maximum lifetime. It is point-in-time
        configuration.
        Note : Configuring too low value will result in killing application
        sooner. This feature is applicable only for leaf queue.
     </description>
   </property>

   <property>
     <name>yarn.scheduler.capacity.root.default.default-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Default lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        If the user has not submitted application with lifetime value then this
        value will be taken. It is point-in-time configuration.
        Note : Default lifetime can't exceed maximum lifetime. This feature is
        applicable only for leaf queue.
     </description>
   </property>
   <!-- 如果application没指定超时时间,则用default-application-lifetime作为默认值。 -->
   <property>
     <name>yarn.scheduler.capacity.root.hive.default-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Default lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        If the user has not submitted application with lifetime value then this
        value will be taken. It is point-in-time configuration.
        Note : Default lifetime can't exceed maximum lifetime. This feature is
        applicable only for leaf queue.
     </description>
   </property>

  <property>
    <name>yarn.scheduler.capacity.node-locality-delay</name>
    <value>40</value>
    <description>
      Number of missed scheduling opportunities after which the CapacityScheduler 
      attempts to schedule rack-local containers.
      When setting this parameter, the size of the cluster should be taken into account.
      We use 40 as the default value, which is approximately the number of nodes in one rack.
      Note, if this value is -1, the locality constraint in the container request
      will be ignored, which disables the delay scheduling.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.rack-locality-additional-delay</name>
    <value>-1</value>
    <description>
      Number of additional missed scheduling opportunities over the node-locality-delay
      ones, after which the CapacityScheduler attempts to schedule off-switch containers,
      instead of rack-local ones.
      Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will
      attempt rack-local assignments after 40 missed opportunities, and off-switch assignments
      after 40+20=60 missed opportunities.
      When setting this parameter, the size of the cluster should be taken into account.
      We use -1 as the default value, which disables this feature. In this case, the number
      of missed opportunities for assigning off-switch containers is calculated based on
      the number of containers and unique locations specified in the resource request,
      as well as the size of the cluster.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings</name>
    <value></value>
    <description>
      A list of mappings that will be used to assign jobs to queues
      The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*
      Typically this list will be used to map users to queues,
      for example, u:%user:%user maps all users to queues with the same name
      as the user.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
    <value>false</value>
    <description>
      If a queue mapping is present, will it override the value specified
      by the user? This can be used by administrators to place jobs in queues
      that are different than the one specified by the user.
      The default is false.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name>
    <value>1</value>
    <description>
      Controls the number of OFF_SWITCH assignments allowed
      during a node's heartbeat. Increasing this value can improve
      scheduling rate for OFF_SWITCH containers. Lower values reduce
      "clumping" of applications on particular nodes. The default is 1.
      Legal values are 1-MAX_INT. This config is refreshable.
    </description>
  </property>


  <property>
    <name>yarn.scheduler.capacity.application.fail-fast</name>
    <value>false</value>
    <description>
      Whether RM should fail during recovery if previous applications'
      queue is no longer valid.
    </description>
  </property>

</configuration>

P138【138_尚硅谷_Hadoop_Yarn_容量调度器任务优先级】06:52

P139【139_尚硅谷_Hadoop_Yarn_公平调度器案例】15:06

2.3 公平调度器案例

2.3.1 需求

P140【140_尚硅谷_Hadoop_Yarn_Tool接口案例环境准备】05:12

2.4 Yarn的Tool接口案例

P141【141_尚硅谷_Hadoop_Yarn_Tool接口案例完成】19:16

P142【142_尚硅谷_Hadoop_Yarn_课程总结】10:26

一、Hadoop入门
    1、常用端口号
        hadoop3.x 
            HDFS NameNode 内部通常端口:8020/9000/9820
            HDFS NameNode 对用户的查询端口:9870
            Yarn查看任务运行情况的:8088
            历史服务器:19888
        hadoop2.x 
            HDFS NameNode 内部通常端口:8020/9000
            HDFS NameNode 对用户的查询端口:50070
            Yarn查看任务运行情况的:8088
            历史服务器:19888
    2、常用的配置文件
        3.x core-site.xml  hdfs-site.xml  yarn-site.xml  mapred-site.xml workers
        2.x core-site.xml  hdfs-site.xml  yarn-site.xml  mapred-site.xml slaves

二、HDFS
    1、HDFS文件块大小(面试重点)
        硬盘读写速度
        在企业中  一般128m(中小公司)   256m (大公司)
    2、HDFS的Shell操作(开发重点)
    3、HDFS的读写流程(面试重点)

三、Map Reduce
    1、InputFormat
        1)默认的是TextInputformat,输入kv,key:偏移量、v:一行内容
        2)处理小文件CombineTextInputFormat,把多个文件合并到一起统一切片
    2、Mapper 
        setup():初始化;map():用户的业务逻辑;clearup():关闭资源;
    3、分区
        默认分区HashPartitioner ,默认按照key的hash值%numreducetask个数
        自定义分区
    4、排序
        1)部分排序,每个输出的文件内部有序。
        2)全排序:一个reduce,对所有数据大排序。
        3)二次排序:自定义排序范畴,实现writableCompare接口,重写compareTo方法
            总流量倒序,按照上行流量,正序
    5、Combiner 
        前提:不影响最终的业务逻辑(求和没问题,求平均值)
        提前聚合map => 解决数据倾斜的一个方法
    6、Reducer
        用户的业务逻辑;
        setup():初始化;reduce():用户的业务逻辑;clearup():关闭资源;
    7、OutputFormat
        1)默认TextOutputFormat,按行输出到文件
        2)自定义

四、Yarn    1、Yarn的工作机制(面试题)
    2、Yarn的调度器
        1)FIFO/容量/公平
        2)apache默认调度器:容量调度器;CDH默认调度器:公平调度器
        3)公平/容量调度器默认一个default,需要创建多队列
        4)中小企业:hive、spark、flink、mr
        5)中大企业:业务模块:登录/注册/购物车/营销
        6)好处:解耦、降低风险,6.18、11.11,降级使用
        7)每个调度器特点:
            相同点:支持多队列,可以借资源,支持多用户
            不同点:容量调度器:优先满足先进来的任务执行
                 公平调度器:在队列里面的任务公平享有队列资源
        8)生产环境怎么选:
            中小企业:对并发度要求不高,选择容量。
            中大企业:对并发度要求比较高,选择公平。
    3、开发需要重点掌握:
        1)队列运行原理
        2)Yarn常用命令
        3)核心参数配置
        4)配置容量调度器和公平调度器
        5)tool接口使用

猜你喜欢

转载自blog.csdn.net/weixin_44949135/article/details/129772664