Principles and Applications of Big Data Technology Part III Big Data Processing and Analysis (2) Rediscussion on Hadoop

1. Optimization and development of Hadoop

1.1 Limitations of Hadoop

For MapReduce and HDFS [no other components included]:

1. The level of abstraction is low, and it is still necessary to manually write code to complete the function

2. The expressive ability is limited. The abstract Map and Reduce functions of MapReduce reduce the development complexity, but also bring about the problem of limited expressive ability. As a result, some tasks cannot be completed with Map and Reduce functions.

3. Developers manage dependencies between jobs by themselves. A Job only includes two stages of Map and Reduce. A problem needs to be solved by multiple Jobs. The dependencies between Jobs are managed by the developers themselves.

4. It is difficult to see the overall logic of the program. The user's processing logic is hidden in the code details. There is no high-level abstraction mechanism, which brings obstacles to the later understanding and maintenance of the code.

5. The iteration efficiency is low. For large-scale machine learning tasks, multiple rounds of iterations are often required. When implemented with MapReduce, the data and results are stored in HDFS, and the efficiency of repeatedly reading and writing HDFS is low.

6. Waste of resources, the Reduce task needs to wait for all Map tasks to be completed before starting

7. Poor real-time performance, only supports offline batch data processing, cannot support interactive data processing, real-time data processing

2. Hadoop2.0

2.1 HDFS HA

Solve the problem

There is only one name node in HDFS 1.0, and there is a single point of failure problem. Although there is a second name node, its task is to prevent the name node from being too large and the recovery time is too long. Periodically obtain FsImage and EditLog to merge and replace FsImage . However, it cannot provide hot backup function. When the name node fails, the system cannot directly switch to the second name node to provide external services immediately, and still needs to be shut down for maintenance.

Solution:

HDFS 2.0 adopts High Availability HA architecture. Generally, two name nodes are set up in the HA cluster, one is in the active state and the other is in the standby state. The active node is responsible for processing all client requests, and the standby name node is used as the backup node. Once the active name node has a problem, the Immediately switch to the standby name node without affecting external services

features

State synchronization: Real-time synchronization of two namenodes is achieved through a shared storage system. The active node writes the updated data into the shared storage system, and the standby node monitors the system. Once there is a new write, it reads the data from the shared storage system and loads it into its own memory, so as to ensure that the status of the two nodes is fully synchronized.

Data node location information is updated at the same time: in order to ensure that the block mapping of each data point in the cluster in the standby name node is up-to-date, it is necessary to configure two name node addresses for each data node, and send the block information and heartbeat information at the same time. for two nodes

Only one namenode provides external services at any time: ZooKeeper ensures that no two namenodes are active.

2.2 HDFS federation

Solve the problem

The design of HDFS 1.0 single name node has problems of scalability, overall system performance and isolation.

In terms of scalability, all metadata is stored in the memory of the name node and cannot be expanded horizontally, which limits the number of system data blocks, files, and directories. However, vertical expansion will cause problems such as long startup time, error-prone and crashes during cleaning, etc.

In terms of overall system performance, the performance of the entire HDFS is limited by the throughput of a single name node

In terms of isolation, it is difficult for a single name node to provide isolation between different programs, and HA only solves the single point of failure problem through hot backup

Solution

HDFS federation has designed multiple independent name nodes to enable horizontal expansion, each name node has its own namespace and block management, does not need to coordinate with each other, and can be seamlessly compatible with the deployment configuration of a single name node without adjustment .

Namenodes in an HDFS federation provide namespace and block management functions. HDFS federation has multiple independent namespaces. Each namespace manages its own set of blocks. Blocks belonging to the same namespace form a block pool [logical concept]. Blocks in a block pool can belong to different data nodes. It can be seen that all name nodes share the underlying data node storage resources. Even if the name node fails, it will not affect its data nodes and thus affect the work of other name nodes.

Clients access multiple namespaces in the form of mount table : Clients can access different mount points to access different sub-namespaces. Global sharing is achieved by mounting each namespace to the global mount table, and the namespace becomes visible to the application when it is mounted in the personal mount table.

Advantage

HDFS cluster scalability: multiple name nodes each manage a part of the directory, so that a cluster can be extended to multiple nodes, no longer limited by the size of the cluster due to memory

Overall system performance: Multiple name nodes manage different data and provide external services at the same time, providing higher read and write efficiency

Good isolation: users hand over different business data to different name nodes for management, and the impact between businesses is small

Notice

HDFS federation cannot solve the single point of failure problem, and a backup name node needs to be deployed for each name node

2.3 A new generation of resource scheduling framework YARN

Drawbacks of MapReduce 1.0

MapReduce 1.0 includes a JobTracker and several TaskTrackers. The former is responsible for job scheduling and resource management, and the latter is responsible for executing specific tasks assigned by the JobTracker.

There is a single point of failure: the system has only one JobTracker, and the system is unavailable when a failure occurs

JobTracker task is too heavy: JobTracker is not only responsible for job scheduling and failure recovery, but also responsible for resource management and allocation. The huge memory overhead increases the risk of failure.

Prone to memory overflow: The allocation of resources on the TaskTracker side does not consider the actual status of CPU and memory, and only allocates resources based on the number of MapReduce tasks. When two large memory-consuming tasks are allocated to the same TaskTracker, memory overflow is prone to occur

Unreasonable resource allocation: resources are forcibly divided into multiple slots (Slots), and the slots are divided into Map slots and Reduce slots, which are provided for Map and Reduce tasks respectively. When the Map slot is used up, the Map task cannot use a large number of remaining Reduce slots, resulting in waste of resources.

YARN Design Ideas

The basic idea of ​​YARN is to decentralize and reduce the burden of JobTracker. It splits its original three functions of resource management, task scheduling and task monitoring, and assigns them to different new components for processing. YARN includes ResourceManager, ApplicationMaster and NodeManager. ResourceManager is responsible for resource management, ApplicationMaster is responsible for task scheduling and monitoring, and NodeManager is responsible for executing the tasks of the original TaskTracker. Greatly reduces the burden on JobTracker and improves the stability and efficiency of the system

The resource scheduling function of HDFS 2.0 is separated separately to form YARN, which is a pure resource management scheduling framework rather than a computing framework. MapReduce, which has been stripped of resource scheduling functions, forms MapReduce 2.0, which is a pure resource management framework running on YARN. The computing framework, resource management and scheduling services are outsourced by YARN

Functions of the three major components of YARN

components

Function

Task

ResourceManager

Handle client requests

Start/Monitor ApplicationMaster

Monitor NodeManagers

Resource Allocation and Scheduling

ResourceManager is responsible for resource management and allocation of the entire system, mainly including resource scheduler and application manager

The resource scheduler is responsible for the scheduling and allocation of resources, and is not responsible for tracking and monitoring program transitions. According to the resource request of ApplicationMaster, the resource scheduler allocates the resources in the cluster to the corresponding applications in the form of containers according to certain conditions.

The application manager is responsible for the management of all applications in the system, mainly including application submission, negotiating resources with the scheduler to start ApplicationMaster, monitoring the running status of ApplicationMaster and restarting when it fails, etc.

ApplicationMaster

Request resources for applications and assign to internal tasks

Task scheduling, monitoring and fault tolerance

ResourceManager receives the job submitted by the user, starts the scheduling process according to the context information of the job and the container status information collected from the NodeManager, and starts an ApplicationMaster for the user job

当用户作业提交时,ApplicationMaster与ResManager协商获取资源,ResManager会以容器的形式为ApplicationMaster分配资源

把获得的资源进一步分配给内部的各个任务(Map任务或Reduce任务),实现资源的“二次分配”

与NodeManager保持交互通信进行应用程序的启动、运行、监控和停止,监控申请到的资源的使用情况,对所有任务的执行进度和状态进行监控,并在任务发生失败时执行失败恢复(即重新申请资源重启任务);

定时向ResManager发送“心跳”消息,报告资源的使用情况和应用的进度信息;当作业完成时,App Mstr向ResManager注销容器,执行周期完成

NodeManager

单个节点上的资源管理

处理来自ResourceManager的命令

处理来自ApplicationMaster的命令

容器生命周期管理

监控每个容器的资源(CPU、内存等)使用情况

跟踪节点健康状况以“心跳”的方式与ResourceManager保持通信

向ResourceManager汇报作业的资源使用情况和每个容器的运行状态

接收来自ApplicationMaster的启动/停止容器的各种请求

容器的特点:作为动态资源分配单位,每个容器中都封装了一定数量的CPU、内存、磁盘等资源,从而限定每个应用程序可使用的资源量。容器的选择通常会考虑应用程序所要处理的数据的位置,进行就近选择,从而实现“计算向数据靠拢”

调度器的特点:调度器被设计成是一个可插拔的组件,YARN不仅自身提供了许多种直接可用的调度器,也允许用户根据自己的需求重新设计调度器

任务状态的监控:NodeManager主要负责管理抽象的容器,只处理与容器相关的事情,而不具体负责每个任务(Map任务或Reduce任务)自身状态的管理,因为这些管理工作是由ApplicationMaster完成的,ApplicationMaster会通过不断与NodeManager通信来掌握各个任务的执行状态

集群的部署:在集群部署方面,YARN的各个组件和Hadoop集群其他组件进行统一部署的

YARN的工作流程

(1)用户编写客户端应用程序,向YARN提交应用程序,提交的内容包括 App Mstr程序、启动App Mstr的命令、用户程序等

(2)YARN中的ResManager负责接收和处理来自客户端的请求,为应用程序分配一个容器,并与该容器所在的NodeManager通信,为该程序在容器中启动一个App Mstr

(3)App Mstr被创建后会首先向ResManager注册,可以通过ResManager查看应用程序的运行状态

具体应用程序的执行步骤:

(4)App Mstr采用轮询的方式向ResManager申请资源

(5)ResManager以“容器”的形式向提出申请的App Mstr分配资源,申请成功后App Mstr与容器所在的NodeManage通信,要求它启动任务

(6)App Mstr要求容器启动任务后,为任务设置好运行环境,并将任务命令写到脚本中,在容器中运行该脚本

(7)各个任务向App Mstr汇报自己的状态和进度,让App Mstr随时掌握个任务运行状态

(8)应用程序运行完成后, App Mstr向ResManager的应用程序管理器注销并关闭自己。若失败,则ResManager重启App Mstr,直到任务执行完成

YARN框架的优势

(1)从MapReduce1.0框架发展到YARN框架,客户端并没有发生变化,其大部分调用API及接口都保持兼容,因此,原来针对Hadoop1.0开发的代码不用做大的改动,就可以直接放到Hadoop2.0平台上运行

(2)大大减少了承担中心服务功能ResourceManager的资源消耗,ApplicationMaster来完成需要大量资源消耗的任务调度和监控,多个作业对应多个ApplicationMaster实现了监控分布化

(3)YARN是一个纯粹的资源调度管理框架,在它上面可以运行包括MapReduce在内的不同类型的计算框架,只要编程实现相应的ApplicationMaster

(4)YARN中的资源管理比MapReduce1.0更加高效。以容器为单位,而不是以slot为单位

YARN的发展目标

“一个集群多个框架”:

“一个集群多个框架”即在一个集群上部署一个统一的资源调度管理框架YARN,在YARN之上可以部署其他各种计算框架由YARN为这些计算框架提供统一的资源调度管理服务,并且能够根据各种计算框架的负载需求,调整各自占用的资源,实现集群资源共享和资源弹性收缩可以实现一个集群上的不同应用负载混搭,有效提高了集群的利用率不同计算框架可以共享底层存储,避免了数据集跨集群移动

Hadoop生态系统中具有代表性的功能组件

组件

功能

应用场景

Pig

提供了类似SQL的Pig Latin语言(包含Filter、GroupBy、Join、OrderBy等操作,同时也支持用户自定义函数)

允许用户通过编写简单的脚本来实现复杂的数据分析,而不需要编写复杂的MapReduce应用程序

Pig会自动把用户编写的脚本转换成MapReduce作业在Hadoop集群上运行,而且具备对生成的MapReduce程序进行自动优化的功能

用户在编写Pig程序的时候,不需要关心程序的运行效率,这就大大减少了用户编程时间

通过配合使用Pig和Hadoop,在处理海量数据时就可以实现事半功倍的效果,比使用Java、C++等语言编写MapReduce程序的难度要小很多,并且用更少的代码量实现了相同的数据处理分析功能

数据查询只面向相关技术人员

即时性的数据处理需求,这样可以通过pig很快写一个脚本开始运行处理,而不需要创建表等相关的事先准备工作

Tez

Tez是Apache开源的支持DAG作业的计算框架,

它直接源于MapReduce框架核心思想是将Map和Reduce两个操作进一步拆分Map被拆分成Input、Processor、Sort、Merge和Output

Reduce被拆分成Input、Shuffle、Sort、Merge、Processor和Output等分解后的元操作可以任意灵活组合,产生新的操作

这些操作经过一些控制程序组装后,可形成一个大的DAG作业

通过DAG作业的方式运行MapReduce作业,提供了程序运行的整体处理逻辑,可去除工作流中多余的Map阶段,减少不必要操作,提升数据处理性能

Hortonworks把Tez应用到数据仓库Hive的优化中,使性能提升了约100倍

Tez在解决Hive、Pig延迟大、性能低等问题的思路,是和那些支持实时交互式查询分析的产品(如Impala、Dremel和Drill等)是不同的

在Hadoop2.0生态系统中,MapReduce、Hive、Pig等计算框架,都需要最终以MapReduce任务的形式执行数据分析,因此,Tez框架可以发挥重要的作用

借助于Tez框架实现对MapReduce、Pig和Hive等的性能优化

可以解决现有MR框架在迭代计算(如PageRank计算)和交互式计算方面的问题

Kafka

Kafka是一种高吞吐量的分布式发布订阅消息系统,用户通过Kafka系统可以发布大量的消息,同时也能实时订阅消费消息

Kafka可以同时满足在线实时处理和批量离线处理

在公司的大数据生态系统中,可以把Kafka作为数据交换枢纽,不同类型的分布式系统(关系数据库、NoSQL数据库、流处理系统、批处理系统等),可以统一接入到Kafka,实现和Hadoop各个组件间不同类型数据的实时高效交换

Guess you like

Origin blog.csdn.net/CNDefoliation/article/details/129321175