Introduction to Hadoop Ecosystem and Core Components Hadoop|MapRedece|Yarn

Hello everyone, I am Beishan, long time no see, Nice to meet you, this article will record and learn about the Hadoop ecosystem.

Big Data Era

Big data refers to a collection of data that cannot be captured, managed and processed by commonly used software tools within a certain period of time. It is a mass and high growth rate that requires a new processing model to have stronger decision-making power, insight discovery and process optimization capabilities. and diverse information assets
insert image description here

Characteristics of 5V in the era of big data

insert image description here

Application scenarios, including recommendations in the e-commerce field, personal credit evaluation in finance, congestion prediction in the traffic field, optimal navigation planning, etc., https://beishan.blog.csdn.net/

insert image description hereinsert image description here
In the big data scenario: how to store massive data and how to calculate massive data?

This involves the concept of distributed and cluster

insert image description here
insert image description here
How to store massive data and how to calculate massive data

insert image description here

Hadoop

insert image description here

Hadoop overview

insert image description here

Official website: https://hadoop.apache.org/

In chivalry, Hadoop refers to an open source software of the Apache Software Foundation

Allows users to use a simple programming model to implement distributed computing processing of massive data across machine clusters

Hadoop core components
HDFS: Distributed file storage system, solving massive data storage
YARN: Cluster resource management and task scheduling framework, solving resource task scheduling
MapReduce: Distributed computing framework, solving massive computing
insert image description here

In a broad sense, Hadoop refers to the big data ecosystem built around Hadoop

insert image description here

Advantages of Hadoop features

insert image description here

Hadoop application at home and abroad

Hadoop was first applied to Internet companies at home and abroad, such as Yahoo, Facebook, and IBM in foreign countries. Domestic examples: BAT and Huawei

insert image description here

insert image description here
Hadoop's success lies in its versatility and simplicity

Precisely distinguish what to do and how to do it. What to do is a business issue, and how to do it is a technical issue. Users are responsible for business, and Hadoop is responsible for technology.

Hadoop distribution

Divided into open source community version and commercial release
insert image description here
version Open source community version: https://hadoop.apache.org/Commercial
release version: https://www.cloudera.com/products/open-source/apache-hadoop.html
insert image description here

Up to now, Hadoop has developed to version 3.x, Hadoop 1.0 includes HDFS (distributed file storage) and MapReduce (resource management and distributed data processing), and to 2.0, MapReduce (distributed data processing) is disassembled. points, introducing a new component YARN (cluster resource management, task scheduling)

insert image description here
Hadoop3.0 architecture components are similar to Hadoop2.0, and 3.0 focuses on performance optimization

insert image description here

Overall overview of Hadoop cluster

  • Hadoop cluster includes two clusters: HDFS cluster and YARN cluster
  • Two clusters are logically separated and usually physically together
  • Both clusters are standard master-slave architecture clusters

MapReduce是计算框架、代码层面的组件 没有集群之说

insert image description here

The two clusters are logically separated and usually physically together. It can be understood from the figure below that
insert image description here
the HDFS cluster consists of a master (NN is NameNode) and three slaves (DN is DataNode) + a secretary (SNN is Secondary NameNode).

The YARN cluster consists of RM (Resource Manager) and NM (Node Manager)

Hadoop集群 = HDFS集群 + YARN集群

  • Logically separated, meaning that they are not dependent on each other
  • Physically together, means that the process is deployed on the same machine

HDFS Distributed File System

The file system is a method of storing and organizing data. It realizes operations such as data storage, hierarchical organization, access, and acquisition, making it easy for users to access and search. The file system uses data blocks instead of physical devices such as hard disks 树形目录. 逻辑抽象The concept, the user does not need to know where the bottom layer of the data exists on the hard disk, just remember the directory and file name to which the file belongs

Traditional common file system

insert image description here
Difficulties encountered in mass data storage:

  • The versatility of traditional storage hardware is poor, and the cost of equipment investment plus later maintenance, upgrade and expansion is very high
  • The traditional storage method means: store when storing, calculate when calculating, and move the data when it needs to be processed
  • Low performance, single-node I/O performance bottleneck cannot be overcome, and it is difficult to support high concurrency and high throughput of massive data
  • poor scalability

Data and Metadata

insert image description here

HDFS Core Properties

  • distributed storage
  • metadata record
  • block storage
  • copy mechanism

distributed storage

The amount of data is large, and stand-alone storage encounters bottlenecks. Distributed storage solves data storage problems through horizontal expansion
insert image description here

metadata record

In view of the fact that files distributed on different machines are not conducive to searching, the metadata records the storage location information of the file machine to quickly locate the file location
insert image description here

block storage

The file is too large to be stored on a single machine, and the upload and download efficiency is low. Stored in different machines through file blocks, for blocks并行操作提高效率

insert image description here

copy mechanism

Backup of different machine settings, redundant storage, and data security
insert image description here
are summarized as follows:
insert image description here

Introduction to HDFS

  • HDFS (Hadoop Distributed File System), Hadoop distributed file system, is one of the core components of Apache Hadoop. It exists as the bottom distributed storage service of the big data ecosystem. It can also be said that the primary solution of big data is The storage problem of massive data
    insert image description here
    insert image description here

HDFS shell operation

HDFS Shell contains various shell-like commands, which can directly interact with Hadoop distributed file system and other file systems. The commonly used commands are as follows: here is the summary
of CSDN temperament & Mo Yu , thank you

insert image description here
insert image description here

insert image description here

Map Reduce

Hadoop primary key MapReduce

Divide and conquer ideas, design concepts, official examples, execution process

divide and conquer

The core idea of ​​MapReduce is: divide and conquer,
insert image description here
split the original problem into several sub-problems, solve the sub-problems, and finally combine them to obtain the solution of the original problem.

After splitting the original problem into several small problems, they can be processed in parallel and calculated at the same time. Of course, if it cannot be split or there is a dependency relationship between small problems after split, then the idea of ​​​​divide and conquer cannot be used.

  1. Can it be split
  2. Is there a dependency
    insert image description here

Example: To count the total number of all parked cars in the parking lot

Map: You count one column, I count one column... This is the Map stage, the more people there are, the more people can pass and count the cars at the same time, and the faster the speed will be.

Reudece: After counting, get together and add all the statistics together, this is the reduce merge summary stage

insert image description here

Understand the idea of ​​MapReduce

  • The idea of ​​MapReduce is easy to understand, the key lies in how to design a distributed computing program based on this idea
  • Follow-up to explain the Hadoop team's design concept for MapReduce
  1. How to deal with big data scenarios

insert image description here

  1. Build an abstract programming model

MapReduce draws on 函数式the ideas in the language, and then provides a high-level parallel programming abstraction model Mapwith two functions.Reduce

Map: perform some repetitive processing on a set of data elements

Reduce: Perform some further result sorting on the intermediate results of the Map

insert image description here
MapReduce defines the following two abstract programming interfaces of Map and Reduce, which are implemented by user programming:
insert image description here
Through the above two programming interfaces, you can see that the data type processed by MapReduce is <key, value> key-value pair

  1. Unified architecture, hidden underlying details

insert image description here

Distributed Computing Concepts

  • Distributed computing is a computing method, which is relative to centralized computing
  • With the development of computer technology, some applications require huge computing power to complete. If centralized computing is used, it will take a long time to complete
  • Distributed computing decomposes the application into many small parts and distributes them to the abortion computer for processing, which can save the overall computing time and greatly improve the computing efficiency
    insert image description here

Introduction to MapReduce

insert image description here

  1. Hadoop MapReduce is a distributed computing framework for easily writing distributed applications that process large amounts of data (multi-TB datasets) in parallel on large hardware clusters (thousands of nodes) in a reliable, fault-tolerant manner
  2. Map Reduce is a guiding ideology for massive data processing and a programming model for distributed computing of large-scale data

MapReduce generation background

Proposed by Google in the 2004 paper "MapReduce: Simplified Data Processing on Large Cluster"
insert image description here

Features of MapReduce

易于编程: MapReduce provides an interface for secondary development. Simply implement some interfaces to complete a distributed program. The task calculation is handed over to the computing framework for processing. The distributed program is deployed to run on the Hadoop cluster, and the cluster nodes can be expanded. to hundreds of thousands

良好的扩展性: When computer resources cannot be satisfied, computing power can be expanded by adding machines. The characteristics of distributed computing based on MapReduce can maintain a nearly linear growth with the number of nodes. This is also the key to MapReduce processing massive data. When computing nodes increase to hundreds or thousands, offline data of several terabytes or even petabytes can be easily processed

高容错性: Hadoop cluster-style distributed construction and deployment, if any machine node is down, it can transfer the above computing tasks to another node to run, without affecting the completion of the entire job task, the process is completely completed within Hadoop

适合海量数据的离线处理: Can handle data volumes of GB, TB and PB levels

Limitations of MapReduce

Although MapReduce has many advantages, it also has relative limitations. The limitations do not mean that it cannot be done, but that the implementation effect is relatively poor in some scenarios, and it is not suitable for MapReduce.

实时计算性能差: MapReduce is mainly used for offline operations, and cannot achieve second-level data response

不能进行流式计算: The characteristic of streaming computing is that data is continuously calculated, and the data is dynamic. As an offline computing framework, MapReduce is mainly for static data sets, and the data cannot be changed dynamically.

MapReduce instance process

A complete MapReduce program has three types of distributed runtime

  • MRAppMaster: responsible for the process scheduling and state coordination of the entire MR program
  • MapTask: Responsible for the entire data processing process of map sister u order
  • ReduceTask: Responsible for the entire data processing process of the reduce phase

insert image description here

MapReduce phase composition

  • A MapReduce programming model contains only one Map stage and Reduce stage, or only the Map stage
  • There cannot be many map stages, and the emergence of multiple reduce stages
  • If the user's business logic is very complex, only multiple MapReduce programs can be run serially
    insert image description here

MapReduce data types

  • In the entire MapReduce program, data kv键值对flows in the form of
  • In actual programming to solve various business problems, it is necessary to consider what is the input and output kv of each stage
  • MapReduce has many built-in default attributes, such as sorting, grouping, etc., which are related to the k of the data, so it is extremely important to determine the type of kv data
    insert image description here

MapReduce official example

Overview:
insert image description here
Example Description:
insert image description here

Calculate the value of pi

insert image description here
insert image description here

WordCount word frequency statistics

  • WordCount is a classic entry case in the field of big data computing, which is equivalent to hello world
  • Through WordCount, you can feel the execution process and default behavior mechanism of MapReduce behind

insert image description here

WordCount programming ideas

  • The core of the map stage: cut the input data and mark them all, so the output is <word, 1>

  • The core of the shuffle stage: through the default sorting and grouping functions in the MR program, the words with the same key will be used as a set of data to form a new kv pair

  • The core of the reduce phase: process a set of data completed by shuffle. This set of data is all the key-value pairs of the pronoun, and the cumulative sum of all 1s is the total number of words.

insert image description here
Word program submission
insert image description here
insert image description here

Map phase execution process

Relying on the WordCount program
insert image description here
MapReduce overall execution flow chart

insert image description here
Map stage execution process
insert image description here
insert image description here

![Insert picture description here](https://img-blog.csdnimg.cn/b721600efc88488f9ce65fad59774916.png

Reduce phase execution process

insert image description here
Reduce phase execution process
insert image description here

insert image description here

Shuffle mechanism

  • The original meaning of Shuffle is to shuffle the cards, to scramble a set of regular data into irregular data as much as possible
  • In MR, Shuffle is more like the inverse process of shuffling, which refers to "scrambling" the irregular output of the map end into data with certain rules according to the specified rules, so that the reduce end can receive and process it.
  • Generally speaking, the operation from Map generating output to Reduce getting data as input is called shuffle
    insert image description here

Shuffle on the Map side

insert image description here

Shuffle on the Reduce side

insert image description here

Disadvantages of the shuffle mechanism

  • Shuffle is the core and essence of MapReduce program
  • Shuffle is also the place where MapReduce is criticized the most. The reason why MapReduce is slower than Spark and Flink computing engines has a lot to do with the Shuffle mechanism.
  • During Shuffle频繁涉及数据在内存,磁盘之间的多次往复

YARN

insert image description here

Introduction, Architecture Components, Program Submission Interaction Process, Scheduler

Introduction to YARN

  • Apache Hadoop Yarn (Yet Another Resource Negotiatot, Another Resource Coordinator) is a new Hadoop resource manager
  • YARN is a general-purpose 资源管理系统and 调度平台can provide unified resource management and scheduling for upper-layer applications
  • Its introduction has brought huge benefits to the cluster in terms of utilization, resource consent management, and data sharing.

YANR is a general 资源管理系统and调度平台

YARN function description

insert image description here

Overview of YARN

insert image description here

YARN architecture, components

YARN official architecture diagram
insert image description here
insert image description here

Program submission YARN interaction process

  1. MR job submission, Client → RM
  2. Resource application MrAppMaster → RM
  3. MR job status report Container(Map|Reduce task) → Container(MrAppMaster)
  4. Node status report NM → RM

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

insert image description here

YARN resource scheduler Scheduler

insert image description here

How to understand resource scheduling

  • In an ideal world, requests made by applications would be approved by YARN, but in reality, resources are limited, and in busy clusters, applications will often need to wait for their subsequent write requests to be satisfied. The YARN scheduler works according to some定义的策略为应用程序分配资源
  • In YARN, it is responsible for allocating resources to applications Scheduler. It is one of the core components of ResourceManager. Scheduler is completely dedicated to scheduling jobs. It cannot track the status of applications.
  • Generally speaking, scheduling is a difficult problem, and there is no optimal strategy. For this reason, YARN provides a variety of schedulers and configuration strategies to choose from

scheduler strategy

According to the needs, choose the appropriate scheduler

  • FIFO Schedule
  • Capacity Schedule
  • Fair Schedule
    insert image description here
  1. FIFO Scheduler
    insert image description here
  2. Capacity Schedule
    insert image description here
    insert image description here

insert image description here

  1. Fair Schedule
    insert image description here
    insert image description here
    insert image description here

https://beishan.blog.csdn.net/, I am Beishan, welcome to comment and exchange

Guess you like

Origin blog.csdn.net/qq_45176548/article/details/130075523