[Big Data] Diagram Hadoop Ecosystem and Its Components

Before understanding the Hadoop ecosystem and its components, let's first understand the three major components of Hadoop, namely HDFS, MapReduce, and YARN, which together constitute the core of the Hadoop distributed computing framework .

  • HDFS (Hadoop Distributed File System): HDFS is Hadoop's distributed file system , which is the basis for large-scale data to be distributed and stored on multiple nodes. HDFS is mainly responsible for data storage and management. It can divide large data sets into multiple data blocks and distribute these data blocks to different computing nodes for storage, improving data reliability and processing efficiency.

  • MapReduce : MapReduce is Hadoop's distributed computing framework . It provides a simple programming model. By decomposing large-scale data into multiple small tasks for parallel processing, the efficiency of data processing can be greatly improved. The MapReduce model includes two phases, Map and Reduce. The Map phase decomposes data into multiple small pieces for processing, and the Reduce phase merges the processing results.

  • YARN ( ): YARN is the resource managerYet Another Resource Negotiator of Hadoop, which is responsible for allocating and managing computing resources for multiple applications, which can effectively improve the utilization of computing resources. YARN can divide the computing resources in the cluster into multiple containers, provide appropriate resources for different applications, and monitor and manage the running status of each application.

insert image description here

1.HDFS

HDFS is Hadoop's distributed file system designed to store large files on inexpensive hardware. It is highly fault tolerant and provides high throughput for applications. HDFS is best suited for applications with very large data sets.

Hadoop HDFS file system provides Master and Slave architecture. The master node runs the Namenode daemon, and the slave nodes run the Datanode daemon.

insert image description here

2.MapReduce

MapReduce is Hadoop's data processing layer, which divides tasks into small chunks and distributes these small chunks to many machines connected by a network, and assembles all events into a final event data set. The basic details required by MapReduce are key-value pairs. All data, whether structured or not, needs to be transformed into key-value pairs before being passed through the MapReduce model. In the MapReduce framework, the processing unit is moved to the data, rather than the data to the processing unit.

insert image description here

3.YARN

YARN stands forYet Another Resource Negotiator, it is the resource manager of Hadoop cluster. YARN is used to implement resource management and job scheduling in Hadoop clusters. The main idea of ​​YARN is to split job scheduling and resource management into individual processes for operation.

YARN provides two daemon processes; the first is called the resource manager ( Resource Manager), and the second is called the node manager ( Node Manager). Both components are used to handle data computation in YARN. The resource manager runs on the master node of the Hadoop cluster and negotiates the resources in all applications, while the node manager is hosted on all the slave nodes. The node manager's responsibility is to monitor containers, resource usage (such as CPU, memory, disk, and network) and provide details to resource managers.

insert image description here

4.Hive

Hive is a data warehouse project for Hadoop. Hive is designed to facilitate informal data summarization, ad hoc query, and interpretation of large volumes of data. With HiveQL, users can perform ad-hoc queries against datasets stored in HDFS and use that data for further analysis. Hive also supports custom user-defined functions that users can use to perform custom analysis.

Let us understand how Apache Hive handles SQL queries:

  • Users will submit queries to the driver (eg ODBC/JDBC) using command line or web UI.
  • The driver will parse the query with the help of the query compiler to check the syntax/query plan.
  • The compiler will send metadata requests to the metadata database.
  • In response, the metastore will provide metadata to the compiler.
  • It is now the compiler's job to validate the specification and resend the plan to the driver.
  • Now the driver will send the execution plan to the execution engine.
  • The program will be executed as a map-reduce job. The execution engine sends the job to the name node job tracker and assigns the job a task tracker that exists in the data nodes and executes the query there.
  • After the query is executed, the execution engine will receive the results from the data nodes.
  • The execution engine sends the result value to the driver.
  • The driver will send the results to the Hive interface (user).

insert image description here

5.Pig

Pig was developed by Yahoo to analyze big data stored in Hadoop HDFS. Pig provides a platform for analyzing massive datasets, consisting of a high-level language for communicating data analysis applications, linked with infrastructure for evaluating these programs.

Pig has the following key properties:

  • Optimization opportunities : Pig provides query optimization to help users focus on meaning rather than efficiency.
  • Extensibility : Pig provides the ability to create user-defined functions for special-purpose processing.

insert image description here

6.Mahout

Mahout is a framework for creating machine learning applications . It provides a rich set of components that you can use to build a customized recommendation system with an algorithm of choice. Mahout was developed to provide execution, scalability, and compliance.

Here are the important packages that define the Mahout interfaces for these key abstractions:

  • DataModel
  • UserSimilarity
  • ItemSimilarity
  • UserNeighborhood

insert image description here

7.HBase

HBase is a distributed, open source, versioned, non-relational database created after Google Bigtable. It is an important component of the Hadoop ecosystem and provides real-time read and write access to data by taking advantage of the fault tolerance of HDFS. Although HBase is a database, it can also be called a data storage system because it does not provide RDBMS features such as triggers, query language, and secondary indexes.

HBase has the following features:

  • It provides continuous modular extensibility.
  • It provides regular reads and writes.
  • Intuitive and configurable table sharding.
  • Automatic failover support between RegionServers.
  • It provides the central base class for supporting Hadoop MapReduce jobs with Apache HBase tables.
  • Client access using the Java API is simple.
  • Query predicates are pushed down through server-side filters.
  • It provides a Thrift gateway and REST-ful web services, supporting XML, Protobuf, and binary data encoding choices.

insert image description here

8.Zookeeper

Zookeeper acts as a coordinator between different Hadoop services, maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper is used to fix bugs and race conditions for these newly deployed applications in distributed environments.

insert image description here

9.Scoop

Sqoop is a data transfer tool for transferring data between Hadoop and relational databases. It is used to import data from a relational database management system (MySQL or Oracle) or mainframe into Hadoop (HDFS) and transform data in Hadoop MapReduce. It is also used to export data back to RDBMS. Sqoop usesmap-reduceto import and export data, so it has parallel processing and fault tolerance.

insert image description here

10.Flume

Flume is a log transport tool similar to Sqoop, but it works on unstructured data (logs), while Sqoop works on both structured and unstructured data. Flume is a reliable, distributed, and available system for efficiently collecting, aggregating, and moving large volumes of log data from many different sources to HDFS. It is not limited to log data aggregation, but can also be used to transport large amounts of event data.

Flume has the following three components:

  • Source
  • Channel
  • Sink

insert image description here

11. Then

Oozie is a workflow scheduling framework for scheduling Hadoop Map/Reduce and Pig jobs. An Apache Oozie workflow is a collection of operations such as Hadoop Map/Reduce jobs, Pig jobs, etc., arranged in a control-dependent DAG (Directed Acyclic Graph). A " control dependency " from one action to anotherstates that another action will not start until the first action completes.

Oozie workflow has the following two nodes, namely control flow node and operation node .

  • Control Flow Nodes (Control Flow Nodes): These nodes are used to provide a mechanism to control the execution path of a workflow.

  • Operation Nodes (Action Node): Operation nodes provide a mechanism by which workflows trigger the execution of computation/processing tasks, such as Hadoop MapReduce, HDFS, Pig, SSH, HTTP jobs.

insert image description here

12. Ambari

Ambari is used to configure, manage and monitor Apache Hadoop clusters.

It provides system administrators with the following tasks:

  • Configuration of Hadoop Cluster : It provides a medium to install Hadoop services on any number of nodes. It also handles configuration of Hadoop services for the cluster.

  • Management of Hadoop Cluster : It provides a central control to manage Hadoop services such as starting, stopping and reconfiguration of the entire cluster.

  • Hadoop Cluster Monitoring : It provides a dashboard for monitoring Hadoop clusters (e.g. nodes down, low disk space left, etc.).

insert image description here

13.Spark

Spark is a general-purpose and fast cluster computing system. It is a very powerful big data tool. Spark provides rich APIs in Python, Scala, Java, R and other languages. Spark supports advanced tools such as Spark SQL, GraphX, MLlib, Spark Streaming, R, etc. These tools are used to perform different types of operations, which we will see in the Spark section.

insert image description here

Guess you like

Origin blog.csdn.net/be_racle/article/details/132506264