Introduction to the Big Data Hadoop Ecosystem

Big data Hadoop ecosystem-component introduction

    Hadoop is currently the most widely used distributed big data processing framework, which is reliable, efficient, and scalable.

    The core components of Hadoop are HDFS and MapReduce. With different processing tasks, various components have appeared one after another, enriching the Hadoop ecosystem. The current ecosystem structure is roughly as shown in the figure:

   According to the service objects and levels, it is divided into: data source layer, data transmission layer, data storage layer, resource management layer, data calculation layer, task scheduling layer, and business model layer. Next, I will give a brief introduction to the relevant components in the Hadoop ecosystem.

    1. HDFS (Distributed File System)

HDFS is the foundation of the entire hadoop system and is responsible for data storage and management. HDFS has a high fault tolerance features (fault-tolerant), and designed to be deployed on low (low-cost) hardware. And it provides high throughput (high throughput) to access application data, suitable for applications with large data sets.

Client: split files, when accessing HDFS, first interact with NameNode to obtain the location information of the target file, then interact with DataNode to read and write data

NameNode: The master node, only one for each HDFS cluster, manages the HDFS name space and data block mapping information, configures related copy information, and processes client requests.

DataNode: The slave node stores actual data and reports status information to the NameNode. By default, one file will be backed up in three different DataNodes to achieve high reliability and fault tolerance.

Secondary NameNode: Secondary NameNode, to achieve high reliability, regularly merge fsimage and fsedits, and push to NameNode; assist and restore NameNode in emergency situations, but it is not a hot backup of NameNode.

Hadoop 2 introduces two important new features for HDFS-Federation and High Availability (HA):

  • Federation allows multiple NameNodes to appear in the cluster, which are independent of each other and do not need to coordinate with each other. Each division of labor and management of its own area. DataNode is used as a general data block storage device. Each DataNode must register with all NameNodes in the cluster, send a heartbeat report, and execute all namenode commands.

  • High availability in HDFS eliminates the single point of failure in Hadoop 1, where NameNode failure will cause the cluster to be interrupted. The high availability of HDFS provides a failover function (the process in which the standby node takes over the work from the failed primary NameNode) for automation.

    2. MapReduce (distributed computing framework)

MapReduce is a disk-based distributed parallel batch computing model for processing large amounts of data. Among them, Map corresponds to independent elements on the data set to perform specified operations to generate key-value pairs in the middle, and Reduce reduces all values ​​of the same key in the intermediate result to obtain the final result.

Jobtracker: There is only one master node, which manages all jobs, task/job monitoring, error handling, etc., divides the task into a series of tasks, and assigns them to Tasktracker.

Tacktracker: Slave node, running Map task and Reduce task; and interacting with Jobtracker to report task status.

Map task: Parse each data record, pass it to the map() function written by the user and execute it, and write the output result to the local disk (if it is a map-only job, it will be written directly to HDFS).

Reduce task: From the deep execution result of Map, it reads input data remotely, sorts the data, and passes the data grouping to the Reduce() function written by the user for execution.

    3. Spark (distributed computing framework)

Spark is a memory-based distributed parallel computing framework. It is different from MapReduce in that the intermediate output results of Job can be stored in memory, so there is no need to read and write HDFS, so Spark is better suited for data mining and machines Learning MapReduce algorithms that require iteration.

Cluster Manager: In standalone mode, it is the master node, which controls the entire cluster and monitors workers. Resource manager in YARN mode

Worker node: The slave node is responsible for controlling the computing node and starting the Executor or Driver.

Driver: Run the main() function of Application

Executor: The executor is a process running on the worker node for a certain Application

Spark abstracts data into RDD (Resilient Distributed Data Set), and provides a large number of libraries, including Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX. Developers can seamlessly combine these libraries in the same application.

Spark Core: Contains the basic functions of Spark; in particular, it defines the RDD API, operations, and actions on both. Other Spark libraries are built on RDD and Spark Core

Spark SQL: Provides an API for interacting with Spark through the Hive Query Language (HiveQL), a SQL variant of Apache Hive. Each database table is treated as an RDD, and Spark SQL queries are converted into Spark operations.

Spark Streaming: Process and control real-time data streams. Spark Streaming allows programs to process real-time data like ordinary RDDs, and pseudo-stream processing through short-term batch processing.

MLlib: A library of commonly used machine learning algorithms. The algorithm is implemented as Spark operations on RDDs. This library contains scalable learning algorithms, such as classification and regression, which require iterative operations on large data sets.

GraphX: A set of algorithms and tools for controlling graphs, parallel graph operations and calculations. GraphX ​​extends the RDD API, including operations for controlling graphs, creating subgraphs, and accessing all vertices on the path

    4. Flink (distributed computing framework)

Flink is a memory-based distributed parallel processing framework, similar to Spark, but there are big differences in some design ideas. For Flink, the main scenario it needs to handle is streaming data, and batch data is just a special case of streaming data.

Flink VS Spark

In Spark, RDD is represented as a Java Object at runtime, while Flink is mainly represented as a logical plan. Therefore, the Dataframe api used in Flink is optimized as the first priority. But relatively speaking, there is no such optimization in spark RDD.

In Spark, there is RDD for batch processing, and DStream for streaming, but the internal reality is still RDD abstraction; in Flink, there is DataSet for batch processing and DataStreams for streaming, but they are two on the same common engine Independent abstraction, and Spark is pseudo stream processing, while Flink is true stream processing.

    5. Yarn/Mesos (distributed resource manager)

YARN is the next generation of MapReduce, namely MRv2, which evolved on the basis of the first generation of MapReduce. It was mainly proposed to solve the poor scalability of original Hadoop and does not support multiple computing frameworks.

Mesos was born in a research project of UC Berkeley and has now become an Apache project. Currently, some companies use Mesos to manage cluster resources, such as Twitter. Similar to yarn, Mesos is a platform for unified resource management and scheduling, and it also supports multiple computing frameworks such as MR and steaming.

    6. Zookeeper (distributed collaboration service)

Solve data management problems in a distributed environment: unified naming, status synchronization, cluster management, configuration synchronization, etc.

Many components of Hadoop rely on Zookeeper, which runs on a computer cluster to manage Hadoop operations.

    7. Sqoop (data synchronization tool)

Sqoop is the abbreviation of SQL-to-Hadoop, which is mainly used to transfer data before traditional databases and Hadoop. The import and export of data are essentially Mapreduce programs, making full use of the parallelization and fault tolerance of MR.

Sqoop uses database technology to describe the data architecture and is used to transfer data between relational databases, data warehouses and Hadoop.

    8. Hive/Impala (data warehouse based on Hadoop)

Hive defines a SQL-like query language (HQL), which converts SQL into MapReduce tasks for execution on Hadoop. Usually used for offline analysis.

HQL is used to run query statements stored on Hadoop. Hive allows developers who are not familiar with MapReduce to write data query statements, and then these statements are translated into MapReduce tasks on Hadoop.

Impala is an MPP (Massively Parallel Processing) SQL query engine used to process large amounts of data stored in a Hadoop cluster. It is an open source software written in C++ and Java. Unlike Apache Hive, Impala is not based on the MapReduce algorithm. It implements a distributed architecture based on a daemon, which is responsible for all aspects of query execution running on the same machine. Therefore, the execution efficiency is higher than that of Apache Hive.

    9, HBase (distributed column storage database)

HBase is a column-oriented, scalable, highly reliable, high-performance, distributed and column-oriented dynamic database based on HDFS for structured data.

HBase uses the BigTable data model: an enhanced sparse sorting mapping table (Key/Value), where the key is composed of row keys, column keys, and timestamps.

HBase provides random, real-time read and write access to large-scale data. At the same time, the data stored in HBase can be processed using MapReduce, which perfectly combines data storage and parallel computing.

   10. Flume (log collection tool)

Flume is a scalable, massive log collection system suitable for complex environments. It abstracts the process of data generation, transmission, processing, and finally written into the target path into a data flow. In a specific data flow, the data source supports customizing the data sender in Flume, thereby supporting the collection of various protocol data.

At the same time, the Flume data stream provides the ability to perform simple processing on log data, such as filtering and format conversion. In addition, Flume also has the ability to write logs to various data targets (customizable).

Flume takes Agent as the smallest independent operation unit, and an Agent is a JVM. A single agent is composed of three components: Source, Sink and Channel

 Source: Collect data from the client and pass it to Channel.

Channel: Buffer area, which temporarily stores the data transmitted by Source.

Sink: Collect data from the Channel and write it to the specified address.

Event: Source files such as log files and avro objects.

 11. Kafka (distributed message queue)

Kafka is a high-throughput distributed publish-subscribe messaging system that can process all action flow data in consumer-scale websites. Realize the topic, partition and its queue mode as well as the producer and consumer architecture mode.

Both producer components and consumer components can be connected to the KafKa cluster, and KafKa is considered to be a message middleware used between component communications. There are many Topics (a highly abstract data structure) in KafKa's internal atmosphere, and each Topic is divided into many partitions, and the data in each partition is numbered and stored in a queue mode. The numbered log data is called the offset (offest) of the log data block in the queue. The larger the offset, the newer the data block, that is, the closer to the current time. The best practice architecture in the production environment is Flume+KafKa+Spark Streaming.

   12. Oozie (workflow scheduler)

Oozie is an extensible work system integrated in the Hadoop stack to coordinate the execution of multiple MapReduce jobs. It can manage a complex system and execute based on external events. External events include data timing and data appearance.

Oozie workflow is a set of actions (for example, Hadoop Map/Reduce job, Pig job, etc.) placed in the control dependent DAG (Direct Acyclic Graph), which specifies the order of action execution.

Oozie uses hPDL (an XML process definition language) to describe this diagram.

Guess you like

Origin blog.csdn.net/qq_37823979/article/details/108748557