A comprehensive introduction to the Hadoop ecosystem

As a distributed computing framework for big data, Hadoop has developed a very complete ecology today. This article will introduce a series of frameworks and components based on the Hadoop ecology.
Hadoop Ecosystem

Flume

  • Summary:
    Flume is a distributed, highly available service for efficiently collecting, aggregating, and moving large volumes of log data.
  • Role:
    The main role of Flume is to collect event or log data from various data sources, and then sink them into the database
    The role of Flume
  • Architecture
    Flume architecture
    The principle of Flume's implementation architecture is also very simple. Data collection is realized through the Agent agent. An Agent includes three components: Source, Channel, and Sink.
    • Source: The source of the collected data. Different data sources correspond to different formats. There are many source types supported by flume, such as avro, thrift, twitter, exec, jms, etc. For all source types, please refer to the official document of flume: https
      :
      // flume.apache.org/FlumeUserGuide.html#flume-sources
    • Channel: buffer, which caches the received source data for consumption by the downstream sink, and will only be deleted when the data is consumed by the sink or enters the next channel. In order to ensure the availability of the channel, flume also provides a variety of channel types, including memory, JDBC, File, Spillable Memory (when the memory queue is full, it will be stored on the disk), and it also supports custom channels
    • Sink: Consume the data in the channel and send the data to the destination, such as hive, hbase, etc.

Sqoop

Sqoop users directly and efficiently transfer data synchronously between Hadoop and various relational databases.
A picture explained. ps: To achieve the same function, there is also Ali's open source datax
sqoop flow chart

HDFS

Hadoop Distributed File System is the core storage system of the upper layer components of Hadoop. HBASE, Hive, etc. are all built based on HDFS storage.
hdfs architecture
HDFS is also the master/slave architecture mode of Master/Slave. The HDFS cluster consists of a single NameNode and multiple DataNodes. The NameNode is the Master, and the DataNode is the Slave.

  • NameNode: Responsible for managing metadata metadata and recording the block information corresponding to the file.
  • DataNode, usually each node in the cluster has a DataNode, which stores specific data.

YARN

YARN is a hadoop resource manager responsible for resource management and task scheduling.

The basic idea of ​​YARN is to separate the two main functions of JobTracker (resource management and job scheduling/monitoring). The main method is to create a global ResourceManager (RM) and several ApplicationMasters (AM) for applications. Application here refers to a traditional MapReduce job or a DAG of jobs (Directed Acyclic Graph)
yarn architecture

  • ResourceManager: Responsible for resource scheduling of all applications in the system
  • NodeManager is the framework client/agent of each machine, responsible for container management, monitoring their resource usage, such as cpu, memory, disk, network, and reporting to ResourceManager/Scheduler

Spark

Spark is a large-scale computing engine that plays an important role in the field of big data. Support Scala, Java, R, Python, Sql multiple languages.
The main components are SparkCore, SparkSQL, Spark MLlib, Spark Streaming, GraphX

  • SparkCore: implements the basic functions of Spark, including modules such as RDD, task scheduling, memory management, error recovery, and interaction with storage systems
  • SparkSQL: Connect to database data through SQL and convert the data into DataFrame. SparkSQL supports a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC.
  • Spark Mllib: Spark's machine learning library, high-quality algorithms are 100 times faster than MapReduce. MIlib provides a rich set of algorithms and statistical methods
    • Classification: Logistic Regression, Naive Bayes, …
    • Regression: generalized linear regression, survival regression, …
    • Decision Trees, Random Forests, and Gradient Boosted Trees
    • Suggestion: Alternating Least Squares (ALS)
    • Clustering: K-means, Gaussian Mixture (GMM), …
    • Topic Modeling: Latent Dirichlet Allocation (LDA)
    • Frequent Itemsets, Association Rules and Sequential Pattern Mining
    • Statistics: linear algebra, hypothesis testing
  • Spark Streaming: Spark's streaming computing framework is actually time-based micro-batch processing. Usually Apache Flink can be used instead.
  • GraphX: Apache Spark's API for graph computing

Kafka

Distributed message queue is currently the strongest king in message queue.
High throughput, high scalability, high availability, and also supports the main process of data persistence
kafka architecture
: the producer sends messages to the topic, and the topic is stored in different partitions, and the consumer group consumes the messages in the topic, and a consumer group Composed of multiple consumers, a message can only be consumed by one consumer in the consumer group, avoiding repeated consumption.

Mahout

Mahout is Apache's machine learning library, the goal is to build an environment for quickly creating scalable, high-performance machine learning applications.
It mainly provides out-of-the-box algorithm libraries such as clustering, classification, and collaborative filtering, and can also implement its own algorithms conveniently and quickly.

Lucene / Solr / ElasticSearch

Apache Lucene is a high-performance, full-featured full-text search engine architecture written entirely in Java, providing a complete query engine, index engine, and partial text analysis engine. The purpose is to provide a simple and easy-to-use toolkit for software developers to easily realize the full-text search function in the target system, or to build a complete full-text search engine based on this.
Both Solr and ElasticSearch are based on Lucene, which can make Lucene more convenient to be called. The setting of its inverted index is more efficient in fast retrieval than the traditional forward index.

Now

Oozie is a workflow scheduling system for managing Apache Hadoop jobs.
Oozie Workflow jobs are directed acyclic graphs (DAGs) of actions.
Oozie coordinator jobs are recurring Oozie workflow jobs triggered by time (frequency) and data availability.
Oozie integrates with the rest of the Hadoop stack, supporting out-of-the-box a variety of Hadoop jobs (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop, and Distcp) as well as system-specific jobs (such as Java programs and shell scripts ).
Oozie is a scalable, reliable and extensible system.
There is another excellent scheduling framework Azkaban

Zookeeper

ZooKeeper is a software project of the Apache Software Foundation that provides open source distributed configuration services, synchronization services, and naming registries for large-scale distributed computing.

ZooKeeper's architecture achieves high availability through redundant services.

Zookeeper's design goal is to encapsulate those complex and error-prone distributed consistency services to form an efficient and reliable primitive set, and provide users with a series of simple and easy-to-use interfaces.

A typical distributed data consistency solution based on which distributed applications can implement data publishing/subscribing, load balancing, naming services, distributed coordination/notification, cluster management, Master election, distributed locks and distributed functions such as queues.
zookeeper service
ZooKeeper services communicate with each other to ensure high availability and consistency of services.
Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heartbeats. If the TCP connection to the server is lost, the client will connect to another server.

zookeeper data structure
The namespace provided by zookeeper is very similar to the standard file system, which is stored in the form of key-value. The name key is a series of path elements separated by slashes /, and each node in the zookeeper namespace is identified by a path.
zookeeper namespace

Ambari

Apache Ambari is a tool for configuring, managing, and monitoring Apache Hadoop clusters. Consists of a set of RESTful APIs and a web interface.

MapReduce

MapReduce is a distributed computing model consisting of Map and Reduce. Map() is responsible for slicing and computing a large block. Reduce() is responsible for summarizing and calculating the data sliced ​​by Map().
The core idea of ​​​​execution: The key-value pairs of the same key call the Reduce method once for a group, and the method iterates this group of data for calculation.
MapReduce plays an important role in the field of big data computing, and can use many cheap machines to achieve amazing computing power.
The figure shows the approximate calculation process of the entire MapReduce. For a detailed introduction, please refer to the official website. I will write a detailed introduction later.

mapreduce execution process

Hive

Based on Hadoop, it can map structured HDFS files into a table, and provides HQL query function similar to SQL syntax. It is no exaggeration to say that it is precisely because of the birth of Hive that Hadoop will be widely promoted and used, and it will last for a long time.

Core essence: convert HQL statements into MapReduce tasks

The main advantages and disadvantages of Hive Advantages
:
It avoids developers to implement the interface of Map and Reduce, which greatly reduces the learning cost. The
HQL syntax is similar to SQL syntax, which is simple and easy to use.
Disadvantages:
The execution efficiency is relatively low. The MapReduce tasks generated by Hive are not intelligent enough. , easy to cause data skew
Hive architecture
Hive Architecture

  • HIVE:
    Meta Store: Metadata, generally stored in mysql
    Client: Client
    Driver: Driver
    HQL Parse: Parser, HQL parsing and grammatical analysis
    Physical Plan: Compile and generate logical execution plan
    Query Optimizer: Optimize logical execution plan
    Execution: Put Convert logical execution plan to physical execution plan
  • Hadoop
    Map Reduce: Perform Computation
    HDFS: File Storage

Pig

Similar to Hive, Pig is also Hadoop-based encapsulation computing, and its core is to simplify the programming of MapReduce. The difference is that the syntax provided by Pig is more similar to the shell, so the users of the two are inconsistent. Hive is more for developers, while Pig is more for operation and maintenance personnel.
pig architecture

HBase

HBase is a distributed, scalable, column-based real-time open source database that supports large amounts of data.
HBase is based on surface-column storage, and the design of RowKey enables it to provide fast point and range queries, but does not support complex SQL queries. The above figure
hbase architecture
is the infrastructure diagram of HBASE.
HBASE data is stored on HDFS.
Client: The client provides an interface to access HBase and maintains a corresponding cache to speed up HBase access.
Zookeeper: Stores the metadata (meta table) of HBase. Whether it is reading or writing data, it goes to Zookeeper to get the meta metadata and tells the client which machine to read and write data. HRegionServer: it handles the client's read and write
requests , responsible for interacting with the bottom layer of HDFS, is the real working node.
The general flow is as follows: the client requests to Zookeeper, and then Zookeeper returns the address of HRegionServer to the client, the client gets the address returned by Zookeeper to request HRegionServer, HRegionServer reads and writes data and returns to the client.

As the innovation of big data technology is too fast, this article will be updated from time to time. If you are interested, you can pay attention to it.

Guess you like

Origin blog.csdn.net/tuposky/article/details/125011437