Big Data Hadoop Ecological Technology Introduction

The Hadoop ecosystem refers to a series of open source software and tools formed around the Hadoop big data processing platform to support application scenarios such as large-scale data processing, storage, management, analysis, and visualization. Temporarily divide its core technologies into 9 categories: 

  1. Data acquisition technical framework: Flume, Logstash, FileBeat; Sqoop and Datax; Cannal and Maxwell
  2. Data storage technology framework:  HDFS, HBase, Kudu, Kafka
  3. Distributed resource management framework:  YARN, Kubernetes and Mesos
  4. Data Computing Technology Framework
    1. Offline data computing: MapReduce, Tez, Spark
    2. Real-time data computing: SparkStreaming in Storm, Flink, Spark
  5. Data analysis technical framework:  Hive, Impala, Kylin, Clickhouse, Druid, Doris
  6. Task scheduling technical framework: Azkaban, Ooize, DolphinScheduler
  7. The underlying technical framework of big data: Zookeeper
  8. Data retrieval technology framework: Lucene, Solr and Elasticsearch
  9. Big data cluster installation management framework: HDP, CDH, CDP

 Currently commonly used Hadoop ecology includes:

  1. HDFS: Hadoop Distributed File System (HDFS) is Hadoop's distributed file system for storing large-scale data sets, providing high fault tolerance, high throughput data access.
  2. MapReduce: Hadoop's distributed computing framework for processing large-scale data sets, enabling tasks such as distributed computing, data cleaning, and batch processing.
  3. YARN: Yet Another Resource Negotiator (YARN) is the resource management framework of Hadoop, which is used to separate computing and storage and realize unified management and scheduling of cluster resources.
  4. Hive: A data warehouse tool in Hadoop, similar to a traditional SQL database, which can perform data storage, query and analysis operations through HiveQL.
  5. HBase: Distributed database in Hadoop, built on HDFS, supports high-speed query and random read and write.
  6. Pig: Pig is a Hadoop-based data flow language for performing complex data processing tasks on the Hadoop platform.
  7. Sqoop: Hadoop's data import and export tool for data transfer between the Hadoop platform and relational databases.
  8. Flume: Hadoop's data collection and aggregation tool for collecting data from multiple sources into Hadoop for processing.
  9. Spark: Although Spark is not part of Hadoop, it is tightly integrated with the Hadoop ecosystem. Spark provides faster data processing and analysis capabilities, with functions such as batch processing, stream processing, machine learning, and graph computing.
  10. Zeppelin: Hadoop's data analysis and visualization tool, which aggregates data storage, query, analysis, display and other functions in an interactive notebook.

Let's briefly compare the differences of each technology stack in these 9 types of technologies.

1. Data collection technical framework

The premise of data analysis is to collect data first, so data collection is the basis of big data.

The technical framework for data collection is as follows: 

1.1  Flume, Logstash, and FileBeat are commonly used for real-time monitoring and collection of log data

Flume, logstash, fileBeat comparison
Comparison item Flume logstash Filebeat
source Apache Elastic Elastic
Development language Java jruby go
memory consumption high high Low
CPU consumption high high Low
fault tolerance High, Internal Affairs Mechanism high, internal persistent queue none
load balancing support support support
plug-in Rich input and output plug-ins Rich input and output plug-ins Only supports file data collection
data filtering provide interceptor Strong filtering ability weak filtering ability
secondary development easy for java programmers Disaster Disaster

1.2  Sqoop and Datax are commonly used for offline data collection of relational databases

Sqoop、DataX
Comparison item Sqoop DataX
source Apache Ali
Development language Java Java
operating mode MapReduce         single process multithread
distributed support not support
effectiveness high middle
data source type Only support relational database and Hadoop related storage system Support more than 20 kinds
Scalability generally high

1.3  Cannal and Maxwell are often used for real-time data acquisition of relational databases

Canal、Maxwell
Comparison item canal Maxwell
source Ali zendesk
Development language Java Java
Data Format free        json format
HA support not support
bootstrap not support support
Partition support support
random read support support

2. Data storage technical framework

Data storage technology framework includes HDFS, HBase, Kudu, Kafka, etc.

  • HDFS: solves the problem of massive data storage, but does not support data modification
  • HBase: It is a distributed NoSQL database based on HDFS, which can utilize the massive data storage capacity of HDFS and support modification operations.
  • Kudu: A technical component between HDFS and HBase, which supports data modification and SQL-based data analysis. Therefore, its positioning is rather embarrassing, it is a compromise solution, and its practical application is limited.
  • Kafka: It is commonly used as a temporary buffer storage of massive data, providing high-throughput read and write capabilities to the outside world.

3. Distributed resource management framework

Enterprise server resources (memory, CPU, etc.) are limited and fixed. However, the application scenarios of the server are flexible and changeable. With the advent of the big data era, the demand for temporary tasks has increased greatly, and these tasks often require a large amount of server resources. Therefore, the allocation of server resources and shareholding completely depend on the manual connection of operation and maintenance personnel, which is too time-consuming and labor-intensive. Therefore, a distributed resource management system is required, and the common ones are YARN, Kubernetes, and Mesos. 

  • YARN's main applications and big data fields
  • Kubernetes main applications and cloud computing fields
  • Mesos main application and cloud computing field

4. Data Computing Technology Framework

Data calculation is divided into offline data calculation and real-time data calculation

4.1 Offline Data Calculation

  • MapReduce, the first generation of offline data computing engine in the big data industry, is mainly used to solve distributed parallel computing of large-scale data sets. It abstracts the calculation logic into two stages of map and reduce for processing.
  • The Tez computing engine is rarely used in the big data ecosystem
  • The biggest feature of Spark is in-memory computing. The intermediate results of the task execution phase are all stored in the memory, and there is no need to read and write disks, which improves the computing performance of data. Moreover, spark provides many high-order functions, which can realize iterative calculation of various complex logics, and is suitable for fast and complex calculations of massive data.

4.2 Real-time data calculation

  • Storm mainly applies and realizes real-time data distributed computing, suitable for small, independent real-time projects
  • Flink is a new generation of real-time data distributed computing engine, with better computing performance and ecosystem than Storm, high throughput and low latency.
  • Spark's SparkStreaming component can also provide second-level real-time data distributed computing
storm、SparkStreaming、Flink
comparison item storm SparkStreaming Considerable
Computational model Native Micro-Batch Native
API type Modular         Declarative Declarative
semantic level At-Least-Once         Exectly-Once Exectly-Once
fault tolerance mechanism Ack Checkpoint checkpoint
state management none have have

5. Data analysis technical framework

Data analysis technical framework including Hive, Impala, Kylin, Clickhouse, Druid, Drois, etc.

Hive, Impala, and Kylin belong to offline OLAP data analysis engines:

  • Hive's execution efficiency is average, but its stability is extremely high
  • Impala can provide high execution efficiency based on memory, but the stability is average
  • Kylin can provide PB level data ms level response through precomputation
Hive、Impala、Kylin
comparison item Hive Impala Kylin
computing engine MapReduce Self-developed app MapReduce/Spark
computing performance middle high high
stability high Low high
data size TB level TB level TB, PB level
SQL support

HQL

Compatible with HQL Standard SQL

Clickhouse, Druid, Doris are real-time OLAP data analysis engines:

  • Druid supports high concurrency, SQL support is limited, and the current maturity is relatively high
  • Clickhouse has limited concurrency capabilities and supports non-standard SQL. Currently high maturity
  • Doris supports high concurrency, supports standard SQL, and is still in the stage of rapid development
Druid、ClickHouse、Doris
comparison item Druid ClickHouse Doris
query performance high high high
High concurrency high Low high
real-time data insertion support support support
real-time data update not support weak middle
join operation limited limited support
SQL support limited non-standard SQL better
Maturity high high middle
Operation and maintenance complexity middle high Low

6. Task scheduling technical framework:

The task scheduling technical framework includes Azkaban, ooiz, DolPhinScheduler, etc. It is mainly suitable for the execution of ordinary scheduled tasks, and the scheduling of multi-level tasks with complex dependencies. It supports distributed and ensures the performance and stability of the scheduling system.

Azkaban、ooiz、DolphinScheduler
comparison item In Azka ewe DolphinScheduler
task type Shell scripts and big data tasks Shell scripts and big data tasks Shell scripts and big data tasks
task configuration Custom DSL syntax configuration XML file configuration Page drag and drop configuration
task pause not support support support
High availability (HA) Supported by DB Supported by DB Support (multi-master multi-worker)
multi-tenant not support not support support
email alert support support support
access control Coarse grained Coarse grained fine-grained
Maturity high high middle
ease of use high middle high
Affiliated company LinkedIn Cloudeara China Analysys

7. The underlying technical framework of big data

The underlying technical framework of big data only refers to zookeeper.

The operation of hadoop, HBase, Kafka and other technical components in the big data ecosystem will use zookeeper, which mainly provides basic functions, such as: namespace, configuration services, etc.

8. Data retrieval technical framework

The mainstream technology for data retrieval is Elasticsearch, and other technologies include Lucene and Solr.

Lucene、Solr、ES
Comparison item Lucene Solr Elasticsearch
ease of use Low high high
Scalability Low middle high
stability middle high high
Cluster O&M Difficulty Clusters are not supported high Low
Project integration high Low Low
Community activity middle middle high

9. Big data cluster installation management framework

If an enterprise wants to transform from traditional data processing to big data processing, the first thing to do is to build a stable and reliable big data platform. A complete big data platform includes data collection, data storage, data computing, data analysis, cluster monitoring, etc. These components need to be deployed on hundreds or even thousands of machines. If you rely entirely on the operation and maintenance personnel to follow up the installation, the workload will be too large, and there will be version matching problems between the various technology stacks. 

Based on the above problems, the big data cluster installation management tool was born. Currently common ones include CDH, HDP, and CDP. They encapsulate big data components and provide an integrated big data platform that can quickly install big data components.

  • HDP, the full name of Hortonworks Data Platform, is encapsulated based on Hadoop, provides interface-based installation and management with the help of Ambari tool, and integrates common components of big data, which can provide one-stop cluster management. It is completely open source and free, without commercial services, but it has stopped updating since version 3.x.
  • CDH stands for Cloudera Distribution Including Apache Hadoop. With the help of Cloudera Manager tool to provide interface-based installation and management, and integrate most big data components, it can provide one-stop cluster management. It is a commercial fee-based big data platform, and it will stop updating after version 6.x.
  • CDP, Cloudera Data Center, and CDH come from the same company, and its version number continues the version number of the previous CDH. Starting from 7.0, CDP supports Private Cloud (private cloud) and Hybrid Cloud (hybrid cloud). CDP integrates the relatively good components of HDP and CDH, and adds some new components.

The article is reprinted from, with some modifications and additions:  One article to understand the complete knowledge system of the big data ecosystem

Guess you like

Origin blog.csdn.net/zhoushimiao1990/article/details/131213442