The Hadoop ecosystem refers to a series of open source software and tools formed around the Hadoop big data processing platform to support application scenarios such as large-scale data processing, storage, management, analysis, and visualization. Temporarily divide its core technologies into 9 categories:
- Data acquisition technical framework: Flume, Logstash, FileBeat; Sqoop and Datax; Cannal and Maxwell
- Data storage technology framework: HDFS, HBase, Kudu, Kafka
- Distributed resource management framework: YARN, Kubernetes and Mesos
- Data Computing Technology Framework
- Offline data computing: MapReduce, Tez, Spark
- Real-time data computing: SparkStreaming in Storm, Flink, Spark
- Data analysis technical framework: Hive, Impala, Kylin, Clickhouse, Druid, Doris
- Task scheduling technical framework: Azkaban, Ooize, DolphinScheduler
- The underlying technical framework of big data: Zookeeper
- Data retrieval technology framework: Lucene, Solr and Elasticsearch
- Big data cluster installation management framework: HDP, CDH, CDP
Currently commonly used Hadoop ecology includes:
- HDFS: Hadoop Distributed File System (HDFS) is Hadoop's distributed file system for storing large-scale data sets, providing high fault tolerance, high throughput data access.
- MapReduce: Hadoop's distributed computing framework for processing large-scale data sets, enabling tasks such as distributed computing, data cleaning, and batch processing.
- YARN: Yet Another Resource Negotiator (YARN) is the resource management framework of Hadoop, which is used to separate computing and storage and realize unified management and scheduling of cluster resources.
- Hive: A data warehouse tool in Hadoop, similar to a traditional SQL database, which can perform data storage, query and analysis operations through HiveQL.
- HBase: Distributed database in Hadoop, built on HDFS, supports high-speed query and random read and write.
- Pig: Pig is a Hadoop-based data flow language for performing complex data processing tasks on the Hadoop platform.
- Sqoop: Hadoop's data import and export tool for data transfer between the Hadoop platform and relational databases.
- Flume: Hadoop's data collection and aggregation tool for collecting data from multiple sources into Hadoop for processing.
- Spark: Although Spark is not part of Hadoop, it is tightly integrated with the Hadoop ecosystem. Spark provides faster data processing and analysis capabilities, with functions such as batch processing, stream processing, machine learning, and graph computing.
- Zeppelin: Hadoop's data analysis and visualization tool, which aggregates data storage, query, analysis, display and other functions in an interactive notebook.
Let's briefly compare the differences of each technology stack in these 9 types of technologies.
1. Data collection technical framework
The premise of data analysis is to collect data first, so data collection is the basis of big data.
The technical framework for data collection is as follows:
1.1 Flume, Logstash, and FileBeat are commonly used for real-time monitoring and collection of log data
Comparison item | Flume | logstash | Filebeat |
---|---|---|---|
source | Apache | Elastic | Elastic |
Development language | Java | jruby | go |
memory consumption | high | high | Low |
CPU consumption | high | high | Low |
fault tolerance | High, Internal Affairs Mechanism | high, internal persistent queue | none |
load balancing | support | support | support |
plug-in | Rich input and output plug-ins | Rich input and output plug-ins | Only supports file data collection |
data filtering | provide interceptor | Strong filtering ability | weak filtering ability |
secondary development | easy for java programmers | Disaster | Disaster |
1.2 Sqoop and Datax are commonly used for offline data collection of relational databases
Comparison item | Sqoop | DataX |
---|---|---|
source | Apache | Ali |
Development language | Java | Java |
operating mode | MapReduce | single process multithread |
distributed | support | not support |
effectiveness | high | middle |
data source type | Only support relational database and Hadoop related storage system | Support more than 20 kinds |
Scalability | generally | high |
1.3 Cannal and Maxwell are often used for real-time data acquisition of relational databases
Comparison item | canal | Maxwell |
---|---|---|
source | Ali | zendesk |
Development language | Java | Java |
Data Format | free | json format |
HA | support | not support |
bootstrap | not support | support |
Partition | support | support |
random read | support | support |
2. Data storage technical framework
Data storage technology framework includes HDFS, HBase, Kudu, Kafka, etc.
- HDFS: solves the problem of massive data storage, but does not support data modification
- HBase: It is a distributed NoSQL database based on HDFS, which can utilize the massive data storage capacity of HDFS and support modification operations.
- Kudu: A technical component between HDFS and HBase, which supports data modification and SQL-based data analysis. Therefore, its positioning is rather embarrassing, it is a compromise solution, and its practical application is limited.
- Kafka: It is commonly used as a temporary buffer storage of massive data, providing high-throughput read and write capabilities to the outside world.
3. Distributed resource management framework
Enterprise server resources (memory, CPU, etc.) are limited and fixed. However, the application scenarios of the server are flexible and changeable. With the advent of the big data era, the demand for temporary tasks has increased greatly, and these tasks often require a large amount of server resources. Therefore, the allocation of server resources and shareholding completely depend on the manual connection of operation and maintenance personnel, which is too time-consuming and labor-intensive. Therefore, a distributed resource management system is required, and the common ones are YARN, Kubernetes, and Mesos.
- YARN's main applications and big data fields
- Kubernetes main applications and cloud computing fields
- Mesos main application and cloud computing field
4. Data Computing Technology Framework
Data calculation is divided into offline data calculation and real-time data calculation
4.1 Offline Data Calculation
- MapReduce, the first generation of offline data computing engine in the big data industry, is mainly used to solve distributed parallel computing of large-scale data sets. It abstracts the calculation logic into two stages of map and reduce for processing.
- The Tez computing engine is rarely used in the big data ecosystem
- The biggest feature of Spark is in-memory computing. The intermediate results of the task execution phase are all stored in the memory, and there is no need to read and write disks, which improves the computing performance of data. Moreover, spark provides many high-order functions, which can realize iterative calculation of various complex logics, and is suitable for fast and complex calculations of massive data.
4.2 Real-time data calculation
- Storm mainly applies and realizes real-time data distributed computing, suitable for small, independent real-time projects
- Flink is a new generation of real-time data distributed computing engine, with better computing performance and ecosystem than Storm, high throughput and low latency.
- Spark's SparkStreaming component can also provide second-level real-time data distributed computing
comparison item | storm | SparkStreaming | Considerable |
---|---|---|---|
Computational model | Native | Micro-Batch | Native |
API type | Modular | Declarative | Declarative |
semantic level | At-Least-Once | Exectly-Once | Exectly-Once |
fault tolerance mechanism | Ack | Checkpoint | checkpoint |
state management | none | have | have |
5. Data analysis technical framework
Data analysis technical framework including Hive, Impala, Kylin, Clickhouse, Druid, Drois, etc.
Hive, Impala, and Kylin belong to offline OLAP data analysis engines:
- Hive's execution efficiency is average, but its stability is extremely high
- Impala can provide high execution efficiency based on memory, but the stability is average
- Kylin can provide PB level data ms level response through precomputation
comparison item | Hive | Impala | Kylin |
---|---|---|---|
computing engine | MapReduce | Self-developed app | MapReduce/Spark |
computing performance | middle | high | high |
stability | high | Low | high |
data size | TB level | TB level | TB, PB level |
SQL support | HQL |
Compatible with HQL | Standard SQL |
Clickhouse, Druid, Doris are real-time OLAP data analysis engines:
- Druid supports high concurrency, SQL support is limited, and the current maturity is relatively high
- Clickhouse has limited concurrency capabilities and supports non-standard SQL. Currently high maturity
- Doris supports high concurrency, supports standard SQL, and is still in the stage of rapid development
comparison item | Druid | ClickHouse | Doris |
---|---|---|---|
query performance | high | high | high |
High concurrency | high | Low | high |
real-time data insertion | support | support | support |
real-time data update | not support | weak | middle |
join operation | limited | limited | support |
SQL support | limited | non-standard SQL | better |
Maturity | high | high | middle |
Operation and maintenance complexity | middle | high | Low |
6. Task scheduling technical framework:
The task scheduling technical framework includes Azkaban, ooiz, DolPhinScheduler, etc. It is mainly suitable for the execution of ordinary scheduled tasks, and the scheduling of multi-level tasks with complex dependencies. It supports distributed and ensures the performance and stability of the scheduling system.
comparison item | In Azka | ewe | DolphinScheduler |
---|---|---|---|
task type | Shell scripts and big data tasks | Shell scripts and big data tasks | Shell scripts and big data tasks |
task configuration | Custom DSL syntax configuration | XML file configuration | Page drag and drop configuration |
task pause | not support | support | support |
High availability (HA) | Supported by DB | Supported by DB | Support (multi-master multi-worker) |
multi-tenant | not support | not support | support |
email alert | support | support | support |
access control | Coarse grained | Coarse grained | fine-grained |
Maturity | high | high | middle |
ease of use | high | middle | high |
Affiliated company | Cloudeara | China Analysys |
7. The underlying technical framework of big data
The underlying technical framework of big data only refers to zookeeper.
The operation of hadoop, HBase, Kafka and other technical components in the big data ecosystem will use zookeeper, which mainly provides basic functions, such as: namespace, configuration services, etc.
8. Data retrieval technical framework
The mainstream technology for data retrieval is Elasticsearch, and other technologies include Lucene and Solr.
Comparison item | Lucene | Solr | Elasticsearch |
---|---|---|---|
ease of use | Low | high | high |
Scalability | Low | middle | high |
stability | middle | high | high |
Cluster O&M Difficulty | Clusters are not supported | high | Low |
Project integration | high | Low | Low |
Community activity | middle | middle | high |
9. Big data cluster installation management framework
If an enterprise wants to transform from traditional data processing to big data processing, the first thing to do is to build a stable and reliable big data platform. A complete big data platform includes data collection, data storage, data computing, data analysis, cluster monitoring, etc. These components need to be deployed on hundreds or even thousands of machines. If you rely entirely on the operation and maintenance personnel to follow up the installation, the workload will be too large, and there will be version matching problems between the various technology stacks.
Based on the above problems, the big data cluster installation management tool was born. Currently common ones include CDH, HDP, and CDP. They encapsulate big data components and provide an integrated big data platform that can quickly install big data components.
- HDP, the full name of Hortonworks Data Platform, is encapsulated based on Hadoop, provides interface-based installation and management with the help of Ambari tool, and integrates common components of big data, which can provide one-stop cluster management. It is completely open source and free, without commercial services, but it has stopped updating since version 3.x.
- CDH stands for Cloudera Distribution Including Apache Hadoop. With the help of Cloudera Manager tool to provide interface-based installation and management, and integrate most big data components, it can provide one-stop cluster management. It is a commercial fee-based big data platform, and it will stop updating after version 6.x.
- CDP, Cloudera Data Center, and CDH come from the same company, and its version number continues the version number of the previous CDH. Starting from 7.0, CDP supports Private Cloud (private cloud) and Hybrid Cloud (hybrid cloud). CDP integrates the relatively good components of HDP and CDH, and adds some new components.
The article is reprinted from, with some modifications and additions: One article to understand the complete knowledge system of the big data ecosystem