Background knowledge of "Building and using big data clusters": Introduction to the big data Hadoop ecosystem

Table of contents

1. Introduction to Hadoop

2. The operating mode of Hadoop

1. Standalone mode

2. Pseudo-distributed mode

3. Fully distributed mode

3. Hadoop ecosystem components

1. HDFS

2. MapReduce

3. YARN

4. Hive

5. Pig

6. HBase

7. HCatalog

8. Avro

9. Thrift

10. Drill

11. The Mahout

12. Scoop

13. Flume

14. Ambari

15. Zookeeper

4. Advantages and disadvantages of Hadoop

Five, Hadoop learning path


1. Introduction to Hadoop

hadoop = MapReduce+HDFS (hadoop file system)

Further explanation:

MapReduce is a project, HDFS is another project, they make up hadoop.

In fact, these two projects are related to hadoop. For example, hadoop is a computer, MapReduce is a CPU, and HDFS is a hard disk.

Obviously, MapReduce processes data and HDFS stores data.

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage. Simply put, Hadoop is a software platform that can be more easily developed and run to process large-scale data.

The core components of Hadoop are HDFS and MapReduce. With different processing tasks, various components appear one after another, enriching the Hadoop ecosystem. The current ecosystem structure is roughly as shown in the figure:

Data Collection Tools:

        Log collection framework: Flume, Logstash, Filebeat

        Data Migration Tool: Sqoop

Data Storage Tool:

        Distributed file storage system: Hadoop HDFS

        Database system: Mongodb, HBase

Data processing tools:

        Distributed Computing Framework:

        Batch processing framework: Hadoop MapReduce

        Stream Processing Framework: Storm

        Hybrid Processing Framework: Spark, Flink

Query analysis framework: Hive, Spark SQL, Flink SQL, Pig, Phoenix

        Resource and Task Management: Cluster Resource Manager: Hadoop YARN

        Distributed coordination service: Zookeeper

        Task scheduling framework: Azkaban, Oozie

        Cluster deployment and monitoring: Ambari, Cloudera Manager

The above listed are relatively mainstream big data frameworks, the community is very active, and the learning resources are relatively rich. Start learning from Hadoop, because it is the cornerstone of the entire big data ecosystem, and other frameworks depend directly or indirectly on Hadoop.

2. The operating mode of Hadoop

Hadoop can be installed and run in three modes.

1. Standalone mode

(1) The default mode of Hadoop, there is no need to modify the configuration file during installation.

(2) Hadoop runs on one computer, without starting HDFS and YARN.

(3) When MapReduce runs and processes data, there is only one JAVA process, and the local file system is used for data input and output.

(4) It is used to debug the logic of the MapReduce program to ensure the correctness of the program.

2. Pseudo-distributed mode

(1) Hadoop is installed on one computer, and the corresponding configuration file needs to be modified to simulate a cluster of multiple hosts with one computer.

(2) HDFS and YARN need to be started, which are independent Java processes.

(3) When MapReduce runs and processes data, it is an independent process for each job, and the input and output use the distributed file system.

(4) It is used for learning and development to test whether the Hadoop program is executed correctly.

3. Fully distributed mode

(1) To install JDK and Hadoop on multiple computers to form an interconnected cluster, the corresponding configuration files need to be modified.

(2) The Hadoop daemon runs on a cluster built by multiple hosts. real production environment.

3. Hadoop ecosystem components

1. HDFS

HDFS is a Java-based Hadoop Distributed File System (Hadoop Distributed File System), which is the most important part of the Hadoop ecosystem. HDFS is the main storage system of Hadoop, providing scalable, highly fault-tolerant, reliable and cost-effective data storage for big data. HDFS is designed to be deployed on inexpensive hardware and is already set as the default configuration in many installations. It provides high throughput to access application data and is suitable for applications with very large data sets. Hadoop interacts directly with HDFS through shell-like commands.

HDFS has two main components: NameNode and DataNode.

NameNode : The NameNode is also known as the master node, but it does not store actual data or datasets. The NameNode stores metadata, that is, file permissions, which Blocks an uploaded file contains, and which DataNodes the Bolck blocks are stored on, and other details. It consists of files and directories.

Tasks of the NameNode:

  • Manage the namespace of the file system;
  • Control client access to files;
  • File or directory operations that manipulate the file namespace, such as open, close, rename, etc.

DataNode : DataNode is responsible for storing the actual data in HDFS and for reading and writing requests from file system clients. At startup, each Datanode connects to its corresponding Namenode and performs a handshake. The verification of the namespace ID and the software version of the DataNode is done through a handshake. DataNodes are automatically shut down when a mismatch is found.

Tasks of DataNodes:

  • DataNodes manage stored data.
  • DataNode also executes block creation, deletion, and block replication instructions from NameNode.

2. MapReduce

MapReduce is a core component of the Hadoop ecosystem, providing data processing. MapReduce is a software framework for easily writing applications that process large amounts of structured and unstructured data stored in the Hadoop distributed file system. The parallel nature of MapReduce programs makes it useful for large-scale data analysis using multiple machines in a cluster, increasing computing speed and reliability. Each stage of MapReduce has key-value pairs as input and output. The Map function takes one set of data and transforms it into another set of data, where individual elements are broken down into tuples (key/value pairs). The function takes the output of the Map as input and assembles these data tuples according to the key, modifying the value of the key accordingly.

Features of MapReduce:

  • Simplicity: MapReduce jobs are easy to run. Applications can be written in any language such as java, C++ and python.
  • Scalability: MapReduce can process PB-level data.
  • Speed: With parallel processing, problems that take days to solve can be solved in hours and minutes with MapReduce.
  • Fault Tolerance: MapReduce takes care of failures. If one copy of the data is not available, another machine has a copy of the same key pair that can be used to solve the same subtask.

3. YARN

YARN (Yet Another Resource Negotiator), as a component of the Hadoop ecosystem, provides resource management. Yarn is also one of the most important components in the Hadoop ecosystem. YARN is known as Hadoop's operating system because it is responsible for managing and monitoring workloads. It allows multiple data processing engines such as real-time streaming and batch processing to process data stored on one platform.

  • Flexibility: In addition to MapReduce (batch processing), other specialized data processing modes, such as interactive and streaming, can also be implemented. Due to this feature of YARN, other applications can also run alongside MapReduce programs in Hadoop2.
  • Efficiency: With many applications running on the same cluster, Hadoop's efficiency increases without much impact on quality of service.
  • Shared: Provide a stable, reliable, secure foundation, and share operational services across multiple workloads.

In addition to the basic modules, Hadoop includes the following items:

4. Hive

Apache Hive is an open source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hive mainly performs three functions: data aggregation, query and analysis . The language used by Hive is called HiveQL (HQL), which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs and executes them on Hadoop.

The main parts of Hive:

  • Metastore: metadata storage.
  • Driver: manages the lifecycle of HiveQL statements.
  • Query Compiler: Compiles HiveQL into a Directed Acyclic Graph (DAG).
  • Hive server: Provides a Thrift interface and JDBC/ODBC server.

5. Pig

Apache Pig is a high-level language platform for analyzing and querying huge datasets stored in HDFS. Pig, an integral part of the Hadoop ecosystem, uses the PigLatin language, which is very similar to SQL. Its tasks include loading the data, applying the required filters and dumping the data in the required format. For program execution, Pig requires the Java runtime environment.

Features of Apache Pig :

  • Extensibility: For special processing, users can create their own functions.
  • Optimization opportunities: Pig allows the system to perform optimizations automatically, which allows users to focus on semantics rather than efficiency.
  • Handles all kinds of data: Pig can analyze both structured and unstructured data.

6. HBase

Apache HBase, an integral part of the Hadoop ecosystem, is a distributed database designed to store structured data in tables with potentially billions of rows and millions of columns. HBase is a scalable, distributed NoSQL database built on top of HDFS. HBase provides real-time access to read or write data in HDFS.

HBase has two components, namely HBase Master and RegionServer.

HBase Master

  • It is not part of the actual data storage, but negotiates load balancing between all RegionServers.
  • Maintain and monitor Hadoop clusters.
  • Executive management (interface for creating, updating and dropping tables).
  • Control failover.
  • Handles DDL operations.

RegionServer

  • Handle read, write, update, delete requests from clients.
  • A RegionServer process runs on each node of the Hadoop cluster. RegionServer runs on DateNode of HDFS.

7. HCatalog

HCatalog is a table and storage management layer for Hadoop. HCatalog supports different components in the Hadoop ecosystem, such as MapReduce, Hive, and Pig, to facilitate reading and writing data from the cluster. HCatalog is a key component of Hive that enables users to store their data in any format and structure. By default, HCatalog supports RCFile, CSV, JSON, sequenceFile, and ORC file formats.

8. Avro

Acro is part of the Hadoop ecosystem and is one of the most popular data serialization systems, providing data serialization and data exchange services for Hadoop. These services can be used together or independently. Big data can use Avro to exchange programs written in different languages. Using serialization services, programs can serialize data into files or messages. It stores the data definition along with the data in a message or file, making it easy for programs to dynamically understand the information stored in the Avro file or message.

  • Avro Schema: It relies on schema for serialization/deserialization. Avro requires a schema to write/read data. When Avro data is stored in a file, its schema is stored with it. Therefore, the file can be processed by any program later.
  • Dynamic typing: It refers to serialization and deserialization without generating code. It is complementary to code generation, and in Avro, statically typed languages ​​can be used as an optional optimization.

9. Thrift

Thrift is a software framework for scalable cross-language service development and an interface definition language for RPC (Remote Procedure Call) communication. Hadoop does a lot of RPC calls, so it's possible to use Thrift for performance or other reasons.

10. Drill

The main purpose of Hadoop ecosystem components is large-scale data processing, including structured and semi-structured data. Apache Drill is a low-latency distributed query engine designed to scale to thousands of nodes and query petabytes of data. Drill is the first distributed SQL query engine with a schema-free model.

Drill has a dedicated memory management system that eliminates garbage collection and optimizes memory allocation and usage. Drill plays nicely with Hive, allowing developers to reuse their existing Hive deployments.

  • Scalability: Drill provides a scalable architecture at various layers, including query layer, query optimization, and client API. We can scale any layer according to the specific needs of the enterprise.
  • Flexibility: Drill provides a hierarchical columnar data model that can represent complex, highly dynamic data and allow for efficient processing.
  • Dynamic Schema Discovery: Drill does not require schema or type specification of data in order to start the query execution process. Instead, Drill starts processing data in units called batches of records and discovers patterns on the fly as they process.
  • Drill's decentralized metadata: Unlike other SQL Hadoop technologies, Drill has no centralized metadata requirements. Drill users do not need to create and manage tables in metadata in order to query data.

11. The Mahout

Apache Mahout is an open source framework for creating scalable machine learning algorithms and data mining libraries. Once the data is stored in HDFS, Mahout provides data science tools to automatically find meaningful patterns in these large datasets.

Mahout's algorithms include:

  • clustering
  • Collaborative filtering
  • Classification
  • frequent pattern mining

12. Scoop

Apache Sqoop imports data from external sources into relevant Hadoop ecosystem components such as HDFS, Hbase or Hive. It can also export data from Hadoop to other external sources. Sqoop works with relational databases like teradata, Netezza, oracle, MySQL.

Features of Apache Sqoop :

  • Importing Sequential Datasets from Mainframe: Sqoop addresses the growing need to move data from mainframe to HDFS.
  • Direct import of ORC files: Improved compression and lightweight indexing for improved query performance.
  • Parallel Data Transfer: Enables faster performance and optimal system utilization.
  • Efficient data analysis: Improve the efficiency of data analysis by combining structured and unstructured data in the schema of the read data lake.
  • Fast data copy: from external system to Hadoop.

13. Flume

Apache Flume efficiently collects, aggregates, and moves large volumes of data from its origin back to HDFS. It is fault tolerant and reliable mechanism. Flume allows data to flow from the source into the Hadoop environment. It uses a simple, scalable data model that allows online analytics applications. Using Flume, we can get data from multiple servers to Hadoop instantly.

14. Ambari

Ambari is a management platform for configuring, managing, monitoring and securing Apache Hadoop clusters. Hadoop administration is made easier because Ambari provides a consistent, secure operational control platform.

Features of Ambari :

  • Simplified installation, configuration and management: Ambari creates and manages large-scale clusters easily and efficiently.
  • Centralized Security Settings: Ambari reduces the complexity of managing and configuring cluster security across the platform.
  • Highly scalable and customizable: Ambari is highly scalable, allowing custom services to be managed.
  • Comprehensive visibility into cluster health: Ambari ensures cluster health and availability through a holistic monitoring approach.

15. Zookeeper

Apache Zookeeper is used to maintain configuration information, name, provide distributed synchronization, and provide group services. Zookeeper manages and coordinates a large cluster of machines.

Features of Zookeeper

  • Fast: Zookeeper is fast in workloads where reads of data are more common than writes. The ideal read/write ratio is 10:1.
  • Ordered: Zookeeper maintains a record of all transactions.

4. Advantages and disadvantages of Hadoop

A big data platform developed based on Hadoop usually has the following characteristics:

  • Scalability: It can reliably store and process PB-level data. The Hadoop ecosystem basically uses HDFS as a storage component, with high throughput, stability and reliability.
  • Low cost: The server group composed of cheap and general-purpose machines can be used to distribute and process data. These server farms can add up to thousands of nodes.
  • High efficiency: By distributing data, Hadoop can process it in parallel on the node where the data resides, and the processing speed is very fast.
  • Reliability: Hadoop can automatically maintain multiple backups of data, and can automatically redeploy computing tasks after task failures.

Hadoop ecological disadvantages:

  • Because Hadoop uses a file storage system, the timeliness of reading and writing is poor. So far, there is no component that supports both fast updates and efficient queries.
  • The Hadoop ecosystem is becoming more and more complex, the compatibility between components is poor, and installation and maintenance are difficult.
  • The functions of each component of Hadoop are relatively single, the advantages are obvious, and the disadvantages are also obvious.
  • The impact of cloud ecology on Hadoop is very obvious. The customized components of cloud vendors lead to further expansion of version differences, and it is impossible to form a joint force.
  • The overall ecology is based on Java development, which has poor fault tolerance, low usability, and components are easy to hang up.

Five, Hadoop learning path

(1) Platform foundation

1.1 Big data

Learn what big data is, getting started with big data, and an introduction to big data.

And the problems existing in big data, including storage and computing problems, and what are the solutions.

1.2 Hadoop platform ecosystem

Familiar with the open source Hadoop platform ecosystem, as well as third-party big data platforms, find some Hadoop introduction blogs or official websites, and learn about:

What’s Hadoop

Why Hadoop exists

How to Use Hadoop

1.3 Hadoop family members

Hadoop is a huge family that includes a series of product components such as storage and computing. It is necessary to understand a series of components, including HDFS, MapReduce, Yarn, Hive, HBase, ZooKeeper, Flume, Kafka, Sqoop, HUE, Phoenix, Impala, Pig, Oozie, Spark, etc., know what it does, Wikipedia definition.

1.4 HDFS

Distributed storage HDFS, understand the HDFS architecture, HDFS storage mechanism, and the cooperative relationship of each node need to be clearly understood.

1.5 Yarn

Distributed resource management Yarn, familiar with the Yarn architecture and how to manage resources.

1.6 MapReduce

Distributed computing MapReduce, understand the underlying architecture of MapReduce, processing solutions, computing architecture solutions, understand the advantages and disadvantages of MapReduce computing.

1.7 HBase

Efficient storage of big data in HBase, understand the underlying architecture of HBase, application scenarios of HBase, and storage solutions.

1.8 Hive

Hive, a big data warehouse, understands Hive's storage mechanism, Hive's transactional changes, Hive's application scenarios, and Hive's underlying computing.

1.9 Spark

In-memory computing platform Spark, familiar with Spark in-memory computing architecture, computing process, Spark's operating mode, and application scenarios.

(2) Advanced platform

2.1 HDFS

Operate HDFS through the command line, view files, upload, download, modify files, assign permissions, etc.

Connect and operate HDFS through java demo to realize file reading, uploading and downloading functions.

Use the DI tool to configure the HDFS operation process to store relational database files in HDFS and save HDFS files to a local directory.

2.2 MapReduce

Eclipse binds the Hadoop environment, adds MapReduce Location, and uses eclipse to run WordCount, a classic instance of MapReduce, to see the principle, try to modify it to Chinese vocabulary statistics, and exclude irrelevant words.

2.3 Hive

Operate Hive through the command line, perform beeline connection, and use SQL statements to operate the Hive data warehouse.

Connect and operate Hive through java demo to realize operations such as creating tables, inserting data, querying, deleting data records, updating data, and deleting tables.

Through the DI tool, configure the relational database to extract the Hive transaction table process, do not connect to Hive through direct drive, and over-implement through HDFS and Hive appearance.

2.4 HBase

Access operations on the command line use HBase, create column families, add data to each column, modify and update data to view changes.

Through the java demo, use the phoenix driver to connect to HBASE, realize the table creation, addition, deletion, modification and query data operations on HBASE.

The DI tool needs to modify the source code or add the phoenix component before it can be used, because the phoenix insert statement is not Insert into, but Upsert into, which cannot match the DI tool.

2.5 Spark

In the command line, run pyspark, and spark shell, perform spark command line operations, submit spark sample tasks, and conduct a trial run.

Switch the Spark running mode to try out the command line.

Connect to Spark through java demo to distribute and calculate tasks.

(3) Advanced platform

For the above-mentioned components, use them proficiently, practice makes perfect, draw inferences from one instance, be able to write MapReduce code, Spark code, etc. according to the scene, and deeply understand how to operate the supported SQL types, stored procedures, triggers, etc. for Hive and HBase, and be able to design the best according to requirements s solution.

(4) Platform Depth

Deeply read the component source code, understand the meaning and impact of each configuration in platform deployment, and how to optimize the component through source code and configuration, modify the source code to improve the fault tolerance, scalability, stability, etc. of the Hadoop platform.

references:

What components are included in the hadoop ecosystem • Worktile Community

https://www.cnblogs.com/wzgwzg/p/15997342.html

Hadoop Learning Path - Alibaba Cloud Developer Community

Guess you like

Origin blog.csdn.net/weixin_62909516/article/details/131628207