Proficient in HADOOP (3) - Introduction to Hadoop - Introduction to Hadoop

1.1 Introduction to Hadoop

Hadoop is the top-level project under the Apache Software Fund, which has several sub-projects born in the Apache incubator. The Hadoop project provides and supports the development of open source software, which provides a framework for developing highly scalable distributed computing applications. The Hadoop framework handles the details of parallel assignment of tasks, allowing application developers to focus on application logic.

Note that the Hadoop logo is a fat yellow elephant. And Hadoop happens to be the name of the chief architect's baby yellow elephant.

The Hadoop project home page (http://hadoop.apache.org/) talks about:

The Hadoop project is a reliable, scalable, distributed computing-based open source software that includes:

Hadoop Core is our flagship project that provides a Distributed File System subproject (HDFS) and a software architecture that supports MapReduce distributed computing.

HBase is built on the Hadoop core and provides a scalable distributed database.

Pig is a high-level dataflow language and framework for implementing parallel computing. It is also built on top of the Hadoop core.

ZooKeeper is an efficient and reliable cooperative support system. Distributed applications use ZooKeeper to store and transmit updates to critical shared state.

Hive is a data warehouse built on Hadoop . It provides the functions of data extraction, random data query and analysis.

The Hadoop core project provides basic services for building cloud computing environments on low-end hardware, and it also provides API interfaces necessary for software running in the cloud. The two basic parts of the Hadoop kernel are the MapReduce framework, which is the cloud computing environment, and the Hadoop Distributed File System (HDFS).

Note that in the Hadoop core framework, MapReduce is often referred to as mapred and HDFS is often referred to as dfs.

The Hadoop core MapReduce framework requires a shared file system. This shared file system does not need to be a system-level file system. Any distributed file system that can be used by the framework can meet the needs of MapReduce.

Although the Hadoop core provides the HDFS distributed file system, it still works without this distributed file system. In the Hadoop JIRA (Bug Tracking System), item 1686 is used to track how to separate HDFS from Hadoop. In addition to HDFS, Hadoop core also supports Cloud Storage (formerly Kosmos) file system (http://kosmosfs.sourceforge.net/) and Amazon Simple Storage Service (S3) file system (http://aws.amazon.com/s3 /). The Hadoop core framework accesses HDFS, cloud storage and S3 through dedicated interfaces. Users are free to use any distributed file system as long as it is a system-mappable file system such as Network File System (NFS), Global File System (GFS) or Lustre.

When using HDFS as a shared file system, Hadoop can analyze which nodes have a copy of the input data, and then try to schedule jobs running on that machine to read that piece of data. This book is about the application and development of Hadoop based on the HDFS file system.

1.1.1 Hadoop's core MapReduce

The Hadoop MapReduce environment provides users with a sophisticated framework to manage and execute Map and Reduce jobs on a cluster. The user needs to enter the following information into the framework:

The location of the job input in the distributed file system
The location of job output in the distributed file system
input format
output format
The class that contains the Map method
The class containing the Reduce method, which is optional.
Location of Jar files containing Map and Reduce methods and other supporting classes

If a job does not require a Reduce method, the user does not need to specify a Reducer class, and the Reduce phase of the job will not be executed. The framework splits the input data, schedules and executes map jobs in the cluster. It sorts the output of the Map job if needed, and then outputs the sorted result to the Reduce task. Finally, output the output data of the Reduce job to the output directory, and finally report the work status to the user.

MapReduce tasks are used to process key/value pairs. The framework will convert each input record into a key/value pair, and each pair of data will be input to the Map job. The output of a Map task is a set of key/value pairs. In principle, the input is one key/value pair, but the output can be multiple key/value pairs. Then, it groups and sorts the Map output key/value pairs. Then, the Reduce method is called once for each key-value pair sorted, and its output is a key and a set of associated data values. The Reduce method can output any number of key/value pairs, which will be written to output files in the job output directory. If the Reduce output key value remains the same as the Reduce input key value, the final output remains sorted.

The framework provides two processes to manage MapReduce jobs:

TaskTracker manages and executes individual Map and Reduce jobs on compute nodes in the cluster.
JobTracker accepts job submissions, provides job monitoring and control, manages tasks, and assigns jobs to TaskTracker nodes.

In general, there is one JobTracker process per cluster and one or more TaskTracker processes per node in the cluster. JobTracker is a key module. If there is a problem with one TaskTracker, the system will be paralyzed. If there is a problem with one TaskTracker, JobTracker will schedule other TaskTracker processes to retry.

Note that a nice feature of the Hadoop core MapReduce environment is that while a job is executing, you can add TaskTrackers to the cluster and dispatch a job to a new node.

1.1.2 Hadoop's Distributed File System

HDFS is a file system specially designed for MapReduce jobs. MapReduce jobs read in orders of magnitude larger amounts of data from the HDFS file system, process them, and write the output back to the HDFS file system after processing. HDFS is not designed to handle random access data. In order to achieve data stability, it stores data on multiple storage nodes. In the Hadoop community, we call it a replica. As long as a copy of the data exists, data consumers can safely use the data.

HDFS is done by two processes,

The NameNode manages file system metadata, and it provides management and control services.
DataNode provides data block storage and query services.

There is a NameNode process in the HDFS file system, which is a key module. If it fails, it will cause the paralysis of the entire system. The Hadoop core provides recovery and automatic backup of the NameNode, but no runtime recovery. A cluster has multiple DataNode processes. Typically, each storage node in the cluster has one DataNode process.

Note that it is common to provide the TaskTracker service and the DataNode service in one node in the cluster. It is also common to provide JobTracker and NameNode services in one node.