Basics

The term Hadoop has come to refer not just to the previously mentioned base modules and sub-modules, but also to the ecosystem,[11] or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.
HDFS has five services as follows:
1. Name Node
2. Secondary Name Node
3. Job tracker
4. Data Node
5. Task Tracker
Top three are Master Services/Demons/Nodes and bottom two are Slave Services. HDFS runs on top of the file systems of the underlying operating systems. The HDFS file system is not restricted to MapReduce jobs.
Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive. The file system uses TCP/IP sockets for communication. Clients use RPC to communicate with each other.
Hadoop works directly with any distributed file system, however, this comes at a price – the loss of locality.
The TaskTracker on each node spawns a separate JVM process to prevent the TaskTracker itself from failing if the running job crashes its JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.
The biggest difference between Hadoop 1 and Hadoop 2 involves YARN technology; Hadoop 3 enables having multiple name nodes; Hadoop 3 decreases storage overhead with erasure coding; Hadoop 3 permits usage of GPU hardware within the cluster;

Big Data

Basics

猜你喜欢