Big Data

Basics

  • The term Hadoop has come to refer not just to the previously mentioned base modules and sub-modules, but also to the ecosystem,[11] or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache PigApache HiveApache HBaseApache PhoenixApache SparkApache ZooKeeperCloudera ImpalaApache FlumeApache SqoopApache Oozie, and Apache Storm.
  • HDFS has five services as follows: 
    1. Name Node 
    2. Secondary Name Node 
    3. Job tracker 
    4. Data Node 
    5. Task Tracker 
    Top three are Master Services/Demons/Nodes and bottom two are Slave Services. HDFS runs on top of the file systems of the underlying operating systems. The HDFS file system is not restricted to MapReduce jobs.

  • Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive. The file system uses TCP/IP sockets for communication. Clients use RPC to communicate with each other.
  • Hadoop works directly with any distributed file system, however, this comes at a price – the loss of locality.
  • The TaskTracker on each node spawns a separate JVM process to prevent the TaskTracker itself from failing if the running job crashes its JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.
  • The biggest difference between Hadoop 1 and Hadoop 2 involves YARN technology; Hadoop 3 enables having multiple name nodes; Hadoop 3 decreases storage overhead with erasure coding; Hadoop 3 permits usage of GPU hardware within the cluster;
  •  

猜你喜欢

转载自blog.csdn.net/wwwpcstarcomcn/article/details/86493741