Hadoop Tutorial

        Hadoop is an open source distributed computing and storage framework developed and maintained by the Apache Foundation.

        Hadoop provides reliable, scalable application-layer computing and storage support for huge computer clusters, it allows distributed processing of large data sets across computer clusters using a simple programming model, and supports between a single computer and thousands of computers to expand.

        Hadoop is developed in Java, so it can be deployed and used on computers with many different hardware platforms. Its core components include distributed file system (Hadoop DFS, HDFS) and MapReduce.

        Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage.

        Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS . HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access application data, suitable for those Applications with very large data sets. HDFS relaxes the requirements of (relax) POSIX, and can access (streaming access) data in the file system in the form of streams.

        The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive data, and MapReduce provides calculation for massive data.

        Apache Hadoop is a framework for running applications on large clusters built from general-purpose hardware. It implements the Map/Reduce programming paradigm, and computing tasks are divided into small pieces (multiple times) and run on different nodes. In addition, it also provides a distributed file system (HDFS), data is stored on computing nodes to provide extremely high cross-data center aggregate bandwidth.


Hadoop History

        In 2003 and 2004, Google published two famous papers GFS and MapReduce.

        These two papers and BigTable published in 2006 became the now famous "Google Three Papers".

        Doug Cutting started the development of Hadoop after being influenced by these theories.

        Hadoop consists of two core components. In Google's paper, GFS is a distributed file system running in a huge computer cluster, and HDFS realizes its function in Hadoop. MapReduce is a distributed computing method, and Hadoop implements its function with the MapReduce framework of the same name. We will introduce it in detail in the following MapReduce chapter. Since 2008, Hadoop exists as an Apache top-level project. It and its many sub-projects are widely used in large network service companies including Yahoo, Alibaba, and Tencent, and are listed as support objects by platform companies such as IBM, Intel, and Microsoft.


The role of Hadoop

        The role of Hadoop is very simple, it is to create a unified and stable storage and computing environment in a multi-computer cluster environment, and to provide platform support for other distributed application services.

        That is to say, Hadoop organizes multiple computers into one computer (doing the same thing) to some extent, then HDFS is equivalent to the hard disk of this computer, and MapReduce is the CPU controller of this computer.

Guess you like

Origin blog.csdn.net/leyang0910/article/details/130515511