Hadoop begins

Who Says Elephants Can not Dance

What is Hadoop

官网定义:The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Hadoop: computing platform for distributed storage and big data

Now a top-level Apache open source projects, Hadoop does not refer to a specific frame or component, which is under the Apache Software Foundation in the Java language developed by an open source distributed computing platform. To achieve massive data distributed computing cluster consisting of a large number of computers, suitable for large data storage and distributed computing platforms.

Here is a simple example: Let's say you have a fruit basket, you want to know the number of apples and pears is how much, as long as one by one the number can know how much the. If you have a container of fruit, this time we need a lot of people at the same time help you counted, which is equivalent to multiple processes or threads. If you do a lot of containers of fruit, then you need distributed computing, that is Hadoop. The so-called will of the birth Hadoopd, mainly due to the amount of data entered into the era of big data, the computer need to be addressed too large. In this case it is necessary to cut these massive data assigned to the N computer for processing. When large amounts of information is assigned to a different computer for processing, to ensure that the final result obtained on the need for proper management of information distributed processing, hadoop is such a solution.

Hadoop origin

Hadoop originated in Google's three major papers, GFS, BigTable and MapReduce. Inspired by the Doug Cutting (Hadoop's father) and others with business time to achieve the DFS and MapReduce mechanism. February 2006 is separated out into a complete stand-alone software, named Hadoop.

Hadoop's growth process experienced: Lucene-> Nutch-> Hadoop

The core idea of ​​the gradual evolution of three papers, final:

GFS—->HDFS
Google MapReduce—->Hadoop MapReduce
BigTable—->HBase

Hadoop core version and architecture

Apache open source community version, is now to 3.x

Hadoop1.0 version two core: HDFS + MapReduce

Hadoop2.0 version, introduced Yarn. Core: HDFS + Yarn + Mapreduce

Yarn is resource scheduling frame. Capable of fine-grained management and scheduling tasks. In addition, calculations can also support other frameworks, such as spark like.

Hadoop3.0 version, not the introduction of the new core, in the original core, upgrade a lot. Specifically to see See the official website

Hadoop idea

Hadoop can run on conventional commercial servers, with high fault tolerance, high reliability, scalability, etc.

Particularly suitable for write once, read many scenes

Suitable scene

  • Large-scale data
  • Streaming data (write once, read many)
  • Commodity hardware (hardware in general)

Not suitable Scene

  • Low latency data access
  • A large number of small files
  • Frequently modified files

PS

Three papers Hadoop google origin of Chinese version

GFS Google's distributed file system File System Google
BigTable a large distributed database
MapReduce Google's open-source MapReduce distributed parallel computing framework

Guess you like

Origin www.cnblogs.com/valjeanshaw/p/11403379.html