Big Data Basics: Hadoop Distributed System Introduction

With the rapid development of intelligent, all things Internet era, the amount of data began to surge, on the one hand we need to start thinking about how to efficiently and reliably store massive amounts of data, on the other hand we also need to analyze and process the data to get more Valuable information. This time we need to use the Hadoop.

 

Apache Software Foundation Hadoop is an open source distributed computing platform to hdfs (Hadoop Distributed File System), MapReduce (Hadoop2.0 joined YARN, Yarn resource scheduling framework, capable of fine-grained management and scheduling tasks, but also to support other computational framework, such as Spark) Hadoop core of the system provides the user with details of the underlying transparent distributed infrastructure. hdfs high fault tolerance, high stretchability, high efficiency, etc. so that the user can be deployed on Hadoop inexpensive hardware, form a distributed system.

Hadoop ecology

 

In addition to basic Hadoop, Hadoop has been developed to have a very complete and a huge open source ecosystem: HDFS provides file storage, YARN resources management, on this basis, subjected to various treatments, including mapreduce, Tez, Sprak, Storm, etc., to meet different requirements of data usage scenarios.

HDFS architecture

HDFS architecture diagram

 

HDFS using a master-slave model, a HDFS cluster consists of a NameNode and several DataNode, where NameNode as the master server manages the file system namespace and client access operation on the file, and the DataNode data management storage is responsible. HDFS underlying data is cut into a plurality of Block, and after the Block has been replicated stored on different DataNode, to achieve the purpose of fault-tolerant redundancy. Want to learn the system big data, you can join the big data technology learning buttoned Junyang: 522 189 307

MapReduce

 

MapReduce is the core of Google's computing model, it will run in parallel computing complex process on cluster size are highly abstract process two functions: Map and Reduce ( "Map (Mapping)" and "Reduce (reduction)") . function to map key / value pairs as an input and generates another series of key / value pairs written as the intermediate output of the local disk. MapReduce framework automatically aggregating data according to the intermediate key value, and the key value is the same data to reduce the unified function processing. places and reduce function key corresponding to the list as the input value, the value of the same value after the merge key, generating another set of key / value pairs as the final output write HDFS.

Hive and the difference Hbase

 

In Hadoop basic ecology, there are two components have to talk about their differences, they are the hive and hbase. Hive is based on Hadoop data warehousing tools, you can map the structure of the data file to a database table, and provides a simple sql query function, you can convert the sql statement to run MapReduce tasks. HBase Hadoop database is a distributed, scalable, large data store.

1.Hive itself does not calculate and store the data, and it is totally dependent on HDFS MapReduce, Hive purely logical tables. hive need to use hdfs store files, need to use MapReduce computing framework.

2.hive can be thought of as a map-reduce the packaging. hive meaning is to write good of the hive sql converted to complex and difficult to write map-reduce programs.

3.hbase physical table is not logical table, a large memory hash table, which is stored by the search engine index, query operation easy.

4.hbase can be considered as a package of hdfs. His essence data storage, a NoSql (not only sql) database; HBase deployed over hdfs, hdfs and overcomes the shortcomings in terms of random access.

He published 191 original articles · won praise 3 · views 30000 +

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104999617