Hadoop Literacy & Quick Start


Dingdu! Here is the compilation of Xiao Ah Woo's study course materials. A good memory is not as good as a bad pen. Today is also a day to make progress. Let's advance together!
Insert picture description here

1. About Hadoop

Hadoop Distributed File System

Hadoop is developed by the Apache Foundation 分布式系统基础架构. Users can develop distributed programs without understanding the underlying details of distributed. Make full use of the power of the cluster 高速运算和存储.

Hadoop implements a 分布式文件系统(Hadoop Distributed File System), one of the components is HDFS. HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput (high throughput) to access application data, suitable for those with large data sets (large data sets). set) application. HDFS relaxes the requirements of POSIX, and can access data in the file system in the form of streaming access.

Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data.

Core architecture

Hadoop is composed of many elements. At the very bottom Hadoop Distributed File System(HDFS), it stores files on all storage nodes in the Hadoop cluster. HDFS is a layer of MapReduce 引擎the engine by the JobTrackers 和 TaskTrackerscomposition. Through the introduction to the core distributed file system HDFS and MapReduce processing process of the Hadoop distributed computing platform, as well as the data warehouse tool Hive and the distributed database Hbase, it basically covers all the technical cores of the Hadoop distributed platform.

Two, Hadoop quick start

See the quick start document for details:

Hurry and poke here~

http://hadoop.apache.org/docs/r1.0.4/cn/quickstart.html

Three, the difference between Hadoop & MapReduce

Hadoop

It is a framework for distributed data and computing. It is good at it 存储大量的半结构化的数据集. Data can be stored randomly, so the failure of a disk will not cause data loss. Hadoop is also very good at distributed computing-quickly processing large data sets across multiple machines.

MapReduce

It is a programming model for dealing with large sets of semi-structured data . The programming model is a way to deal with and structure specific problems. For example, in a relational database, a collection language is used to execute queries, such as SQL. Tell the language the desired result and submit it to the system to figure out how to generate the calculation. You can also use more traditional languages ​​(C++, Java) to solve the problem step by step. These are two different programming models, and MapReduce is the other.

MapReduce和Hadoop是相互独立的,实际上又能相互配合工作得很好 。

Insert picture description here

Fourth, the significance of Hadoop big data processing

Insert picture description here

Hadoop is a large data processing applications in a wide range of applications benefit from their own 数据提取, 变形and 加载(ETL)on aspects of the natural advantages. The distributed architecture of Hadoop puts the big data processing engine as close to the storage as possible, which is relatively suitable for batch processing operations like ETL, because the batch processing results of similar operations can go directly to storage. The MapReduce function of Hadoop is implemented 将单个任务打碎,并将碎片任务(Map)发送到多个节点上,之后再以单个数据集的形式加载(Reduce)到数据仓库里.

Ending!
More course knowledge learning records will come later!

就酱,嘎啦!

Insert picture description here

Note:
Life is diligent, nothing is gained.

Guess you like

Origin blog.csdn.net/qq_43543789/article/details/108757421