To study large data, you have to know Hadoop

We must first of its profits. This multi-dimensional data like large, high complexity of things, there must be a good platform to support.

Hadoop what is ?

Hadoop is a software platform for analysis and processing big data, open source software is framed with a Appach Java language implemented in a large number of computer clusters composed of a distributed among the flood of data for the calculation.

Hadoop core design framework is : HDFS and MapReduce.HDFS provides storage for the vast amounts of data, MapReduce provides for the calculation of vast amounts of data.

Big Data Hadoop flow reference to the following simple treatment may be appreciated that FIG : data obtained by a result of Hadoop clusters obtained after treatment.

HDFS: Hadoop Distributed File System, Hadoop distributed file system.

Default data file is divided into large blocks of a distributed memory 64M. Data1 below file is divided into three machines in the cluster, which is mirrored in a redundant three distributed in different machines.

MapReduce: Hadoop is created for each input split a task invocation Map calculation, in turn deal with this task in a record (record) in this split, map the results will be output in the form of key-value, hadoop responsible by key value after finishing the output map as an input Reduce output Reduce Task of the output of the entire job, saved on HDFS.

Hadoop cluster mainly composed NameNode, DataNode, Secondary NameNode, JobTracker, TaskTracker composition as shown below:

NameNode中记录了文件是如何被拆分成block以及这些block都存储到了那些DateNode节点.NameNode同时保存了文件系统运行的状态信息. DataNode中存储的是被拆分的blocks.Secondary NameNode帮助NameNode收集文件系统运行的状态信息.JobTracker当有任务提交到Hadoop集群的时候负责Job的运行,负责调度多个TaskTracker.TaskTracker负责某一个map或者reduce任务

作者强力推荐阅读文章:

大数据工程师必须掌握开源工具汇总

大数据高级工程师教你如何读懂大数据核心技术

顶级大数据工程师需要掌握的技能

大数据、机器学习和人工智能未来发展的8个因素

Guess you like

Origin blog.csdn.net/sdddddddddddg/article/details/91402247