Rookie of Hadoop Quick Start

First, the concepts

1, Big Data

Big Data is a concept, is a technology based on a variety of data analysis on Hadoop big data platform as the representative frame technology.

Large data includes a large data base and framework Hadoop Spark represented, further comprising real-time data, off-line data processing, data analysis, data mining and analysis techniques to predict machine algorithm.

2、Hadoop

Hadoop is an open source big data framework, is a distributed computing solution.

Two core Hadoop solves the problem of data storage (HDFS distributed file system) and distributed computing issues (MapRe-duce).

Example 1: a user wants to get the data path, the data stored on many machines, as users do not have to consider on which machine, HD-FS automatically get.

Example 2: If the file is a 100p, the line containing the desired Hadoop filtered string. In this scenario, HDFS distributed storage, breaking the restrictions the server's hard disk size, solve the problem of a single machine can not store large files, while MapReduce distributed computing allows a large amount of data to calculate the job fragmentation, the final summary output .

Two, Hadoop Features

advantage

1, support for very large files. HDFS stored files can support TB and PB-level data.

2, detection and rapid response to hardware failure. Data backup mechanism, NameNode heartbeat mechanism to detect whether there DataNode.

3, high scalability. May be constructed on inexpensive machine achieve a linear (horizontal) extension, when the cluster is new nodes are added, the NameNode may be perceived, and the backup data distributed to the respective nodes.

4, mature ecosystem. With the power of open source, some small tools around Hadoop derived.

Shortcoming

1, can not do low latency. High data throughput is optimized, at the expense of latency access to data.

2, not suitable for a large number of small files are stored.

3, low file modification efficiency. HDFS for write once, read many times the scene.

Three, HDFS introduction

1, HDFS Frame Analysis

HDFS main structure from the Master and Slave. Mainly by NameNode, Secondary NameNode, DataNode constitution.

Rookie of Hadoop Quick Start

NameNode

HDFS namespace management and mapping of the data blocks stored channel metadata file to the local data to the block map.

If NameNode hung up, the file will not be restructuring, how to do? What fault tolerance mechanisms?

HA may be configured Hadoop i.e. high-availability cluster, cluster NameNode two nodes, one active master node, the other stan-dby spare node, both the data always consistent. When the master node is unavailable, the standby node switches automatically and immediately, the user is not aware, NameNode avoid single point problem.

Secondary NameNode

Auxiliary NameNode, share NameNode work, can assist the recovery NameNode emergency.

DataNode

Slave node, data is actually stored, and reading and writing the data block to store information report NameNode.

2, HDFS file read and write

Files are stored in the data block according to the embodiment DataNodes, abstract data block is a block, as the storage and transmission units, rather than the entire file.

Rookie of Hadoop Quick Start

Why file in accordance block to store it?

First, the concept paper of shielding, simplifies the design of the storage system, such as 100T file is larger than the storage disk, the file needs to turn into a plurality of data blocks stored in the plurality of disks; in order to ensure data security, to be backed up, and the data block ideal for backing up data, thereby enhancing fault tolerance and availability of data.

How the data block size is set to consider?

If the data block size is too small, the general files will be divided into a plurality of data blocks, it is also necessary to access a plurality of data block address when accessed, so that efficiency is not high, but also compare the memory consumption NameNode severe; data block is set too high, then the parallel support is not good, if the system needs to reboot to load the data, while the larger data blocks, system recovery will be longer.

3.2.1 HDFS file reading process

Rookie of Hadoop Quick Start

1, the metadata query (block DataNode node is located) to communicate NameNode, locate the file in DB server DataNode.

2, a selection DataNodes (principle of proximity, then randomly) server, the request for establishing the socket stream.

3, DataNode start sending data (read data stream from the disk into the inside, do the check in packet units).

4, the client has received packet as a unit, and now the local cache, and then written to the destination file, it is equivalent to the back of block block block block the synthesis of the last file append to the front ultimately desired.

3.2.2 HDFS file write process

Rookie of Hadoop Quick Start

1, a request to the communication NameNode upload files, whether NameNode check the target file already exists, the parent directory exists.

2、NameNode返回确认可以上传。

3、client会先对文件进行切分,比如一个block块128m,文件有300m就会被切分成3个块,一个128m、一个128m、一个44m。请求第一个block该传输到哪些DataNode服务器上。

4、NameNode返回DataNode的服务器。

5、client请求一台DataNode上传数据,第一个DataNode收到请求会继续调用第二个DataNode,然后第二个调用第三个DataNode,将整个通道建立完成,逐级返回客户端。

6、client开始往A上传第一个block,当然在写入的时候DataNode会进行数据校验,第一台DataNode收到后就会传给第二台,第二台传给第三台。

7、当一个block传输完成之后,client再次请求NameNode上传第二个block的服务器。

四、MapReduce介绍

1、概念

MapReduce是一种编程模型,是一种编程方法,是抽象的理论,采用了分而治之的思想。MapReduce框架的核心步骤主要分两部分,分别是Map和Reduce。每个文件分片由单独的机器去处理,这就是Map的方法,将各个机器计算的结果汇总并得到最终的结果,这就是Reduce的方法。

2、工作流程

向MapReduce框架提交一个计算作业时,它会首先把计算作业拆分成若干个Map任务,然后分配到不同的节点上去执行,每一个Map任务处理输入数据中的一部分,当Map任务完成后,它会生成一些中间文件,这些中间文件将会作为Reduce任务的输入数据。Reduce任务的主要目标就是把前面若干个Map的输出汇总到一起并输出。

Rookie of Hadoop Quick Start

3、运行MapReduce示例

运行Hadoop自带的MapReduce经典示例Word-count,统计文本中出现的单词及其次数。首先将任务提交到Hadoop框架上。

Rookie of Hadoop Quick Start

查看MapReduce运行结束后的输出文件目录及结果内容。

Rookie of Hadoop Quick Start

可以看到统计单词出现的次数结果

Rookie of Hadoop Quick Start

五、Hadoop安装

墙裂推荐:史上最详细的Hadoop环境搭建(https://blog.csdn.net/hliq5399/article/details/78193113)

1、Hadoop部署模式

本地模式

伪分布式模式

完全分布式模式

以上部署模式区分的依据是NameNode、Data-Node、ResourceManager、NodeManager等模块运行在几个JVM进程、几个机器上。

Rookie of Hadoop Quick Start

2、安装步骤(以伪分布式模式为例)

学习Hadoop一般是在伪分布式模式下进行。这种模式是在一台机器上各个进程上运行Hadoop的各个模块,伪分布式的意思是虽然各个模块是在各个进程上分开运行的,但是只是运行在一个操作系统上的,并不是真正的分布式。

5.2.1 JDK包下载、解压安装及JAVA环境变量配置

exportJAVA_HOME=/home/admin/apps/jdk1.8.0_151

exportCLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

exportPATH=$JAVA_HOME/bin:$PATH

Rookie of Hadoop Quick Start

5.2.2 Hadoop包下载、解压安装及Hadoop环境变量配置

exportHADOOP_HOME="/zmq/modules/hadoop/hadoop-3.1.0"

exportPATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

Rookie of Hadoop Quick Start

5.2.3 配置Hadoop-env.sh、mapred-env.sh、yarn-env.sh文件的JAVA_HOME参数

exportJAVA_HOME="/home/admin/apps/jdk1.8.0_151"

Rookie of Hadoop Quick Start

5.2.4 配置core-site.xml,配置的是HDFS的地址和Hadoop临时目录

Rookie of Hadoop Quick Start

5.2.5 Configuration hdfs-site.xml, set the number of backup storage HDFS, there is a pseudo-distributed deployment, to fill 1

Rookie of Hadoop Quick Start

5.2.6 format HDFS, start NameNode, Data-Node, SecondaryNameNode, view the process

Rookie of Hadoop Quick Start

5.2.7 build complete, the operating HDFS (commonly used in the new directory, upload and download files, etc.), as well as run MapReduceJob

Six, Hadoop more

Described above is only a preliminary study and use of Hadoop, the Hadoop HA fully distributed deployment, resource scheduling Hadoop YARN, Hadoop's high availability and fault tolerance, other Hadoop ecosystem components, etc. not to study, Hadoop water deep sigh, ha ha.

About the Author : Dream piano, two years + experience in testing, current testing is mainly responsible for internal and some external platform product delivery test project.

Guess you like

Origin blog.51cto.com/14463231/2425940