Large data entry Hadoop, Hadoop installation and configuration, HDFS pseudo distributed deployment (a)

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/ck784101777/article/details/102676795

First, an overview (skip to deploy II)

1. The origin of Big Data

  With the development of computer technology, the popularity of the Internet has accumulated information to a very large extent, the growth of information has also been accelerated, with the accelerated construction of networked Internet, Internet, information is exploding, collecting retrieval statistical information is more difficult, the traditional database structure is difficult to respond to this change, we must use the new technology to solve these problems

2. What is Big Data

- means not big data capture, manage, and treated with conventional tools within a certain time frame data set

- Big data requires new processing mode in order to have more decision-making power, insight found massive force and process optimization capabilities, high growth rates and diverse information assets

- Big data refers to data from a wide variety of types, quick access to valuable information

3. What do Big Data

- business organizations to use data analysis to help them reduce costs, improve efficiency, develop new products and make more informed business decisions, etc.

- data collection and analysis of information and data relationships derived, used to detect business trends, judging research quality, to avoid the spread of disease, even when measured against crime or traffic conditions, etc.

- massively parallel database processing, data mining grid, a distributed file system or database, calm scalable cloud storage system

4. Large data characteristics

-Volume general volume

TB can be from hundreds to tens of millions of PB, and even the size of EB

-Variety diversity

Large data includes data of various formats and forms

-Velocity timeliness

Many big data need to be processed in a timely manner under a certain time limit

-Veracity accuracy

The result of the process to ensure a certain degree of accuracy

-Value great value

Big data contains a lot of depth value, big data analysis and mining use will bring huge commercial value

 

5. Large data Hadoop

What is hadoop

-hadoop is a massive data analysis and processing software platform

-Hadoop is an open source software, developed using JAVA

-Hadoop can provide a distributed infrastructure

- high reliability, scalability, high efficiency, high fault tolerance, cost

 

6.Hadoop origin

In 2003 Google released one after another three papers, namely, GFS, MapReduce, BigTable. This is three technologies become Google's troika, although no source announced, but released a detailed design on these three products

GFS:

GFS is a scalable distributed file system for large-scale, distributed, applications to access large amounts of data, and can run on inexpensive commodity hardware, providing fault tolerance

MapReduce:

MapReduce programming model is a set of distributed for parallel computing, from the Map and Reduce composition, Map is a map, the instruction dispatch to multiple worker, is Reduce the statute, the calculated result combined worker

BigTable:

Bigtable is to store structured data, build on the GFS, Scheduler, Lock Service, and MapReduce

 

7.Hadhoop and Google's relationship

Hadoop is a Yahoo company-funded development, based on Google's three papers, the use of Java development, but Google is much worse than in performance, but the open source ah, people can use, so gradually become the mainstream tool for large data and later developed by Google big data tools are compromised, select a compatible Hadoop

Hadoop is based on Google's three papers developed, there are three technical support, HDFS, MapReduce, Hbase

Hadoop relationship with Google three papers:

GFS-->HDFS

MapReduce-->MapReduce

BigTable-->Hbase

 

7.Hadoop Common Components

HDFS: Hadoop Distributed File System (core)

MapReduce: Distributed Computing Framework (core)

Yarn: cluster resource management system (core)

Zookeeper: Distributed Collaboration Services

Hbase: distributed database columns exist

Hive: Hadoop-based data warehouse

Sqoop: Data Synchronization Tool

Pig: Hadoop-based data flow system

Mahout: data mining algorithms library

Flume: log collection tool

 

8. Let's look at the relationship between MapReduce, Yarn, HDFS

HDFS distributed file system as a storage node, data stored in the resource management Yarn HDFS file system, the MapReduce is a procedure for calculating and integrating data resources

 

9. go into the details HDFS

Hadoop HDFS cluster is actually a cluster, when it comes to HDFS, Here to talk about what is HDFS

HDFS: in fact, a file system, and fastDFS similar, like Baidu cloud, Ali clouds and other is a file storage system, of course, generally if only to be used if the file is stored directly fastDFS this would have been enough, HDFS aim not only with to store files so simple, it involves distributed computing.

HDFS distributed systems have NameNode and DataNode, NameNode the entire file system directory, memory-based storage, storage of detailed information on some files, such as file name, file size, creation time, file location. Datanode is the data file, which is the file itself, but a small file divided. The above chart has been done introduced, will not repeat them here.

Character:

-Client

Splitting files, access HDFS, interact with Namenode, get the file location information, interact with DataNode, read and write data

-Namenode

Master Node Manager HDFS namespace and data block mapping information, copy of a policy configuration, handles all client requests

-Secondarynode

Regular merger fsimage and fsedits, pushed to NameNode, emergency situations can help restore NameNode

-Datanode

Data storage node, the actual data is stored, to store information reporting NameNode

 

HDFS works, a little difficult to express in words, so I'll just draw a map of it:

Watch the sequence:

1) See the master node Namenode

Its primary function is to split a large file, the file will be divided into sub-node Datanode (assuming a small file * n), and the n-th recording BLOCK_ID small files (on which is stored for confirming the child node. And the backup -1 child nodes to other parts of the child node data

可以将其看做书本的目录,把一篇文章分为n个小节,记录页码

核心作用:分割 备份 定位

2)再看子节点datanode

主机点通过与子节点建立心跳确认子节点状态,若子节点宕机,则将自身的数据复制到其他节点

 

10.细谈Yarn

Yarn是一种新的 Hadoop资源管理器,它是一个通用资源管理系统,可为上层应用提供统一的资源管理和调度,它的引入为集群在利用率、资源统一管理和数据共享等方面带来了巨大好处。

Yarn角色:

-ResourceManager

处理客户端请求,启动/监控ApplicationMaster,监控NodeManager,资源分配与调度

-NodeManager

单个节点上的资源管理,处理来自ResourceManager的命令,处理来自ApplicationMaster的命令

-ApplicationMaster

数据切分,为应用程序申请资源,并分配给内部任务,任务监控与容错

-Container

对任务运行环境的抽象,封装了CPU,内存等,多维资源以及环境变量,启动命令等任务运行相关的信息资源分配与调度

-Client

用户与Yarn交互的客户端程序,提交应用程序,监控应用程序的状态,杀死应用程序等

 

 

11.细谈MapReduce

MapReduce角色与结构:

-JobTracker

Master节点只有一个,管理所有作业/任务的监控,错误处理等,将任务分解成一系列任务,并派发给TaskTracker

-TaskTracker

Slave节点,一般是多台,运行Map Task和Reduce Task,与JobTracker交互,汇报任务状态

-Map Task

解析每条数据记录,传递给用户编写的map()并执行,将输出结构写入本地磁盘,如果map-only作业,直接写入HDFS

-Reducer Task

从Map Task的执行结果中,远程读取输入数据,对数据进行排序,将数据按照分组传递给用户编写的reduce函数执行

 

 

 

Mapreduce工作原理:

MapReduce是一种编程模型,用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)"和"Reduce(归约)",是它们的主要思想,都是从函数式编程语言里借来的,还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下,将自己的程序运行在分布式系统上。当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(归约)函数,用来保证所有映射的键值对中的每一个共享相同的键组。

弄懂MarReduce的关键在与弄懂Map函数和Reduce函数

Map将会生成一组键值对,以单个字符串作为键名,以1作为值

Reduce会归约Map生成的键值对,将相同键名的key归约在一起,将值累加,最终的得出某个键的总个数

map函数和Reduce函数是需要手动干预编写的,自己定义规则,以何种形式统计出数据就是大数据的核心,下图是使用wordcount方式统计字符串出现的个数

Mapreduce工作原理如下图:

MapReduce所编写好的程序将跑在各个DataNode上,这里有个概念就是计算向数据移动,也就是DataNode的数据文件存储在这里,我的程序发送到DataNode节点上去读取数据和分析数据就好了。期间会有出现各个DataNode之间进行数据发送,比如说节点DataNode1进行这台机读取数据时进行shuffle时需要把相同的key作为一组调用一次reduce,那么如果这时当然会有一些同key的在其他节点DataNode上的,所以就需要进行数据传送。Input这里的wordcount.txt就是DataNode上的文件数据。Split阶段是MapReduce一定会执行的,这是它的规则,而map阶段就是我们必须进行手动干预的,也就是编码对数据进行分析,分析成map文件,然后再shuffle阶段中自发进行数据派送,规则是以同样的key为一组调用reduce阶段进行数据压缩,reduce也是进行手动干预的,我们编码进行数据计算,计算同key的个数,统计完后就可以输出一个文件出来了,这整个过程数据的传输都是放在context这个上下文中。下面是借鉴网上的一张图,HDFS与MapReduce之间的关系协助大概就是这么个意思。

 

 

 

12.Hadoop生态系统

 

二、 hadoop部署,单机模式

1.hadoop的部署有三种模式

-单机

-伪分布式

-完全分布式

2.Hadoop单机模式安装与使用

需要安装JDK,并且配置Java环境

1)安装java环境

  1. [root@nn01 ~]# yum -y install java-1.8.0-openjdk-devel  #安装1.8以上的都可以
  2. [root@nn01 ~]# java -version  
  3. openjdk version "1.8.0_131"
  4. OpenJDK Runtime Environment (build 1.8.0_131-b12)
  5. OpenJDK 64-Bit Server VM (build 25.131-b12, mixed mode)
  6. [root@nn01 ~]# jps
  7. 1235 Jps

2)安装hadoop

安装hadoop需要用到几个包,可以到我的github上下载

[hadoop相关软件]https://github.com/ck784101777/hadoop

  1. [root@nn01 ~]# cd hadoop/
  2. [root@nn01 hadoop]# ls  //本例需要hadoop-2.7.7.tar.gz
  3. hadoop-2.7.7.tar.gz kafka_2.12-2.1.0.tgz zookeeper-3.4.13.tar.gz
  4. [root@nn01 hadoop]# tar -xf hadoop-2.7.7.tar.gz
  5. [root@nn01 hadoop]# mv hadoop-2.7.7 /usr/local/hadoop #无需安装直接放到目录下
  6. [root@nn01 hadoop]# cd /usr/local/hadoop
  7. [root@nn01 hadoop]# ls
  8. bin include libexec NOTICE.txt sbin
  9. etc lib LICENSE.txt README.txt share
  10. [root@nn01 hadoop]# ./bin/hadoop   //启动报错,JAVA_HOME没有找到
  11. Error: JAVA_HOME is not set and could not be found.
  12. [root@nn01 hadoop]#

3)解决报错问题

报错原因是没有配置好JAVA_HOME环境,默认是${JAVA....},修改如下

  1. [root@nn01 hadoop]# rpm -ql java-1.8.0-openjdk
  2. /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/bin/policytool
    /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre/lib/amd64/libawt_xawt.so
  3.  
  4. [root@nn01 hadoop]# cd ./etc/hadoop/
  5. [root@nn01 hadoop]# vim hadoop-env.sh #25代表行数,set nu 查看行数
  6. 25 export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64 /jre"
  7. 33 export HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" 

4)初次使用hadoop 

本例将使用hadoop的wordcount(统计词频)来统计单个文件中每个字符串出现的次数

格式:

./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar  wordcount  被统计目录或文件  输出目录

  1. [root@nn01 ~]# cd /usr/local/hadoop/ 
  2. [root@nn01 hadoop]# ./bin/hadoop  #再次执行不报错
  3. Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
  4. CLASSNAME run the class named CLASSNAME
  5. .......
  6.  
  7. [root@nn01 hadoop]# mkdir /usr/local/hadoop/input      #创建一个目录
  8. [root@nn01 hadoop]# ls
  9. bin etc include lib libexec LICENSE.txt NOTICE.txt input README.txt sbin share
  10. [root@nn01 hadoop]# cp *.txt /usr/local/hadoop/input
  11. [root@nn01 hadoop]# ./bin/hadoop jar \
  12. share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount input output    
  13.  //wordcount为参数 统计input这个文件夹,存到output这个文件里面(这个文件不能存在,要是存在会报错,是为了防止数据覆盖)
  14. [root@nn01 hadoop]# cat output/part-r-00000   //查看
  15. [root@nn01 output]# cat part-r-00000    #字符串 个数
    ""AS    2
    "AS    17
    "COPYRIGHTS    1
    "Contribution"    2
    "Contributor"    2
    "Derivative    1
    "GCC    1
    "Legal    1
    "License"    1
    "License");    2
    "Licensed    1
    "Licensor"    1

 

三、Hadoop伪分布式文件架构部署

伪分布式的配置和完全分布式配置类似,区别在与所有角色安装在一台机器上(Client,NameNode,Secondary NameNode)使用本地磁盘,一般生产环境都会使用完全分布式,伪分布式用来学习和测试Hadoop的功能

 

 

 

根据HDFS分布式文件架构图,我们需要至少4台主机,一台NameNode(Secondary NameNode),3台DataNode

整个流程如下:

-配置运行环境,四台主机

-配置域名解析

-在所有主机上安装JDK,JAVA环境

-主节点生成秘钥并发送到各个节点机,可以免密访问四个节点

-安装部署Hadoop,NameNode与DataNode

-修改配置文件slave

-修改配置文件core-site.xml

-修改配置文件hdfs-site.xml

-格式化分区

-测试

步骤一:运行环境准备

1)三台机器配置主机名为node1、node2、node3,配置ip地址(ip如图-1所示),yum源(系统源)

2)编辑/etc/hosts(四台主机同样操作,以nn01为例)

配置hosts,使他们可以通过主机名通信

  1. [root@nn01 ~]# vim /etc/hosts
  2. 192.168.1.60 nn01
  3. 192.168.1.61 node1
  4. 192.168.1.62 node2
  5. 192.168.1.63 node3

3)安装java环境,在node1,node2,node3上面操作(以node1为例)

  1. [root@node1 ~]# yum -y install java-1.8.0-openjdk-devel

4)布置SSH信任关系

在DataNode上生成密钥对发送给各个节点

  1. [root@nn01 ~]# vim /etc/ssh/ssh_config //第一次登陆不需要输入yes
  2. Host *
  3. GSSAPIAuthentication yes
  4. StrictHostKeyChecking no
  5. [root@nn01 .ssh]# ssh-keygen
  6. [root@nn01 .ssh]# for i in 61 62 63 64 ; do ssh-copy-id 192.168.1.$i; done
  7. //部署公钥给nn01,node1,node2,node3

5)测试信任关系

无需密码登录证明建立了信任关系

  1. [root@nn01 .ssh]# ssh node1
  2. Last login: Fri Sep 7 16:52:00 2018 from 192.168.1.60
  3. [root@node1 ~]# exit
  4. logout
  5. Connection to node1 closed.
  6. [root@nn01 .ssh]# ssh node2
  7. Last login: Fri Sep 7 16:52:05 2018 from 192.168.1.60
  8. [root@node2 ~]# exit
  9. logout
  10. Connection to node2 closed.
  11. [root@nn01 .ssh]# ssh node3

步骤二:配置hadoop

环境配置文件:hadoop-env.sh

核心配置文件:core-site.xml

HDFS配置文件:hdfs-site.xml

节点配置文件:slaves

安装hadoop在第二节有介绍

1)修改slaves文件

在DataNode上修改

  1. [root@nn01 ~]# cd /usr/local/hadoop/etc/hadoop
  2. [root@nn01 hadoop]# vim slaves
  3. node1
  4. node2
  5. node3

2)hadoop的核心配置文件core-site

fs.defaultFS:文件系统配置参数

hadoop.tmp.dir:数据目录配置参数

  1. [root@nn01 hadoop]# vim core-site.xml
  2. <configuration>
  3. <property>
  4. <name>fs.defaultFS</name>
  5. <value>hdfs://nn01:9000</value>     #hdfs://自定义命名
  6. </property>
  7. <property>
  8. <name>hadoop.tmp.dir</name>
  9. <value>/var/hadoop</value>      #保存hadoop文件的目录
  10. </property>
  11. </configuration>
  12.  
  13. [root@nn01 hadoop]# mkdir /var/hadoop        //hadoop的数据根目录

3)配置hdfs-site文件

定义NameNode和SecondNode,在一台主机上,从节点设置3台

  1. [root@nn01 hadoop]# vim hdfs-site.xml
  2. <configuration>
  3. <property>
  4. <name>dfs.namenode.http-address</name>
  5. <value>nn01:50070</value>
  6. </property>
  7. <property>
  8. <name>dfs.namenode.secondary.http-address</name>
  9. <value>nn01:50090</value>
  10. </property>
  11. <property>
  12. <name>dfs.replication</name>
  13. <value>2</value>
  14. </property>
  15. </configuration>

4)同步配置到node1,node2,node3

使用rsync命令将hadoop同步到节点机上

  1. [root@nn01 hadoop]# for i in 62 63 64 ; do rsync -aSH --delete /usr/local/hadoop/
  2. \ 192.168.1.$i:/usr/local/hadoop/ -e 'ssh' & done
  3. [1] 23260
  4. [2] 23261
  5. [3] 23262

5)查看是否同步成功

  1. [root@nn01 hadoop]# ssh node1 ls /usr/local/hadoop/
  2. .............

步骤三:格式化

格式化namenode,验证是否存在角色,查看集群是否创建成功,有多少个节点

  1. [root@nn01 hadoop]# cd /usr/local/hadoop/
  2. [root@nn01 hadoop]# ./bin/hdfs namenode -format            //格式化 namenode
  3. [root@nn01 hadoop]# ./sbin/start-dfs.sh                             //启动
  4. [root@nn01 hadoop]# jps                                                  //验证角色
  5. 23408 NameNode
  6. 23700 Jps
  7. 23591 SecondaryNameNode
  8. [root@nn01 hadoop]# ./bin/hdfs dfsadmin -report        //查看集群是否组建成功
  9. Live datanodes (3):                                                   //有三个角色成功

 

 

Guess you like

Origin blog.csdn.net/ck784101777/article/details/102676795
Recommended