Hadoop, and installation (a)

Brief introduction

The main record the basic principles of the various components of Hadoop, processes and knowledge of key points, etc., including HDFS, YARN, MapReduce and so on.

bedding

  • People speed data generated faster, to accelerate the machine even more, more data usually beats better algorithms, it is necessary an additional method of processing data.

  • Hard disk capacity increases, but the performance has not kept up, the solution is the data assigned to multiple disks, and then read at the same time. But bring some problems:

    Hardware problem: Copy Data Solutions (RAID)

    Analysis requires data from different hard disk read: MapReduce

The Hadoop provides

  1. Reliable shared storage (distributed storage)
  2. Abstract analysis interface (distributed analysis)

Big Data

concept

A machine can not be used for data processing

Big Data is the core of the overall sample =

characteristic

  • A large number of (volume):  Generally in big data, the level of individual files to be at least dozens, hundreds GB
  • Fast (velocity):  is reflected in the rapid creation of data and frequency of data changes
  • Diversity (variety):  refers to diverse types and sources of data, the data structure can be further summarized as a structured (Structured), semi-structured (semi-structured), and unstructured (Unstructured)
  • Volatility:  the attendant features fast data streaming also presents characteristics of a wave. Volatile data stream will appear along with periodic peaks, season, triggering a specific event day
  • Accuracy:  also known as data assurance (data assurance). Different ways to collect data channels vary greatly in quality. The level of error and confidence level of data analysis and output depends largely on the quality of the collected data
  • Complexity:  reflected in the data management and operation. How to extract, transform, load, connection, data related to intrinsic grasp of useful information has become more challenging

Key Technology

  1. Data distributed across multiple machines

    Reliability: each data block is copied to a plurality of nodes

    Performance: a plurality of data processing nodes simultaneously

  2. Calculated with the data go

    << local disk speed network IO IO speed, large data system will try to assign the task to run (the program is running, the program and its dependencies are copied to the machine where the data resides run) away from the nearest machine data

    Code to data migration, avoid large-scale data, significant data migration situation, try to make some calculated data occurs on the same machine

  3. Substituted serial IO random IO

    << transmission time seek time, the general data writing is no longer modified

 

Hadoop version found

1) Lucene - Doug Cutting to create open-source software, using java to write code that implements similar to Google's full-text search function, which provides full-text search engine architecture, including complete engine query engine and index

2) the end of 2001 to become a subproject apache Foundation

3) For a large number of scenes, Lucene face the same difficulties with Google

Option 4) learning and imitation Google to solve these problems: miniature version of Nutch

5) Google can be said to be the source of the idea of ​​hadoop (Google three papers in terms of big data)

GFS --->HDFS

Map-Reduce --->MR

BigTable --->Hbase

6) 2003-2004, Google disclosed the details of the GFS and Mapreduce thought, in order to spare time with a 2-year basis Doug Cutting, who realized the DFS and Mapreduce mechanism, the soaring performance Nutch

7) In 2005 Hadoop as part of Nutch Lucene subproject of the Apache Foundation formally introduced. In March 2006, Map-Reduce and Nutch Distributed File System (NDFS) are incorporated into the project called Hadoop

8) The name comes from Doug Cutting son's toy elephant

9) Hadoop was born and developed rapidly, mark this era of cloud computing

 2.4 Hadoop advantage

1) High reliability: Hadoop underlying maintain multiple copies of data, so even a Hadoop compute or storage element fails, it will not cause data loss.

2) high scalability: distribution of tasks between cluster data, it can be easily extended thousands of nodes.

3) Efficiency: In MapReduce thinking, Hadoop is working in parallel to speed up the task of processing speed.

4) high fault tolerance: the ability to automatically fail the task redistribution.

 Hadoop composition

HDFS Architecture Overview

 

 

 Architecture Overview HDFS
HDFS (Hadoop Distributed File System) architecture overview shown in FIG.

 

YARN Architecture Overview

1) ResourceManager (rm): a client request, start / monitoring ApplicationMaster, the NodeManager monitoring, resource allocation and scheduling;

2) NodeManager (nm): resource management on a single node, process commands from the ResourceManager, process commands from the ApplicationMaster;

3) ApplicationMaster: data segmentation, application to application resources, and assigned to internal tasks, task monitoring and fault tolerance.

4) Container: abstract task operating environment, package the CPU, memory and other resources as well as multi-dimensional environment variables, startup commands and other tasks related to the operation of the information.

Hadoop runtime environment to build

More than one cloud Ali, Ali cloud operating directly on the

[root@iZbp1efx14jd8471u20gpaZ ~]# hostnamectl set-hostname hadoop001
[root@iZbp1efx14jd8471u20gpaZ ~]# cd /opt
[root@iZbp1efx14jd8471u20gpaZ opt]# mkdir module
[root@iZbp1efx14jd8471u20gpaZ opt]# mkdir software
[root@iZbp1efx14jd8471u20gpaZ opt]#  rpm -qa | grep java
[root@iZbp1efx14jd8471u20gpaZ opt]# cd software/
[root@iZbp1efx14jd8471u20gpaZ software]# ll
total 257012
-rw-r--r-- 1 root root  77660160 Jan 13 20:26 hadoop-2.7.2.tar.gz
-rw-r--r-- 1 root root 185515842 Jan 13 20:24 jdk-8u144-linux-x64.tar.gz
[root@iZbp1efx14jd8471u20gpaZ software]# tar -zvxf jdk-8u144-linux-x64.tar.gz /opt/module/ -C
tar: option requires an argument -- 'C'
Try `tar --help' or `tar --usage' for more information.
[root@iZbp1efx14jd8471u20gpaZ software]# tar -zvxf jdk-8u144-linux-x64.tar.gz -C /opt/module/
jdk1.8.0_144/
jdk1.8.0_144/THIRDPARTYLICENSEREADME-JAVAFX.txt
jdk1.8.0_144/THIRDPARTYLICENSEREADME.txt

 

[root@iZbp1efx14jd8471u20gpaZ module]# cd jdk1.8.0_144/
[root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# vi /etc/profile
[root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# vi /etc/profile
[root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# source /etc/profile
[root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
[root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# cd ../../software/
[root@iZbp1efx14jd8471u20gpaZ software]# ll
total 374200
-rw-r--r-- 1 root root 197657687 Jan 13 20:30 hadoop-2.7.2.tar.gz
-rw-r--r-- 1 root root 185515842 Jan 13 20:24 jdk-8u144-linux-x64.tar.gz
[root@iZbp1efx14jd8471u20gpaZ software]# tar -zxvf hadoop-2.7.2.tar.gz -C /opt/module/
hadoop-2.7.2/
hadoop-2.7.2/NOTICE.txt
hadoop-2.7.2/etc/
hadoop-2.7.2/etc/hadoop/
hadoop-2.7.2/etc/hadoop/kms-log4j.properties
[root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# vi /etc/profile
[root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# source /etc/profile
[root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# hadoop -version
Error: No command named `-version' was found. Perhaps you meant `hadoop version'
[root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# hadoop version
Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2017-05-22T10:49Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar
[root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# hadoop
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
  CLASSNAME            run the class named CLASSNAME
 or
  where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
                       note: please use "yarn jar" to launch
                             YARN applications, not this command.
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  trace                view and modify Hadoop tracing settings

Most commands print help when invoked w/o parameters.
[root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]#

(1)bin目录:存放对Hadoop相关服务(HDFS,YARN)进行操作的脚本
(2)etc目录:Hadoop的配置文件目录,存放Hadoop的配置文件
(3)lib目录:存放Hadoop的本地库(对数据进行压缩解压缩功能)
(4)sbin目录:存放启动或停止Hadoop相关服务的脚本
(5)share目录:存放Hadoop的依赖jar包、文档、和官方案例

Guess you like

Origin www.cnblogs.com/dalianpai/p/12189469.html