Brief introduction
The main record the basic principles of the various components of Hadoop, processes and knowledge of key points, etc., including HDFS, YARN, MapReduce and so on.
bedding
-
People speed data generated faster, to accelerate the machine even more, more data usually beats better algorithms, it is necessary an additional method of processing data.
-
Hard disk capacity increases, but the performance has not kept up, the solution is the data assigned to multiple disks, and then read at the same time. But bring some problems:
Hardware problem: Copy Data Solutions (RAID)
Analysis requires data from different hard disk read: MapReduce
The Hadoop provides
- Reliable shared storage (distributed storage)
- Abstract analysis interface (distributed analysis)
Big Data
concept
A machine can not be used for data processing
Big Data is the core of the overall sample =
characteristic
- A large number of (volume): Generally in big data, the level of individual files to be at least dozens, hundreds GB
- Fast (velocity): is reflected in the rapid creation of data and frequency of data changes
- Diversity (variety): refers to diverse types and sources of data, the data structure can be further summarized as a structured (Structured), semi-structured (semi-structured), and unstructured (Unstructured)
- Volatility: the attendant features fast data streaming also presents characteristics of a wave. Volatile data stream will appear along with periodic peaks, season, triggering a specific event day
- Accuracy: also known as data assurance (data assurance). Different ways to collect data channels vary greatly in quality. The level of error and confidence level of data analysis and output depends largely on the quality of the collected data
- Complexity: reflected in the data management and operation. How to extract, transform, load, connection, data related to intrinsic grasp of useful information has become more challenging
Key Technology
-
Data distributed across multiple machines
Reliability: each data block is copied to a plurality of nodes
Performance: a plurality of data processing nodes simultaneously
-
Calculated with the data go
<< local disk speed network IO IO speed, large data system will try to assign the task to run (the program is running, the program and its dependencies are copied to the machine where the data resides run) away from the nearest machine data
Code to data migration, avoid large-scale data, significant data migration situation, try to make some calculated data occurs on the same machine
-
Substituted serial IO random IO
<< transmission time seek time, the general data writing is no longer modified
Hadoop version found
1) Lucene - Doug Cutting to create open-source software, using java to write code that implements similar to Google's full-text search function, which provides full-text search engine architecture, including complete engine query engine and index
2) the end of 2001 to become a subproject apache Foundation
3) For a large number of scenes, Lucene face the same difficulties with Google
Option 4) learning and imitation Google to solve these problems: miniature version of Nutch
5) Google can be said to be the source of the idea of hadoop (Google three papers in terms of big data)
GFS --->HDFS
Map-Reduce --->MR
BigTable --->Hbase
6) 2003-2004, Google disclosed the details of the GFS and Mapreduce thought, in order to spare time with a 2-year basis Doug Cutting, who realized the DFS and Mapreduce mechanism, the soaring performance Nutch
7) In 2005 Hadoop as part of Nutch Lucene subproject of the Apache Foundation formally introduced. In March 2006, Map-Reduce and Nutch Distributed File System (NDFS) are incorporated into the project called Hadoop
8) The name comes from Doug Cutting son's toy elephant
9) Hadoop was born and developed rapidly, mark this era of cloud computing
2.4 Hadoop advantage
1) High reliability: Hadoop underlying maintain multiple copies of data, so even a Hadoop compute or storage element fails, it will not cause data loss.
2) high scalability: distribution of tasks between cluster data, it can be easily extended thousands of nodes.
3) Efficiency: In MapReduce thinking, Hadoop is working in parallel to speed up the task of processing speed.
4) high fault tolerance: the ability to automatically fail the task redistribution.
Hadoop composition
HDFS Architecture Overview
Architecture Overview HDFS
HDFS (Hadoop Distributed File System) architecture overview shown in FIG.
YARN Architecture Overview
1) ResourceManager (rm): a client request, start / monitoring ApplicationMaster, the NodeManager monitoring, resource allocation and scheduling;
2) NodeManager (nm): resource management on a single node, process commands from the ResourceManager, process commands from the ApplicationMaster;
3) ApplicationMaster: data segmentation, application to application resources, and assigned to internal tasks, task monitoring and fault tolerance.
4) Container: abstract task operating environment, package the CPU, memory and other resources as well as multi-dimensional environment variables, startup commands and other tasks related to the operation of the information.
Hadoop runtime environment to build
More than one cloud Ali, Ali cloud operating directly on the
[root@iZbp1efx14jd8471u20gpaZ ~]# hostnamectl set-hostname hadoop001 [root@iZbp1efx14jd8471u20gpaZ ~]# cd /opt [root@iZbp1efx14jd8471u20gpaZ opt]# mkdir module [root@iZbp1efx14jd8471u20gpaZ opt]# mkdir software [root@iZbp1efx14jd8471u20gpaZ opt]# rpm -qa | grep java [root@iZbp1efx14jd8471u20gpaZ opt]# cd software/ [root@iZbp1efx14jd8471u20gpaZ software]# ll total 257012 -rw-r--r-- 1 root root 77660160 Jan 13 20:26 hadoop-2.7.2.tar.gz -rw-r--r-- 1 root root 185515842 Jan 13 20:24 jdk-8u144-linux-x64.tar.gz [root@iZbp1efx14jd8471u20gpaZ software]# tar -zvxf jdk-8u144-linux-x64.tar.gz /opt/module/ -C tar: option requires an argument -- 'C' Try `tar --help' or `tar --usage' for more information. [root@iZbp1efx14jd8471u20gpaZ software]# tar -zvxf jdk-8u144-linux-x64.tar.gz -C /opt/module/ jdk1.8.0_144/ jdk1.8.0_144/THIRDPARTYLICENSEREADME-JAVAFX.txt jdk1.8.0_144/THIRDPARTYLICENSEREADME.txt
[root@iZbp1efx14jd8471u20gpaZ module]# cd jdk1.8.0_144/ [root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# vi /etc/profile [root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# vi /etc/profile [root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# source /etc/profile [root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# java -version java version "1.8.0_144" Java(TM) SE Runtime Environment (build 1.8.0_144-b01) Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) [root@iZbp1efx14jd8471u20gpaZ jdk1.8.0_144]# cd ../../software/ [root@iZbp1efx14jd8471u20gpaZ software]# ll total 374200 -rw-r--r-- 1 root root 197657687 Jan 13 20:30 hadoop-2.7.2.tar.gz -rw-r--r-- 1 root root 185515842 Jan 13 20:24 jdk-8u144-linux-x64.tar.gz [root@iZbp1efx14jd8471u20gpaZ software]# tar -zxvf hadoop-2.7.2.tar.gz -C /opt/module/ hadoop-2.7.2/ hadoop-2.7.2/NOTICE.txt hadoop-2.7.2/etc/ hadoop-2.7.2/etc/hadoop/ hadoop-2.7.2/etc/hadoop/kms-log4j.properties
[root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# vi /etc/profile [root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# source /etc/profile [root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# hadoop -version Error: No command named `-version' was found. Perhaps you meant `hadoop version' [root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# hadoop version Hadoop 2.7.2 Subversion Unknown -r Unknown Compiled by root on 2017-05-22T10:49Z Compiled with protoc 2.5.0 From source with checksum d0fda26633fa762bff87ec759ebe689c This command was run using /opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar [root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]# hadoop Usage: hadoop [--config confdir] [COMMAND | CLASSNAME] CLASSNAME run the class named CLASSNAME or where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file note: please use "yarn jar" to launch YARN applications, not this command. checknative [-a|-h] check native hadoop and compression libraries availability distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive classpath prints the class path needed to get the credential interact with credential providers Hadoop jar and the required libraries daemonlog get/set the log level for each daemon trace view and modify Hadoop tracing settings Most commands print help when invoked w/o parameters. [root@iZbp1efx14jd8471u20gpaZ hadoop-2.7.2]#
(1)bin目录:存放对Hadoop相关服务(HDFS,YARN)进行操作的脚本
(2)etc目录:Hadoop的配置文件目录,存放Hadoop的配置文件
(3)lib目录:存放Hadoop的本地库(对数据进行压缩解压缩功能)
(4)sbin目录:存放启动或停止Hadoop相关服务的脚本
(5)share目录:存放Hadoop的依赖jar包、文档、和官方案例