Basic concepts and simple use of Hadoop

1. Concept

1.1、Hadoop 1.0和Hadoop 2.0

What is Hadoo 1.0

Hadoop1.0 is the first generation of Hadoop, which refers to the Hadoop version of Apache Hadoop 0.20.x, 1.x or CDH3 series. The kernel is mainly HDFS和MapReduce两个系统composed of.
Wherein the offline processing MapReduce is a framework 由编程模式(新旧API), 运行时环境(JobTracker和TaskTracker)and the 数据处理引擎(MapTask和ReduceTask)three parts.

What is Hadoo 2.0

Hadoop2.0 is the second generation of Hadoop, which refers to the Hadoop version of Apache Hadoop 0.23.x, 2.x or CDH4 series. The core is mainly composed HDFS、MapReduce和YARN三个系统组成of YARN, a resource management system, responsible for cluster resource management and scheduling, and MapReduce It is an offline processing framework running on YARN. It is the same as MapReduce in Hadoop 1.0 in terms of programming model (new and old API) and data processing engine (MapTask and ReduceTask).

The difference between the two

Hadoop overall framework difference

Hadoop 1.0 is composed of a distributed storage system HDFS and a distributed computing framework MapReduce. HDFS is composed of a NameNode and multiple DateNodes, and MapReduce is composed of a JobTracker and multiple TaskTrackers.

Hadoop2.0 has made the following improvements to overcome the shortcomings of Hadoop1.0:

  1. Aiming at the problem of Hadoop1.0 single NameNode restricting the scalability of HDFS, the HDFS Federation is proposed, which allows multiple NameNodes to manage different directories to achieve access isolation and horizontal expansion, and at the same time completely solve the single point of failure problem of NameNode;
  2. In view of the deficiencies of MapReduce in Hadoop 1.0 in terms of scalability and multi-frame support, it separates resource management and job control in JobTracker, which are respectively composed of ResourceManager (responsible for the resource allocation of all applications) and ApplicationMaster (responsible for managing an application) Program) implementation, that is, the introduction of the resource management framework Yarn.
  3. As a resource management system in Hadoop2.0, Yarn is a general resource management module that can manage and schedule resources for various applications. It is not limited to a framework of MapReduce, but can also be used by other frameworks, such as Tez and Spark. , Storm, etc.

Differences in MapReduce Computing Framework

The MapReduce1.0 computing framework is mainly composed of three parts: programming model, data processing engine and runtime environment. Its basic programming model is to abstract the problem into two stages, Map and Reduce, where the Map stage parses the input data into key/value, iteratively calls the map() function for processing, and then outputs it to the local directory in the form of key/value , The Reduce stage will process the same value with the key and write the final result to HDFS; its data processing engine is composed of MapTask and ReduceTask, which are responsible for the logic processing of the Map stage and the logic of the Reduce stage respectively; its runtime environment is determined by A JobTracker and several TaskTracker services are composed of two types of services. The JobTracker is responsible for resource management and the control of all jobs, and the TaskTracker is responsible for receiving commands from the JobTracker and executing it.
MapReducer2.0 has the same programming model and data processing engine as MRv1, the only difference is the runtime environment. MRv2 is a computing framework MapReduce that runs on the resource management framework Yarn after processing on the basis of MRv1. Its runtime environment is no longer composed of services such as JobTracker and TaskTracker, but becomes the general resource management system Yarn and the job control process ApplicationMaster, where Yarn is responsible for resource management scheduling and ApplicationMaster is responsible for job management.

The difference between Hadoop1 and Hadoop2 is still very big, HDFS and MR are different, at least the configuration files are different. For project application, it is recommended to go to the higher version as much as possible. If you are more robust, a stable version that is slightly lower than the highest version is enough.

1.2, MapReduce and HDFS

What is MapReduce

MapReduce is a programming model for parallel operations on large-scale data sets (greater than 1TB). The concepts "Map" and "Reduce" and their main ideas are borrowed from functional programming languages, as well as features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map (mapping) function to map a set of key-value pairs into a set of new key-value pairs, and specify a concurrent Reduce function to ensure that all mapped key-value pairs Each of them shares the same key set.
Key points:
1. MapReduce is a distributed computing model proposed by Google. It is mainly used in the search field to solve the calculation problem of massive data.
2. MR consists of two stages: Map and Reduce. Users only need to implement two functions, map() and reduce(), to realize distributed computing.

What is HDFS

DFS is a distributed file system. Distributed files are stored in a cluster composed of multiple machines. The system used to manage distributed file storage is called a distributed file system.

HDFS is the Hadoop Distributed File System. It is good at storing large files, stream reading, and running on general commercial hardware . HDFS is not suitable for storing a large number of small files. Namenode will store metadata in the memory. Normally, each file, directory and block will occupy 150 bytes; it is also not suitable for any concurrent writing scenario, HDFS write file operation It is the append mode.

1.3, NameNode sum DataNode

What is Block

In HDFS, files are divided into different blocks and stored in the data nodes of the cluster, and the metadata of the file system is centrally managed by the file system. The block of the file system is usually 512 bytes, but HDFS defaults to 128M, but compared with ordinary file systems, if a file does not reach 128M, it will not occupy the entire block. The default block size is to reduce addressing time. 抽象出块的概念的好处在于一个文件的大小可以超过整个磁盘,简化存储管理,很适合复制机制来实现高可用.

What is NameNode

NameNode is a piece of software that usually runs on a separate machine in an HDFS instance. The NameNode manages the namespace of the file system and maintains the metadata of the file system tree, all files, directories, and blocks. There are two types of metadata: namespace image and edit log.

What is DataNode

DataNode is also a piece of software that usually runs on a separate machine in an HDFS instance. NameNode has a mapping relationship between Block and DataNode, but it does not persist this information. The source of this information depends on the report sent to the NameNode when the DataNode starts.DataNode维护着最终的Block,并定期向NameNode发送该DataNode包含的Block列表。

1.4 、 JobTracker 和 TaskTracker

What is JobTracker

JobTracker is a background service process. After it is started, it will always monitor and receive heartbeat information sent from each TaskTracker, including information such as resource usage and task running status.
The main functions of JobTracker:

  1. Job control: In hadoop, each application is represented as a job, and each job is divided into multiple tasks. The job control module of JobTracker is responsible for job decomposition and status monitoring.
    The most important thing is status monitoring: it mainly includes TaskTracker status monitoring, job status monitoring and task status monitoring. Main role: fault tolerance and provide decision-making basis for task scheduling.
  2. Resource management.

What is TaskTracker

TaskTracker is JobTracker和Task之间的桥梁: On the one hand, it receives and executes various commands from JobTracker: run tasks, submit tasks, kill tasks, etc.; on the other hand,
it periodically reports the status of each task on the local node to JobTracker through heartbeat. RPC protocol is used for communication between TaskTracker and JobTracker and Task.
The main functions of TaskTracker:

  1. 汇报心跳: Tracker periodically reports various information on all nodes to JobTracker through the heartbeat mechanism. This information includes two parts:
    machine-level information : node health, resource usage, etc.
    Task level information : task execution progress, task running status, etc.
  2. 执行命令: JobTracker will issue various commands to TaskTracker, including: launch task (LaunchTaskAction), submit task (CommitTaskAction), kill task (KillTaskAction),
    kill job (KillJobAction) and reinitialize (TaskTrackerReinitAction).

1.5、ResourceManager和NodeManager

ResourceManager is based on the resource requirements of applications; each application requires different types of resources and therefore requires different containers. ResourceManager 是一个中心的服务, What it does is to schedule, start the ApplicationMaster to which each Job belongs, and monitor the existence of the ApplicationMaster.

NodeManager is the agent of each machine framework, it is the container that executes the application, monitors the resource usage of the application (CPU, memory, hard disk, network) and reports to the scheduler (ResourceManager).
The responsibilities of ApplicationMaster include: requesting appropriate resource containers from the scheduler, running tasks, tracking the status of applications and monitoring their progress, and handling the causes of task failures.

2. Simple use of Hadoop

For the construction of Hadoop cluster, please refer to "Building Hadoop Cluster under CentOS7"

2.1 Create a folder

Create folder /test/input on HDFS

[root@hadoop-master bin]# hadoop fs -mkdir -p /test/input

2.2 View the created folder

[root@hadoop-master bin]# hadoop fs -ls /
Found 1 items
drwxr-xr-x   - root supergroup          0 2020-08-19 12:19 /test
[root@hadoop-master bin]# hadoop fs -ls /test
Found 1 items
drwxr-xr-x   - root supergroup          0 2020-08-19 12:19 /test/input

2.3 Upload files

Prepare test files

[root@hadoop-master test]# vi hw_hadoop.txt 
[root@hadoop-master test]# cat hw_hadoop.txt 
hello world leo825
hello world hadoop
hello that girl
hello that boy

Upload the file to the /test/input folder of HDFS

[root@hadoop-master test]# hadoop fs -put ./hw_hadoop.txt /test/input

Check upload results

[root@hadoop-master test]#  hadoop fs -ls /test/input
Found 1 items
-rw-r--r--   2 root supergroup         69 2020-08-19 12:26 /test/input/hw_hadoop.txt

2.4 Download files

[root@hadoop-master local]# hadoop fs -get /test/input/hw_hadoop.txt ./
[root@hadoop-master local]# ll
总用量 12
drwxr-xr-x.  2 root  root     6 4月  11 2018 bin
drwxr-xr-x.  2 root  root     6 4月  11 2018 etc
drwxr-xr-x.  2 root  root     6 4月  11 2018 games
drwxr-xr-x.  4 root  root    30 8月   8 10:59 hadoop
-rw-r--r--.  1 root  root    69 8月  19 16:52 hw_hadoop.txt

2.5 Run a demo program of mapreduce: wordcount

Run the wordcount in the following examples-2.7.3.jar program

hadoop jar /usr/local/hadoop/apps/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar  wordcount /test/input /test/output

Results of the

[root@hadoop-master test]# hadoop jar /usr/local/hadoop/apps/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar  wordcount /test/input /test/output
20/08/19 17:25:55 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/192.168.223.131:8032
20/08/19 17:25:56 INFO input.FileInputFormat: Total input paths to process : 1
20/08/19 17:25:56 INFO mapreduce.JobSubmitter: number of splits:1
20/08/19 17:25:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1597810488836_0001
20/08/19 17:25:57 INFO impl.YarnClientImpl: Submitted application application_1597810488836_0001
20/08/19 17:25:57 INFO mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1597810488836_0001/
20/08/19 17:25:57 INFO mapreduce.Job: Running job: job_1597810488836_0001
20/08/19 17:26:08 INFO mapreduce.Job: Job job_1597810488836_0001 running in uber mode : false
20/08/19 17:26:08 INFO mapreduce.Job:  map 0% reduce 0%
20/08/19 17:26:18 INFO mapreduce.Job:  map 100% reduce 0%
20/08/19 17:26:25 INFO mapreduce.Job:  map 100% reduce 100%
20/08/19 17:26:26 INFO mapreduce.Job: Job job_1597810488836_0001 completed successfully
20/08/19 17:26:26 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=88
		FILE: Number of bytes written=237555
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=184
		HDFS: Number of bytes written=54
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Rack-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=6450
		Total time spent by all reduces in occupied slots (ms)=4613
		Total time spent by all map tasks (ms)=6450
		Total time spent by all reduce tasks (ms)=4613
		Total vcore-milliseconds taken by all map tasks=6450
		Total vcore-milliseconds taken by all reduce tasks=4613
		Total megabyte-milliseconds taken by all map tasks=6604800
		Total megabyte-milliseconds taken by all reduce tasks=4723712
	Map-Reduce Framework
		Map input records=4
		Map output records=12
		Map output bytes=117
		Map output materialized bytes=88
		Input split bytes=115
		Combine input records=12
		Combine output records=7
		Reduce input groups=7
		Reduce shuffle bytes=88
		Reduce input records=7
		Reduce output records=7
		Spilled Records=14
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=131
		CPU time spent (ms)=1490
		Physical memory (bytes) snapshot=290959360
		Virtual memory (bytes) snapshot=4160589824
		Total committed heap usage (bytes)=154607616
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=69
	File Output Format Counters 
		Bytes Written=54

In the YARN web interface: http://192.168.223.131:8088/cluster

count
View the output result:

[root@hadoop-master test]# hadoop fs -ls /test/output
Found 2 items
-rw-r--r--   2 root supergroup          0 2020-08-19 17:26 /test/output/_SUCCESS
-rw-r--r--   2 root supergroup         54 2020-08-19 17:26 /test/output/part-r-00000
[root@hadoop-master test]# hadoop fs -cat /test/output/part-r-00000
boy	1
girl	1
hadoop	1
hello	4
leo825	1
that	2
world	2

Guess you like

Origin blog.csdn.net/u011047968/article/details/107913010