Big data course D2 - an overview of hadoop

E-mail of the author of the article: [email protected] Address: Huizhou, Guangdong

 ▲ Purpose of this chapter

⚪ Understand the definition and characteristics of hadoop;

⚪ Master the basic structure of hadoop;

⚪ Master the common commands of hadoop;

⚪ Understand the execution process of hadoop;

1. Introduction

1 Overview

1. HDFS (Hadoop Distributed File System - Hadoop Distributed File System) is a set of mechanisms provided by Hadoop for distributed storage.

2. HDFS is implemented by Doug Cutting based on Google's paper <The Google File System> (GFS).

2. Features

1. Ability to store very large files. In an HDFS cluster, as long as the number of nodes is large enough, a file can be stored regardless of its size - HDFS will cut the file into chunks.

2. Rapid response and detection of failures. In the HDFS cluster, the operation and maintenance personnel do not need to monitor each node frequently, but can determine the status of other nodes by monitoring the NameNode - the DataNode will send heartbeats to the NameNode at regular intervals.

3. High fault tolerance. In HDFS, multiple copies of data are automatically saved, so data will not be lost due to the loss of one or several copies.

4. With high throughput. Throughput actually refers to the total amount of data read and written by the cluster per unit time.

5. It can be scaled out on relatively cheap machines.

6. Does not support low-latency access. In an HDFS cluster, the response speed is generally at the second level, and it is difficult to achieve a response at the millisecond level.

7. Not suitable for storing a large number of small files. Each small file will generate a piece of metadata, and a large number of small files will generate a large amount of metadata. Too much metadata will take up a lot of memory, and will lead to lower query efficiency.

8. Simplified consistency model. In HDFS, a file is allowed to be written once and read many times, modification is not allowed, but additional writing is allowed.

9. Does not support super transactions or even transactions. In HDFS, because of the large amount of data, all data will not be rewritten due to a problem with one or several data blocks at this time - on the premise that the amount of data is large enough, error tolerance is allowed.

2. Basic concepts

1. Basic structure

1. HDFS itself is a typical master-slave (M/S) structure: the master node is NameNode, and the slave node is DataNode.

2. HDFS splits the uploaded file, and each data block is called a block.

3. HDFS will automatically back up the uploaded files. Each backup is called a copy (replication/replicas). If not specified, the number of replicas is 3 by default.

4. HDFS has designed a file system modeled on Linux, allowing files to be stored in different virtual paths, and also designed a set of permission policies similar to Linux. The root path of HDFS is /.

2. Block

1. Block is the basic form of data storage in HDFS, that is, the data uploaded to HDFS will eventually land on the disk of DataNode in the form of Block.

2. If not specified, the default block size is 134217728B (ie 128M). It can be adjusted through the dfs.blocksize property, placed in the hdfs-site.xml file, and the unit is byte.

3. If a file is less than the specified size of a block, then the size of the file is the size of the corresponding block. For example, if a file is 70M, then the corresponding Block is 70M. The value specified by the attribute dfs.blocksize can actually be the maximum capacity of a Block immediately.

4. Note that when designing the block size, the block is maintained on the disk of the DataNode, and the ratio of the addressing time of the block on the disk and the transmission time (writing time) should be considered. Generally speaking, when the addressing time is 1% of the transmission time, the efficiency is the highest. The computer's addressing time on the disk is about 10ms, so the writing time is 10ms/0.01=1000ms=1s. Considering that most servers use mechanical disks, the writing speed of mechanical disks is generally about 120MB/s, and the size of a block at this time is about 1s*120MB/s=120M.

5. HDFS will assign a unique number BlockID to each Block.

6. The meaning of cutting:

a. Ability to store very large files.

b. Ability to perform fast backups.

3. NameNode

1. NameNode is the main (core) node in HDFS. In Hadoop1.X, there can only be one NameNode, which is prone to single point of failure; in Hadoop2.X, a maximum of two NameNodes are allowed; in Hadoop3.X, the number of NameNodes is no longer limited, and therefore in Hadoop3. In X's cluster, the NameNode has no single point of failure.

2. The role of NameNode: receiving external requests, recording metadata, and managing DataNode.

3. Metadata (metadata) is the data used to describe the data (metadata can probably be understood as a ledger). In HDFS, metadata is actually used to describe some properties of files. In HDFS, the metadata is split into many items, mainly including the following items:

a. The name of the uploaded file and the virtual path where it is stored, such as /log/a.log.

b. The uploading user and user group corresponding to the file.

c. Permissions for the file, for example -rwxr-xr--.

d. File size.

e. Block size.

f. The mapping relationship between files and BlockIDs.

g. Mapping relationship between BlockID and DataNode.

h. Number of copies, etc.

4. The size of a piece of metadata is about 150B.

5. Metadata is maintained in memory and on disk.

a. The purpose of maintaining in memory is to query quickly

b. The purpose of maintaining in the disk is persistence

6. The storage location of metadata on the disk is determined by the attribute hadoop.tmp.dir, which is placed in the core-site.xml file. If not specified, it is placed under /tmp by default.

7. Files related to metadata:

a. edits: write operation file. Used to record HDFS write operations.

b. fsimage: meta image file. Stores the serialized form of NameNode's metadata (probably can be understood as the persistent storage form of metadata on disk).

8. When the NameNode receives a write operation (command), it will first record the write operation (command) into the edits_inprogress file. After the record is successful, the NameNode will parse the command and then modify the metadata in memory. After the modification is successful, an ACK signal will be returned to the client to indicate success. During this process, it will be found that the metadata in the fsimage file has not changed.

9. follow

Guess you like

Origin blog.csdn.net/u013955758/article/details/131932773