Hadoop study notes (3): HDFS architecture consisting of

HDFS Overview

HDFS background

With the increasing amount of data, in a range of no less than the operating system under the jurisdiction of memory, and then distributed to more operating
disk for system management, but it is not convenient to manage and maintain, the urgent need for a system to manage multiple files on the machine,
which is a distributed file system management. HDFS distributed file management system in just one.

HDFS concept

HDFS, it is a file system for storing files, to locate files by directory tree; secondly, it is distributed
type, by the many servers together to achieve its function, the cluster server has its own role.
HDFS is designed for write once, read many of the scenes, and does not support modifying files. Suitable for doing data
analysis, do not suitable for network application.

HDFS advantages and disadvantages

advantage

  1. High fault tolerance
    • Data is automatically saved multiple copies. It increased by a copy of the form, to improve fault tolerance;
    • After one copy is lost, it can be automatically restored.
  2. Suitable for large data processing
    • Data Scale: capable of processing data reached GB, TB, even PB-level data;
    • File Size: capable of handling more than one million the number of file size, the number is quite large.
  3. Streaming data access, it can ensure data consistency.
  4. It can be built on low-cost machines, through a multi-copy mechanism to improve reliability.

Shortcoming

  1. Not suitable for low-latency access to data , such as storing data millisecond, it is impossible.
  2. Not efficient for a large number of small files are stored.
    • Storing lots of small files, then it will take NameNode a lot of memory to store files, directories, and block information. This is undesirable because the memory is always limited NameNode;
    • Small files stored addressing times more than read time, it violates the HDFS design goals.
  3. Concurrent writes, random file modification.
    • A file can have only one write, do not allow multiple threads to write;
    • Supports only data append (additional) randomly modify the file does not support.

HDFS architecture consisting of


This architecture consists of four main parts, namely HDFS Client, NameNode, DataNode and Secondary NameNode.

  1. Client: is the client.
    • File segmentation. HDFS file upload time, Client cut into a file of a Block, and then stored;
    • NameNode interact with, access to the location information of the file;
    • DataNode interaction with reading or writing data;
    • Client provides a number of commands to manage HDFS, such as start or shut down HDFS;
    • Client can be accessed through a number of HDFS commands;
  2. NameNode: is the Master, it is a supervisor, manager.
    • Management HDFS namespace;
    • Management block (Block) mapping information;
    • Configuring a copy of the policy;
    • The client process read and write requests.
  3. DataNode: is the Slave. NameNode orders, DataNode perform the actual operation.
    • Storing the actual data block;
    • The data block read / write operations.
  4. Secondary NameNode: not NameNode hot spare. When NameNode hang, it does not immediately replace NameNode and provide services.
    • Auxiliary NameNode, sharing their workload;
    • Regular merger Fsimage and Edits, and pushed to NameNode;
    • In case of emergency, may assist recovery NameNode.

The size of the file blocks HDFS

HDFS files are physically stored in block (Block), the block size can be specified by the configuration parameters (dfs.blocksize), the default size is 128M hadoop2.x version, the old version is 64M.
HDFS block larger than the disk block, its purpose is to minimize the addressing overhead. If the block is set large enough, the data from the disk transfer time will be significantly greater than the time required for positioning the block start position. Thus, depending on the time of transmission of a document consisting of a plurality of blocks of disk transfer rate.
If the addressing time is about 10ms, and the transmission rate of 100MB / s, in order to make the addressing time of only 1% of the transmission time, we want to set the block size is approximately 100MB. The default block size is 128MB.
Block size: 10ms * 100 * 100M / s = 100M

 

Set file block size

  • Can not be too small: HDFS block setting too small, the addressing time will increase, the program has been looking for the start position of the block
  • You can not be too: if too large set of blocks from the disk data transfer time will be significantly greater than the time required for positioning of the block start position. Causing the program in dealing with this data, it will be very slow.
  • Summary: HDFS block size is set, the transmission rate depends on the disk.






 

Guess you like

Origin www.cnblogs.com/wbyixx/p/10988137.html