The reason 128M in HDFS block set

Now why on HDFS recorded setting a block 128M, it is understood in reference to a blog.

Why set up a relatively large block of block

This is a classic interview questions, need up to consider the performance and storage.

The transmission performance considerations

Reading the data on the hard disk is read one sector, the sector is the smallest unit of a SECTOR reading, a sector size 512Byte generally, if a data to be read, the operation required to complete the addressing and read operation, it is first of all find a place to store data, so head to locate a sector to be read, and then read.
Unlike HDFS system as a physical disk, a physical file system based on an abstract file system, the read data is a minimum unit block. It also requires addressing and reading operation, the address in general need to set the time as short as possible by dividing the reading time, which can minimize the addressing overhead , i.e., so that the total time to read at least the address file . If a block is set small, the address and the time to read could almost read a document addressing such time may be occupied by a lot.
Analogy calculation, taking into account the disk speed of data transmission is significantly greater than the time required for positioning, it assumed that the transmission rate of 100M / s, a positioning block time 0.1s (actually faster than this).
(1) If the block is set to 10M, 100M file needs to be divided into a block 10, a block is read 10M, the positioning time of 0.1s, the reading time is 0.1s, the total of the file has been read 100M 2s, addressing quite as it accounted for half time.
(2) If the block is set to 100M, then the positioning needs 1 0.1s, require a complete read 1s, a total of only 1.1s, significant savings in time.
Therefore, the block size of block can not be set too small, we need to set a large number, now Hadoop 2.X, in consideration of transmission speed and disk seek time, usually set 128M.
Why not the bigger the better it will block set, because data can only handle a block of time MapReduce tasks, the greater will be set if there is, the less the map to perform a task, it will affect computing performance and, therefore, can not be set too high.

On storage performance considerations

Simple appreciated more metadata stored set if the block is too small, the generated (block position, size information, etc.), so that the pressure is increased namenode is therefore not recommended block is set too small.

Set reasons 128M

The following are the reasons set 128M of analysis, refer to the following blog.

(1) Existing Conditions : HDFS actual seek time about 0.01s, i.e. 10ms, the transmission speed of a normal disk 100M / s, due to the general HDFS inexpensive machines deployed on a normal disk are therefore considered justified.

(2) testing experience : a large number of test experience indicates that the addressing time 1% of the transmission time is optimal performance comparison. Thus addressing time 0.01s, 1s more suitable for the data output, so that the 100M / s can transmit data is 100M, choosing 128M.

Actual production, disk transfer more efficient, the block size will be set larger, with reference to the above calculation can be estimated.

Reference Hirofumi:
(1) Https://Www.Cnblogs.Com/isabellasu/p/11175643.Html

Guess you like

Origin www.cnblogs.com/youngchaolin/p/11519810.html