Big Data Technology Hadoop: HDFS Storage Principles (5)

Table of contents

1. Principle introduction

1.1 Block

1.2 Copy mechanism

2. fsck command

2.1 Set the default number of copies

2.2 Temporarily set file copy size

2.3 The fsck command checks the number of copies of a file

2.4 Block size configuration

3. NameNode metadata

3.1 NameNode action

3.2 edits file

3.3 FSImage file

3.4 Element data merging control parameters

3.5 The role of SecondaryNameNode

4. HDFS reading and writing process

4.1 Writing process

4.2 Reading process


1. Principle introduction

1.1 Block

HDFS Distributed file storage usually splits a file into multiple parts and then sends them to different server nodes.

 

Problem: Different files have different sizes. If they are roughly split and placed on different nodes of the server, the sizes of each part will be different, which is not conducive to unified management.

Solution: Set up a unified management unit, block block.

  •  Block block,HDFS minimum storage unit
  • 256MB each (can be modified)

In this way, the file can be divided into multiple Block blocks, and different Block blocks are stored in the corresponding server.

for example

A file size of 1G can theoretically be divided into 4 Blocks.

If the cluster has three servers, then a certain server has 2 Blocks, and then the other two servers have 1 Block each.

1.2 Copy mechanism

Without backup, if a block is damaged, the entire file will become unavailable.

Therefore, the copy mechanism is a very important mechanism to ensure data security.

2. fsck command

2.1 Set the default number of copies

The default number of copies of HDFS files is 3.

Of course, this value can be modified. Specifically you can configure the following properties inhdfs-site.xml

<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>

This attribute defaults to3. Generally, we do not need to actively configure it (unless we need to set something other than 3 value)

If you need to customize this property, please modify the hdfs-site.xml file of each server and set this property.

2.2 Temporarily set file copy size

If no restrictions are imposed, the default number of copies of the files we create or upload is the value set above.

But for a single file upload, we can also specify how many copies of a certain file.

hadoop fs -D dfs.replication=2 -put test.txt /tmp/

For files that already exist in HDFS, modifying the dfs.replication attribute will not take effect. If you want to modify the existing file, you can use the command

hadoop fs -setrep [-R] 2 path

With the above command, the content of the specified path will be modified to 2< a i=4> copies are stored.

The -R option is optional. Use -R to also take effect on subdirectories.

2.3 fsckCommand to check the number of copies of a file

If we want to view detailed file copy number information, we can use the following command:

hdfs fsck path [-files [-blocks [-locations]]]

fsck can check whether the specified path is normal

        -files can list the status of files in the path

        -files -blocks Output file block report (how many blocks, how many copies)

        -files -blocks -locations Output the details of each block

2.4 Block size configuration

By default, the block size is 256MB, of course we can also modify it.

  <property>
    <name>dfs.blocksize</name>
    <value>268435456</value>
    <description>设置HDFS块大小,单位是b</description>
  </property>

3. NameNode metadata

3.1 NameNode action

NameNode role: manages Block blocks.

Inhdfs, files are divided into piles of blocksblock blocks, then if the file is large and there are many files, Hadoop how to record and organize the files and blockWhat about the relationship between blocks?

Answer is hereNameNode.

NameNode is based on a batch of edits and a fsimage< The cooperation of /span> files completes the management and maintenance of the entire file system.

3.2 edits file

The edits file is a journal file that records every operation inhdfs, and The files affected by this operation have their correspondingblock.

 

3.3 FSImage file

Merge alledits files into the final result to get oneFSImageFSImage a>File

summary

NameNode is based on edits and FSImage With the cooperation of , complete the management of the entire file system file.

1. Each timeHDFS operation, uniform coverageeditsarticle record

2. After edits reaches the size and goes online, open a newedits record

3. Perform regular merging operationsedits

  • Fsimagetext,  generaledits Combined first partfsimage
  • If the fsimage file currently exists, edit alledits and the existing ones fsimage are merged to form a new onefsimage

4. heavy123flow process

3.4 Element data merging control parameters

For metadata merging, it is a timed process based on:

dfs.namenode.checkpoint.period, the default is 3600 (seconds) which is 1 hour

dfs.namenode.checkpoint.txns, default 1000000, which is 1 million transactions

As long as one of the conditions is met, it will be executed.

Checks whether the conditions are met, default60 checks once every second, based on:

dfs.namenode.checkpoint.check.period, default 60 (seconds), to decide.

3.5 SecondaryNameNodeeffect

For metadata merging, rememberHDFSThe cluster has a secondary role:SecondaryNameNode a>?

Yes,This is what it does to merge metadata

SecondaryNameNodeCommunication 过httpNameNodefsimageeditsRatori Suzu (

Then the merge is completed and provided toNameNode for use.

4. HDFS reading and writing process

4.1 Writing process

1. Customer end directionNameNodeGetting started

2. NameNodeAfter reviewing the permissions and remaining space, allow writing if the conditions are met, and inform the client of the write DataNodeAddress

3. The client sends data packets to the specifiedDataNode

4. The data is written toDataNode and completes the copying of the data copy at the same time. The received data is distributed to otherDataNode

5. as above,DataNode1multiple systemDataNode2 ,Landback baseDataNode2复给Datanode3sumDataNode4

6. Notice of completion of copyingNameNode, NameNodeNumber installation work

Key information points:

NameNode is not responsible for data writing, but is only responsible for metadata recording and permission approval

The client directlyto1stationDataNode (network distance)< /span>The one with closest to the clientThis DataNode is generallyWrite data,

The number of secondary copies has been completed,YearDataNodeThe self-running process has been completed< a i=4>(Configuration one piece =8>给2, 234)

4.2 Reading process

1、Customer end directionNameNodeSign up for a certain sentence

2, NameNode determines the client permissions and other details, allows reading, and returns this File'sblocklist

3, the client gets the block list and searches for it on its ownDataNodeRead it

key point:

Number of connections not availableNameNodeProvided

The list of blockblock provided by will try to provide the one closest to the client based on network distance calculation. NameNode

This is because 1block has3< /span> copies, it will try to find the one closest to the client for it to read.

The hardest thing is to persist and continue to the next level~

Guess you like

Origin blog.csdn.net/YuanFudao/article/details/132732657