I want to enter the big data Hadoop HDFS knowledge points (1)

Insert picture description here

01 Let's learn big data together

Today, Lao Liu started a review of the knowledge points of big data Hadoop. Hadoop contains three modules. This time I first share the basic knowledge points of the HDFS module in Hadoop. It can be regarded as a summary of the content of the review today. I hope to learn A little help from big data students, and hope to get criticism and guidance from big guys! (Each point is very important, can not be ignored)

02 Knowledge points to remember

Point 1: What is Hadoop?

Hadoop, which is a distributed system infrastructure developed by Apache, consists of three modules: HDFS for distributed storage, MapReduce for distributed computing, and Yarn, a resource scheduling engine.

Point 2: What is distributed?

The answer to this question was seen by Lao Liu in a certain organization. It was about using a batch of cheap and ordinary machines connected through the network to complete storage and computing tasks that a single machine could not complete.

Point 3: What is HDFS?

HDFS, at first glance, is the English abbreviation, its full name is Hadoop Distributed File System, which translates to Hadoop Distributed File System. In HDFS, a large number of files can be scattered and stored on different servers. A single file is relatively large. Under a single disk block, it can be divided into many small blocks, and then stored on different servers. Each server is connected through the network. To form a whole

Point 4: HDFS command usage

In Lao Liu's view, at least a few commonly used commands of HDFS should be remembered, so that the interviewer will not think of it.

1. View the created file
hdfs dfs -ls /
2. Create a file
hdfs dfs -touchz /test.txt in the hdfs file system
3. View the content of the HDFS file
hdfs dfs -cat /test.txt
4. Upload the file from the local path To HDFS
hdfs dfs -put /local path/hdfs path
5. Download file
hdfs dfs -get /hdfs path/local path
in hdfs file system 6. Create directory
hdfs dfs -mkdir /test01 in hdfs file system
7.

In hdfs file system Delete the file hdfs dfs -rm /edits.txt from the file system. Pay special attention to this. There are many types of deletion, and this is just one of them! ! !
8. Modify the file name
hdfs dfs -mv /test.sh /test01.sh in the hdfs file system

Point 5: HDFS core concept data block block

What is HDFS block?

The files on HDFS3.0 are divided into blocks according to the unit of 128M, and stored in different data nodes of the cluster, DataNode.
Insert picture description here
Looking at the above picture, you can know how the blocks are distributed. However, this distribution has a very obvious defect, that is, if one of the data nodes, DataNode1, fails, the blocks it stores will be lost. Therefore, to ensure data availability and fault tolerance, HDFS is designed to have three copies of each block. That is, three copies, and set the number of copies in hdfs-site.xml:

<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>

Then the current block storage is like this:
Insert picture description here

Point 6: Draw the HDFS architecture diagram
Insert picture description here

According to this picture, you can know that HDFS is a master-slave architecture, Master|Slave or called management node|work node.

Point 7: Talk about NameNode

NameNode, it is mainly used to manage nodes and metadata information of HDFS, and store metadata in its memory.

Among them, you must know the concept of metadata information! ! !

Descriptive information about a file or directory, such as the path where the file is located, file name, file type, etc. This information is called the metadata of the file.

The HDFS metadata information is the file directory tree, all the files and directories of the entire tree, the block list of each file, and the datanode list where each block block is located. Among them, each file, directory, and block occupies about 150 bytes of metadata.

HDFS metadata information is stored in two forms: ① edits log ② namespace mirror file fsimage. Among them, fsimage: metadata mirror file, saves the file system directory tree information and the correspondence between files and blocks; edits log: log file, saves the change record of the file system.

Here, Lao Liu has something to say. Lao Liu didn't care much about metadata information at first, but as he slowly learned, he kept encountering the concept of metadata information, and he wondered at the time. What is it, so I wasted a lot of time to look back again, so you must keep it in mind! ! !

Point 8: Talk about DataNode

DataNode, data node, is used to store block and block metadata. The metadata here includes the length of the data block, the checksum of the block data, and the timestamp.

Point 9: Talk about the Secondary NameNode (note that you can’t ignore it)

First of all, why is the metadata stored in the NameNode in memory?

Because after doing this, if the client requests data, it can read it directly with the NameNode, and the reading speed will be extremely fast.

But this is problematic. What's the problem?

Once the system crashes, data will be lost.

But how can this problem be solved?

It is said here that in the edit log editlog in the NameNode node, all changes to the HDFS by the client are recorded. Once the system fails, the editlog can be restored.

Having said so much, Lao Liu can now talk about Secondary NameNode. User operation requests are generally directly unloaded from the memory, and then persisted to the fsimage in the disk. The edits log is also recorded in the disk. As the NameNode runs for a long time, more and more logs are recorded. But at this time the NameNode stops and the data in the memory disappears. After the NameNode is restarted, it will load the fsimage file on the disk and merge the log files, and then merge into a complete fsimage file, but if there are too many edits log files , NameNode recovery time will be particularly long, so in order to avoid this situation, there is a Secondary NameNode, which is to assist the NameNode to merge metadata and speed up the next startup of the NameNode. (One more thing to say here is whether you rarely see it. For example, in the Hadoop HA implemented by ZooKeeper, its work is done by Standby NameNode.) Next, the workflow of Secondary NameNode is As shown below:
Insert picture description here

1. The NameNode manages metadata information, which is periodically refreshed to the disk. Two of the files are the edits log operation log file and the fsimage metadata mirror file. After a new log operation file is generated, it will not be merged with fsimage immediately, nor will it be flashed to the NameNode memory, but will be in the edits log first. When the size of the edits file reaches a critical value (64M) or a gap of 1 hour At that time, the checkpoint will trigger the work of the Secondary NameNode.

2. When a checkpoint is triggered, the NameNode will generate a new edits.new, and the Secondary NameNode will copy the edits and fsimage to the local.

3. The Secondary NameNode will load the local fsimage file into the memory, and then merge it with the edits file to generate a new fsimage.ckpt file.

4. The Secondary NameNode copies the newly generated fsimage.ckpt file to the NameNode node.

5. The edits.new and fsimage.ckpt files on the NameNode node will replace the original edits and fsimage files. So far, a cycle is completed, waiting for the next checkpoint to be triggered.

03 Summary

This is the summary of HDFS knowledge points in today’s big data Hadoop. This time I summarize the basic knowledge points of HDFS first. Next time, I will summarize and share some architecture knowledge of HDFS. I hope to be able to learn about big data. The classmates are helpful and hope to get criticism and guidance from the big guys.

Finally, if there is something, the official account: Lao Liu who is working hard, contact; if nothing is wrong, I will learn big data with Lao Liu.

Guess you like

Origin blog.csdn.net/qq_36780184/article/details/109787073