HDFS is still the king of storage

Why is the king of HDFS is to store it?

Let's take the issue to understand the structure and principles of HDFS, I have always thought that the best way to learn is Tell me what the big data network. So for starters, be sure to Tell me what network, even if you can not read English, but also to see the translation software.

First, look at the official description:

Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file system. However, the big difference with other distributed file systems. HDFS is highly fault-tolerant, designed to be deployed on low-cost hardware. HDFS provides data access to high-throughput applications, suitable for applications with large data sets.

Look at this presentation probably have been able to understand the HDFS is something, in short, the use of multiple low-cost servers to store data and to ensure high availability, thereby reducing the cost.

Since the basic concept of HDFS's official website said very clearly, and I'll say something about something other than the official website. We know that the birth of the first big data is derived from google of three papers, the first paper published time Google File System in 2003. So imagine without a distributed file system, we need to store the file size exceeds the size of the disk how to do? Corresponding solution is RAID technology.

RAID technology is composed of a plurality of normal disk array, mainly to improve the storage capacity of the disk access speed, disk availability and fault tolerance.

Simple to understand is that there was only one server a disk, and now the addition of N disks. This can solve the problem of file size exceeds the disk, and the files may be divided into N parts concurrent read and write. This is the first RAID0, RAID0 but did not do so as long as there is a backup disk damage would destroy the integrity of the data.

Of course, RAID technology has also been improved to ensure the integrity of the data is necessary to use data backup, and that this will lead to reduced disk usage. You can not have both fish and, as shown below:

 

 

In fact, this is the "vertical stretch" expansion mode, by upgrading the cpu, memory, disk and other computer will become more powerful. But this does not work in the Internet era, we have a higher demand for large-scale computing power and data storage. So there later HDFS distributed file system, it uses a "horizontal scaling", to achieve higher computing power and large data storage by adding more servers.

Development of large data also has more than ten years, various techniques are endless, computing framework, iterative algorithm is constantly updated.

Everything big data are related to the data, HDFS as the first large data storage systems, a variety of new computational framework and want to get the algorithm is widely used, it is necessary to support HDFS in order to obtain the data inside. So big data is more developed, the more inseparable from HDFS. HDFS is not the best large data storage technology, it is still the most important large data storage technology.

OK, in front of the bedding had enough, now we have to talk to the king of HDFS architecture principles, Zhang still have grilled the classic map from the official website, as follows:

 

 

As can be seen from the above that two key components, a DataNodes is, one is NameNode.

HDFS has a master-slave architecture, HDFS cluster consists of a single NameNode, a file management system named master server and client management space, access to the files that make up. In addition, there are many DataNodes, usually one for each node in the cluster, the node connected to the storage management for their running.

HDFS file system namespace disclosed, and allows the user data stored in the file. Internally, the file is divided into one or more blocks which are stored in a set of DataNode. NameNode perform a file system namespace operations, such as opening, closing, and rename files and directories. It also determines the block to the DataNode mapping. DataNode responsible for providing read and write requests from the file system client.

Next, look at how to do HDFS is highly available.
The picture below is a data block stored in multiple copies a schematic, drawings, the document / users / sameerp / data / part -0, the backup copy number set to 2, the stored BlockID 1,3 respectively.
On DataNode0 DataNode2 two servers and two backup storage Block1 of
the two backup storage DataNode4 Block3 DataNode6 and two servers, one after any of the above-described server downtime, each data block there is at least one backup exists, not affect access to the file / users / sameerp / data / part -0 's.

 

 

关于HDFS的基础架构与原理就先说到这,接下来的文章会对HDFS具体的高可用设计做进一步的分析解读。查看更多精彩内容可以关注一下公众号:"懂点大数据"

Guess you like

Origin blog.csdn.net/sinat_33462342/article/details/93473494