Big Data Distributed File System HDFS

Software version

hadoop-2.6.0-cdh5.7.0.tar.gz

content:

1. Distributed file system HDFS 2. HDFS
advantages and disadvantages
3. Design idea of ​​distributed file system 4. HDFS
architecture Figure
5. Hadoop download and JDK installation
6. shh installation and HDFS file parameter configuration 7. HDFS
shell operation
8. HDFS Java API programming 9. HDFS
read and write data flow
10. New features of hadoop

11. The actual case of HDFS log collection


1. Distributed file system HDFS

 

1) When the dataset reaches a certain scale, then there is no way for a single machine to handle it

2) Distribute the data to each independent machine (multiple machines work together)

 

Official website to find information, hadoop.apache.org

 



 

 

introduce

HDFS ( Hadoop Distributed File System ) is a distributed file system designed to run on cheap hard disks. very different. HDFS is a fault-tolerant system that is deployed on inexpensive hard drives. HEFS provides high throughput to access application data and is suitable for big data access. HDFS relaxes some POSIX requirements to enable streaming access to the file system. HDFS was originally the infrastructure for the Apache Nutch search engine. HDFS is a core project of apache (The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/

 

2. Advantages and disadvantages of HDFS

Pros 1. Built on an inexpensive machine. 2. Use big data processing. 3. High fault tolerance

Official description

1.  Hardware Error ( Hardware Failure )

Hardware errors are the norm rather than the exception . An HDFS may contain hundreds or thousands of server machines, each of which stores a portion of a file. The fact that it contains large structures, each of which has a high probability of error, means that HDFS is largely unusable, therefore, automatic detection of errors and express automatic recovery of them is a goal of HDFS . ( Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system's data. The fact that there are a huge number of components and that each component has a non -trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. )

2.  Streaming Data Access

Applications on HDFS need to access data in a streaming way. This is the difference between him and ordinary applications. HDFS is more for batch processing, not interactive use by users. The point is high throughput of data access, not low latency of data access. ( Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. )

3.  Large Data Sets

Applications running on HDFS have large data sets, a typical HDFS can be gigabytes or tb . HDFS is tuned to support large files, it should provide high aggregate data bandwidth for hundreds of nodes in a single cluster and scale, can support thousands of files on one HDFS . ( Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster .It should support tens of millions of files in a single instance. )

4.  Simple Coherency Model

HDFS is a write-once read-many model. Once a file is created, written, and closed, it cannot be changed, except for appending and ending. It can append content you add to the end of the file, but it can't be updated arbitrarily. This assumption solves the data consistency problem so that high throughput access is supported. MapReduce applications and crawler applications fit this model. ( HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. )

5.  Moving Computation is Cheaper than Moving Data (Moving Computation is Cheaper than Moving Data ”)

Application requests are more computationally efficient if executed near the data they operate on. This is very important when the amount of data is very large, it will minimize network congestion and increase the overall throughput of the system. Our assumption is that it is often better to migrate computation to where the data resides, rather than move the data to where the application runs. ( A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. )

6.  Portability Across Heterogeneous Hardware and Software Platforms

HDFS is designed to be easy to port from one platform to another. This has contributed to the widespread adoption of HDFS as the platform of choice for a large set of applications.

 

shortcoming

Not suitable for low-latency data access. 2. Not suitable for small file storage

3. Design idea of ​​distributed file system

1. Split a file into multiple blocks

2. Each block is stored on each node in a multi-copy manner

3. Save the metadata mapping relationship

4. Load balancing

5. Distributed parallel computing



4. HDFS Architecture Diagram


HDFS is a master/slave architecture. The HDFS cluster consists of a NameNode , which is the master server that manages the file system namespace and manages client access to files. In addition, there are many datanodes , usually a node in the cluster, that manage the storage. HDFS exposes a filesystem namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks, which are stored in a set of datanodes . The NameNode performs filesystem namespace operations such as opening, closing, and renaming files and directories. It also determines the mapping of blocks to datanodes . The datanode is responsible for serving read and write requests from clients of the file system. The datanode also performs block creation, deletion and replication at the direction of the NameNode . (HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion,and replication upon instruction from the NameNode.

 

NameNode and DataNode generally run on some cheap machines. These typical machines run on the linux operating system. HDFS uses the Java language. Any machine can run as long as it can run java . A typical deployment has a special machine that only runs the NameNode software. Each machine in the cluster runs an instance of the DataNode software. But this is not the case in a production environment. (The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.)

 

HDFS file copy storage

  

 

File System ( The File System Namespace )

HDFS supports traditional hierarchical file organization. Users or applications can create directories and store files in these directories. The filesystem namespace hierarchy is similar to most other existing filesystems ; HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. )

The NameNode maintains the filesystem namespace. Any changes to the filesystem namespace or its properties are logged by the NameNode . An application can specify the number of copies of a file that should be maintained by HDFS , and an application can specify the number of copies of a file that should be maintained by HDFS . The number of copies of a file is called the copy factor of that file. This information is stored by the NameNode ( The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode. )

 

Data Replication _

HDFS is designed to reliably store very large files across machines in large clusters. It stores each file as a sequence of blocks. The blocks of the file are replicated for fault tolerance. Block size and replication factor are configurable per file. ( HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. )

一个blocks除了最后一个block,其它都一样。(All blocks in a file except the last block are the same size, while users can start a new block without filling out the last block to the configured block size after the support for variable length block was added to append and hsync

应用程序可以指定文件的副本系数(An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.

The NameNode makes all decisions about block replication. It periodically receives heartbeat and blocking reports for each datanode in the cluster. Receiving a heartbeat means that the DataNode is working properly. ( The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. )

 

Summarize yourself:

Master ( NameNode/NN ) with N Slaves ( DataNode /DN )

HDFS/YARN/HBase is also a similar structure

 

1 file will be split into multiple blocks

Blocksize:128m

130M ==> 2 blocks : 128M and 2M

 

NN

1) Responsible for the response to the client request

2) Responsible for the management of metadata (noun of file, copy coefficient, DN stored by Block )

 

DN

1) Store the data block corresponding to the user's file

2) Send heartbeat information to NN regularly, report itself and all Block information, health status

 

NameNode + N Individual DataNode

Recommendation: NN and DN are deployed on different nodes

Replication factor: replication factor, replication factor

All blocks in a file except the last block are the same size


5. Hadoop download and JDK installation


6.shh installation and HDFS file parameter configuration


7. HDFS shell operation


8. HDFS Java API programming


9. HDFS read and write data process


10. New features of hadoop


11. The actual case of HDFS log collection


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325985992&siteId=291194637