Introduction to distributed file storage hdfs and common commands

1. Introduction to hdfs

1.1 What is HDFS?

HDFS (Hadoop Distributed File System) is an important part of the Hadoop ecosystem. It is the storage component in Hadoop and the most basic part. Computing models such as MapReduce all rely on data stored in HDFS. HDFS is a distributed file system that stores very large files in a streaming data access mode, and stores the data in blocks on different machines in a commercial hardware cluster.

1.2 Design goals of HDFS

Storage of very large files HDFS is suitable for storing large files, and the size of a single file is usually more than 100 MB. HDFS is suitable for storing massive files, and the total storage capacity can reach PB, EB level.
Streaming data access Designed for data batch processing, focusing on high throughput of data access
Hardware fault tolerance is built on ordinary machines, and hardware errors are normal rather than abnormal. Therefore, error detection and fast and automatic recovery are the core architectural goals of HDFS
Simple consistency model Write once, read a file many times, no need to change after creation, writing and closing
Does not support low-latency data access HDFS cares about high data throughput, and is not suitable for applications that require low-latency data access.
Local calculations move calculations near the data

1.3 The composition of HDFS

data block

Files are divided and stored in blocks, and blocks are usually set to be relatively large (minimum 6M, default 128M)
The larger the block, the faster the addressing, and the higher the reading efficiency. At the same time, because MapReduce tasks are also processed in blocks, too large blocks are not conducive to parallel processing of data
A file occupies at least one block (logical concept)
Redundant backup data blocks are very suitable for data backup, which can provide data fault tolerance and improve availability. Each block can have multiple backups (three by default), which are stored on separate machines, so that a single point of failure will not cause data loss.

Namenode

namenode is responsible for maintaining the information of the entire file system, including: the entire file tree, file block distribution information, file system metadata, data replication strategy, etc.

Datanode

The datanode stores the content of the file, is responsible for the actual read and write operations of the file, maintains the communication with the namenode, and synchronizes the file block information

Secondary NameNode

The Secondary NameNode merges the edit logs of the NameNode into the fsimage file. The whole purpose of the Secondary NameNode is to provide a checkpoint in HDFS. It is just a helper node of NameNode to help NameNode work better. It is not intended to replace the NameNode nor is it a backup of the NameNode. This is why it is considered a checkpoint node in the community.

1.4 Storage file formats supported by HDFS

HDFS supports any file format

The commonly used formats are as follows:

1. Sequencefile: kv format, which occupies more disk than source text format
2. Textfile: line text file, used more in production
3. rcfile: mixed row and column storage
4. orc: column storage, more used in production
5, parquet: column storage, production used more
6, avro: almost no, little
7, jsonfile: json format, almost no, slightly
8, inputformat: almost no, slightly
large data storage of data, more than 99% The scenes are all using columnar storage

1.5 HDFS small file problems and solutions

The problem describes a large number of files whose size is smaller than the block size
Background: The metadata object of each file occupies about 150 bytes, so if there are 10 million small files and each file occupies one block, the NameNode needs about 2G space. If 100 million files are stored, the NameNode needs 20G space; data is processed in units of blocks.
Impact: Occupy resources, reduce processing efficiency
Solution:
•Reduce small files from the source
•Use archive to package
•Use other storage methods, such as Hbase, ES, etc.

2, hdfs commonly used commands

# 查看目录下的文件及目录列表
hdfs dfs -ls /

# 创建目录
hdfs dfs -mkdir -p /user/che/1021001

# 拷贝文件到hdfs文件系统中的
hdfs dfs -copyFromLocal /data/test.txt  /user/che/1021001/

# 查看文件内容
hdfs dfs -tail /user/che/1021001/test.txt
hdfs dfs -cat /user/che/1021001/test.txt

# 删除文件
hdfs dfs -rm /user/che/1021001/test.txt

# 删除文件夹
hdfs dfs -rm -r /user/che