Summary of HDFS using Hadoop 3 in Python

Hadoop overview

        Hadoop is a distributed infrastructure developed by the Apache Software Foundation. Users can develop distributed programs without understanding the underlying details of distribution, making full use of the power of the cluster for high-speed computing and storage.
        Hadoop implements a distributed file system (Hadoop Distributed File System, HDFS). HDFS has the characteristics of high fault tolerance, and it is designed to be deployed on cheap hardware, and it provides high throughput to access application data, suitable for applications with very large data sets. The core design of the Hadoop framework is HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides calculation for massive data.

The three core components of Hadoop

  • HDFS 
  • MapReduce
  • Yarn

Install Hadoop 3 pseudo-distributed version/stand-alone version based on Linux

Please refer to the article: Centos7 installs Hadoop3 stand-alone version (pseudo-distributed version)

HDFS principle 

HDFS adopts Master/Slave architecture.

  • An HDFS cluster contains a single NameNode and multiple DataNodes.
  • NameNode serves as the Master service, which is responsible for managing the namespace of the file system and client access to files. The NameNode will save the specific information of the file system, including file information, the information that the file is divided into specific blocks, and the information of the DataNode to which each block belongs. For the entire cluster, HDFS provides a single namespace to users through the NameNode.
  • As a Slave service, DataNode can exist in multiple clusters. Usually each DataNode corresponds to a physical node. DataNodes are responsible for managing the storage they own on the nodes. It divides the storage into multiple blocks, manages block information, and periodically sends all its block information to the NameNode.

HDFS system architecture diagram has three main roles: Client, NameNode, and DataNode.

Some key elements of HDFS

  • Block: Divide the file into blocks, usually 64M.
  • NameNode: It is the Master node and the big leader. Manage data block mapping; handle client read and write requests; configure copy strategies; manage HDFS namespace. The directory information, file information and block information of the entire file system are saved by a single host.
  • SecondaryNameNode: It is a younger brother that shares the workload of the elder NameNode; it is a cold backup of the NameNode; it merges fsimage and fsedits and then sends them to the NameNode. (Hot backup: b is a hot backup of a. If a is broken, then b will immediately replace a. Cold backup: b is a cold backup of a. If a is broken, then b cannot immediately replace a. But b Store some information about a on it to reduce the loss after a is broken.)
  • DataNode: It is the Slave node, slave, and does the work. Responsible for storing the data block sent by the Client; performing read and write operations on the data block.
  • fsimage: metadata image file (directory tree of file system)
  • edits: metadata operation log (records of modification operations made to the file system)

HDFS design highlights

  1. HDFS data backup HDFS is designed as a framework that can reliably store massive amounts of data across machines in large clusters. It stores all files as a sequence of blocks. All blocks are the same size except the last block.
  2. The default rule for files in HDFS is write one (write once, read many times), and it is strictly required to have only one writer at any time.
  3. The NameNode fully manages the replication of data blocks. It periodically receives heartbeat signals and block status reports (BlockReport) from each DataNode in the cluster. When a heartbeat signal is received, it is assumed that the DataNode is working properly. The block status report contains a list of all data blocks on the DataNode.
  4. What is stored in the NameNode memory is =fsimage+edits. The SecondaryNameNode is responsible for obtaining fsimage and edits from the NameNode at regular intervals (default 1 hour) for merging and then sending them to the NameNode. Reduce the workload of NameNode.

HDFS file writing 

The Client initiates a file writing request to the NameNode.

  1. According to the file size and file block configuration, the NameNode returns to the Client the information of the DataNodes it manages.
  2. The client divides the file into multiple blocks, and writes to each DataNode block in sequence according to the address information of the DataNode.

Example explanation: There is a file FileA with a size of 100M. Client writes FileA to HDFS.

Hadoop environment configuration instructions:

  1. HDFS is distributed on three racks Rack1, Rack2, and Rack3.

The file writing process is as follows:

  1. Client divides FileA into 64M blocks. Divided into two blocks, block1 and Block2;
  2. The Client sends a data write request to the NameNode, as shown in the blue dotted line ①------> in the figure.
  3. NameNode node records block information. And return the available DataNode, such as the pink dotted line ②--------->.
    1. Block1: host2,host1,host3
    2. Block2: host7,host8,host4
    3. principle:
      1. NameNode has RackAware rack awareness function, which can be configured.
      2. If the Client is a DataNode node, then when storing blocks, the rules are: Copy 1, on the same node as the Client; Copy 2, on a node in a different rack; Copy 3, on another node in the same rack as the second copy; other copies Pick at random.
      3. If the Client is not a DataNode node, then when storing the block, the rules are: Copy 1, on a randomly selected node; Copy 2, a different copy 1, on the rack; Copy 3, on another node that is the same as Copy 2; other copies Pick at random.
  4. Client sends block1 to DataNode; the sending process is streaming writing. The streaming writing process is as follows:
    1. Divide 64M block1 into 64k packages;
    2. Then send the first package to host2;
    3. After host2 receives it, it sends the first package to host1, and at the same time, the client sends the second package to host2;
    4. After host1 receives the first package, it sends it to host3 and receives the second package from host2 at the same time.
    5. And so on, as shown by the solid red line in the figure, until block1 is sent.
    6. host2, host1, and host3 send notifications to NameNode and host2 to Client, saying "the message has been sent." As shown in the figure with pink solid line.
    7. After receiving the message from host2, the Client sends a message to NameNode saying that I have finished writing. That's it. thick yellow line
    8. After sending block1, block2 is sent to host7, host8, and host4, as shown by the blue solid line in the figure.
    9. After sending block2, host7, host8, and host4 send notifications to the NameNode, and host7 sends notifications to the Client, as shown in the light green solid line in the figure.
    10. The Client sends a message to the NameNode saying that I have finished writing, as shown in the thick yellow solid line in the figure. . . That's it.
  5. Analysis: Through the writing process, we can understand
    1. To write a 1T file, we need 3T of storage and 3T of network traffic.
    2. During the process of reading or writing, NameNode and DataNode communicate through HeartBeat to ensure that DataNode is alive. If it is found that the DataNode is dead, the data on the dead DataNode will be placed on other nodes. When reading, read from other nodes.
    3. It doesn't matter if a node hangs up, there are other nodes that can be backed up; even if a certain rack hangs up, it doesn't matter; there are also backups on other racks.

HDFS file reading

The Client initiates a file reading request to the NameNode.

  1. The Client initiates a file reading request to the NameNode.
  2. NameNode returns the block information stored in the file and the information of the DataNode where the block block is located.
  3. Client reads file information.

As shown in the figure, the Client needs to read FileA from the DataNode. FileA consists of block1 and block2. The read operation flow is as follows:

  1. Client sends read request to NameNode.
  2. NameNode checks the Metadata information and returns the location of FileA's block.
    1. block1:host2,host1,host3
    2. block2:host7,host8,host4
  3. The position of the block is sequential, first read block1, then read block2. And block1 reads on host2; then block2 reads on host7.

In the above example, the Client is located outside the rack, so if the Client is located on a DataNode inside the rack, for example, the Client is host6. Then, when reading, the rule to follow is: preferably read the data on this rack.

HDFS data backup

The storage of backup data is key to HDFS reliability and performance. HDFS uses a strategy called rack-aware to determine the storage of backup data.

Through a process called Rack Awareness, the NameNode determines the rack id to which each DataNode belongs.

By default, a block has three backups:

  1. One on the DataNode specified by NameNode
  2. One is on a DataNode that is not the same rack as the specified DataNode.
  3. One is on the DataNode in the same rack as the specified DataNode.

This strategy takes into account the failure of the same rack and the performance of data replication between different racks.

Selection of replicas: In order to reduce overall bandwidth consumption and read latency, HDFS will try to read the nearest replica. If there is a copy on the same rack, then read that copy. If an HDFS cluster spans multiple data centers, it will first try to read the local data center copy.

HDFS Shell Commands

Introduction to HDFS Shell

        To call the file system (HDFS) Shell command, use the format bin/hadoop fs <args>. All FS shell commands take URI paths as arguments. The URI format is scheme://authority/path . For HDFS file systems, the scheme is hdfs , and for local file systems, the scheme is file . The scheme and authority parameters are optional. If not specified, the default scheme specified in the configuration will be used.

Detailed explanation of HDFS Shell commands

cat

Usage: hadoop fs -cat URI [URI …]

Output the contents of the file specified by the path to stdout .

Example:

  • hadoop fs -cat hdfs://host1:port1/file1 hdfs://host2:port2/file2
  • hadoop fs -cat file:///file3 /user/hadoop/file4

Return value:
0 is returned on success, and -1 is returned on failure.

chgrp

使用方法:hadoop fs -chgrp [-R] GROUP URI [URI …] Change group association of files. With -R, make the change recursively through the directory structure. The user must be the owner of files, or else a super-user. Additional information is in the Permissions User Guide. -->

Change the group the file belongs to. Using -R will cause changes to be made recursively through the directory structure. The user of the command must be the owner of the file or the superuser. See the HDFS Permissions User Guide for more information .

chmod

Usage: hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI …]

Change the permissions of the file. Using -R will make changes recursively down the directory structure. The user of the command must be the owner of the file or a superuser. See the HDFS Permissions User Guide for more information .

chown

Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI]

Change the owner of the file. Using -R will make changes recursively down the directory structure. The user of the command must be a superuser. See the HDFS Permissions User Guide for more information .

copyFromLocal

Usage: hadoop fs -copyFromLocal <localsrc> URI

Similar to the put command , except that the source path is limited to a local file .

copyToLocal

Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Similar to the get command , except that the target path is limited to a local file .

cp

Usage: hadoop fs -cp URI [URI ...] <dest>

Copy files from source path to destination path. This command allows multiple source paths, in which case the target path must be a directory.
Example:

  • hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

return value:

Returns 0 on success and -1 on failure.

of

Usage: hadoop fs -du URI [URI …]

Displays the size of all files in the directory, or when only one file is specified, displays the size of this file.
Example:
hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://host:port/user/hadoop/dir1
Return value:
0 is returned on success, -1 on failure.

So

Usage: hadoop fs -dus <args>

Displays the size of the file.

expunge

Usage: hadoop fs -expunge

Empty the trash. Please refer to the HDFS design documentation for more information about the recycle bin features.

get

Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>

Copy the file to the local file system. Files that failed the CRC check can be copied using the -ignorecrc option. Use the -crc option to copy the file along with the CRC information.

Example:

  • hadoop fs -get /user/hadoop/file localfile
  • hadoop fs -get hdfs://host:port/user/hadoop/file localfile

return value:

Returns 0 on success and -1 on failure.

getmerge

Usage: hadoop fs -getmerge <src> <localdst> [addnl]

Accepts a source directory and a target file as input, and concatenates all files in the source directory into local target files. addnl is optional and specifies to add a newline character at the end of each file.

ls

Usage: hadoop fs -ls <args>

If it is a file, return file information in the following format:
File name <Number of copies> File size Modification date Modification time Permissions User ID Group ID
If it is a directory, return a list of its direct children, just like in Unix. The information of the directory return list is as follows:
Directory name <dir> Modification date Modification time Permission user ID Group ID
Example:
hadoop fs -ls /user/hadoop/file1 /user/hadoop/file2 hdfs://host:port/user/hadoop /dir1 /nonexistentfile
Return value:
0 on success, -1 on failure.

lsr

Usage: hadoop fs -lsr <args>
Recursive version of ls command. Similar to ls -R in Unix.

mkdir

Usage: hadoop fs -mkdir <paths>

Accepts the uri specified by the path as a parameter and creates these directories. Its behavior is similar to Unix's mkdir -p, which creates parent directories at all levels in the path.

Example:

  • hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
  • hadoop fs -mkdir hdfs://host1:port1/user/hadoop/dir hdfs://host2:port2/user/hadoop/dir

return value:

Returns 0 on success and -1 on failure.

movefromLocal

Usage: dfs -moveFromLocal <src> <dst>

Outputs a "not implemented" message.

mv

Usage: hadoop fs -mv URI [URI …] <dest>

Move files from source path to destination path. This command allows multiple source paths, in which case the target path must be a directory. Moving files between different file systems is not allowed.
Example:

  • hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -mv hdfs://host:port/file1 hdfs://host:port/file2 hdfs://host:port/file3 hdfs://host:port/dir1

return value:

Returns 0 on success and -1 on failure.

put

Usage: hadoop fs -put <localsrc> ... <dst>

Copy single or multiple source paths from the local file system to the target file system. Also supports reading input from standard input and writing to the target file system.

  • hadoop fs -put localfile /user/hadoop/hadoopfile
  • hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
  • hadoop fs -put localfile hdfs://host:port/hadoop/hadoopfile
  • hadoop fs -put - hdfs://host:port/hadoop/hadoopfile
    reads input from standard input.

return value:

Returns 0 on success and -1 on failure.

rm

Usage: hadoop fs -rm URI [URI …]

Delete the specified file. Only delete non-empty directories and files. Please refer to the rmr command to learn about recursive deletion.
Example:

  • hadoop fs -rm hdfs://host:port/file /user/hadoop/emptydir

return value:

Returns 0 on success and -1 on failure.

rmr

Usage: hadoop fs -rmr URI [URI …]

Recursive version of delete.
Example:

  • hadoop fs -rmr /user/hadoop/dir
  • hadoop fs -rmr hdfs://host:port/user/hadoop/dir

return value:

Returns 0 on success and -1 on failure.

setrep

Usage: hadoop fs -setrep [-R] <path>

Change the copy factor of a file. The -R option is used to recursively change the copy coefficient of all files in the directory.

Example:

  • hadoop fs -setrep -w 3 -R /user/hadoop/dir1

return value:

Returns 0 on success and -1 on failure.

stat

Usage: hadoop fs -stat URI [URI …]

Returns statistics for the specified path.

Example:

  • hadoop fs -stat path

Return value:
0 is returned on success, and -1 is returned on failure.

tail

How to use: hadoop fs -tail [-f] URI

Output the 1K bytes at the end of the file to stdout. The -f option is supported, and the behavior is consistent with Unix.

Example:

  • hadoop fs -tail pathname

Return value:
0 is returned on success, and -1 is returned on failure.

test

How to use: hadoop fs -test -[ezd] URI

Options:
-e Check if the file exists. Returns 0 if present.
-z Check if the file is 0 bytes. If so, return 0.
-d If the path is a directory, return 1, otherwise return 0.

Example:

  • hadoop fs -test -e filename

text

How to use: hadoop fs -text <src>

Export source files to text format. Allowed formats are zip and TextRecordInputStream.

touchz

Usage: hadoop fs -touchz URI [URI ...]

Create an empty file of 0 bytes.

Example:

  • hadoop -touchz pathname

Return value:
0 is returned on success, and -1 is returned on failure.

HDFS Python API

In addition to operating HDFS through HDFS Shell commands, HDFS can also be programmed through Java API, Python API, C++ API, etc. Before using Python API programming, you need to install the third-party library that HDFS depends on: PyHDFS

pyhdfs 

Official document address: https://pyhdfs.readthedocs.io/en/latest/pyhdfs.html

Introduction to the original text:

WebHDFS client with support for NN HA and automatic error checking

For details on the WebHDFS endpoints, see the Hadoop documentation:

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html

The general meaning is: the pyhdfs module/library uses the WebHDFS client to connect to HDFS and supports NN HA and automatic error detection. For detailed usage, please refer to the official document address.

pyhdfs core classes

Python 3 uses pyhdfs (focus on the HdfsClient class)

Install pyhdfs module/library

pip install pyhdfs

pyhdfs usage example

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/11 22:35
# 文件名称 : pythonHdfs_1
# 开发工具 : PyCharm

import pyhdfs

# 基于pyHDFS 模块, 连接Hadoop 主机:9870 端口
fs = pyhdfs.HdfsClient(hosts="192.168.43.11:9870", user_name="root")

# 返回用户根目录
print(fs.get_home_directory())

# 返回可用namenode节点
print(fs.get_active_namenode())

# 返回指定目录下所有文件
print(fs.listdir("/"))

# hadoop 创建指定目录
fs.mkdirs('/uploads')

# 再次执行返货指定目录下所有文件
print(fs.listdir("/"))

# 执行本地文件上传Hadoop 指定目录
# fs.copy_from_local("D:\one.txt", '/uploads/one.txt')

# 执行Hadoop 文件下载
# fs.copy_to_local("/uploads/one.txt", r'D:\two.txt')

# 判断目录是否存在
print(fs.exists("/uploads"))
# 返回目录下的所有目录,路径,文件名
print(list(fs.walk('/uploads')))

# 删除目录/文件
fs.delete("/uploads", recursive=True)  # 删除目录  recursive=True
fs.delete("/uploads/one.txt")  # 删除文件

Python 3 universal package pyhdfs

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/11 22:35
# 文件名称 : pythonHdfs
# 开发工具 : PyCharm

import sys
import pyhdfs

class HDFS:
    def __init__(self, host='192.168.43.11',user_name='root'):
        self.host = host
        self.user_name=user_name

    def get_con(self):
        try:
            hdfs = pyhdfs.HdfsClient(hosts = self.host,user_name = self.user_name)
            return hdfs
        except pyhdfs.HdfsException,e:
            print "Error:%s" % e

    # 返回指定目录下的所有文件
    def listdir(self,oper):
        try:
            client = self.get_con()
            dirs = client.listdir(oper)
            for row in dirs:
                print row
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 返回用户的根目录
    def get_home_directory(self):
        try:
            client = self.get_con()
            print client.get_home_directory()
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e


    # 返回可用的namenode节点
    def get_active_namenode(self):
        try:
            client = self.get_con()
            print client.get_active_namenode()
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 创建新目录
    def mkdirs(self,oper):
        try:
            client = self.get_con()
            print client.mkdirs(oper)
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 从集群上copy到本地
    def copy_to_local(self, dest,localsrc):
        try:
            client = self.get_con()
            print client.copy_to_local(dest,localsrc)
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 从本地上传文件至集群
    def copy_from_local(self, localsrc, dest):
        try:
            client = self.get_con()
            print client.copy_from_local(localsrc, dest)
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e


    # 查看文件内容
    def read_files(self,oper):
        try:
            client = self.get_con()
            response = client.open(oper)
            print response.read()
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 向一个已经存在的文件追加内容
    def append_files(self, file,content):
        try:
            client = self.get_con()
            print client.append(file,content)
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 查看是否存在文件
    def check_files(self,oper):
        try:
            client = self.get_con()
            print client.exists(oper)
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 查看文件的校验和
    def get_file_checksum(self,oper):
        try:
            client = self.get_con()
            print client.get_file_checksum(oper)
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 查看路径总览信息
    def get_content_summary(self,oper):
        try:
            client = self.get_con()
            print client.get_content_summary(oper)
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 查看当前路径的状态
    def list_status(self,oper):
        try:
            client = self.get_con()
            print client.list_status(oper)
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

    # 删除文件
    def delete_files(self,path):
        try:
            client = self.get_con()
            print client.delete(path)
        except pyhdfs.HdfsException, e:
            print "Error:%s" % e

if __name__ == '__main__':
    db = HDFS('Hadoop3-master','root')
    # db.listdir('/user')
    # db.get_home_directory()
    # db.get_active_namenode()
    # db.mkdirs('/dave')
    # db.copy_from_local("D:/one.txt","/uploads/two.txt")
    # db.listdir('/uploads')
    # db.read_files('/uploads/two.txt')
    # db.check_files('/uploads/two.txt')
    # db.get_file_checksum('/uploads/two.txt')
    # db.get_content_summary('/')
    # db.list_status('/')
    # db.list_status('/uploads/two.txt')
    # db.copy_to_local("/uploads/two.txt","D:/one.txt")
    # db.append_files('/uploads/two.txt',"88, CSDN 博客")
    # db.read_files('/uploads/two.txt')
    # db.copy_from_local("D:/three.txt", "/uploads/two.txt")
    db.listdir('/uploads')
    # db.delete_files('/uploads/two.txt')

Guess you like

Origin blog.csdn.net/zhouzhiwengang/article/details/132257422