HDFS (Distributed File Storage System)--execution process and API operation

Table of contents

process

1. Read process/download

2. Write process/upload

process

1. Read process/download

The client initiates an RPC request to the NameNode
NameNode will verify after receiving the request:
1. Check if the specified path exists
2. Check if the file exists
If the file exists, the NameNode reads the metadata and sends a signal to the client
The client will ask the NameNode for the address of the first Block
NameNode will read the metadata after receiving the request, and then put the address of the first Block into the queue for the client
By default, a block has three addresses (3 copies). After receiving the queue, the client selects a closer node to read the first block. After the reading is completed, the block will be verified by checksum; if the verification fails , the client sends a signal to the NameNode, and then re-selects the address to re-read; if the verification is successful, the client will ask the NameNode for the address of the second block, and repeat the three processes of 4, 5, and 6.
After the client has read all the blocks, it will send an end signal to the NameNode, and the NameNode will close the file after receiving the signal

2. Write process/upload

The client initiates an RPC request to the NameNode
NameNode will verify after receiving the request:
1. Check if the specified path exists
2. Verify that the write path has permission
3. Check if there is a file with the same name in the specified path
If the verification fails, an exception is thrown. If the verification is successful, the metadata is recorded, and the NameNode will send a signal to the client.
After receiving the signal, the client will ask the NameNode for the storage location of the first Block.
After the NameNode receives the request, it will wait for the heartbeat of the DataNode, select the address of the DataNode, put it in the queue and return it to the client. By default, the NameNode will choose 3 addresses.
The client receives 3 addresses in the queue, and selects a closer (network topology distance) node from these addresses to write to the first copy of the first Block.
The node where the first copy is located will write the second copy to other nodes through the pipeline (pipeline, actually Channel in NIO), and the node where the second copy is located is writing to the third copy.
After writing, the node where the third copy is located will return ack to the node where the second copy is located. After receiving the ack, the node where the second copy is located will return ack to the node where the first copy is located. The first copy The node where it is is returning ack to the client
After writing the first block, the client will ask the NameNode for the storage location of the second block, and then repeat the process of 5, 6, 7, and 8.
When all the blocks are written, the client will send an end signal to the NameNode, and the NameNode will close the file/stream. After the stream is closed, the file cannot be modified.

3. Deletion process

The client initiates an RPC request to the NameNode.
After the NameNode receives the request , it will record the request in the edits file , and then update the metadata in the memory . After the memory update is successful, the client will return an ack signal . At this time, the block corresponding to the file is still stored in the DataNode .
When the NameNode receives the heartbeat from the DataNode , the NameNode will check the Block information , and will send a heartbeat response to the DataNode , requesting to delete the corresponding Block. The DataNode will not actually delete the Block until it receives the heartbeat response .

API operation

Throughput ----- the total amount of data read and written by nodes or clusters per unit time 1s read and write 100M --- 100M/s

High concurrency is not necessarily high throughput, but high throughput is generally high concurrency

1. Preparation steps

The dependent jar packages that need to be imported into HDFS:

hadoop-2.7.1\share\hadoop\common\*.jar

hadoop-2.7.1\share\hadoop\common\lib\*.jar

hadoop-2.7.1\share\hadoop\hdfs\*.jar

hadoop-2.7.1\share\hadoop\hdfs\lib\*.jar

2. API operation

read file

@Test

public void testConnectNamenode() throws Exception{

    Configuration conf=new Configuration();

    FileSystem fs=FileSystem.get(new URI("hdfs://192.168.234.21:9000"), conf);

    InputStream in=fs.open(new Path("/park/1.txt"));

    OutputStream out=new FileOutputStream("1.txt");

    IOUtils.copyBytes(in, out, conf);

}

upload files

@Test

public void testPut() throws Exception{

    Configuration conf=new Configuration();

    conf.set("dfs.replication","1");

    FileSystem fs=FileSystem.get(new URI("hdfs://192.168.234.21:9000"),conf,"root");

    ByteArrayInputStream in=new ByteArrayInputStream("hello hdfs".getBytes());

    OutputStream out=fs.create(new Path("/park/2.txt"));

    IOUtils.copyBytes(in, out, conf);

}

Delete Files

@Test

public void testDelete()throws Exception{

    Configuration conf=new Configuration();

    FileSystem fs=FileSystem.get(new URI("hdfs://192.168.234.21:9000"),conf,"root");

    //true表示无论目录是否为空，都删除掉。可以删除指定的文件

    fs.delete(new Path("/park01"),true);

    //false表示只能删除不为空的目录。

    fs.delete(new Path("/park01"),false);

    fs.close();

}

create folder on hdfs

@Test

public void testMkdir()throws Exception{

    Configuration conf=new Configuration();

    FileSystem fs=FileSystem.get(new URI("hdfs://192.168.234.21:9000"),conf,"root");

    fs.mkdirs(new Path("/park02"));

}

Query the files in the specified directory of hdfs

@Test

public void testLs()throws Exception{

    Configuration conf=new Configuration();

    FileSystem fs=FileSystem.get(new URI("hdfs://192.168.234.21:9000"),conf,"root");

    FileStatus[] ls=fs.listStatus(new Path("/"));

    for(FileStatus status:ls){

        System.out.println(status);

    }

}

Recursively view the files in the specified directory

@Test

public void testLs()throws Exception{

    Configuration conf=new Configuration();

    FileSystem fs=FileSystem.get(new URI("hdfs://192.168.234.214:9000"),conf,"root");

    RemoteIterator<LocatedFileStatus> rt=fs.listFiles(new Path("/"), true);

    while(rt.hasNext()){

        System.out.println(rt.next());

    }

}

double naming

@Test

public void testCreateNewFile() throws Exception{

    Configuration conf=new Configuration();

    FileSystem fs=FileSystem.get(new URI("hdfs://192.168.234.176:9000"),conf,"root");

    fs.rename(new Path("/park"), new Path("/park01"));

}

Get the block information of a file

@Test

public void testCopyFromLoaclFileSystem() throws Exception{

    Configuration conf=new Configuration();

    FileSystem fs=FileSystem.get(new URI("hdfs://192.168.234.176:9000"),conf,"root");

    BlockLocation[] data=fs.getFileBlockLocations(new Path("/park01/1.txt"),0,Integer.MaxValue);

    for(BlockLocation bl:data){

        System.out.println(bl);

    }

}

You can also use plug-ins to operate hdfs

Java Big Data Road--HDFS Detailed Explanation (5)--Execution Process and API Operation

HDFS (Distributed File Storage System)--execution process and API operation

process

1. Read process/download

2. Write process/upload

3. Deletion process

API operation

1. Preparation steps

2. API operation

Guess you like