How does HDFS handle storage and access of large and small files?

How does HDFS handle storage and access of large and small files?

HDFS (Hadoop Distributed File System) is a distributed file system for storing and processing large-scale data. It handles large files through block storage and parallel read strategies, and handles small files through combined storage and metadata compression strategies.

For the storage and access of large files, HDFS adopts the strategy of block storage and parallel reading. Specifically, when a large file is stored in HDFS, it will be divided into multiple data blocks and stored on different DataNodes. In this way, parallel writing and reading of data can be realized, and storage and access efficiency can be improved. At the same time, HDFS also performs redundant backups for each data block to ensure data reliability and high availability.

The following is a simplified sample code showing the storage and reading process of large files:

// 存储大文件
public void storeLargeFile(String filePath) {
    
    
    File file = new File(filePath);
    byte[] buffer = new byte[128 * 1024 * 1024]; // 每次读取128MB的数据块
    
    try (InputStream inputStream = new FileInputStream(file)) {
    
    
        int bytesRead;
        while ((bytesRead = inputStream.read(buffer)) != -1) {
    
    
            String blockId = generateBlockId(); // 生成数据块的唯一标识
            DataNode dataNode = selectDataNode(); // 选择一个DataNode作为目标节点
            
            dataNode.writeBlock(blockId, buffer, bytesRead); // 将数据块写入目标节点
            metadataService.updateMetadata(file.getName(), blockId, dataNode); // 更新元数据信息
        }
    } catch (IOException e) {
    
    
        e.printStackTrace();
    }
}

// 读取大文件
public void readLargeFile(String fileName) {
    
    
    List<BlockInfo> blockInfos = metadataService.getBlockInfos(fileName); // 获取文件的数据块信息
    
    try (OutputStream outputStream = new FileOutputStream(fileName)) {
    
    
        for (BlockInfo blockInfo : blockInfos) {
    
    
            DataNode dataNode = blockInfo.getDataNode();
            byte[] blockData = dataNode.readBlock(blockInfo.getBlockId()); // 从DataNode读取数据块
            
            outputStream.write(blockData); // 将数据块写入输出流
        }
    } catch (IOException e) {
    
    
        e.printStackTrace();
    }
}

In the above code, the process of storing a large file is as follows:

  1. First, split the large file into data blocks of 128MB size, and use the buffer to read the contents of the data blocks.
  2. Then, generate a unique identifier for each data block, and select a DataNode as the target node.
  3. Next, write the data block to the target node, and update the metadata information, including file name, data block identifier and target node.
  4. Repeat the above steps until all data blocks are written.

In the process of reading a large file, first obtain the data block information of the file, then read the data block from the corresponding DataNode in sequence, and write the data block to the output stream.

For the storage and access of small files, HDFS adopts a strategy of combined storage and metadata compression. Specifically, when small files are stored in HDFS, they will be merged into one or more data blocks to reduce metadata overhead. At the same time, HDFS also compresses metadata to further reduce storage space usage.

The following is a simplified sample code showing the process of storing and reading small files:

// 存储小文件
public void storeSmallFile(String filePath) {
    
    
    File file = new File(filePath);
    byte[] data = new byte[(int) file.length()];
    
    try (InputStream inputStream = new FileInputStream(file)) {
    
    
        inputStream.read(data);
        
        String blockId = generateBlockId(); // 生成数据块的唯一标识
        DataNode dataNode = selectDataNode(); // 选择一个DataNode作为目标节点
        
        dataNode.writeBlock(blockId, data, data.length); // 将数据块写入目标节点
        metadataService.updateMetadata(file.getName(), blockId, dataNode); // 更新元数据信息
    } catch (IOException e) {
    
    
        e.printStackTrace();
    }
}

// 读取小文件
public void readSmallFile(String fileName) {
    
    
    BlockInfo blockInfo = metadataService.getBlockInfo(fileName); // 获取文件的数据块信息
    
    DataNode dataNode = blockInfo.getDataNode();
    byte[] blockData = dataNode.readBlock(blockInfo.getBlockId()); // 从DataNode读取数据块
    
    try (OutputStream outputStream = new FileOutputStream(fileName)) {
    
    
        outputStream.write(blockData); // 将数据块写入输出流
    } catch (IOException e) {
    
    
        e.printStackTrace();
    }
}

In the above code, the process of storing small files is as follows:

  1. First, read the contents of the small file into a byte array.
  2. Then, generate a unique identifier for the data block, and select a DataNode as the target node.
  3. Next, write the data block to the target node, and update the metadata information, including file name, data block identifier and target node.

In the process of reading a small file, first obtain the data block information of the file, then read the data block from the corresponding DataNode, and write the data block to the output stream.

Through the above cases and codes, we can see that HDFS handles large files through the strategy of block storage and parallel reading, and handles small files through the strategy of combined storage and metadata compression. This design enables HDFS to efficiently store and access large and small files, while ensuring data reliability and high availability.

Guess you like

Origin blog.csdn.net/qq_51447496/article/details/132725276