How to do in-situ smooth scaling of HDFS?

background

When the scale of data becomes larger and larger, the cost of storage also increases. Over time, the distribution of data popularity tends to follow the 2/8 principle, that is, 80% of the visits are concentrated on 20% of the data. For the infrequently accessed 80% of the data, using multiple SSDs for storage is a huge waste, and the cold data needs to be migrated to other systems with lower storage costs. At this time, JuiceFS becomes the ideal choice, the cost is reduced by 20 times, and at the same time, it provides the same high-performance metadata capabilities as HDFS (avoiding avalanches when the metastore traverses the metadata), and also has high throughput when scanning a large number of cold data. If 80% of the data is moved to JuiceFS, the overall cost savings can be 90%. If JuiceFS is provided with appropriate space for caching, HDFS can also be completely replaced (20% of the hot data is served by the cache disk managed by JuiceFS, which can also have extremely high performance).

In 2019, we implemented several such cases. After the data is migrated to JuiceFS, the HDFS capacity has dropped, and it is necessary to shrink the capacity to reduce the storage cost. Everyone has done scaling, but many people are not familiar with scaling. Let's talk about how to do HDFS scaling well, especially in this context.

Three reduction schemes

The first scaling method, if the number of DataNode nodes is relatively large, and the CPU and memory resources are allowed to be reduced while reducing the storage space, you can shrink several DataNode nodes and directly use the decommission provided by HDFS. This is the most common method. The cross-node migration of a large amount of data involved in the reduction process will generate a large amount of intranet traffic and may affect the online load. It requires the operation and maintenance personnel to keep close attention and manual tuning, which usually takes a week or two. . If there are only 3 DataNode nodes left in the cluster, or the above CPU or memory resources cannot be reduced synchronously, this method cannot be used.

The second scaling method is to reduce the disk space on each node while keeping the number of DataNode nodes unchanged. You can modify the dfs.data.dirparameters , delete one or more disk directories, and then wait for HDFS to automatically add copies. . The unification of this method will also lead to a large amount of data movement between nodes, which will generate a large amount of intranet traffic and may affect the online load. It requires operation and maintenance personnel to keep close attention and manual tuning, and it may take a week or two. In addition, if there are only 2 copies of data, it is relatively dangerous. Once a disk directory is deleted, there is a problem with a node or a certain disk is broken, which is very likely to cause data loss.

Both of the above methods generate a lot of network traffic, which may affect online services and increase the risk of data loss. This article provides a third method, how to avoid the generated intranet traffic from affecting the online workload while scaling down, and at the same time minimize the risk of data loss during the scaling process.

case analysis

First, let's take a look at the directory structure of the DataNode on disk:

└── dn
    ├── current
    │   ├── BP-847673977-192.168.0.120-1559552771699
    │   │   ├── current
    │   │   │   ├── dfsUsed
    │   │   │   ├── finalized
    │   │   │   │   ├── subdir0
    │   │   │   │   │   ├── subdir1
    │   │   │   │   │   │   ├── blk_1073742303
    │   │   │   │   │   │   ├── blk_1073742303_1479.meta
    │   │   │   ├── rbw
    │   │   │   └── VERSION
    │   │   ├── scanner.cursor
    │   │   └── tmp
    │   └── VERSION
    └── in_use.lock
  • BP-847673977-192.168.0.120-1559552771699: This is the block pool directory. If deployed in Federation mode, there will be multiple block pool directories.
  • dfsUsed: Saves the usage statistics of the disk, refreshed every 10 minutes.
  • finalizedAnd the rbw directory: These two are used to store data blocks, finalized is the data block that has been written, and rbw is the data block that is being written. Each data block corresponds to 2 files, the blk file stores data, and the other one ends with meta to store metadata such as checksums.
  • VERSIONFile: mainly contains layout version, cluster ID, DataNode ID, block pool ID and other information.
  • scanner.cursorFile: DataNode will periodically check each blkfile , and this file is used to record the location of the check.
  • It is not difficult to see that all data files exist in finalizedand rbw, and DataNodethere will be no data files with the same Block ID on the same . Therefore, it is completely possible to move the data on one disk to another disk by migrating the blk file, and then unmount the disk to achieve the purpose of shrinking.

Shrinking steps

The HDFS in the examples in this article is CDH 5.16 version and uses ClouderaManager to manage the cluster. There are only 3 nodes in the cluster, each node has multiple SSD disks and two copies of data, and the storage utilization is very low. Each node can unload a disk, but the two common scaling methods above cannot be used to shrink the capacity at the same time. The process should minimize the impact on online services as much as possible.

The following operations are for a single DataNode, and other DataNodes also need to follow the steps below (which can be appropriately parallelized) :

  1. Select the disk. Select the data disk to be uninstalled and the data disk to receive data, and make sure that the remaining space of the disk receiving data is larger than the data on the disk to be uninstalled. It is assumed here:

Disk being unmounted: /dfs1, DataNode data directory on this disk:/dfs1/dfs/dn

Data receiving disk: /dfs, the DataNode data directory on this disk:/dfs/dfs/dn

  1. Data is copied for the first time. dfs.data.dirSelect the directory on the unloaded disk from , and then copy the entire directory to the receiving data disk. In order to minimize the occupation of IO, copy the data by ioniceadding rsyncto ensure that high-priority tasks are not blocked.
ionice -c 2 -n 7 rsync -au /dfs1/dfs/dn/ /dfs/shrink_temp/dn
  1. It is necessary to ensure that the data has been copied, so the DataNode needs to be stopped. DataNodes can be shut down through the ClouderaManager interface.

  2. The second incremental copy of the data. Repeat step 2 to incrementally update the newly added data between steps 2 and 3 to the receiving tray. Incremental data will be relatively small, and it is estimated that it will be completed soon.

ionice -c 2 -n 7 rsync -au /dfs1/dfs/dn/ /dfs/shrink_temp/dn
  1. Merge directories. At this time, the data on the uninstalled disk has been copied to the receiving disk, but the data is still in the original folder. If there are two DataNode data directories on the same disk, the HDFS capacity calculation will be repeated, so it needs to be merged . The data can be copied through the hard chain of rsync, which does not involve real data copying, executes very fast, and deletes the copied source data at the same time. Check whether the remaining data has blk files, if not, the merge is completed.
ionice -c 2 -n 7 rsync -au --link-dest=/dfs/shrink_temp/dn --ignore-existing --remove-source-files /dfs/shrink_temp/dn/ /dfs/dfs/dn
  1. Modify the dfs.data.dir configuration item through ClouderaManager to delete the data directory on the uninstalled disk.

  2. Start the DataNode through ClouderaManager and check the status of HDFS.

sudo -u hdfs hdfs fsck /

Why not just copy and merge the data of the unloaded disk into the DataNode data directory of the receiving disk ? This is because at the time of the first copy, the DataNode is still running, so the DataNode will regularly check the number of copies. At this time, the copied data is an extra copy and may be deleted by the DataNode.

The time that the DataNode stops during the entire scaling process is only the time required for steps 4 and 5. Step 4 is an incremental copy, which is very fast, and step 5 is just a file metadata operation, which is also very fast.

The above steps seem to be many, and manual operations are prone to errors, so we wrote a script for the above scaling process (some operations depend on the API of the Hadoop distribution, currently supporting CDH5), please download setup-hadoop.py , run the command, And follow the prompts to input to shrink:

python setup-hadoop.py shrink_datanode

future improvements

In the above shrinking process, the data needs to be completely copied from one disk to another disk, and it needs to have enough remaining space. In addition, it may also cause data imbalance among the disks in the DataNode. In the future, this shrinking process can be improved. When copying data, the blk file is copied to multiple disks according to a certain rule to ensure data balance among multiple disks.

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324040822&siteId=291194637