Skillfully use symbolic links to migrate HDFS data, business is completely unaware!

question

JuiceFS is a distributed file system based on object storage. JuiceFS can ensure strong data consistency and extremely high read and write performance in the previous article comparing with object storage, so it can be used to replace HDFS. However, the overall migration of the data platform is usually a time-consuming and labor-intensive project. It is necessary to migrate ultra-large-scale data without affecting the upper-level business as much as possible. The following will introduce how to smoothly migrate massive data from HDFS to JuiceFS through the migration tool of JuiceFS.

Smooth Migration Solution

In addition to the files we actually see on HDFS, the data platform actually has some equally important information, the so-called "metadata", which is stored in systems like Hive Metastore. Therefore, when we talk about data migration, we cannot separate these two types of data. We must consider it at the same time. After migrating the data, we need to update the location (LOCATION) information of the Hive table or partition at the same time. If there is any problem with the data, it will be corrected. impact on the business side.

In order to ensure the consistency of data and metadata, the usual practice is to update the location information in the metadata synchronously after the data is migrated. However, when the data scale is relatively large and the business may update the data, it is difficult to ensure data copying and updating. Location information is an atomic operation, which may cause data loss during the migration process, affecting the reliability of the overall migration. It even needs to be realized at the cost of suspending the business, or using a mechanism such as double-write in the business to realize online migration, intruding into the business logic, which is time-consuming and labor-intensive.

If a unified path can be provided for data access during the migration process to shield the actual data location and realize the decoupling of the metadata and the actual data location, the risk of the overall migration will be greatly reduced. The symbolic link of the file system can achieve this effect. JuiceFS also supports symbolic links, and supports symbolic links across file systems. With this, it can provide a unified access entry for multiple file systems and form a unified namespace.

Symbolic link is a concept widely used in operating systems, you can manage data scattered in various places in a directory tree through symbolic links. Correspondingly, we can also manage multiple storage systems in one file system through the symbolic link feature of JuiceFS. In fact, the function of symbolic links has been implemented on HDFS as early as 2013 by the Hadoop community (HADOOP-10019), but unfortunately it has not been fully supported so far. With the help of symbolic links, various storage systems including but not limited to HDFS and object storage can be managed on JuiceFS. On the surface, JuiceFS is accessed, but the actual access is the underlying real storage.

At the same time, the atomic rename operation of JuiceFS can also play a key role in the data migration process. JuiceFS uses a symbolic link to jump back to the original data, but when the data is completely copied, the symbolic link needs to be overwritten. At this time, atomic renaming can ensure the security and reliability of the data and avoid data loss and damage.

In addition, JuiceFS can dynamically perceive the migration process through configuration files and special flag files, and perform additional checks when adding and deleting files to ensure that newly created files will also appear in the migrated directory, and ensure that Files to be deleted can also be deleted from the new system. For more complex renaming operations, there are similar mechanisms for correctness.

With these features of JuiceFS just introduced, data and metadata can be migrated separately during data migration, and the entire migration process is completely transparent to the business. The specific migration steps are explained below.

Steps

Step 1: Use JuiceFS as the access entry of HDFS

Create corresponding symbolic links for all first-level directories (or files) of HDFS on JuiceFS (assuming that no content will be created in the HDFS root directory), and then jfs://name/<path>you can fully access the content in HDFS through , the two are completely equivalent of. As shown below.

Step 2: Use JuiceFS to access data in HDFS

There are two ways to implement this step. The first is to modify the LOCATION of the table or partition in the Hive Metastore to the corresponding JuiceFS path, such as the previous one hdfs://ns/user/test.db/table_aand the new one jfs://name/user/test.db/table_a. The second method is to fs.hdfs.implmodify to com.juicefs.MigratingFileSystem, which keeps LOCATION unchanged.

The purpose of these two methods is to change all accesses to HDFS to access JuiceFS, because step 1 has already created a symbolic link to HDFS, so it will not affect existing business access to HDFS.

Step 3: Migrate the directory structure

From this step, we will officially carry out the migration work, but don't rush to copy the data, we need to map the directory structure from HDFS first. You can select the table or directory you want to migrate, and then quickly migrate the directory structure on HDFS to JuiceFS through the tools provided by JuiceFS. Take migration hdfs://ns/user/test.db/table_a as an example, all subdirectories in this directory will be created in JuiceFS level by level. Because this step only involves metadata operations and no data copying, the directory structure of historical data can be migrated from HDFS to JuiceFS at an extremely fast speed. Also note that all files still point to paths in HDFS via symlinks. As shown in the figure below, the red part represents the newly created directory on JuiceFS.

Similarly, after completing this step, it will not affect the existing business access to HDFS, but the newly written data will be directly stored in JuiceFS.

Step 4: Migrate data

This step will actually start copying the data, concurrently replacing the symlinks left over from the previous step to normal files in HDFS with the real data via the JuiceFS migration tool. There will be no more symbolic links in the final migration directory, which means that this directory has been migrated. As shown in the image below, the red part has changed from a symbolic link to a normal file.

reverse migration

During the data migration process, you can also roll back at any time through reverse migration to undo the migration operation. If the location information in the metadata has been modified, the JuiceFS migration tool can ensure that the original state is restored during reverse migration. If new data has been written to JuiceFS, these new data can also be copied back to the original storage system.

Summarize

Through the introduction of the previous operation steps, it can be seen that the entire migration process will not affect the existing business to continue to access HDFS, and the business is imperceptible from the beginning to the end. JuiceFS provides comprehensive tools to simplify the migration process. For detailed operation guide, please refer to the official documentation of JuiceFS .

This article takes the migration of HDFS to JuiceFS as an example to illustrate the symbolic link feature of JuiceFS. In fact, you can use your brain to apply the symbolic link of JuiceFS in more and wider scenarios, such as data transfer between different HDFS clusters. Migration, cross-cloud and cross-region data migration, etc. It is precisely because of the powerful symbolic link feature that JuiceFS provides a unified data access layer and view, which makes things that cannot be operated smoothly in many cases possible.

Recommended reading: Zhihu x JuiceFS: Using JuiceFS to Accelerate Flink Container Startup

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324103049&siteId=291194637