Design and implementation of Alluxio cross-cluster synchronization mechanism

1. Alluxio application scenarios and background

The design and implementation of Alluxio's cross-cluster synchronization mechanism ensures that metadata is consistent when running multiple Alluxio clusters.

Alluxio sits between the storage and computing layers, providing high-performance caching and a unified namespace on top of different underlying file systems (UFS). While updating UFS via Alluxio makes Alluxio consistent with UFS, in some cases, such as when running multiple Alluxio clusters sharing one or more UFS namespaces, this may not be the case. To ensure consistency in this case, Alluxio has implemented a cross-cluster synchronization mechanism, which will be introduced in detail in this article.

1. Background introduction

As the volume of data grows, the way that data is stored and accessed becomes more and more complex. For example, data may be located in different storage systems (S3, GCP, HDFS, etc.), stored on the cloud or locally, or located in different geographical regions, and may be further isolated for privacy or security protection. Furthermore, these complexities are not only in data storage, but also in how data is used for computation, for example, data may be stored on the cloud while computation happens locally.

Alluxio is a data orchestration platform that reduces such complexity by providing a unified access interface on UFS, and improves computing performance by providing data locality and caching.

For many organizations, running one Alluxio cluster may be sufficient, but some organizations need to run multiple Alluxio clusters. For example, if the computation is run in multiple regions, it may be more advantageous to run an Alluxio cluster in each region. Additionally, some organizations may need to run separate clusters due to data privacy concerns, or may wish to increase scalability by running multiple clusters. While part of the data space may be isolated within a cluster, other data can be shared across multiple clusters. For example, one cluster might be responsible for extracting and transforming data, while several other clusters might query that data and make updates.

Since each Alluxio cluster may replicate (i.e. mount) some part of the UFS storage space, Alluxio will be responsible for keeping its copy consistent with UFS, so that users can query the latest file copy. In this article, we describe the components used to make Alluxio data consistent with UFS across one or more clusters.

2. Alluxio data consistency

Keeping data consistent in a distributed system is complex, with dozens of different consistency levels, each allowing different users to query and modify different states of the data at specific times. These consistency levels form a spectrum from weak to strong, with stronger consistency being more restrictive and generally easier to build applications on. Alluxio is no exception, it provides different consistency guarantees according to the configuration and UFS used (see Alluxio's data consistency model for details).

To simplify the discussion about consistency, we will make the following assumptions:

● For any file, UFS is the "only source of truth" for the file.

This means that every file in Alluxio corresponds to a file on UFS, and there is always the latest version of the file in UFS. If the copy of the file stored in Alluxio is different from the file in UFS, then the file version in Alluxio is inconsistent. (Here we assume that UFS itself ensures strong consistency, that is, some degree of linearizability or external consistency. At a high level, this allows users to put UFS (even if the system is distributed by many formula parts) are accessed as a single file system that performs operations sequentially in real time.

Before discussing the consistency between Alluxio and UFS, let's take a look at the basic architecture of Alluxio. Alluxio is composed of master nodes and worker nodes. The master node is responsible for keeping track of the file's metadata, such as its path, size, etc., while the worker nodes are responsible for storing the data itself. If the client wants to read a file, it must first read the metadata from a certain master node, and then use it to locate the worker that stores the copy of the data (the data can be loaded from UFS if necessary). If the client wants to write a file, it must first create metadata for the file in the master, then write the file to UFS through the worker, and finally mark the file as complete on the master. While a file is being written, its metadata is marked as incomplete, preventing other clients from accessing the file.

From this basic design, we can see that as long as all updates to files are written to UFS through Alluxio, the data in Alluxio will be consistent with the data in UFS, and the client will always query the latest data version.

However, in reality, it is not so simple. For example, some users may not pass Alluxio when updating UFS, or the client may fail and only write some files to UFS without marking completion on the Alluxio master. These may cause The data in Alluxio and UFS are inconsistent.

So, how are these issues addressed? Since we mainly assume that UFS is the only data source, to resolve these inconsistencies only need to make Alluxio synchronize with UFS.

3. Metadata synchronization

Metadata sync is the main component used to check and fix inconsistencies between Alluxio and UFS. When a client accesses a path in Alluxio, this function may be triggered under certain conditions (discussed later). The basic procedure is as follows:

● Load metadata for the path from UFS.

● Compare the metadata in UFS with the metadata in Alluxio. Metadata contains fingerprints of file data (such as last modification time and collision-resistant hashes), which can be used to check for data inconsistencies.

● If any inconsistencies are found, the metadata in Alluxio is updated and stale data is marked for eviction from the worker. The latest data is loaded from UFS to workers as needed.

 

 

Figure: The metadata synchronization process when the client reads. 1. The client reads a path in the file system. 2. The metadata synchronization module on the master checks whether synchronization is required according to user configuration. 3. Synchronize by loading the metadata from UFS and create a fingerprint to compare the metadata in Alluxio and UFS. If the fingerprints are different, the metadata in Alluxio will be updated. 4. The client reads the file data from the worker according to the updated metadata, and loads the data from UFS if necessary.

The only problem is deciding when to execute this metadata synchronization procedure, which requires us to make a trade-off between stronger consistency and better performance.

Metadata synchronization every time the data is accessed

If the client in Alluxio performs metadata synchronization every time it accesses a path, the client will always be able to view the latest data status on UFS. This will give us the highest level of consistency, usually the strongest that UFS can guarantee. However, this will also lead to performance degradation since every access to the data (even if the data has not been modified) will synchronize with UFS.

Metadata synchronization based on time

Additionally, metadata synchronization can be performed based on a physical time interval. In this case, the metadata on the Alluxio master contains the last time the path was successfully synced with UFS. Now, new synchronizations will only occur after a user-defined time interval has elapsed (see UFS metadata synchronization for details).

While this approach may greatly improve performance, it also leads to a relatively weak level of consistency guarantee, namely eventual consistency. This means that any particular read result may or may not be consistent with UFS. Additionally, the order in which data updates are queried may be arbitrary. For example, in UFS, the update of file A is actually earlier than that of another file B, but the Alluxio cluster may query that the update of file B is earlier than file A. Therefore, users of the system must understand these different levels of consistency guarantees and tune applications as needed.

2. Cross-cluster synchronization mechanism

In the previous chapter, we discussed the scenario, background and metadata synchronization of a single Alluxio cluster. This chapter will introduce how to establish metadata synchronization in a multi-cluster scenario to ensure metadata consistency.

1. Multi-cluster consistency based on time synchronization

One of the use cases for time-based metadata synchronization is when multiple Alluxio clusters are used and the clusters share part of the UFS data space. Often, we can think of these clusters as running separate workloads that may need to share data at some point in time. For example, one cluster might ingest and transform data from one day, and then another cluster queries that data the next day. Clusters running query tasks may not always need to see the latest data, e.g. latency of up to an hour can be tolerated.

In practice, using time-based synchronization doesn't always work because only certain workloads will update files regularly. In fact, for many workloads, most files are written only once, while only a small percentage of files are updated frequently. In this case, time-based synchronization becomes less efficient, because most synchronizations are unnecessary, and increasing the time interval will cause frequently modified files to be in an inconsistent state for a longer period of time.

2. Use Cross Cluster Sync to achieve multi-cluster consistency

To avoid the inefficiencies of time-based synchronization, the cross-cluster synchronization feature allows inconsistencies to be tracked directly, so files are only synchronized when necessary. This means that whenever a path is changed on an Alluxio cluster, the cluster will publish an invalidation message informing other Alluxio clusters that the path has been modified. The next time a client accesses this path on a subscribing (cross-cluster synchronization function) cluster, a synchronization operation with UFS will be triggered.

Cross-cluster synchronization has two main advantages over time-based synchronization. First, synchronization is only performed on files that have been modified, and second, modifications are quickly visible to other clusters in a time roughly equivalent to the time it takes to send a message from one cluster to another.

From this we can see that the cross-cluster synchronization function will be most effective when the following assumptions are met.

● There are cross sections in one or more UFSs mounted on multiple Alluxio clusters. (We think a reasonable range for the number of Alluxio clusters deployed in the system is 2-20).

● At least one cluster will update files on UFS.

● All updates to UFS must go through the Alluxio cluster (see "Other use cases" below for how to handle other cases).

Now we want to make sure that updates from one Alluxio cluster will eventually be observed in all other Alluxio clusters (i.e. the cluster and UFS meet eventual consistency guarantees), so that applications can share data across clusters.

Path invalidation publish/subscribe

The cross-cluster synchronization function is implemented based on the publish/subscribe (pub/sub) mechanism. When an Alluxio cluster mounts a UFS path, it subscribes to that path, and whenever the cluster modifies a file on UFS, it publishes the modified path to all subscribers.

Table 1: Example of three Alluxio clusters mounting different UFS paths.

Referring to the example in Table 1, there are three Alluxio clusters, and each cluster mounts a different S3 path. Here, cluster C1 mounts the S3 bucket (bucket) s3://bucket/ to its local path /mnt/, and cluster C2 mounts the subset s3://bucket/folder of the same bucket to its local path /mnt /folder, and finally C3 mounts s3://bucket/other to its root path /.

Thus, cluster C1 will subscribe to the path ("topic" in pub/sub semantics) s3://bucket, cluster C2 will subscribe to the path s3://bucket/folder, and cluster C3 will subscribe to the path s3://bucket/ other. Subscribers will receive all published messages starting with the subscription "topic".

For example, if cluster C1 creates a file /mnt/folder/new-file.dat, it will publish an invalid message containing s3://bucket/folder/new-file.dat, which will be received by cluster C2 . Also, if cluster C1 creates a file /mnt/other-file.dat, no messages will be sent because no subscribers have a topic that matches s3://bucket/other-file.dat.

As mentioned earlier, Alluxio's metadata includes the time when the last sync happened for that path. In the case of cross-cluster synchronization, it also contains the time of the last path failure message received over the pub/sub interface. Taking advantage of this, when the client accesses a path, it will synchronize with UFS in the following two cases.

a) The path is accessed for the first time.

b) The failure time of the path is later than the last synchronization time.

Assuming there are no faults in the system, obviously eventual consistency will be guaranteed. Every modification to a file causes each subscribing cluster to receive an invalidation message, thereby synchronizing the next time the file is accessed.

Figure 1: Mechanism for cross-cluster synchronization during file creation. A. The client creates a file on cluster 1. B. The client writes the file to the worker. C. The worker writes the file to UFS. D. The client completed the file on the master. E. Cluster 1 publishes a file invalidation message to cluster 2's subscribers. F. Cluster 2 marks the file as requiring synchronization in its metadata synchronization component. When the client accesses the file in the future, the steps 1-5 shown in Figure 1 will also be used for synchronization.

Implement Pub/sub mechanism

The Pub/sub mechanism is implemented through a discovery mechanism (discovery mechanism), which allows clusters to know what paths are mounted on other clusters, and a network component, which is used to send messages.

The discovery mechanism is a single java process called CrossClusterMaster that must be accessible to all Alluxio clusters via a configurable address/port combination. Whenever an Alluxio cluster starts up, the CrossClusterMaster is notified of the addresses of all the master nodes of the cluster. Additionally, whenever the cluster mounts or unmounts UFS, the mounted path will be sent to the CrossClusterMaster. Every time these values ​​are updated, the CrossClusterMaster node will send the new values ​​to all Alluxio clusters.

Using this information, each Alluxio cluster will compute the intersection of its local UFS mount paths with all UFS mount paths of external clusters. For each intersecting path, the cluster master will use the GRPC connection to create a subscription to the external cluster master with that path as the topic. In the example in Table 1, C1 will create a subscription to C2 with topic s3://bucket/folder and a subscription to C3 with topic s3://bucket/other. Additionally, C2 will create a subscription to C1 with topic s3://bucket/folder, and C3 will create a subscription to C1 with topic s3://bucket/other. This way, whenever the cluster modifies a path, for example to create a file, it publishes the modified path to any subscribers whose topic is a prefix of that path. For example, if C1 creates a file /mnt/other/file, it will publish s3://bucket/other/file to C3.

In order to proactively maintain subscriptions to other clusters, a thread runs on each Alluxio master to handle path mounts or unmounts, cluster joins or detachments, and connection failures.

Whenever a subscriber receives a path, it updates the expiration time metadata to the current time, so that the next time a client accesses the path, there will be a sync with UFS. Following our example above, the next time the client reads the path /file on cluster C3, a sync with UFS will be performed on s3://bucket/other/file.

Ensure eventual consistency

If it can be guaranteed that each published message is delivered to all subscribers (including future subscribers) exactly once (exactly once), then obviously eventual consistency will be guaranteed, because every modification will make subscribers access the path. Synchronize. However, the connection may be interrupted, the cluster may be disconnected and connected to the system, and the node may also fail. How do we ensure the accurate delivery of messages? The short answer is, we can't. Instead, exactly-once message delivery is guaranteed only while the subscription (using the underlying TCP connection) is running. Additionally, when the subscription is first established, the subscriber will mark the metadata of the root path (topic) as requiring synchronization. This means that any superset path that is a topic will be synchronized the first time it is accessed after the subscription has been established.

For example, when C1 establishes a subscription to C2 with topic s3://bucket/folder, C1 will mark s3://bucket/folder as requiring synchronization. Then, for example, when s3://bucket/folder/file is accessed for the first time, a sync will take place.

This greatly simplifies the task of dealing with failures or configuration changes in the system. If a subscription fails for any reason, such as network issues, master failover, configuration changes, then the recovery process is the same - the subscription is re-established and the corresponding path is marked as out of sync. To mitigate the effects of network problems, a user-defined parameter can be set to determine how many messages can be buffered in the publisher's send queue, and how long to wait for a timeout if the queue is full before the operation is likely to block.

Of course, as expected, while our system will fail, it won't happen very often or performance will suffer. Fortunately, even in the case of frequent failures, the performance degradation is similar to that of using time-based synchronization. For example, if a failure occurs every 5 minutes, the expected performance is similar to that with time-based (5-minute interval) synchronization enabled.

Note that if the CrossClusterMaster process fails, then new cluster and path mount discovery will not work, but the cluster will maintain its existing subscriptions without interruption. Additionally, the CrossClusterMaster is stateless (think of it as a point in the Virtual Chassis address and mount path), and thus, can be stopped and restarted when necessary.

other use cases

As mentioned earlier, in order for this feature to work, all updates to UFS should go through the Alluxio cluster. Of course this condition may not necessarily be met, and there are several ways to deal with this problem.

● Users can manually mark a trail as requiring synchronization.

● Time-based synchronization can be enabled together with cross-cluster synchronization.

3. Discussion and conclusion

1. Discussion and future work

Why not use a pub/sub mechanism that ensures exactly-once message delivery?

We know that using a pub/sub mechanism that ensures exactly-once message delivery would greatly simplify our design, and indeed there are many powerful systems such as Kafka and RabbitMQ that were created to solve exactly this problem. The benefit of using these systems is that failures may have less impact on performance. For example, if a subscriber loses its connection, when it reconnects, the system can pick up where it left off.

Even so maintaining these systems is a very complex task in itself. First, you need to figure out some issues, such as how many physical machines to deploy, how many times to replicate the message, how long to keep it, whether to block the operation when the message cannot be published due to connection problems, etc. Also, it is likely that failure recovery mechanisms will eventually be required, resulting in a more complex design.

(Note that for eventual consistency we really only need at least once message delivery, since multiple message deliveries only negatively impact performance, not data consistency, but even in this case, most of the difficulties remain).

Scale to more than 20 Alluxio clusters or handle frequent failures

In the future, we hope to support scaling to hundreds of Alluxio clusters, but scaling from 20 clusters to hundreds of clusters may have different design considerations. First, we expect failures to occur more frequently; second, the design may cause significant overhead on the master.

As mentioned earlier, frequent failures can degrade performance to a level similar to that with time-based synchronization. With several hundred clusters, we expect network or master node failures to occur fairly frequently. (Note that this also depends on the configuration, since the failure will only affect clusters that have mounted UFS paths that intersect with the failed UFS path. So if the clusters mostly mount disjoint UFS paths, it may not be a big problem). Furthermore, if the paths mounted by all clusters intersect, they will have to maintain subscriptions to all other clusters, and hundreds of messages will need to be sent for a single publication.

In this case, we may need to incorporate a reliable pub/sub mechanism such as Kafka or RabbitMQ, but this is just a replacement for peer-to-peer subscriptions, not a change in the overall system design. Failures will still occur, and the cluster will recover the same way - marking intersecting UFS paths as requiring synchronization. Only a solid pub/sub mechanism would hide many of Alluxio's failures. For example, if the mechanism wants to reliably store the last 5 minutes of messages, then only failures lasting longer than 5 minutes need to be recovered by the original method. Furthermore, these systems are able to scale regardless of the number of Alluxio clusters, adding more nodes when necessary. However, using and maintaining these systems creates a significant amount of overhead and may only be worth trying in certain configurations.

Some Thoughts on Consistency

Although this article introduces the basic idea of ​​ensuring eventual consistency, there are still several important contents that are not explained in detail.

First, the invalidation message must be published after the modification to UFS is complete, and second, UFS must ensure strong consistency at the level of linear consistency or external consistency (consistency in S3). If either of these two conditions is not met, the cluster may not observe the latest version of the file when the subscriber receives the invalidation and performs a sync. Third, if a cluster is disconnected from the CrossClusterMaster and later re-established, the cluster must also go through the failback process, since an external cluster may have mounted and modified paths during the disconnection.

Publish full metadata

As mentioned earlier, invalidation messages published contain only the modified paths. However, these messages can also include updated metadata for the path, avoiding synchronization on the subscribing cluster. This is not done because there is no normal way to know which version of the metadata is the latest version.

For example, two Alluxio clusters C1 and C2 update the same file on UFS. On UFS, the update of cluster C1 occurs before the update of cluster C2. Both clusters then publish their updated metadata to a third cluster, C3. Due to network conditions, C2's message arrives before C1. At this point, C3 needs to know that it should discard the update from C1 because it already has the latest metadata version. Of course, this can be done if the metadata contains version information, but unfortunately for all UFS supported by Alluxio, the normal method cannot do it. Therefore, C3 still needs metadata synchronization with UFS to get the latest version directly from the only data source.

Subscribe Notification Service

Certain Underlying Storage Systems (UFS), such as Amazon SNS and HDFS iNotify, provide notification services to let users know when files have been modified. For this type of UFS, subscribing to these services may be a better option than subscribing to an Alluxio cluster. The advantage of this is that it supports writing to UFS without Alluxio. Again, the system design will remain the same, just instead of subscribing to other Alluxio clusters, it will subscribe to such notification services.

Note that Alluxio also provides ActiveSync functionality for HDFS, allowing metadata to be kept in sync with the underlying UFS. This is different from the synchronization mechanism across clusters, because ActiveSync performs synchronization when files are updated, while cross-cluster synchronization only performs synchronization when files are accessed.

4. Conclusion

This article mainly introduces the scenarios where running multiple Alluxio clusters can bring advantages, and how Alluxio uses time-based synchronization and cross-cluster synchronization functions to keep the clusters in sync with the mounted UFS. For more information about how to deploy the cross-cluster synchronization function, please click to read the original text.

If you want to learn more about Alluxio's dry articles, popular events, and expert sharing, click to enter [Alluxio Think Tank] :

Ministry of Industry and Information Technology: Do not provide network access services for unregistered apps Go 1.21 officially releases Linus to personally review the code, hoping to quell the "infighting" about the Bcachefs file system driver ByteDance launched a public DNS service 7-Zip official website was identified as a malicious website by Baidu Google releases AI code editor: Project IDX Tsinghua Report: Wenxin Yiyan firmly sits first in China, beyond the ChatGPT Vim project, the future meditation software will be launched, ChatGPT was founded by "China's first Linux person" with a daily cost of about 700,000 US dollars , OpenAI may be on the verge of bankruptcy
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5904778/blog/8590253