2010 RDMA-BasedJob Migration Framework for MPI over InfiniBand study notes

2010 RDMA-BasedJob Migration Framework for MPI over InfiniBand study notes

 

Not in the process of migration, migration hinder node enters state (enter the migration barrier).

The state of the node NLA to nodes arranged to distinguish the node is the node spare node is being used, but also to determine whether the node by changing the state of the migration source node or the migration destination node, or have the migration process on the node to the other on the programming node inactive, no longer active.

 

Design and Implementation section :

On each node to configure a NLA, by a migration of excitation initialization (start) the migration program. In the design, MVAPICH2 to three components to transmit and receive messages to the proxy registration FTB.

Job Manager:

(1) Start NLA (original and alternate nodes) ------- start stage

(2) conduct coordination between the various components

Node Launch Agent:

(1) Job Manager and combine to form a hierarchical task starts extensible frame

(2) is responsible for starting and growing the local node applications

(3) implemented in the NLA has been extended, so that the NLA is also supported on the backup (after migration) node start-up operation.

C/R:

Each MPI process will have a C / R process. By calling BLCR, C / R process responsible for setting up a checkpoint on the source node, and restart the process on the target node.

 

A: MVAPICH2 process proposed in the migration process

Process migration may be requested by the user or provided by the health monitoring component health monitoring reports excitation.

Phase 1: task interrupted. (Obtain a global consistency)

First, in the start-up phase of the original by the task manager to run parallel computing nodes to provide a spare node. Job Manager Start NLA spare node to the original host node and registers all the relevant information FTB fault-tolerant backplane. NLA in the original node are in " MIGRATION_READY " state, NLA on the standby node are in " MIGRATION_SPARE " state.

       Once the migration is excited, to determine a correspondence relationship between the source node and the destination node migration from the Job Manager. And contains the node name by issuing "FTB_MIGRATION" message initialization process migration. All nodes NLA and MPI process will receive the message. After receiving the news, all the process hangs their MPI MPI communication activities, complete transmission of all information being transmitted, and dismantle their communications terminal (drain all inflight MPI messages, and teardown their communication end-points). Since the network characteristics of native InfiniBand, set up checkpoints to separate MPI processes to achieve globally consistent state it is very necessary.

l First, InfiniBand provides high performance communication protocol by a user to bypass the operating system level. Skip operating system in the actual communication with incomplete knowledge of network activity. Therefore, the operating system without losing the overall consistency of the direct network activity becomes very difficult to stop.

l Second, unlike the conventional TCP / IP communications environment as a memory, many InfiniBand network connection environment is designed to implement the network adapter cache memory available in the kernel. Therefore, the network connection environment must be released before setting up checkpoints, and after re-established.

l Third, in order to obtain high performance, and even some of the InfiniBand network connection environment caches on the remote node, such as RDMA remote node, the source data cache must be released before checkpointing. Otherwise, when restarting the network connection environment to rebuild these nodes become unreasonable, there is a potential inconsistency.

 

 

Phase 2:JobMigration

At the end of the first phase, all processes interrupt their MPI communication behavior, and the release of their communication channels. At this moment, we have reached global consistency, the second phase begins. Non-migrating migration process on the source node remains stopped. At the same time, the migration process on the source node using BLCR checkpoint, their progress snapshots sent to the destination node. We have achieved using the extended BLCR library RDMA-based process migration strategy. Our strategy utilizing low-latency InfiniBand network to obtain low cost process migration.

    Once all the process states are migrated to the target node, NLA on the source end node generates a "FTB_MIGRATION_PIIC" message signs second phase. Thereafter, the NLA transmission source node "MIGRATION_INACTIVE", indicates that the node is no longer NLA active.

Phase 3:Restarton Spare Node.

After receiving the "FTB_MIGRATION_PIIC" information, process manager to adjust mpispawn tree structure to adapt to changes due to some process to migrate to a new node to bring topology. It then broadcasts "FTB_RESTART" message, comprising the load and the target host is on the order list migrated to the migration destination node.

When "FTB_RESTART" message reaches the destination node, its node agent starts (NLA) from the arrangement information extracted from the payload MPI process, restart the MPI process to migrate from the checkpoint snapshot destination node in the second stage. Then, the process migration NLA on the target node to its state by the "MIGRATION_SPARE" state to "MIGRATION_READY" state, indicating that it is now in active.

Phase 4:Resume

Once the restart these MPI processes on the target node. They enter the MPI migration obstacles. At this time, all the nodes are in migration obstacles. All synchronization process, and may terminate impede migration (migration into the process). Once out of the migration obstacles, all the nodes re-establish communication terminal and re-start the communication activities. Process node migration cycle is complete and ready for the next cycle of the migration process.

 

 

 

 

 

 

 

 

 

B. RDMA-based migration process

In the second stage, we have a snapshot of the migration process from the source node to the destination node. A snapshot is in process on the source node for a single process checkpoint obtained using BLCR library. In local policy, process migration by using BLCR set up checkpoints and into the local file system, copied to the target node in the checkpoint file, use the checkpoint file restart node on the destination node. However, this method had serious storage overhead and overhead transfer the files to the destination node. Even if the checkpoint file can be stored in the global file system, the cost can not be reduced. If more than one process is set up checkpoints, will write a conflict of concurrent streams on the same node, which causes global file system performance significantly. Further, when a conflict is also stored in the same file in different nodes of the global file system.

Online process migration mechanism can replicate the process in the case of no file system overhead to another node memory snapshot. This mechanism is implemented in TCP / IP networks, the TCP socket BLCR as output / input file descriptor of the source / destination node, so the TCP / IP protocol stack has severe memory copy overhead. Although InfiniBand provides a socket abstraction by IPoIB, it can only achieve sub-optimal performance, because it still follows the socket-based protocol memory copy, can not take advantage of zero-copy InfiniBand RDMA mechanism.


In this section, we present a RDMA-based process migration strategy, the use of high-bandwidth InfiniBand properties reached the checkpoint data transmission. On the migration source node, modify our strategy BLCR library checkpoint file written to a local buffer pool from more than MPI polymerization process, each piece contains data of a process. RDMA read operation by the destination node pop up large blocks of data, and rebuild the checkpoint file from a different process. Shows the basic design strategy in Figure 3.

In the source node, the user-level buffer manager in the second phase checkpoint initialization time, to prepare a pool kernel space mapped by BLCR. Buffer Manager runs in user space, because more flexibility in managing large buffer is not equipped and user control. During the checkpoint, there are some data needs to be saved checkpoint when a process is running BLCR the code in the kernel space, it needs chunk buffer from the buffer manager. If a buffer is available in the buffer pool, it is assigned to the process. All of it is the presence of this checkpoint data block buffer until it is filled. When a block is filled, it returns the buffer pool, and choose a free buffer block. Whenever a filled buffer pool block returns, the buffer manager on the source node sends a request to the buffer manager RDMA_Read target node. This request contains two types of information :( 1 ) the destination buffer manager RDMA information performs RDMA read operation to read data. ( 2 ) based on information (e.g., the arrangement process, data size, data offsets), blocks belonging to the same process may be connected to form a complete checkpoint file. If the destination node buffer pool is available free block, the RDMA read operation to pull the target node block, the block will be connected to a checkpoint file in place. Then, the target node sending buffer manager RDMA_Read reply, informing the source node to release a buffer manager buffer block. All processes are set up checkpoints on the source node and the destination node in their process will restart a snapshot of migrated in the third stage of the process.

Released nine original articles · won praise 4 · views 10000 +

Guess you like

Origin blog.csdn.net/juan190755422/article/details/42585929