Enhancing Fault Tolerance in MPI for Modern InfiniBand study notes

Enhancing Fault Tolerance in MPI for Modern InfiniBand

Chapter 2: a multi-channel communication checkpoint / restart MPI frame design

Introduction 2.1 Background


Checkpointed user request, a "checkpoint request" by the global CR manager and transmits the request to the local manager CR (local process control). CR manager notifies the local InfiniBand communication channel (which is part of MPI processes). InfiniBand communication channel manager to lock all communications, including the output buffer and to release all network resources, and information being transmitted. Then call the local manager showed that CR obtain a separate checkpoint is very safe. Local CR CR Library Manager command to set the checkpoint process. Our implementation uses BLCR package as CR library. After checkpointing process state easily, CR manager informed the local InfiniBand network communication channels a communication manager recovery activities. Similarly, in order to restart the MPI task manager sends a global CR "reset request" to all local CR managers. After local CR manager acquires the request through BLCR starting MPI processes and notifies the recovery InfiniBand network communications manager. As we have seen, in addition to saving local process state, coordinated checkpoint is a very important step in the communication process to ensure that all individual communication channels to reach a checkpoint has been state.

2.2 Checkpoint multi-channel communication frame MPI

The proposed method is based on coordinated checkpoint protocol.

CR layer is responsible for receiving a request from the Task Manager to inform each communication layer suspend / restart the transmission, the decision using a known time CR library (herein BLCR) process local checkpoints. Further CR layer needs to position tracking process. . For example, if two processes on the same node to restart on different nodes, it should detect topology changes and changes according to the internal node shared memory communication channel to the network. In order to ensure in-order message delivery at the time of converting the communication channel, CR layer maintains two queues important, the best temporary send queue and receive queue. Checkpoint phase begins, all operations to be sent will be best placed in the transmit queue. Meanwhile, after sending all data channels being transmitted and supplied to the temporary receive queue. Thus, the communication channel during the checkpointing information from the childless. Therefore, converting the communication channel is secure, after a reboot.


During each checkpointing need to suspend the pipe needs to provide these functions and interfaces for the CR layer hang a standard / restart hook. CR layer name these interfaces when needed and does not require consideration of the detailed design specifications of these pipes.

2.3 Detailed design and Challenges

2.3.1 Point to Point Communications

The new CR framework provides hooks for each communication channel to register for the frame recovery function. Recovery function for each communication channel is initiated by a command frame is provided during the checkpoint process. This framework allows each channel to provide two functions recovery set-point communication capability checkpoint register. These two functions, named "Suspend callback" function and "Resume callback".

SuspendCallback function

CR library before the checkpoint, the checkpoint recovery function during the pending command by the CR frame. This task function is to prepare the channel before checkpoint. Coordinated checkpoint after use, to ensure the passage suspend route does not process the information sent and no information is transmitted during (in flight). This ensures that all processes in a consistent state.

ResumeCallback function

       After CR library checkpoints, the checkpoint recovery function during the pending command by the CR frame. CR obtained simultaneously from the hard disk store a previous checkpointing process.

2.3.3 Checkpointing Collective Operations

Checkpoint operation is set, using two recovery function request: a checkpoint recovery function request and complete recovery checkpoint.

Requestto Checkpoint Callback function

Request to Checkpoint Callback function has a local CR Manager command when checkpointing request occurs. Call to inform checkpoint request collection and returns immediately. After the call returns, the local CR managers wait until they are ready to show a collection notice is set checkpoints.

In an internal node operation designed to stop their communication process and automatically add the shared memory region a special field. Continue to check this area finished adding each process. When the number of its local node number and the process is equal to that it indicates that the process has stopped all collective communication. At this time, all local process to achieve a consistent state. Choose a leader to copy the contents of the shared memory region and beyond into the local buffer shared memory region set. After this is done, all processes bucket on the node CTC (Clear to Checkpoint) and wait until the checkpoint is complete.

During the call CTC RTC all notifications transmitted to by a pointer set (to Checkpoint Function Clear). When ready to be set up checkpoints set of call CTC.

CR CR framework library can now perform command checkpoint operation.

CheckpointComplete Callback function

       After setting after the checkpoint, or restart the process from the checkpoint has been previously set, the Frame command Checkpoint Complete Callback function (CC) represents CR operation is completed.

When CC was ordered to wait for the checkpoint to complete the set operation in the process of resurrection. Leader process utilizes the shared memory region is set and stored in a data structure stored in the local buffer. Once the leader finish creating shared memory region, all in the process of returning from the callback.

This is a consistent state before the system status and RTC is excited. Thus, all the communication channels on the normal operation.

 

发布了9 篇原创文章 · 获赞 4 · 访问量 1万+

Guess you like

Origin blog.csdn.net/juan190755422/article/details/42678959