Distributed Snapshots

Distributed Snapshots
首先,我们先介绍一种可以记录到Distributed system中consistent global state的方法(此处所介绍的algorithm并没有处理failure的能力)即使是在没有failure的情况下,要在distributed system中取得consistent global state也是很困难的. (This algorithm is proposed by Chandy & Lamport 85)
System model:
一个distributed system被model成一个有向图(directed graph), G=(PC),其中P是processes所成的合,C是logical unidire
channel: reliable : messages在传送的途中不会lost. FIFO: preserve the ordering of messages. state of the system :等于state of all processes及state of all channels.
events:一个process P中所发生的event (send ,receive, or internal)可能改变P本身的state,或者最多会改变一个channel的state.
一个process只能记录本身的state以及其所发出(send)或接收(receive)的messages,除此之外无法记录其它东西.
execution:一个distributed system的execution是由一连串的states所组成. (当system中有event发生时就会引发state transition).
The goal of the distributed snapshot scheme: 确保经由不同processes间适当的合作(proper cooperation),各个processes各自所记录到的local states最终合起来会形成正确的system state.
A simple example: " single-toked conservation " system. (这里指出一个反例,用以说明要取得一个consistent global state并非一件简单的事).考虑系统中只有两个processes C & C',以及一个token,若我们所设计的distributed snapshot protocol如下:
当任何process拿到token时,即记录下本身的local state,明显地,此protocol并非一个正确的protocol.因为当我们将C和C'的local states组合成一global state时,发现系统中"居然有两个token",此时产生矛盾. (同理也能找出会让global state中没有token的"错误" protocol) The distributed snapshot algorithm 由某一个process发动state recording的程序,假设process P initiates state recording, P首先会记录本身的local state,且同时对每一条outgoing channel送出一special message--"marker"--.
当其它process收到"marker"时,此process会记录本身的local state且/或记录sender of the "marker"和receiver之间的channel state之后,会将"marker" forward出去(延着所有outgoing channel ).

以下就是这个distributed snapshot algorithm:
简单了解一下这个algorithm是如何运作的:
    首先,当一个以前不曾收过"marker"的process Q收到经由channel C传送过来的"marker"时, Q会先记录本身的local state,然后将channel C的state记录为empty .
   之后,若Q又收到经由其它channel C'所传送过来的"marker"的话,此时Q不会记录local state,但会把这段期间内(第一次收到"marker" <---->收到经由C'所送来的"marker" )所收到由C'所送来的messages记录成C'的channel state.
  当然, Q会把"marker" forward出去.

5.2.2 A Distributed Checkpointing and Rollback Recovery Method
我们现在要讨论的是一种建立distributed checkpoints的方法,以及相对应的rollback recovery的方法. 在distributed checkpointing的方法中,每个process都只会记录本身的local state.如果一个checkpoints所成的集合会形成一个consistent global state,则这个checkpoints集合被称之为consistent.要注意的一点是:在此,我们并没有记录channel states,只有process的local state会被记录到(在distributed snapshot algorithm中我们需要记录channel states).在前几个章节中有提到,把一个process roll back之后,可能会造成整个system state变成inconsistent.这是因为把一个process roll back,可能会undo一个sending event (造成某个message变成orphan).(注:所谓orphan message就是那些已经被某些process收到,但却找不出是谁发出此message的message;而lost message,就是那些已经发出但尚未被收到的message)同样的, rollback也有可能会造成lost message.但我们不需考虑lost message的情况,因为当系统发现有lost message时,可以很简单地用retransmission的方式解决,但若出现orphan message,我们根本无法知道是哪个process送出来的message,所以在这种情况下,系统又要再rollback一次. 我们在这一节所讨论的方法只能够避免orphan messages的产生(我们只在'适当"的时机建立checkpoints).所以在这个本节中所谓的consistent state是指不会有orphan messages的global state,但可能会有lost messages,所以在这种情况下,系统又要再rollback一次.
本节中所讨论的algorithm将checkpoints分成两类:永久的(permanent)以及暂时的(tentative),并将这两种checkpoints储存在stable storage上.一个永久的checkpoint无法被undo (一旦被建立之后就一直存在);而一个暂时的checkpoint可以被undo (建立之后可能因为某种原因而反悔,于是把这个checkpoint除掉,当作什么事都没发生),或者可已变成永久的checkpoint. 这个algorithm的基本的想法与distributed snapshot algorithm的观念是很相似的,但有点不相同的是:本algorithm使用了一种two-phase protocol以保证所有的processes都执行checkpoint的动作,或者没有任何process会做checkpoint.这也就是为何我们要将checkpoints分成两类(永久的和暂时的)的原因.当一个process正在记录本身的local state时,一个暂时的checkpoint会被建立.当此process发现刚刚建立的tentative checkpoint是一个可被接受的checkpoint时,这个tentative checkpoint会被转成permanent checkpoint. 现在简单地描述一下此algorithm是如何工作的:
Phase 1 : 一个initiator Q会先做tentative checkpoint,然后送出一个message要求其他所有的processes也做tentative checkpoint.当一个process P收到这样的一个message时, P会建立一个tentative checkpoint.然而P也有可能选择不做tentative checkpoint (可能是一些local的因素使得P不能做tentative checkpoint).然后P会回报他的决定给initiator Q ,且从现在开始到Phase 2结束, P都不得与其他processes有任何communication.
Phase 2 :
如果initiator Q得知其他所有的processes都有做tentative checkpoints,则Q会送出一个message,令所有的tentative checkpoints变成permanent.但如果有任何process的回报是"没有做tentative checkpoint"则Q会送出一个message令其他有做tentative checkpoint的processes将其tentative checkpoint给取消掉(undo).
这个简单的方法保证一起做的checkpoints会形成consistent global state,也就是说不会记录到orphan message (但可能会记录到lost message;请注意此处所谓的consistent global state与上依小节的定意识不同的) .我们来了解一下为何会会如此！Orphan message出现的情况只有：Q在local checkpoint Cq之后送出一个message m给P,而此message在P的local checkpoint Cp之前被收到,若某一global state中含有Cp及Cq这两个checkpoint的话,则message m即变成了orphan message.但是这种情况是不可能发生的,因为当一个process从开始做tentative checkpoint之后到phase 2结束之前是不允许跟其他process做communication的.因此,在checkpoint Cq送出去的message只有可能在checkpoint Cp之后被收到.
虽然用以上的方法保证可以得到一组consistent的checkpoints,但事实上我们可以不必如此辛苦,某些新的checkpoints和一些旧的checkpoints组合再一起,也有可能产生一组consistent的checkpoints.举例而言,如下图中所示：

考虑一个由P,Q,R三个processes所组成的program：P发出讯息要求其他的processes一起做checkpoint,假设此时P,Q,R上所产生的checkpoints是{p2,q2,r2},无庸置疑地, {p2,q2,r2}是一组consistent的checkpoint,但我们也可以清楚地发现, {p2,q1,r2}同样也可以组成一组consistent的checkpoints (因为在这样的组合中我们找不到orphan message).所以, process Q其实是可以不用做checkpoint的,因为旧的checkpoint -- q1 --就已经足够了.
以上的想法将会被应用在我们即将提出来的protocol上,以尽可能地减少做checkpoint的process之数目.每个process在送出一个message的时候,都会在此message上附加上一个单调递增的( monotonically increase)的sequence number (或称timestamp ). 令 last_recdQ(P) :在process P的最后一个permanent or tentative checkpoint之后,由Q收到从P送来的最后一个message的sequence number. first_sendQ(P) :在process Q的最后一个permanent or tentative checkpoint之后,由Q送至P的第一个message的sequence number. 当P收到Q送来的request时, P会得到附带在request message上的last_recdQ(P),且当下式成立时做checkpoint :
last_recdQ(P) >= first_sentP(Q) last_recdQ(P) >= first_sentP(Q)
(上式表示：在P建立一个新的checkpoint之前, Q有收到P在上一个checkpoint之后所送出来给Q的message,则此时P必须建立checkpoint). 若上式不成立：也就是说, Q收到的最后一个由P送来的message的sequence number比P自上一个checkpoint之后所送出的第一个message的sequence number还小,则在由Q目前的checkpoint及P的上一个checkpoint所组成的state中没有orphan message存在. 明显地,如果一个process Q在上一次checkpoint之后就再也没有收到由process P所送来的message,则P根本就不需要做checkpoint.只有那些曾经有送过message给Q的process需要考虑是否建立checkpoint.为了知道有哪些process曾经有送过message给Q,我们maintain一个set :
ckpt_cohortQ={ P | 自從上一次 checkpoint 之後, Q 有收到從 P 送來的 message 者 } ckpt_cohortQ={ P |自从上一次checkpoint之后, Q有收到从P送来的message者} 如果当Q想建立tentative checkpoint时, ckpt_cohortQ就是那些Q应该送request要求他们一起建立tentative checkpoint的process.但是在这个set中,并不是所有的process都要和Q一起建立checkpoint,只有那些符合last_recdQ(P ) >= first_sentP(Q)条件的才需要. 每一个process皆有一个boolean variable : willing_to_ckpt,用来表示此process目前是否有意愿建立checkpoint (其值是由process本身决定) 以下简述此algorithm的动作：
一initiator Q利用发出"take a tentative ckpt and last_recdQ(P)"给其他所有属于ckpt_cohortQ中的process来发动checkpointing程序.然后, Q会收集其他所有process所做的决定(决定是否建立tentative checkpoint),最后将最终的决定送给其他所有的process.(所谓最终的决定是指：如果ckpt_cohortQ中的所有process都想建立tentative checkpoint,则建立checkpoint,否则取消(undo)所有的tentative checkpoint) Checkpoint Algorithm 如下圖所示： Checkpoint Algorithm如下图所示：

Rollback Recovery
在distributed checkpointing的方法中提到,我们每次所做的checkpoint都会形成一个consistent system state,所以一种最简单的rollback recovery的方法就是"每个process都倒退一格到前一个checkpoint".
然而,这却不是最有效率的方法.举例说明:假设一个特定的process启动了rollback recovery (可能是因为产生了暂时性的fault),且假设此process自从上一次做完permanentcheckpoint之后就没有在隅其他processes进行沟通(communication),很明显地,如果上述情况发生的话,则其他processes根本就不需要跟着一起rollback.也就是说:某些processes的current state和另外剩下的那些processes上次的permanent checkpoint是有可能组合成一个consistent global state的. 我们介绍一种可以使rollback量降到最低的方法:对任意的processes P和Q,我们定义:
last_sentQ(P) :在Q的最后一个permanent checkpoint之前, Q送给P的最后一个message的sequence number. 当Q rollback并要求P一起rollback时, Q会把last_sendQ(P)附在request message上,传送给P.若last_recdP(Q) > last_sentQ(P),则P会进行rollback.也就是说: P所收到上一次来自Q的message的sequence number会比Q在建立上一个permanent checkpoint之前所送出的最后一个message的sequence number大,则P会跟着一起rollback.简单的说:若在Q建立permanent checkpoint之后, Q仍然有送message给P,则P要跟着一起rollback. (如果P不跟着一起rollback的话,那么在Q的permanent checkpoint之后所送出给P的message就会变成orphan message).

猜你喜欢