A Minimum-Process Coordinated Checkpointing Protocol For Mobile Distributed System
A Minimum-Process Coordinated Checkpointing Protocol For Mobile Distributed System
1
Department of Computer Science & Engineering
Meerut Institute of Engineering & Technology, Meerut, India, Pin-125005
2
Singhaniya University
Pechri, Rajasthan, India
minimize useless checkpoints. Pj is directly dependent set and broadcasting the same on the static network along
upon Pk only if there exists m such that Pj receives m from with the checkpoint request.
Pk in the current CI and Pk has not taken its permanent Koo and Toeg [11], and Cao and Singhal [4] proposed
checkpoint after sending m. A process Pi is in the minimum-process blocking coordinated checkpointing
minimum set only if checkpoint initiator process is algorithms. Neves et al. [12] gave a loosely synchronized
transitively dependent upon it. In minimum-process coordinated protocol that removes the overhead of
coordinated checkpointing algorithms, only a subset of synchronization. Higaki and Takizawa [10] proposed a
interacting processes (called minimum set) are required to hybrid checkpointing protocol where the mobile stations
take checkpoints in an initiation. take checkpoints asynchronously and fixed ones
The Chandy-Lamport [6] algorithm is the earliest non- synchronously. Kumar and Kumar [29] proposed a
blocking all-process coordinated checkpointing algorithm. minimum-process coordinated checkpointing algorithm
In this algorithm, markers are sent along all channels in where the number of useless checkpoints and blocking are
the network which leads to a message complexity of reduced by using a probabilistic approach. A process takes
O(N2), and requires channels to be FIFO. Elnozahy et al. its mutable checkpoint only if the probability that it will
[8] proposed an all-process non-blocking synchronous get the checkpoint request in the current initiation is high.
checkpointing algorithm with a message complexity of To balance the checkpointing overhead and the loss of
O(N). In coordinated checkpointing protocols, we may computation on recovery, P Kumar [24] proposed a
require piggybacking of integer csn (checkpoint sequence hybrid-coordinated checkpointing protocol for mobile
number) on normal messages [5], [8], [13], [19], [22]. distributed systems, where an all-process checkpoint is
The existence of mobile nodes in a distributed system taken after executing minimum-process checkpointing
introduces new issues that need proper handling while algorithm for a certain number of times.
designing a checkpointing algorithm for such systems. Transferring the checkpoint of an MH to its local MSS
These issues are mobility, disconnection, finite power may have a large overhead in terms of battery
source, vulnerable to physical damage, lack of stable consumption and channel utilization. To reduce such an
storage etc. These issues make traditional checkpointing overhead, an incremental checkpointing technique could
techniques unsuitable to checkpoint mobile distributed be used [16]. Only the information, which changed since
systems [1], [5], [15]. To take a checkpoint, an MH has to last checkpoint, is transferred to the MSS.
transfer a large amount of checkpoint data to its local MSS In the present study, we purpose a minimum process
over the wireless network. Since the wireless network has coordinated checkpointing algorithm for Mobile
low bandwidth and MHs have low computation power, Distributed Systems in which no useless checkpoints are
all-process checkpointing will waste the scarce resources taken and the blocking of processes is reduced to bare
of the mobile system on every checkpoint. Prakash and minimum.
Singhal [15] gave minimum-process coordinated
checkpointing protocol for mobile distributed systems.
A good checkpointing protocol for mobile distributed 2. System Model
systems should have low overheads on MHs and wireless
channels and should avoid awakening of MHs in doze We use the system model presented in [2], [4]. In this
mode operation. The disconnection of MHs should not model, a mobile computing system consists of n mobile
lead to infinite wait state. The algorithm should be non- hosts (MHs), and m mobile support stations (MSSs),
intrusive and should force minimum number of processes where n > m. A cell is a logical or geographical coverage
to take their local checkpoints [15]. In minimum-process area under an MSS. An MH can directly communicate
coordinated checkpointing algorithms, some blocking of with an MSS Mi only if it is present in the cell serviced by
the processes takes place [4], [11], or some useless Mi. At any time, an MH belongs to only one cell or may
checkpoints are taken [5], [13], [19]. be disconnected. The static network provides reliable
Cao and Singhal [5] achieved non-intrusiveness in the First-In-First-Out (FIFO) delivery of messages between
minimum-process algorithm by introducing the concept any two MSSs with arbitrary message latency. Similarly,
of mutable checkpoints. The number of useless the wireless network within a cell ensures reliable FIFO
checkpoints in [5] may be exceedingly high in some delivery of messages between an MSS and an MH.
situations [19]. Kumar et. al [19] and Kumar et. al [13] In this paper, we consider a distributed computation in a
reduced the height of the checkpointing tree and the mobile computing system that consists of N processes,
number of useless checkpoints by keeping non- running concurrently on different MHs or MSSs. For
intrusiveness intact, at the extra cost of maintaining and simplicity, we assume that each MH runs one process.
collecting dependency vectors, computing the minimum Message passing is the only way of communication. The
computation is asynchronous. The processes do not share
memory or clock. Each process progresses at its own
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 5, May 2010 25
www.IJCSI.org
speed and messages are exchanged through reliable P2 shows that P2 is not transitively dependent upon P1, due
channels, whose transmission delays are finite but to m3 and m2.
arbitrary. A process in the cell of MSS means the process
is either running on the MSS or on an MH supported by it.
It also includes the processes of MHs, which have been 3.1 Example
disconnected from the MSS but their checkpoint related
information is still with this MSS. We also assume that the We explain our algorithm with an example. P1, P2, P3, P4
processes are non-deterministic. The ith CI (checkpointing and P5 are processes with initial dependency set [00001],
interval) of a process denotes all the computation [00010], [00100], [01000] and [10000], respectively.
performed between its ith and (i+1)th checkpoint,
including the ith checkpoint but not the (i+1)th checkpoint. P1___________________________________________
R1[00001] m1, [00001]
MSS, then the MSS is the initiator MSS. All data after converting tentative checkpoints in to permanent
structures are initialized on completion of a checkpointing checkpoints and not after taking tentative checkpoints.
process, if not mentioned explicitly. m_vect1[]: An array of size n maintained on every MSS.
Pr_csni: A monotonically increasing integer checkpoint It contains those new processes which are found on getting
sequence number for each process. It is incremented by 1 checkpoint request from initiator.
on tentative checkpoint. m_vect2 []: An array of size n. for all j such that m_vect1
td_vecti []: It is a bit array of length n for n process in the [j] 0, m_vect2= m_vect2 m_vect1.
system. td_vecti[j] =1 implies Pi is transitively dependent m_vect3[]: An array of length n; on receiving m_vect3[],
upon Pj. When Pi receives m from Pj such that Pj has not m_vect[], m_vect1[] along with checkpoint request
taken any permanent checkpoint after sending m then Pi [c_req] or on the computation of m_vect1[] locally:
sets td_vecti[j]=1. When Pi commit its checkpoint, it sets m_vect3[]=m_vect3[] c_req.m_vect3[];
td_vecti[] =0 for all processes except for itself which is m_vect3[]=m_vect3[]m_vect[];
initialized to 1.
m_vect3[]=m_vect3[]c_req.m_vect1[];
chkpt-sti: A boolean which is set to 1 when Pi takes a
m_vect3[]=m_vect3[] m_vect1[];
tentative checkpoint; on commit or abort, it is reset to zero
m_vect3[] maintains the best local knowledge of the
m_vect[]: A bit array of size n for n processes in the
minimum set at an MSS.
systems. When Pi starts checkpointing procedures, it
computes tentative minimum set as follows: m_vect[j] =
td_vecti[j] where j=1, 2, ., n. 4.1 Computation of m_vect[], m_vect1[],
TC[]: An array of size n to save information about the m_vect2[], m_vect3[]:
processes which have taken their tentative checkpoints.
When process Pj takes its tentative checkpoint then jth bit 1. Suppose a process Pr wants to initiate checkpointing
of this vector is set to 1. It is initialized to all zeros in the procedure. Its send its request to its local MSS, say MSSr..
beginning of the checkpointing process. It is maintained MSSr maintains the dependency vector of Pr (say
by the checkpoint initiator MSS only. td_vectr[]). MSSr coordinates checkpointing on behalf of
Max_time: it is a flag used to provide timing in Pr. It computes tentative minimum set as follows:
checkpointing operation. It is initialized to zero when
timer is set and becomes 1 when maximum allowable i=1,n m_vect[i] = td_vectr[i]
time for collecting global checkpoint expires. 2. On receiving m_vect[] from MSSr, any MSS (say MSSS)
MSS_plist[]: A bit array of length n for n processes which computes the m_vect1[] as follows:
is maintained at each MSS MSS_plistK[j] =1 implies each Suppose MSSs maintains the process Pj such that Pj
process Pj is running on MSSk. If Pj is disconnected, then it MSSs and Pj m_vect
checkpoint related information is on MSSk.
m_vect1[i]=1 iff m_vect[i]=0 and td_vectj[i]=1
MSS_chk_taken: A bit array of length n bits maintained
m_vect1[] maintains the new processes found for the
by the MSS. MSS_chk_taken [j]=1 implies Pj which is in
minimum set when a process receives the checkpoint
the cell of MSS has taken its tentative checkpoint.
request.
MSS_chk_request: A bit array of length n at each MSS.
m_vect2=m_vect2 U m_vect1
The jth bit of this array is set to 1 whenever initiator
sends the checkpoint request to Pj and Pj is in the cell of i, m_vect1[i]=0
this MSS. 3. m_vect3= m_vect U m_vect2
MSS_fail_bit: A flag maintained on every MSS, MSSin sends c_req to MSSs along with m_vect[]and some
initialized to 0; set to 1 when any process in the cell of process (say Pk) is found at MSSs, which takes the
MSS fails to take tentative checkpoint. checkpoint to this c_req. All MSSs maintains the
Pin: The process which has initiated the checkpointing processes of minimum set to the best of their knowledge in
operation. m_vect3. It is required to minimize duplicate checkpoint
MSSin: The MSS, which has Pin in its cell. requests. Suppose, there exists some process (say Pl) such
p_csnin: checkpoint sequence number of initiator process. that Pk is directly dependent upon Pl and Pl is not in the
g_chkpt: A flag which indicates that some global m_vect3, then MSSs sends c_req to Pl. The new processes
checkpoint is being saved. found for the minimum set while executing a potential
csn[]: An array of size n, maintained on every MSS, for n checkpoint request at an MSS are stored in m_vect1.
processes. csn[i] represents the most recently committed When an MSS finds that all the local processes, which
checkpoint sequence number of Pi. After the commit were asked to take checkpoints, have taken their
operation, if m_vect[i] =1 then csn[i] is incremented. It checkpoints, it sends the response to the MSSin along with
should be noted that entries in this array are updated only m_vect2; so that MSSin may update its knowledge about
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 5, May 2010 27
www.IJCSI.org
minimum set and wait for the new processes before If Pj receive m and it gets checkpoint request later on then
sending commit. In this way, MSSin sends commit only if m will become orphan. In order to handle this situation,
all the processes in the minimum set have taken their we buffer m at Pj. Pj receive m after taking its tentative
tentative checkpoints. checkpoint if it is member of minimum set; otherwise it
process m on commit.
For a disconnected MH that is a member of minimum set,
5. The Checkpointing Protocol the MSS that has its disconnected checkpoint, converts its
disconnected checkpoint into tentative one. When a MSS
As the wireless bandwidth is a scarce commodity in learns that its concerned processes in its cell have taken
mobile systems; therefore; we impose minimum burdon on their tentative checkpoints, it sends the response to MSSin.
wireless channels. The local MSS of an MH acts on behalf On receiving positive response from all concerned MSSs,
of the process running on MH. the MSSin issues the commit request to all MSSs. On
We piggyback checkpoint sequence numbers and commit when a process learns that it has buffered some
dependency vectors onto normal computation messages, message and has not received the formal tentative
but this information is not sent on wireless channels. The checkpointing request from any process, then it processes
local MSS of an MH, strips all the additional information the buffered messages.
from the computation message and sends it to the
concerned MH. The dependency vector of a process
running on an MH is maintained by its local MSS. 5.1 Formal Outline of the checkpointing
Our algorithm is distributed in nature in the sense that any Algorithm:
process can initiate checkpointing. If two processes initiate
checkpointing concurrently, then the checkpoint imitator 5.1.1 Actions taken when Pi sends m to Pj:
of the lower process ID will prevail. The local MSS of a send(Pi, Pj, m, pr_csni,td_vecti[]);
process coordinates checkpointing on its behalf. Suppose //Pi piggybacks its own csn and transitive dependency
two processes Pi and Pj starts checkpointing concurrently vector onto m.
and MSSp and MSSq are their local MSS respectively then
MSSp and MSSq will send checkpoint requests along with 5.1.2 Algorithm executed at initiator MSS (say MSSin)
tentative minimum set to all the MSSs. MSSp will receive Suppose Pin initiates checkpointing. Pin sends the request
the checkpoint request of MMSq and MMSq will receive to MSSin. MSSin computes m_vect [Refer section 4.1].
the checkpoint request of MSSp. Suppose Process-ID of Pi (1)On the basis of computed m_vect, MSSin computes
is less than Process-ID of Pj, then the checkpoint initiates m_vect1, m_vect2, m_vect3 [Refer section 4.1].
of Pi will prevail. Any other MSS will automatically (2) m_vect = m_vect3.
ignore the request of Pj because every MSS will compare (3) MSSin sends c_req to all MSSs along-with m_vect[].
the process id of Pi and Pj. (4) Set max-time.
We propose that any process in the system can initiate the (5) Wait for response.
checkpointing operation. When a process Pin starts (6) On receiving response (Pin, MSSin, MSSs,
checkpointing procedure, it send its request to its local mss_ chk_taken, m_vect2, mss_fail_bit) or at max_time
MSS say MSSin. MSSin computes the tentative minimum (a) If (max_time)OR(mss_fail_bit){ send message
set m_vect[] as follows: abort (Pin, MSSin, pr_csnin} to all MSSs, Exit;
i=1,n m_vect[i] = td_vect[i] //Maximum allocated time expired or some process
MSSin coordinates checkpointing process on behalf of Pin. failed to take checkpoint
We want to emphasize that td_vectin[] contains the (b) m_vect[] = m_vect[]U m_vect2[]. [U is a set
processes on which Pin transitively depends and the set is union operator]
not complete. (c) TC[] = TC[] U mss_chk_taken[]
MSSin sends c-req to all MSSs along with m_vectin[]. (7) For (k=0;k<n; k++)
When an MSS say MSSp receives c-req; it sends the c-req If ( k such that TC[k] m_vect[k]) then go to step 5;
to all such process which are running in it and are also the (8) Send message commit (Pin, MSSin,pr_csnin, m_vect[])
member of m_vectin[]. Suppose Pj gets the checkpoint to all MSSs; // m_vect[] is the exact minimum set//
request at MSSp Now we find any process Pk such that Pk
does not belong to m_vectin[] and Pk belongs to td_vectj[]. 5.1.3 Algorithm Executed at a process Pj on receiving
In this case, Pk is also included in the minimum set. of m from Pi:
During checkpointing suppose Pi takes it tentative Case 1: If (m.pr_csni = = csn[i])// Pi has not taken its
checkpoint and after that it send m to Pj such that Pj has tentative checkpoint before sending m
not taken it tentative checkpoint at the time of receiving m. { rec(m);
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 5, May 2010 28
www.IJCSI.org
5.1.5 Algorithm executed at any process Pi; An MH may fail during checkpointing process. If an MH
On receiving tentative checkpoint request, fails after taking its tentative checkpoint or if it is not a
Take tentative checkpoint and inform local MSS. member of minimum set, then the checkpointing
procedure can be completed uninterruptedly. If a process
fails during checkpointing, then our straight forward
approach is to discard the whole checkpointing operation.
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 5, May 2010 29
www.IJCSI.org
The failed process will not be able to respond to the Both Pi and Pj have taken their permanent checkpoints
initiators request and the initiator will detect the failure by during the current initiation; the following possibilities can
timeout and will discard the complete checkpointing take place:
operation. If the initiator fails after sending commit, the Pi sends m after commit and Pj receives m before taking
checkpointing process can be considered complete. If the the tentative checkpoint. As Pj m_vect[], the initiator
initiator fails during checkpointing, then some processes, MSS can issue commit only after Pj has taken its tentative
waiting for commit will time out and will issue abort on checkpoint and inform the initiator. Therefore rec(m) at Pj
his own. can not take place before Pj takes its tentative checkpoint.
Kim and Park [17] proposed that a process commits its Suppose Pi sends m after taking the tentative checkpoint
tentative checkpoints if none of the processes, on which it and Pj receive m before taking its tentative checkpoint. In
transitively depends, fails; and the consistent recovery line this case, when Pj will receive m, it will check the
is advanced for those processes that committed their piggybacked Pr_csn of Pi along with m and will conclude
checkpoints. The initiator and other processes, which that Pi has taken tentative checkpoint for the new
transitively depend on the failed process, have to abort initiations and Pj has not taken its tentative checkpoint for
their tentative checkpoints. Thus, in case of a node failure this initiation. Therefore, Pj will process m only after Pj
during checkpointing, total abort of the checkpointing is takes it tentative checkpoint. Hence the receiver of m at Pj
avoided. can not occur before taking its tentative checkpointing.
In the proposed protocol, a process is blocked during the suffers from additional message overhead of sending
period, it receives m of higher CSN and it recues request to all processes to send their dependency vectors
checkpoint request or commit message. and all processes send dependency vectors to the initiator
In CS algorithm, initiator MSS collects dependency process. But in our algorithm, no such overhead is
vectors of all processes, computes minimum set and imposed. The Cao-Singhal [5] suffers from the formation
broadcasts minimum set to all MSSs. In KT algorithm and of checkpointing tree. In our algorithm, theoretically, we
in the proposed protocol, no such step is taken. can say that the length of the checkpointing tree will be
In KT algorithm, transitive dependencies are captured by considerably low as compared to algorithm [2], as most of
traversing direct dependencies and have a checkpoint tree the transitive dependencies are captured during the normal
is formed. It may lead to exceedingly high time for global processing. We do not compare our algorithm with
checkpoint collection and the blocking period may also be Prakash-Singhal [15], as Cao-Singhal proved that there no
high. In our algorithm, Transitive dependencies are such algorithm exists [4].
captured during normal processing and hence Furthermore, in algorithm [4], transitive dependencies are
checkpointing tree is not formed. Therefore, the time to captured by direct dependencies. Hence the average
collect the global checkpoint will be low as compared to number of useless checkpoints requests will be
KT algorithm. In CS algorithm, direct dependency vectors significantly higher than the proposed algorithm. In [5],
are collected in the initiation of the checkpointing huge data structures are piggybacked along with
algorithm. Therefore, this algorithm suffers from high checkpointing request, because they are unable to maintain
synchronization message overhead. exact dependencies among processes. Incorrect
(4) In KT algorithm and in the proposed protocol, an dependencies are solved by these huge data structures. In
integer number is piggybacked onto normal messages. In our case, no such data structures are piggybacked on
CS algorithm, no such information is piggybacked onto checkpointing request and no such useless checkpoint
normal messages. It can not handle the following situation. requests are sent, because we are able to maintain exact
Pi receives m from Pj in the current CI such that Pj has dependencies among processes and furthermore, are able
taken some permanent checkpoint after sending m. In this to capture transitive dependencies during normal
case, Pi does not become causally dependent upon Pj due to computation at the cost of piggybacking bit vector of
receipt of m. In this case, if Pi is in the minimum set, Pj length n for n processes onto normal computation
will unnecessarily be included in the minimum set. messages.
(5) Blocking of processes takes place differently in these
three protocols as follows. In KT algorithm, processes are 8.2 Comparison with other Algorithms:
not allowed to send any messages. In CS algorithm,
processes are not allowed to send or receive any messages. We use following notations to compare our algorithm with
In the proposed protocol, a few processes are not allowed other algorithms:
to process the selective messages received only during the Nmss: number of MSSs.
checkpointing period. A process is allowed to send Nmh: number of MHs.
messages and perform normal computations during its Cpp: cost of sending a message from one process to
blocking period. It is even allowed to receive selected another
messages. Cst: cost of sending a message between any two MSSs.
(6) We maintain exact dependencies among processes and Cwl: cost of sending a message from an MH to its local
a best possible knowledge of the minimum set, computed MSS (or vice versa).
so far, at the local MSS. In this way, number of duplicate Cbst: cost of broadcasting a message over static
checkpoint requests is reduced as compared to the KT network.
algorithm and no useless checkpoint requests are sent. Csearch: cost incurred to locate an MH and forward a
message to its current local MSS, from a source MSS.
Tst: average message delay in static network.
8.1 General Comparison with existing non- Twl: average message delay in the wireless network.
blocking minimum process algorithms: Tch: average delay to save a checkpoint on the stable
storage. It also includes the time to transfer the
In the algorithms [13], [19], initiator process/MSS checkpoint from an MH to its local MSS.
collects dependency vectors for all the processes and N: total number of processes
computes the minimum set and sends the checkpointing Nmin: number of minimum processes required to take
request to all the processes with minimum set. These checkpoints.
algorithms are non-blocking; the message received during Nmut: number of useless mutable checkpoints [2].
checkpointing may add processes to the minimum set. It
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 5, May 2010 31
www.IJCSI.org
Tsearch: average delay incurred to locate an MH and executions of the proposed algorithm. In case, two
forward a message to its current local MSS. processes concurrently initiate checkpointing, then the
Nucr: average number of useless checkpoint requests in [2]. Cao-Singhal Cao- Koo-Toeg Elnozahy Proposed
Ndep: average number of processes on which a process [4] Singhal [5] Algorithm et al [8] Algorithm
depends. [11]
Avg. 2Tst 0 h1*Tch 0 h2*Tch
h1: height of the checkpointing tree in Koo-Toueg blocking
algorithm [4]. Time
Average No. Nmin Nmin+ Nmin N Nmin
h2: height of the checkpointing tree in the proposed of Nmut
algorithm.: checkpoints
In Koo-Toueg algorithm [4] and in the proposed one, the Average 3Cbst+2Cwirele 2*Nmin*Cpp 3*Nmin*Cpp* 2*Cbst + N 2*Nmin
Message ss+2Nmss*Cst + Cbst+ Ndep *Cpp *Cpp +Cbst
checkpoint initiator process, say Pin sends the checkpoint Overhead +3Nmh* Cwl Nucr*Cpp
request to any process Pi if Pin is causally dependent upon
initiation of the process with lower process-ID will
Pi. Similarly, Pi sends the checkpoint request to any
prevail.
process Pj if Pi is causally dependent upon Pj. In this way,
a checkpointing tree is formed. Theoretically, we can say
Table 1: A Comparison of System Performance
that checkpointing tree will not be formed in our
algorithm. But due to Z-dependencies, a low order
checkpointing tree can be formed, because during normal 9. Conclusion
computations all the transitive dependencies are not
captured. Hence, the checkpointing tree in the proposed We have proposed a minimum process coordinated
scheme will be negligible as compared to KT and CS checkpointing algorithm for mobile distributed system,
algorithm in most of the practical situations. where no useless checkpoints are taken and an effort is
made to minimize the blocking of processes. The number
of processes that take checkpoints is minimized to avoid
8.3 Performance of our algorithm awakening of MHs in doze mode of operation and
thrashing of MHs with checkpointing activity. Further, it
8.3.1 The Synchronization message overhead: saves limited battery life of MHs and low bandwidth of
In the first phase, a process taking a tentative checkpoint
wireless channels. We have used the concept of delaying
needs two system messages: request and reply. A process
selective messages at the receiver end only during the
may receive more than one request for the same
checkpointing period. By using this technique, only
checkpoint initiation from different processes. However,
selective processes are blocked for a short duration and
we have used some techniques to reduce the duplicate
processes are allowed to do their normal computations and
checkpoint requests. Thus the system overhead is
send messages in the blocking period. We captured the
approximately 2*Nmin*Cpp in the first phase. In the second
transitive dependencies during the normal execution. The
phase, the commit requested is broadcasted on the static
Z-dependencies are well taken care of in this protocol. We
network; and the system overhead is Cbst.
also avoided collecting dependency vectors of all
8.3.2 Number of processes taking checkpoints: In our processes to compute the minimum set. Thus, the
algorithm, only minimum number of processes is required proposed protocol is simultaneously able to reduce the
to take their checkpoints. useless checkpoints to zero and tries to optimize the
blocking of processes at very less cost of maintaining
exact dependencies among processes and piggybacking
8.4 A Comparative Study
checkpoint sequence numbers and dependency vectors
The blocking time of the Koo-Toueg [11] protocol is onto normal computation messages.
highest, followed by Cao-Singhal [4] algorithm. In the
algorithms proposed in [5], [8], no blocking of processes
takes place, but some useless checkpoints are taken, which
10. References
are discarded on commit. In Elnozahy et al [8] algorithm,
[1] Acharya A. and Badrinath B. R., Checkpointing Distributed
all processes take checkpoints. In the protocols [4], [11], Applications on Mobile Computers, Proceedings of the 3rd
and the proposed one, only minimum numbers of International Conference on Parallel and Distributed Information
processes record their checkpoints. The message overhead Systems, pp. 73-80, September 1994.
in the proposed protocol is greater than [8], but less than [2] Baldoni R., Hlary J-M., Mostefaoui A. and Raynal M., A
[4], [5] and [11]. In algorithm [5], concurrent executions Communication-Induced Checkpointing Protocol that Ensures
of the algorithm are allowed, but it may lead to Rollback-Dependency Trackability, Proceedings of the
inconsistencies in doing so [20]. We avoid concurrent
IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 5, May 2010 32
www.IJCSI.org
International Symposium on Fault-Tolerant-Computing Systems, [18] L. Kumar, M. Misra, R.C. Joshi, Checkpointing in
pp. 68-77, June 1997. Distributed Computing Systems Book Chapter Concurrency in
[3] Cao G. and Singhal M., On coordinated checkpointing in Dependable Computing, pp. 273-92, 2002.
Distributed Systems, IEEE Transactions on Parallel and [19] L. Kumar, M. Misra, R.C. Joshi, Low overhead optimal
Distributed Systems, vol. 9, no.12, pp. 1213-1225, Dec 1998. checkpointing for mobile distributed systems Proceedings. 19th
[4] Cao G. and Singhal M., On the Impossibility of Min-process IEEE International Conference on Data Engineering, pp 686
Non-blocking Checkpointing and an Efficient Checkpointing 88, 2003.
Algorithm for Mobile Computing Systems, Proceedings of [20] Ni, W., S. Vrbsky and S. Ray, Pitfalls in Distributed
International Conference on Parallel Processing, pp. 37-44, Nonblocking Checkpointing, Journal of Interconnection
August 1998. Networks, Vol. 1 No. 5, pp. 47-78, March 2004.
[5] Cao G. and Singhal M., Mutable Checkpoints: A New [21] L. Lamport, Time, clocks and ordering of events in a
Checkpointing Approach for Mobile Computing systems, IEEE distributed system Comm. ACM, vol.21, no.7, pp. 558-565,
Transaction On Parallel and Distributed Systems, vol. 12, no. 2, July 1978.
pp. 157-172, February 2001. [22] Silva, L.M. and J.G. Silva, Global checkpointing for
[6] Chandy K. M. and Lamport L., Distributed Snapshots: distributed programs, Proc. 11th symp. Reliable Distributed
Determining Global State of Distributed Systems, ACM Systems, pp. 155-62, Oct. 1992.
Transaction on Computing Systems, vol. 3, No. 1, pp. 63-75, [23] Parveen Kumar, Lalit Kumar, R K Chauhan, A Non-
February 1985. intrusive Hybrid Synchronous Checkpointing Protocol for
[7] Elnozahy E.N., Alvisi L., Wang Y.M. and Johnson D.B., A Mobile Systems, IETE Journal of Research, Vol. 52 No. 2&3,
Survey of Rollback-Recovery Protocols in Message-Passing 2006.
Systems, ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, [24] Parveen Kumar, A Low-Cost Hybrid Coordinated
2002. Checkpointing Protocol for mobile distributed systems, To
[8] Elnozahy E.N., Johnson D.B. and Zwaenepoel W., The appear in Mobile Information Systems.
Performance of Consistent Checkpointing, Proceedings of the [25] Lalit Kumar Awasthi, P.Kumar, A Synchronous
11th Symposium on Reliable Distributed Systems, pp. 39-47, Checkpointing Protocol for Mobile Distributed Systems:
October 1992. Probabilistic Approach International Journal of Information and
[9] Hlary J. M., Mostefaoui A. and Raynal M., Computer Security, Vol.1, No.3 pp 298-314.
Communication-Induced Determination of Consistent
Snapshots, Proceedings of the 28th International Symposium on
Fault-Tolerant Computing, pp. 208-217, June 1998.
[10] Higaki H. and Takizawa M., Checkpoint-recovery Protocol
for Reliable Mobile Systems, Trans. of Information processing
Japan, vol. 40, no.1, pp. 236-244, Jan. 1999.
[11] Koo R. and Toueg S., Checkpointing and Roll-Back
Recovery for Distributed Systems, IEEE Trans. on Software
Engineering, vol. 13, no. 1, pp. 23-31, January 1987.
[12] Neves N. and Fuchs W. K., Adaptive Recovery for Mobile
Environments, Communications of the ACM, vol. 40, no. 1, pp.
68-74, January 1997.
[13] Parveen Kumar, Lalit Kumar, R K Chauhan, V K Gupta A
Non-Intrusive Minimum Process Synchronous Checkpointing
Protocol for Mobile Distributed Systems Proceedings of IEEE
ICPWC-2005, pp 491-95, January 2005.
[14] Pradhan D.K., Krishana P.P. and Vaidya N.H., Recovery in
Mobile Wireless Environment: Design and Trade-off Analysis,
Proceedings 26th International Symposium on Fault-Tolerant
Computing, pp. 16-25, 1996.
[15] Prakash R. and Singhal M., Low-Cost Checkpointing and
Failure Recovery in Mobile Computing Systems, IEEE
Transaction On Parallel and Distributed Systems, vol. 7, no. 10,
pp. 1035-1048, October1996.
[16] Ssu K.F., Yao B., Fuchs W.K. and Neves N. F., Adaptive
Checkpointing with Storage Management for Mobile
Environments, IEEE Transactions on Reliability, vol. 48, no. 4,
pp. 315-324, December 1999.
[17] J.L. Kim, T. Park, An efficient Protocol for checkpointing
Recovery in Distributed Systems, IEEE Trans. Parallel and
Distributed Systems, pp. 955-960, Aug. 1993.