0% found this document useful (0 votes)
39 views6 pages

Identification of Critical Factors in Check Pointing Based Multiple Fault Tolerance For Distributed System

Performance of a checkpointing based multiple fault tolerance is low. The main reason is overheads associate with checkpointing. A checkpointing algorithm can be improved by improved storing strategy and checkpointing scheduling. Improved storage strategy and checkpointing scheduling will reduce the overheads associated with checkpointing. Performance and efficiency is most desirable feature of recovery based on checkpointing. In this paper important critical issues involved in fast and efficient recovery are discussed based on checkpointing. Impact of each issue on performance of checkpointing based recovery is also discussed. Relationships among issues are also explored. Finally comparisons of important issues are done between coordinated checkpointing and uncoordinated checkpointing. http://cisjournal.org/Download_pdf_2_6.aspx

Uploaded by

saeedullah81
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views6 pages

Identification of Critical Factors in Check Pointing Based Multiple Fault Tolerance For Distributed System

Performance of a checkpointing based multiple fault tolerance is low. The main reason is overheads associate with checkpointing. A checkpointing algorithm can be improved by improved storing strategy and checkpointing scheduling. Improved storage strategy and checkpointing scheduling will reduce the overheads associated with checkpointing. Performance and efficiency is most desirable feature of recovery based on checkpointing. In this paper important critical issues involved in fast and efficient recovery are discussed based on checkpointing. Impact of each issue on performance of checkpointing based recovery is also discussed. Relationships among issues are also explored. Finally comparisons of important issues are done between coordinated checkpointing and uncoordinated checkpointing. http://cisjournal.org/Download_pdf_2_6.aspx

Uploaded by

saeedullah81
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Volume 2 No.

1 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org

Identification of Critical Factors in Checkpointing Based Multiple Fault


Tolerance for Distributed System
Sanjay Bansal1, Sanjeev Sharma2
1
Medi-Caps Institute of Technology and Management
Indore, India
[email protected]
2
School of Information Technology, Rajiv Gandhi Prodhyogiki Vishwavidya
Bhopal, India
[email protected]

ABSTRACT
Performance of a checkpointing based multiple fault tolerance is low. The main reason is overheads associate with
checkpointing. A checkpointing algorithm can be improved by improved storing strategy and checkpointing scheduling.
Improved storage strategy and checkpointing scheduling will reduce the overheads associated with checkpointing.
Performance and efficiency is most desirable feature of recovery based on checkpointing. In this paper important critical
issues involved in fast and efficient recovery are discussed based on checkpointing. Impact of each issue on performance of
checkpointing based recovery is also discussed. Relationships among issues are also explored. Finally comparisons of
important issues are done between coordinated checkpointing and uncoordinated checkpointing.

Keywords: Checkpointing, Distributed System, Recovery, Fault Tolerance

1. INTRODUCTION checkpoints. These consistent check points are used to


bound rollback propagation. Consistency is more in case
Checkpoint with rollback-recovery is a well- of coordinate check points due to consistent set of
known technique to tolerate process crashes and failures in checkpoints [3].
distributed system. In order to tolerate crash of process Coordinated checkpoint involves the rollback
failure one of approach is create a new process of that check point of all processes from the last snapshot when a
process with same state. This can be done if state and faulty situation is detected, even when a single process
complete description of all processes executing in crashes. For this reason recovery time is very large and it
distributed environment must be saved on stable stored makes unsuitable for real time applications. In case of
time and time and when any crash failure of processes are frequent failures and multiple faults coordinate check
detected this save checkpoints are used to crate new point technique can not be used. Performance can be
processes identical to crashed processes and this way improved by decreasing the recovery time .Main reason
multiple process crashed can be tolerated. Checkpoint is for large recovery time is restarting all the initial state.
an operation which stores the current state of computation Recovery time can be reduced by enabling the restart from
in stable storage. Checkpoints are established during the last correct state instead of from very first state. There
normal execution of a program periodically. This must be some mechanism to ensure restarting from last
information is saved on a stable storage so that it can be correct state will reach a state matching the rest of the
used in case of node failures. The information includes the system, as before the crash. Checkpointing is only taken
process state, its environment, and the value of registers. A when all process agree for a consistent state. There are two
fundamental goal of any rollback recovery protocol is to main ways to implement coordinate checkpointing;
bring the system into a consistent state when blocking and none blocking. Blocking checkpointing
inconsistencies occur because of a failure [1]. consists of stopping the computation to take the global
state. This permits better control on the state of the
2. TYPES OF CHECKPOINTING AND different processes and their communication channels. The
CORRESPONDING RECOVERY ISSUES second one, called non-blocking coordinated
checkpointing, does not provide this kind of control, but
Coordinated checkpoint and uncoordinated does not require the interruption of the computation [4].
checkpoint associated with message logging are the two
main techniques used for saving the distributed execution (b) Uncoordinated checkpoint with message
state and recovering from system failures [2]. logging
In Uncoordinated checkpoint protocols, all
(a)Coordinated checkpoint processes execute a checkpoint independently of the
In coordinate check point processes coordinate others so that recovery can be done independently with
their checkpoints in order to save a system-wide consistent each other. There is a big question if checkpoints are taken
state. Coordinate check points are consistent set of independently than how complete and overall description

43
Volume 2 No. 1 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org
and order of process execution is determined. One way is tolerance. In order to ease the FTPA implementation,
uncoordinated checkpointing is combined with message Researcher developed Get it Fault-Tolerant (GiFT), a
logging. source-to-source precompiled tool to automate the FTPA
Message logging is a common technique used to implementation. Researcher evaluates the performance of
build systems that can tolerate process crash failures. FTPA with parallel matrix multiplication and five kernels
These protocols require that each process periodically of NAS Parallel Benchmarks on a cluster system with
records its local state and log the messages received since 1,024 CPUs. The experimental results show that the
recording that state. Message logging stores all performance of FTPA is better than the performance of the
interprocess messages in order to bring checkpoint up to traditional check pointing approach due to fast recovery
date. When a process crashes, a new process is created in [7].However this is only suitable for large problem. If the
its place: the new process is given the appropriate problem size is not large enough, not all processes will
recorded local state, and then it replays the logged contribute to parallel recomputing.
messages in the order they were originally received
[5].Message logging is combined with uncoordinated (3.2) Adaptiveness:
checkpoint to restart the system from last correct state. It A fault tolerance technique is expected to have
is combined with message logging to ensure the complete the capability of dynamically adapting to distinct runtime
description of a process execution state in case of its conditions. The capability of dynamically adapting to
failure. Besides logging of all received messages, re- distinct runtime conditions is an important issue. One of
sending the same relevant messages in the same order to way by which a fault tolerant technique can be made
the crashed processes during their re-execution is also dynamic is by an adaptive programming model [8]. This
main function of message logging. There are three kinds programming model is hybrid, composed by a
of message logging protocols: optimistic, pessimistic and synchronous part (where there are time bounds on
causal. Pessimistic protocols ensure that all messages processing speed and message delay) and an asynchronous
received by a process are logged on reliable media before part (where there is no time bound). There is further
it sends information in the system. Logged information on research scope to develop more adaptive programming
reliable media can be re-sent later and only if necessary model to make fault tolerance technique more adaptive to
during rollback. Message logging optimistic protocols just dynamic situation. In case of fault, the most important
ensure that all messages will eventually be logged. So, one issue is efficient recovery in dynamic heterogeneous
usual way to implement optimistic logging is to log the systems. Recovery under different numbers of processors
messages on non-reliable media. Causal protocols log is highly desirable. The fault tolerant and recover
message information of a process in all causally dependent approaches must be suitable for applications with a need
processes [6]. for adaptive or reactionary configuration control.
Uncoordinated checkpointing when used with Researcher proposed flexible rollback recovery in
message logging having fast recovery since restart is from dynamic heterogeneous computing for such crucial
last consistent state not from initial sate as in case of requirements [9]. Still overhead of this technique is
coordinate checkpointing. Since checkpointing is done significant and need to be address further. Performance of
independently hence multiple faults can be handled by this any fault tolerant technique depends upon recovery time.
approach which can not be handled by coordinated Adaptive checkpointing is required to cope with volatile
checkpointing. dynamic environment.

3. ISSUES WITH CHECKPOINTING (3.3) Multiple fault capability:


BASED RECOVERY: IN THIS SECTION Multiple fault handling in distributed system is
WE DISCUSSED CRITICAL AND very crucial. In case of multiple faults, most of existing
IMPORTANT ISSUES RELATED TO techniques have not enough provision or capability to
handle multiple faults. Even, if enough capable to handle
CHECKPOINTING BASED FAULT multiple faults than with low and unacceptable
TOLERANCE performance. Single point failure must be taken seriously
while designing the multiple fault tolerance technique.
(3.1) Recovery cost: Some multiple fault tolerant algorithms are based on
Conventional rollback-recovery protocols redo optimistic or causal message logging approach. One
the computation of the crashed process since the last approach to handle this critical issue suggested by
checkpoint on a single processor. As a result, the recovery researcher is use of uncoordinated checkpoint, distributed
time of all protocols is no less than the time between the message logging and uses a reliable coordinator and
last checkpoint and the crash. Researcher proposed a new checkpoint servers. In real situation, none of their existing
application-level fault-tolerant approach for parallel implementation tolerates more than one fault. There is
applications called the Fault-Tolerant Parallel Algorithm strong need for proper augmentation by appropriate
(FTPA), which provides fast self-recovery. When fail-stop mechanisms. Restart of full system in case of multiple
failures occur and are detected, all surviving processes faults is also a major disadvantage in terms of
recomputed the workload of failed processes in parallel. performance due to high recovery time. Thus optimistic or
FTPA, however, requires the user to be involved in fault causal message logging required some mechanism to start

44
Volume 2 No. 1 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org
from last checkpoint instead of restart. This problem can available who can tolerate multiple tolerate automatically
be overcome by using pessimistic message logging [2] A critical aspect for an automatic recovery is the
principle. In pessimistic message logging principle based availability of checkpoint files: If a resource becomes
algorithms, all in transit messages are stored on reliable unavailable, it is very likely that the associated storage is
media. In such a case recovery does not required restart also unreachable due to a network partition. A strategy to
Thus it reduces the recovery time on cost of large number increase the availability of checkpoints is replication .
of non computational reliable resources. Costs of such Replication is number of copies of checkpointing. Migol
resources are very high [10].Coordinate checkpointing can is a framework for automatic recovery in grid application
not handle multiple faults but uncoordinated with message based on replication [3].
logging can handle the multiple faults. In order to tolerate
multiple faults using checkpointing and recovery, three (3.6) Memory requirement:
critical functionalities that is necessary for fault tolerance: In order to tolerate multiple faults, memory
a light-weight failure detection mechanism, dynamic requirement should be low or medium but not large.
process management that includes process migration and a Message logging stores all transit messages on reliable
consistent checkpoint and recovery mechanism. Hugo media. In that case require a large number of non
Jung et al. proposed a technique to address this critical computational reliable resources [10].In order provide
functionality [1]. multiple fault capability this memory requirement because
very huge. If the MPI implementation is to tolerate n
(3.4) Performance: concurrent faults (n being the number of MPI processes),
But in case of uncoordinated checkpointing with then a reliable coordinator and a set of reliable remote
message logging is based on piecewise information that checkpoint servers should be used
need be making integrated information to recover the Multiple fault capability can be increased using
system. Most recent consistence state is constructed with replication based checkpointing. N number of replicas can
this piecewise information. Storing of these piece wise handle atleast N number of faults. But areplication
information adversely affect the failure free performance, protocol must be practical and simple. The protocol must
bandwidth and latency. The root cause of these is provide rigorously-proven yet simply-stated consistency
overheads for collecting this piece wise information in non guarantee with a reasonable performance. Niobe is such
failure environment and construction of most consistence protocol purposed by researcher [13].Number of replicas
state in multiple faults case. Coordinate checkpointing must be sufficient. Large numbers of replicas will increase
although incapable of handling multiple faults but the cost of maintaining the consistency. Less number of
overhead are low. Coordinate checkpointing are simple replicas will affect the performance, scalability and
than uncoordinated checkpointing [9]. Run time must be multiple fault tolerance capability. Therefore, reasonable
low for any checkpointing based fault tolerance in both number replicas must be estimate as per system
fault free and faulty case. Another overhead associate with configuration and load. . Researcher proposed adaptive
sender based pessimistic logging is due to huge amount of replicas creation algorithm [14].There is further research
messages. These huge amounts of messages must be kept scope to develop improved algorithm to maintain a
in memory. These not only lower the performance but rational replica number. Replica on demand is a feature
require larger amount of stable storage in general that can be implemented to make more adaptive, flexible
performance of checkpointing technique is very low. and dynamic. There is research scope to further improve
Researcher proposed replication based check-pointing to protocols to achieve replication efficiently. There are some
improve the performance [11].There are many issues crucial requirements with replication protocol. These
related to replication based check pointing fault-tolerance crucial requirements are support for a flexible number of
technique. These issues are mainly degree of replication, replicas, strict consistency in the presence of network,
check pointing storage type and location, check pointing disk, and machine failures and efficient common case read
frequency, check point size and check point run time. At and write operations without requiring potentially
the same time researcher suggested a adaptive check expensive two or three-phase commit protocols.
pointing and replication to dynamically adapt the check
pointing frequency and the number of replicas as a (3.7) Synchronization:
reaction on changing system properties (number of active Checkpointing must no rely on global
resources, resource failure frequency and system load synchronization because some nodes may leave or join the
)[12]. distributed system in dynamically manner.

(3.5) Automatic multiple fault detection and (3.8) Domio Effect and roll back propagation:
recovery: This is an undesirable feature generally with
One of issue with multiple fault tolerance is uncoordinated checkpointing with message logging.
automatic detection and recovery. Uncoordinated Uncoordinated checkpointing is based on inter process
checkpointing with optimistic or causal message logging communication, Message are logged independently and in
is combined to achieve automatic multiple fault tolerance order to get complete description of failed process these
in distributed system. A distributed system must tolerate n inter process dependencies may force some healthy
number of multiple faults but in reality none of system is process to roll back. This roll back of healthy process is

45
Volume 2 No. 1 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org
called rollback propagation. This rollback may bring be partially mitigated by first storing checkpoints locally
system at initial stage with loss of all computation. This before serializing their transfer to a SAN. To help improve
situation is called domio effect [9] the performance of checkpointing, particularly the
(3.9) Storage strategy: checkpointing delay due to shared storage, they propose a
Performance of checkpointing technique depends group-based checkpointing solution that divides processes
upon storage strategy. Central dedicated servers and into multiple (smaller) checkpointing groups. Each group
network storage are generally used a storage for checkpoints individually in an effort to reduce the
checkpointing [15] [16] [17]. Although it is simple checkpointing overhead. In order to achieve low overhead,
strategy to store the checkpoints but performance is low however, their solution requires that non checkpointing
because all load of checkpointing is on central storage. So groups make computational progress during other groups’
overheads are very high. Further these central or network checkpoints. This requires that the processes be divided
storage are single point failure. Scalability of central into groups according to their communication patterns,
storage is also low. Performance can be improved if which in turn requires information from the user [1].
loads of checkpointing can be distributed evenly over all (3.12) Checkpointing overheads:
node involved in computation. John Paul Walter suggested Checkpointing overhead consist of coordination
Replication based checkpointing that distributed time, memory write time, and continue time. The
checkpointing over all computational loads coordination phase includes the time needed to quiesce the
[11].Performance and scalability is improved on cost of network channels/exchange bookmarks. The memory
consistency. Replication based checkpointing methods write time consists of the time needed to checkpoint the
need more care attention related to consistency like degree entire memory footprint of a single process and writes it to
of replica, consistency among replica, replica on demand a local disk. Finally, the continue phase includes the time
etc. Consistency among replicas is a major issue. Multiple needed to synchronize and resume the computation. On
copies of same entity causes problem of consistency due occasion, particularly with large memory footprints, the
to update of any copy by one of the user. A replication continue phase can seem disproportionately long. This is
protocol must ensure the consistency among all replicas. due to some nodes’ slower checkpoint/file writing
[18]. The major issues with checkpointing storage are performance, forcing the faster nodes to wait [11]. By
capacity, scalability, performance and overheads. appropriate scheduling in checkpointing in such a way
Different storages which is used for checkpointing are based on its computational power these continue phase can
parallel file system, centre storage, distributed storage be minimized. Thus, the time required to checkpoint the
area, network storage, disk etc. replica consistency entire system is largely dependent on the time needed to
usually requires deterministic replica behavior write the memory footprint of the individual nodes.
[14].Researcher proposed an algorithm uses both active
and passive strategies to implement optimistic replication (3.13) Checkpointing Interval & frequency:
protocol [18]. Researcher also proposed a simple protocol Checkpointing time interval is the time elapsed
by combining the token with cache. This gives benefits of between two successive checkpoints. Checkpointing
token as well as cache [3].There is still need of more frequency is number of checkpoint taken for a particular
simple, adaptive and practical replication protocol with node for a given amount of time. Checkpointing frequency
adequate and sufficient ensured consistency [19]. is reciprocal of checkpointing time interval, various
overheads component of checkpointing also depends upon
(3.10) Levels of checkpointing: checkpointing frequency. As frequency is reduced
There are mainly three level of checkpointing. overheads also reduces because it’s take less time to write
These levels are kernel level, user level and application footprints to memory. Checkpointing size and frequency
level. Kernel level works as kernel. It is easy but depends must be varying as per trend and potency of dynamism of
upon operating system. User level checkpointing acts as a distributed system.
library. Portability is high but on the cost of limited access
to kernel specific attributed. In user level programmer put 4. COMPARISON
the location of checkpointing in programming. Researcher
propose operational level checkpointing to lower the cost In this section we compared coordinated and
of recovery [YTG, 2006 PRAHSANT] uncoordinated checkpointing on the basis of some
important critical factors discussed in section 3
(3.11) Scheduling of checkpointing:
Scheduling of checkpointing decide the overall Table1: Comparison Table
performance of checkpointing. Improved scheduling
approaches are suggested to reduce the various overheads Issue Coordinate Uncoordinated
like computational overhead, storage overheads and checkpointing checkpointing
transfer overheads of checkpoints. Wang et al. moves Consistency More less
checkpoint data to centralized storage in smaller groups. Recovery tine More less
However, their results are limited to 16 nodes, making the Performance Low High
scalability of their solution unknown[20]. Jung et al. show Single Faults Yes Yes
that the overhead of SAN-based checkpoint storage may Multiple Faults No Yes

46
Volume 2 No. 1 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org
Frequent Failures NO Yes Large-Scale Fault Tolerant MPI,” Proc. 18th Ann.
Automatic NO Yes Supercomputing Conf. (SC ’06), pp. 127-140,
Domio Effect a Roll NO yes 2006
back propagation
Overhead Less More [5] Emil Vassev, Que Thu Dung Nguyen and Heng
Protocol Simple Complex Kuang, “Fault-Tolerance through Message-
Scheduling Less Complex Complex logging and Check-pointing,” Technical Report
Corcodia University.
5. CONCLUSION
[6] L. Alvisi and K. Marzullo. Message logging :
We have discussed many issues of Pessimistic, optimistic, and causal. In
checkpointing based recovery for multiple faults. These Proceedings of the 15th International Conference
issues are discussed as an impact of overall performance on Distributed Computing,Systems (ICDCS
of distributed system. For distributed system performance 1995), pages ,229–236. IEEE CS Press, May-
and efficiency are important. Some issues are not relevant June 1995.
with coordinate checkpointing but relevant with
uncoordinated checkpointing. Better performance and [7] X. Yang, Y. Du, Panfeng W. Fu, and Jia
efficiency can be achieved by improved storage strategy “FTPA: Supporting Fault-Tolerant Parallel
and lowering the overheads associate with checkpointing. Computing through Parallel Recomputing,” Ieee
Write time can be reduced by addressing the issue related Transactions On Parallel And Distributed
the storage strategy. Likewise by scheduling the load of Systems, Vol. 20, No. 10, October 2009
checkpointing equally and evenly over all computing
nodes can improve the performance, reduces overheads [8] S. Gorender, and M Raynal, “An Adaptive
and cost of checkpointing multiple fault tolerance Programming Model for Fault-Tolerant
capability with performance can be achieved by Distributed Computing” Ieee Transactions On
optimizing the replicas and producing the replica on Dependable And Secure Computing, Vol. 4, No.
demand. An adaptive replica may further reduce the 1, January-March 2007.
consistency cost.
Various overhead like memory write time, [9] S. Jafar, A. Krings, and T. Gautier,” Flexible
coordination time, continue time are different with Rollback Recovery in Dynamic Heterogeneous
different checkpoint size and checkpointing frequency. Grid Computing”, IEEE Transactions On
These overheads need to optimize as a function of size of Dependable and Secure Computing, Vol. 6, No.
frequency. Scalability of checkpointing can be further 1, Jan-Mar 2009
improved by improved replicated checkpointing
[10] A. Bouteiller, F. Cappello, T. H Krawezik, Pi
Lemarinier, F Magniette, “MPICH-V2: a Fault
6. REFERENCES
Tolerant MPI for Volatile Nodes based on
Pessimistic Sender Based Message Logging, ”
[1] H. Jung, D. Shin, H. Kim, and Heon Y. Lee,
SC’03, NoV 15-21, 2003, Phoenix, Arizona,
“Design and Implementation of Multiple
USA Copyright 2003 ACM 1-58113-695-
FaultTolerant MPI over Myrinet (M3) ,” SC|05
1/03/001...
Nov 1218,2005, Seattle, Washington, USA
Copyright 2005 ACM
[11] J. Walters and V. Chaudhary,” Replication-Based
Fault Tolerance for MPI Applications,” Ieee
[2] M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B.
Transactions On Parallel And Distributed
Johnson. A survey of rollback-recovery protocols
Systems, Vol. 20, No. 7, July 2009
in message passing systems. Technical Report
CMU-CS-96-81, School of Computer Science,
[12] M Chtepen, F.. Claeys, B. Dhoedt, P. Demeester,
Carnegie Mellon University, Pittsburgh, PA,
, and P. Vanrolleghem,” Adaptive Task
USA, October 1996
Checkpointing and Replication:Toward Efficient
Fault-Tolerant Grids”, IEE ransactions on
[3] X. China, “Token-Based Sequential Consistency
Parallel and Distributed Systems, Vol. 20, No. 2,
in Asynchronous Distributed System ,” 17 th
Feb 2009
Internaional Conference on Advanced
Information Networking and Applications
[13] J Maccormick1, C Thekkath, M.Jager,K. Roomp,
(AINA'03),March 27-29, ISBN: 0-7695-1906-7
and L. Peterson , “Niobe: A Practical Replication
Protocol.” ACM Journal Name, Vol. V, No. N,
[4] C. Coti, T. Herault, P. Lemarinier, L. Pilard, A.
Month 20YY.
Rezmerita,E. Rodriguez, and F. Cappello, “MPI
Tools and Performance Studies—Blocking versus
[14] Cao Huaihu, Zhu Jianming, “An Adaptive
Non-Blocking Coordinated Checkpointing for
Replicas Creation Algorithm with Fault

47
Volume 2 No. 1 ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org
Tolerance in the Distributed Storage Network” [18] A. Kale, U. Bharambe, “Highly available fault
2008 IEEE tolerant distributed computing using reflection
and replication,” Proceedings of the International
[15] H.P. Reiser, M.J. Danel, and F.J. Hauck., “ A Conference on Advances in Computing,
flexible replication framework for scalable Communication and Control ,Mumbai, India
andreliable .net services.,” In Proc. of the IADIS Pages: 251-256 ,: 2009 .
Int. Conf. on Applied Computing, volume1,
pages 161–169, 2005 [19] Andr´e Luckow Bettina Schnor,“Adaptive
Checkpoint Replication for Supporting the Fault
[16] Q. Gao, W. Yu, W. Huang, and D.K. Panda, Tolerance of Applications in the Grid,“ Seventh
“Application-Transparent Checkpoint/Restart for IEEE International Symposium on Network
MPI Programs over Infini-Band,” Proc. 35th Ann. Computing and Applications, 978-0-7695-3192-
Int’l Conf. Parallel Processing (ICPP ’06), pp. 2/08 $25.00 © 2008 IEEE.
471-478, 2006.
[20] C. Wang, F. Mueller, C. Engelmann, and S.L.
[17] S. Sankaran, J.M. Squyres, B. Barrett, A. Scott, “A Job Pause Service under
Lumsdaine, J. Duell, P. Hargrove, and E. Roman, LAM/MPIþBLCR for Transparent Fault
“The LAM/MPI Checkpoint/Restart Framework: Tolerance,”Proc. 21st Int’l Parallel and
System-Initiated Checkpointing,” Int’l J. High Distributed Processing Symp.(IPDPS ’07), pp.
Performance Computing Applications, vol. 19, no. 116-125, 2007.
4, pp. 479-493, 2005.

48

You might also like