On Mobile Checkpointing using Index and Time Together

Checkpointing is one of the commonly used techniques to provide fault-tolerance in distributed systems so that the system can operate even if one or more components have failed. However, mobile computing systems are constrained by low bandwidth, mobility, lack of stable storage, frequent disconnections and limited battery life. Hence, checkpointing protocols having lesser number of synchronization messages and fewer checkpoints are preferred in mobile environment. There are two different approaches, although not orthogonal, to checkpoint mobile computing systems namely, time-based and index-based. Our protocol is a fusion of these two approaches, though not first of its kind. In the present exposition, an index-based checkpointing protocol has been developed, which uses time to indirectly coordinate the creation of consistent global checkpoints for mobile computing systems. The proposed algorithm is non-blocking, adaptive, and does not use any control message. Compared to other contemporary checkpointing algorithms, it is computationally more efficient because it takes lesser number of checkpoints and does not need to compute dependency relationships. A brief account of important and relevant works in both the fields, time-based and index-based, has also been included in the presentation.





References:
[1] K. M. Chandy and L. Lamport, "Distributed snapshots: determining
global states of distributed systems," ACM Transactions on Computer
Systems, 1985, February, vol. 3, no. 1, pp. 63-75.
[2] B. Gupta, S. Rahimi, R. Rias, and G. Bangalore, "A low-overhead nonblock
checkpointing algorithm for mobile computing environment," in
Lecture Notes in Computer Science 3947, GPC 2006, Springer-Verlag,
2006, pp. 597-608.
[3] N. Neves and W. Fuchs, "Adaptive recovery for mobile environments,"
in Proc. IEEE High-Assurance Systems Engineering Workshop, October
21-22, 1996, pp.134-141. Also in Communications of the ACM, 1997,
January, vol. 40, no. 1, pp. 68-74.
[4] B. Randell, "System structure for software fault-tolerance," IEEE
Transactions on Software Engineering, 1975, June, vol. 1, no. 2, pp.
220-232.
[5] J. Helary, A. Mostefaoui, R. Netzer, and M. Raynal, "Communicationbased
prevention of useless checkpoints in distributed computations,"
Distributed Computing, 2000, January, vol. 13, no. 1, pp. 29-43.
[6] J. Tsai, "An efficient index-based checkpointing protocol with constantsize
control information on messages," IEEE Transactions on
Dependable and Secure Computing, 2005, Oct-Dec, vol. 2, no. 4, pp.
278-296.
[7] R. Baldoni, F. Quaglia, and P. Fornara, "An index-based checkpointing
algorithm for autonomous distributed systems," in Proc. 16th IEEE
Symposium on Reliable Distributed Systems, October 22-24, 1997,
pp.27-34. Also in IEEE Transactions on Parallel and Distributed
Systems, 1999, February, vol. 10, no. 2, pp. 181-192.
[8] L. Lamport, "Time, clocks and the ordering of events in a distributed
system," Communications of the ACM, 1978, July, vol. 21, no. 7, pp.
558-565.
[9] L. Alvisi, E. Elnozahy, S. Rao, S. A. Husain, and A. D. Mel, "An
analysis of communication-induced checkpointing," in Proc. IEEE
Fault-Tolerant Computing Symposium, June 15-18, 1999, pp. 242-249.
[10] N. Neves, "Time-based coordinated checkpointing," Ph.D. dissertation,
UIUCDCS-R-98-2054, University of Illinois at Urbana-Champaign,
1998.
[11] N. Vaidya, "Another two-level failure recovery scheme: performance
impact of checkpoint placement and checkpoint latency," Tech. Rep. 94-
068 (revised), Texas A&M University, January, 1995.
[12] N. Vaidya, "On checkpoint latency," in Proc. Pacific Rim International
Symposium on Fault-Tolerant Systems, December, 1995, pp. 60-65.
[13] M. Chaoguang, Z. Yunlong, and Y. Wenbin, "A two-phase time-based
consistent checkpointing strategy," in Proc. ITNG-06 3rd IEEE
International Conference on Information Technology: New Generations,
April 10-12, 2006, pp. 518-523.
[14] Z. Tong, R. Kain, and W. Tsai, "A low overhead checkpointing and
rollback recovery scheme for distributed systems," in Proc. 8th IEEE
Symposium on Reliable distributed systems, October 10-12, 1989, pp.
12-20.
[15] F. Cristian and F. Jahanian, "A timestamp-based checkpointing protocol
for long-lived distributed computation," in Proc. 10th IEEE Symposium
on Reliable Distributed Systems, 30 Sep-02 Oct, 1991, pp. 12-20.
[16] N. Neves and W. Fuchs, "Using time to improve the performance of
coordinated checkpointing," in Proc. 2nd IEEE International Computer
Performance and Dependability Symposium, September 4-6, 1996, pp.
282-291.
[17] S. George, I. Chen, and Y. Jin, "Movement-based checkpointing and
logging for recovery in mobile computing systems," in Proc.
MobiDE-06 5th ACM International Workshop on Data Engineering for
Wireless and Mobile Access, June, 2006, pp. 51-58.
[18] J. Ahn and C. Hwang, "Low-cost fault-tolerance for mobile nodes in
mobile IP based systems," in Proc. IEEE International Conference on
Distributed Computing Systems Workshop, April 16-19, 2001, pp. 508-
513.
[19] J. Ahn, S. Min, and C. Hwang, "A causal message logging protocol for
mobile nodes in mobile computing systems," Future Generation
Computer Systems, 2004, May, vol. 20, no. 4, pp. 663-686.
[20] S. Neogy, A. Sinha, and P. Das, "Checkpoint processing in distributed
systems software using synchronized clocks," in Proc. ITCC 2001 IEEE
International Conference on Information Technology: Coding and
Computing, April 2-4, 2001, pp. 555-559.
[21] S. Neogy, A. Sinha, and P. Das, "Distributed checkpointing using
synchronized clocks," in Proc. COMPSAC-02 26th IEEE Annual
International Computer Software and Applications Conference, 2002,
pp. 199-206.
[22] C. Lin, S. Wang, S. Kuo, and I. Chen, "A low overhead checkpointing
protocol for mobile computing systems," in Proc. PRDC-02 Pacific Rim
International Symposium on Dependable Computing, December 16-18,
2002, pp. 37-44.
[23] C. Lin, S. Wang, and S. Kuo, "An efficient time-based checkpointing
protocol for mobile computing systems over wide area networks," in
Lecture Notes in Computer Science 2400, Euro-Par 2002, Springer-
Verlag, 2002, pp. 978-982. Also in Mobile Networks and Applications,
2003, vo. 8, no. 6, pp. 687-697.
[24] Z. Tong, R. Kain, and W. Tsai, "Rollback recovery in distributed
systems using loosely synchronized clocks," IEEE Transactions on
Parallel and Distributed Systems, 1992, March, vol. 3, no. 2, pp, 246-
251.
[25] N. Neves and W. Fuchs, "Coordinated checkpointing without direct
coordination," in Proc. IPDS-98 IEEE International Computer
Performance and Dependability Symposium, September 7-9, 1998, pp.
23-31.
[26] D. Briatico, A. Ciuffoletti, and L. Simoncini, "A distributed dominoeffect
free recovery algorithm," in Proc. 4th IEEE International
Symposium on Reliability in Distributed Software and Database
Systems, October, 1984, pp. 207-215.
[27] D. Manivannan and M. Singhal, "A low-overhead recovery technique
using quasi synchronous checkpointing," in Proc. 16th IEEE
International Conference on Distributed Computing Systems, May,
1996, pp. 100-107.
[28] Y. Luo, Y. Min, and D. Zhang, "An improved scheme of index-based
checkpointing," in Proc. 11th IEEE Pacific Rim International
Symposium on Dependable Computing, December 12-14, 2005, Page(s):
5 pp.
[29] G. Cao and M. Singhal, "Mutable checkpoints: a new checkpointing
approach for mobile computing systems," IEEE Transactions on
Parallel and Distributed Systems, 2001, February, vol. 12, no. 2, pp.
157-172.