Implementation of Watch Dog Timer for Fault Tolerant Computing on Cluster Server

In today-s new technology era, cluster has become a necessity for the modern computing and data applications since many applications take more time (even days or months) for computation. Although after parallelization, computation speeds up, still time required for much application can be more. Thus, reliability of the cluster becomes very important issue and implementation of fault tolerant mechanism becomes essential. The difficulty in designing a fault tolerant cluster system increases with the difficulties of various failures. The most imperative obsession is that the algorithm, which avoids a simple failure in a system, must tolerate the more severe failures. In this paper, we implemented the theory of watchdog timer in a parallel environment, to take care of failures. Implementation of simple algorithm in our project helps us to take care of different types of failures; consequently, we found that the reliability of this cluster improves.




References:
[1] Ian Foster and A. Iamnitchi,"A problem -Specific Fault-Tolerance
Mechanism for Asynchronous, Distributed Systems", IEEE, p.4-13
2000.
[2] Ian Foster, C. Kesselman, Craig Lee, G.v.Lazzewski,,"A Fault Detection
Service for Wide Area Distributed Computations", Cluster Computing,
v.2 n.2,p.117-128, 1999.
[3] Sriram Rao, Lorenzo Alvisi, Harrick M.Vin , "Egida : An Extensible
Toolkit For Low-overhead Fault-Tolerance, Fault-Tolerant Computing",
Digest of Papers. Twenty-Ninth Annual International Symposium, p. 45-
55, 1999.
[4] Paul Toenend and Jie Xu, "Replication-based Fault-Tolerance in a Grid
Environment", citeceer, 2003.
[5] Pascal Felber, Proya Narasimhan, Member, IEEE, "Experiences,
Strategies, and Challenges in Building Fault-Tolerant CORBA
Systems", IEEE transactions on Computers , Vol.53, NO.5, May 2004.
[6] Object Management Group, "Fault Tolerant CORBA (Final Adopted
Specification)" CMG Technical Committee Document formal/01-12-
29.,Dec., 2001.
[7] R.Friedman and E.Hadad, "FTS: A High Performance CORBA Fault
Tolerance Service", Proc. IEEE Workshop Object Oriented Real-time
Dependable Systems., Jan. 2002.
[8] Jack G. Ganssle, "Great Watchdogs", V-1.2, Gaanssel Group, updated
January, 2004.
[9] http://en.wikipedia.org/wiki/Watchdog_timer
[10] http://en.wikipedia.org/wiki/graceful degradation