Implementation of Watch Dog Timer for Fault Tolerant Computing on Cluster Server
In today-s new technology era, cluster has become a
necessity for the modern computing and data applications since many
applications take more time (even days or months) for computation.
Although after parallelization, computation speeds up, still time
required for much application can be more. Thus, reliability of the
cluster becomes very important issue and implementation of fault
tolerant mechanism becomes essential. The difficulty in designing a
fault tolerant cluster system increases with the difficulties of various
failures. The most imperative obsession is that the algorithm, which
avoids a simple failure in a system, must tolerate the more severe
failures. In this paper, we implemented the theory of watchdog timer
in a parallel environment, to take care of failures. Implementation of
simple algorithm in our project helps us to take care of different
types of failures; consequently, we found that the reliability of this
cluster improves.
[1] Ian Foster and A. Iamnitchi,"A problem -Specific Fault-Tolerance
Mechanism for Asynchronous, Distributed Systems", IEEE, p.4-13
2000.
[2] Ian Foster, C. Kesselman, Craig Lee, G.v.Lazzewski,,"A Fault Detection
Service for Wide Area Distributed Computations", Cluster Computing,
v.2 n.2,p.117-128, 1999.
[3] Sriram Rao, Lorenzo Alvisi, Harrick M.Vin , "Egida : An Extensible
Toolkit For Low-overhead Fault-Tolerance, Fault-Tolerant Computing",
Digest of Papers. Twenty-Ninth Annual International Symposium, p. 45-
55, 1999.
[4] Paul Toenend and Jie Xu, "Replication-based Fault-Tolerance in a Grid
Environment", citeceer, 2003.
[5] Pascal Felber, Proya Narasimhan, Member, IEEE, "Experiences,
Strategies, and Challenges in Building Fault-Tolerant CORBA
Systems", IEEE transactions on Computers , Vol.53, NO.5, May 2004.
[6] Object Management Group, "Fault Tolerant CORBA (Final Adopted
Specification)" CMG Technical Committee Document formal/01-12-
29.,Dec., 2001.
[7] R.Friedman and E.Hadad, "FTS: A High Performance CORBA Fault
Tolerance Service", Proc. IEEE Workshop Object Oriented Real-time
Dependable Systems., Jan. 2002.
[8] Jack G. Ganssle, "Great Watchdogs", V-1.2, Gaanssel Group, updated
January, 2004.
[9] http://en.wikipedia.org/wiki/Watchdog_timer
[10] http://en.wikipedia.org/wiki/graceful degradation
[1] Ian Foster and A. Iamnitchi,"A problem -Specific Fault-Tolerance
Mechanism for Asynchronous, Distributed Systems", IEEE, p.4-13
2000.
[2] Ian Foster, C. Kesselman, Craig Lee, G.v.Lazzewski,,"A Fault Detection
Service for Wide Area Distributed Computations", Cluster Computing,
v.2 n.2,p.117-128, 1999.
[3] Sriram Rao, Lorenzo Alvisi, Harrick M.Vin , "Egida : An Extensible
Toolkit For Low-overhead Fault-Tolerance, Fault-Tolerant Computing",
Digest of Papers. Twenty-Ninth Annual International Symposium, p. 45-
55, 1999.
[4] Paul Toenend and Jie Xu, "Replication-based Fault-Tolerance in a Grid
Environment", citeceer, 2003.
[5] Pascal Felber, Proya Narasimhan, Member, IEEE, "Experiences,
Strategies, and Challenges in Building Fault-Tolerant CORBA
Systems", IEEE transactions on Computers , Vol.53, NO.5, May 2004.
[6] Object Management Group, "Fault Tolerant CORBA (Final Adopted
Specification)" CMG Technical Committee Document formal/01-12-
29.,Dec., 2001.
[7] R.Friedman and E.Hadad, "FTS: A High Performance CORBA Fault
Tolerance Service", Proc. IEEE Workshop Object Oriented Real-time
Dependable Systems., Jan. 2002.
[8] Jack G. Ganssle, "Great Watchdogs", V-1.2, Gaanssel Group, updated
January, 2004.
[9] http://en.wikipedia.org/wiki/Watchdog_timer
[10] http://en.wikipedia.org/wiki/graceful degradation
@article{"International Journal of Information, Control and Computer Sciences:56662", author = "Meenakshi Bheevgade and Rajendra M. Patrikar", title = "Implementation of Watch Dog Timer for Fault Tolerant Computing on Cluster Server", abstract = "In today-s new technology era, cluster has become a
necessity for the modern computing and data applications since many
applications take more time (even days or months) for computation.
Although after parallelization, computation speeds up, still time
required for much application can be more. Thus, reliability of the
cluster becomes very important issue and implementation of fault
tolerant mechanism becomes essential. The difficulty in designing a
fault tolerant cluster system increases with the difficulties of various
failures. The most imperative obsession is that the algorithm, which
avoids a simple failure in a system, must tolerate the more severe
failures. In this paper, we implemented the theory of watchdog timer
in a parallel environment, to take care of failures. Implementation of
simple algorithm in our project helps us to take care of different
types of failures; consequently, we found that the reliability of this
cluster improves.", keywords = "Cluster, Fault tolerant, Grid, Grid ComputingSystem, Meta-computing.", volume = "2", number = "2", pages = "442-4", }