Hierarchical Checkpoint Protocol in Data Grids

Grid of computing nodes has emerged as a
representative means of connecting distributed computers or
resources scattered all over the world for the purpose of computing
and distributed storage. Since fault tolerance becomes complex due
to the availability of resources in decentralized grid environment,
it can be used in connection with replication in data grids. The
objective of our work is to present fault tolerance in data grids
with data replication-driven model based on clustering. The
performance of the protocol is evaluated with Omnet++ simulator.
The computational results show the efficiency of our protocol in
terms of recovery time and the number of process in rollbacks.




References:
[1] O. Marin, “The darx framework: Adapting fault tolerance for agent
systems,” Ph.D. dissertation, Universit´e de Have, 2003.
[2] B. Hamid, “Distributed fault-tolerance techniques for local
computations,” Ph.D. dissertation, Universit´e Bordeaux I, 2007.
[3] F. Reichenbach, “Service snmp de dtection de faute pour des systmes
rpartis,” Ph.D. dissertation, Ecole polytechnique de Lausane, 2002.
[4] M. Wiesmann, F. Pedone, and A. Schiper, “A systematic classification
of replicated database protocols based on atomic broadcast,” in 3rd
Europeean Research Seminar on Advances in Distributed Systems, 1999.
[5] X. Besseron, “Tol´erance aux fautes et reconfiguration dynamique
pour les applications distribu´ees `a grande ´echelle,” Ph.D. dissertation,
Universit´e de Grenoble, 2010.
[6] N. M. Ndiaye, “Techniques de gestion des d´e faillances dans les grilles
informatiques tol´e rantes aux fautes,” Ph.D. dissertation, Universit´e
Pierre et Marie Curie, 2013.
[7] S. Drapeau, “Un canevas adaptable de services de duplication,” Ph.D.
dissertation, Institut National Polytechnique de Grenoble, 2003.
[8] R. Souli-Jbali, M. S. Hidri, and R. B. Ayed, “Dynamic data
replication-driven model in data grids,” in 39th Annual Computer
Software and Applications Conference, COMPSAC Workshops 2015,
Taichung, Taiwan, July 1-5, 2015, 2015, pp. 393–397.
[9] Chandy and Lamport, “Distributed snapshots : Determining global states
of distributed systems,” ACM Transactions on Computer Systems, vol. 3,
no. 1, pp. 63–75, 1985.
[10] H. S.Paul, A. Gupta, and R. Badrinath, “Hierarchical coordinated
checkpointing protocol,” in International Conference on Parallel and
Distributed Computing Systems, 2002, pp. 240–245.
[11] K. Bhatia, K. Marzullo, and L. Alvisi, “Scalable causal message logging
for wide-area environments,” Concurrency and Computation: Practice
and Experience, vol. 15, no. 3, pp. 243–250, 2003.
[12] S. Monnet, C. Morin, and R. Badrinath, “Hybrid checkpointing for
parallel applications in cluster federations,” in 3rd Workshop on
Resiliency in High Performance Computing (Resilience) in Clusters,
Clouds, and Grids, 2004, pp. 773–782.
[13] E. Meneses, C. L. Mendes, and L. V. Kale, “Team based message
logging : Preliminary results,” in 4th IEEE ACM International
Symposium on Cluster Computing and the Grid, 2010.
[14] J.-M. Yang, K. Li, W.-W. Li, and D.-F. Zhang, “Trading off logging
overhead and coordinating overhead to achieve efficient rollback
recovery,” Concurrency and Computation: Practice and Experience,
vol. 21, no. 3, pp. 819–853, 2009.
[15] A. Guermouche, “Nouveaux protocoles de tolrance aux fautes pour les
applications du calcul haute performance,” Ph.D. dissertation, Universit´e
Paris-Sud, 2011.
[16] D. B. Johnson and W. Zwaenepoel, “Sender based message logging,”
in The Seventeenth Annual International Symposium on Fault-Tolerant
Computing, 1987, pp. 14–19.
[17] A. Varga and R. Hornig, “An overview of the omnet++ simulation
environment,” in Proceedings of the 1st International Conference on
Simulation Tools and Techniques for Communications, Networks and
Systems & Workshops, 2008, pp. 60:1–60:10.