Generic Workload Management System Using Condor-Based Pilot Factory in PanDA Framework

In the current Grid environment, efficient workload management presents a significant challenge, for which there are exorbitant de facto standards encompassing resource discovery, brokerage, and data transfer, among others. In addition, the real-time resource status, essential for an optimal resource allocation strategy, is often not readily accessible. To address these issues and provide a cleaner abstraction of the Grid with the potential of generalizing into arbitrary resource-sharing environment, this paper proposes a new Condor-based pilot mechanism applied in the PanDA architecture, PanDA-PF WMS, with the goal of providing a more generic yet efficient resource allocating strategy. In this architecture, the PanDA server primarily acts as a repository of user jobs, responding to pilot requests from distributed, remote resources. Scheduling decisions are subsequently made according to the real-time resource information reported by pilots. Pilot Factory is a Condor-inspired solution for a scalable pilot dissemination and effectively functions as a resource provisioning mechanism through which the user-job server, PanDA, reaches out to the candidate resources only on demand.




References:
[1] K. Harrison, R.W.L. Jones, D.Liko, C.L. Tan, "Distributed Analysis in
the ATLAS Experiment," in Proc. AHM Conf., 2006.
[2] S. Kolos et al., "Online Monitoring software framework in the ATLAS
experiment", CHEP 2003, La Jolla, California, USA, 2003.
[3] Akihiko Konagaya, "The Grid as a ÔÇÿBa- for Biomedical Knowledge
Creation," Grid Computing in Life Science, LSGRID 2005, pp. 1-10.
[4] W. T. Sullivan, III, D. Werthimer, S. Bowyer, J. Cobb, D. Gedye, D.
Anderson. A New Major SETI Project Based on Project SERENDIP
Data and 100,000 Personal Computers. Astronomical and Biochemical
Origins and the Search for Life in the Universe, Proc. of the Fifth Intl.
Conf. on Bioastronomy. 1997.
[5] J. Frey, T. Tannenbaum, M. Livny, "Condor-G: A Computation
Management Agent for Multi-Institutional Grid", Cluster Computing,
Springer Netherlands, 2004, pp. 237-246.
[6] D. Thain, T. Tannenbaum, and M. Livny. Condor and the Grid. In Grid
Computing: Making the Global Infrastructure a Reality. John Wiley &
Sons Inc., 2002.
[7] T. T. Douglas Thain and M. Livny. Distributed Computing in Practice:
The Condor Experience. Concurrency and Computation: Practice and
Experience, 2004.
[8] Papakhian, M. Comparing Job-Management Systems: The User's
Perspective. IEEE Computational Science & Engineering, (April-June)
1998. Available: http://pbs.mrj.com
[9] D.P. Anderson. "BOINC: A System for Public-Resource Computing and
Storage," 5th IEEE/ACM International Workshop on Grid Computing,
Pittsburgh, PA, 2004, pp. 365-372.
[10] Zhou, S. LSF: Load Sharing in Large-Scale Heterogeneous Distributed
Systems. Proceedings of the Workshop on Cluster Computing, 1992.
[11] Foster, I. and Kesselman, C. The Globus Project: A Status Report. In
Proc. Heterogeneous Computing Workshop, IEEE Press, 1998, pp. 4-18.
[12] P.Nilsson, J.Caballero, K.De, T. Maeno, M.Potekhin and T.Wenaus,
"The PanDA system in the ATLAS experiment," ACAT 2008
Conference Proceedings.
[13] Klimentov A., "ATLAS Distributed Data Management Operations.
Experience and Projection," Journal of Physics: Conf. Series, 2007.
[14] Nilsson P., "Experience from a Pilot based system for ATLAS, " Journal
of Physics: Conference Series, 2008
[15] M. Avvenuti, P. Corsini, P. Masci, A. Vecchio, "Opportunistic
Computing for Wireless Sensor Network," IEEE Intl Conf. on Mobile
Adhoc and Sensor Systems," 2007, pp. 1-6
[16] Foster, I., Kesselman, C., and Tuecke, S., "The Anatomy of the Grid:
Enabling Scalable Virtual Organizations," Intl. J. Supercomputer
Applications, 2001
[17] B. DeWin, F. Piessens, W. Joosen, T. Verhanneman, "On The
Importance of the Separation-Of-Concerns Principle in Secure Software
Engineering," In ACSA Workshop on the Application of Engineering
Principles to System Security Design, 2003, pp. 1-10.
[18] Enabling Grids for E-science. Available: www.eu-egee.org
[19] T Maeno, "PanDA: Distributed Production and Distributed Analysis
System for ATLAS," Journal of Physics: Conference Series, 2008.
[20] Organization for the Advancement of Structured Information Standards,
"Introduction to UDDI: Important Features and Functional Concepts,"
2004.
[21] M. Litzkow, M. Livny, and M. Mutka. Condor - A Hunter of Idle
Workstations. In Proc. 8th Intl Conf. on Distributed Computing
Systems, 1988, pp.104-111.
[22] Jim Basney, Miron Livny, and Todd Tannenbaum, "High Throughput
Computing with Condor," HPCU news, Volume 1(2), June 1997.
[23] Rajesh Raman, Miron Livny, and Marvin Solomon, "Matchmaking:
Distributed Resource Management for High Throughput Computing,"
Proc. of the 7th IEEE International. Symposium on High Performance
Distributed Computing, July 28-31, 1998, Chicago, IL
[24] gLite, Lightweight Middleware for Grid Computing. Available:
http://glite.web.cern.ch/glite/
[25] Condor manual, development release version 7.0. Available:
http://www.cs.wisc.edu/condor/manual/
[26] K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W.
Smith, S. Tuecke, "A Resource Management Architecture for
Metacomputing Systems," Proc. IPPS/SPDP -98 Workshop on Job
Scheduling Strategies for Parallel Processing, 1998.
[27] Open Science Grid. http://www.opensciencegrid.org
[28] A. Tsaregorodtsev, V. Garonne, I. Stokes-Rees, "DIRAC: A Scalable
Lightweight Architecture for High Throughput Computing," Fifth
IEEE/ACM International Workshop on Grid Computing (GRID'04),
2004, pp.19-25.
[29] Distributed.net: The First General-Purpose Distributed Computing
Project. Available: http://www.distributed.net
[30] Derrick Kondo, David P. Anderson and John McLeod VII.
"Performance Evaluation of Scheduling Policies for Volunteer
Computing," 3rd IEEE International Conference on e-Science and Grid
Computing. Bangalore, India, December 10-13, 2007.
[31] CERN Twiki. http://twiki.cern.ch/twiki/bin/view/EGEE/BDII
[32] Igor Sfiligoi. Structural Overview of the GlideinWMS. Available:
http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/
[33] Chiu P, Huber M, "Clustering Similar Actions in Sequential Decision
Processes," in Proc. of the 8th Intl Conf. on Machine Learning and
Applications (ICMLA'09), Miami Beach, FL. 2009, pp. 776-781.