High Performance in Parallel Data Integration: An Empirical Evaluation of the Ratio Between Processing Time and Number of Physical Nodes

Many studies have shown that parallelization decreases efficiency [1], [2]. There are many reasons for these decrements. This paper investigates those which appear in the context of parallel data integration. Integration processes generally cannot be allocated to packages of identical size (i. e. tasks of identical complexity). The reason for this is unknown heterogeneous input data which result in variable task lengths. Process delay is defined by the slowest processing node. It leads to a detrimental effect on the total processing time. With a real world example, this study will show that while process delay does initially increase with the introduction of more nodes it ultimately decreases again after a certain point. The example will make use of the cloud computing platform Hadoop and be run inside Amazon-s EC2 compute cloud. A stochastic model will be set up which can explain this effect.





References:
[1] Kumar, V.: Introduction to Parallel Computing, 2nd edition. Addison-
Wesley Longman Publishing Co., Inc., (2002)
[2] Eager, D. L., Zahorjan, J., Lozowska, E. D.: Speedup Versus Efficiency
in Parallel Systems. IEEE Transactions on Computers, Vol. 38, No. 3,
pp. 408--423 (1989)
[3] Cutting, D.: Nutch: an Open-Source Platform for Web Search. In:
Beigbeder, M., Yee, W. G. (eds.) Workshop on Open Source Web
Information Retrieval (OSWIR), pp. 31--33 (2005)
[4] Khare, R., Cutting, D., Sitaker, K., Rifkin, A.: Nutch: A Flexible and
Scalable Open-Source Web Search Engine (2004)
[5] Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google File System.
SOSP '03: Proceedings of 19th ACM symposium on Operating systems
principles, ACM Press, pp. 29--43, NY (2003)
[6] Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on
Large Clusters, OSDI'04: Sixth Symposium on Operating System
Design and Implementation, San Francisco, CA (2004)
[7] Cafarella, M. J., Etzioni, O.: A Search Engine for Natural Language
Applications. WWW '05: Proceedings of the 14th international
conference on World Wide Web, pp. 442--452. ACM Press, NY (2005)
[8] Drost, I., Scheffer, T.: Thwarting the Nigritude Ultramarine: Learning to
Identify Link Spam. ECML, pp. 96--107 (2005)
[9] Kimball, A., Michels-Slettvet, S., Bisciglia, C.: Cluster Computing for
Web-Scale Data Processing. SIGCSE '08: Proceedings of the 39th
SIGCSE technical symposium on Computer science education, ACM,
pp. 116--120 (2008)
[10] Butler, M. H., Rutherford, J.: Distributed Lucene : A distributed free
text index for Hadoop. HP Laboratories (2008)
[11] Hasan, W., Motwani, R.: Optimization Algorithms for Exploiting the
Parallelism-Communication Tradeoff in Pipelined Parallelism. VLDB
'94: Proceedings of the 20th International Conference on Very Large
Data Bases, Morgan Kaufmann Publishers Inc., pp. 36--47 (1994)
[12] Ahn, J. H., Erez, M., Dally, W. J.: Tradeoff between Data-, Instruction-,
and Thread-Level Parallelism in Stream Processors. ICS '07:
Proceedings of the 21st annual international conference on
Supercomputing, ACM, pp. 126--137 (2007)
[13] Amini, H., Kazakov D., Ridge, E.: Parallelism vs Communication
Overhead Trade-off in a JADE Multi-Agent Implementation of Cellular
Automata. The First International Symposium on Nature-Inspired
Systems for Parallel, Asynchronous and Decentralised Environments
(NISPADE), AISB convention, Bristol (2006)
[14] Balasubramonian, R., Dwarkadas, S., Albonesi, D. H.: Dynamically
managing the communication-parallelism trade-off in future clustered
processors. SIGARCH Comput. Archit. News, ACM, pp. 275--287
(2003)
[15] Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., Stoica, I.:
Improving MapReduce Performance in Heterogeneous Environments.
8th Symposium on Operating Systems Design and Implementation, pp.
29--42 (2008)
[16] Ukkonen, E.: Approximate string-matching with q-grams and maximal
matches. Theor. Comput. Sci., Elsevier Science Publishers Ltd., 92, pp.
191--211 (1992)