Regression Approach for Optimal Purchase of Hosts Cluster in Fixed Fund for Hadoop Big Data Platform

Given a fixed fund, purchasing fewer hosts of higher capability or inversely more of lower capability is a must-be-made trade-off in practices for building a Hadoop big data platform. An exploratory study is presented for a Housing Big Data Platform project (HBDP), where typical big data computing is with SQL queries of aggregate, join, and space-time condition selections executed upon massive data from more than 10 million housing units. In HBDP, an empirical formula was introduced to predict the performance of host clusters potential for the intended typical big data computing, and it was shaped via a regression approach. With this empirical formula, it is easy to suggest an optimal cluster configuration. The investigation was based on a typical Hadoop computing ecosystem HDFS+Hive+Spark. A proper metric was raised to measure the performance of Hadoop clusters in HBDP, which was tested and compared with its predicted counterpart, on executing three kinds of typical SQL query tasks. Tests were conducted with respect to factors of CPU benchmark, memory size, virtual host division, and the number of element physical host in cluster. The research has been applied to practical cluster procurement for housing big data computing.





References:
[1] http://hadoop.apache.org/(Accessed on 07/03/2017).
[2] https://hive.apache.org/(Accessed on 07/03/2017).
[3] http://spark.apache.org/(Accessed on 07/03/2017).
[4] https://hbase.apache.org/(Accessed on 07/03/2017).
[5] http://spark.apache.org/sql/(Accessed on 07/03/2017).
[6] Capriolo E, Wampler D, and Rutherglen J, Programming hive. O'Reilly Media, Inc., 2012.
[7] Karau H, Konwinski A, Wendell P, et al, Learning spark: lightning-fast big data analysis. O'Reilly Media, Inc., 2015.
[8] http://hbase.apache.org/book.html#arch.overview(Accessed on 07/03/ 2017).
[9] M. Armbrust, R. S. Xin, C. Lian, et al, “Spark SQL: Relational data processing in spark,” in Proc. of the 2015 ACM SIGMOD International Conference on Management of Data, ACM, 2015, pp.1383–1394.
[10] M.Zaharia, M. Chowdhury, M. J. Franklin, et al, “Spark: cluster computing with working sets,” in Usenix Conference on Hot Topics in Cloud Computing, USENIX Association, 2010, pp.1765–1773.
[11] http://www.CPUbenchmark.net/(Accessed on 07/03/2017).
[12] http://www.vmware.com/(Accessed on 07/03/2017).
[13] http://www.ubuntu.org/(Accessed on 07/03/2017).
[14] https://ambari.apache.org/(Accessed on 07/03/2017).
[15] http://hortonworks.com/products/data-center/hdp/(Accessed on 07/03/2017).
[16] B. Sotomayor, R. S. Montero, I. M. Llorente, et al, “Virtual infrastructure management in private and hybrid clouds,” IEEE Internet computing, 2009, vol. 13, no. 5, pp. 14–22.
[17] http://www.computerweekly.com/feature/Big-data-storage-Hadoop-storage-basics(Accessed on 07/03/2017).
[18] A. M. Brown, “A step-by-step guide to non-linear regression analysis of experimental data using a Microsoft Excel spreadsheet,” Computer Methods and Programs in Biomedicine, 2001, vol. 65, no. 3, pp. 191–200.
[19] C. L. Lawson, R. J. Hanson, Solving least squares problems. Society for Industrial and Applied Mathematics, 1995.
[20] M.J. Box, D. Davies, and W.H. Swann, Non-Linear optimization Techniques. Oliver & Boyd, 1969.
[21] N.J.Gunther, P. Puglia, K. Tomasette, “Hadoop Superlinear Scalability,” Communications of the ACM, 2009, vol. 58, no. 4, pp. 46–55.
[22] A. Mukherjee, J. Datta, R. Jorapur, et al, “Shared disk big data analytics with apache Hadoop,” 19th international conference on High Performance computing (HiPC2012), IEEE, 2012, pp. 1–6.
[23] T. White, Hadoop: The definitive guide. O'Reilly Media, Inc., 2012.