Comparison between XGBoost, LightGBM and CatBoost Using a Home Credit Dataset

Gradient boosting methods have been proven to be a very important strategy. Many successful machine learning solutions were developed using the XGBoost and its derivatives. The aim of this study is to investigate and compare the efficiency of three gradient methods. Home credit dataset is used in this work which contains 219 features and 356251 records. However, new features are generated and several techniques are used to rank and select the best features. The implementation indicates that the LightGBM is faster and more accurate than CatBoost and XGBoost using variant number of features and records.


Authors:



References:
[1] Y. Freund, R. E. Schapire, “A decision-theoretic generalizationof online learning and an application to boosting," Journal of Computer andSystem Sciences, vol. 55, no. 1, pp 119-139, 1997.
[2] P. Kontschieder, M. Fiterau, A. Criminisi, S. Rota Bulo. “Deep neural decision forests,” In Proceedings of the IEEE International Conference on Computer Vision, pp 1467–1475, 2015.
[3] J. C. Wang, T. Hastie, “Boosted varying-coefficient regression models for product demand prediction,” Journal of Computational and Graphical Statistics, vol. 23, no. 2, pp 361–382, 2014.
[4] E Al Daoud, “Intrusion Detection Using a New Particle Swarm Method and Support Vector Machines,” World Academy of Science, Engineering and Technology, vol. 77, 59-62, 2013.
[5] E. Al Daoud, H Turabieh, “New empirical nonparametric kernels for support vector machine classification,” Applied Soft Computing, vol. 13, no. 4, 1759-1765, 2013.
[6] E. Al Daoud, "An Efficient Algorithm for Finding a Fuzzy Rough Set Reduct Using an Improved Harmony Search," I.J. Modern Education and Computer Science, vol. 7, no. 2, pp16-23, 2015.
[7] Y. Zhang, A. Haghani. “A gradient boosting method to improve travel time prediction. Transportation Research Part C,” Emerging Technologies, vol. 58, 308–324, 2015.
[8] K. Guolin, M. Qi, F. Thomas, W. Taifeng, C. Wei, M. Weidong, Y. Qiwei, L. Tie-Yan, "LightGBM: A Highly Efficient Gradient Boosting Decision Tree," Advances in Neural Information Processing Systems vol. 30, pp. 3149-3157, 2017.
[9] A. Dorogush, V. Ershov, A. Gulin "CatBoost: gradient boosting with categorical features support," NIPS, p1-7, 2017.
[10] M. Qi, K. Guolin, W. Taifeng, C. Wei, Y. Qiwei, M. Weidong, L. Tie-Yan, "A Communication-Efficient Parallel Algorithm for Decision Tree," Advances in Neural Information Processing Systems, vol. 29, pp. 1279-1287, 2016.
[11] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, “Fast bayesian optimization of machine learning hyperparameters on large datasets,” In Proceedings of Machine Learning Research PMLR, vol. 54, pp 528-536,2017.
[12] J. H. Aboobyda, and M. A. Tarig, “Developing Prediction Model Of Loan Risk In Banks Using Data Mining,” Machine Learning and Applications: An International Journal (MLAIJ), vol. 3, no. 1, pp 1–9, 2016.