Using Fractional Factorial Designs for Variable Importance in Random Forest Models

Random Forests are a powerful classification technique, consisting of a collection of decision trees. One useful feature of Random Forests is the ability to determine the importance of each variable in predicting the outcome. This is done by permuting each variable and computing the change in prediction accuracy before and after the permutation. This variable importance calculation is similar to a one-factor-at a time experiment and therefore is inefficient. In this paper, we use a regular fractional factorial design to determine which variables to permute. Based on the results of the trials in the experiment, we calculate the individual importance of the variables, with improved precision over the standard method. The method is illustrated with a study of student attrition at Monash University.





References:
[1] Box, G.E.P. and Hunter, J.S. and Hunter, W.G., Statistics for Experi-menters, 2nd ed. Hoboken, New Jersey: John Wiley & Sons, 2005.
[2] Breiman, L. and Cutler, A., "Random Forests", Salford Sytems, www.salfordsystems.com, 2008.
[3] Hastie, T. and R.Tibshirani and J.Friedman, The Elements of Statistical Learning, 2nd. Ed., New York: Springer, 2009.
[4] Liaw, A. and M.Wiener, “Classification and Regression by random Forest”, R News, 2(3), 18-22, 2002.
[5] Liaw, A. and M.Wiener, randomForest: Breiman and Cutler’s random forests for classification and regression, R package version 4.6-12., http:/CRAN.R-project.org/package=randomForest, 2012.
[6] Margolon, B.H., “Results on factorial designs of resolution IV for the 2n and 2n3m series”, Technometrics, 10, 431-444, 1969.
[7] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R-project.org, 2012.