A Large Dataset Imputation Approach Applied to Country Conflict Prediction Data

This study demonstrates an alternative stochastic imputation approach for large datasets when preferred commercial packages struggle to iterate due to numerical problems. A large country conflict dataset motivates the search to impute missing values well over a common threshold of 20% missingness. The methodology capitalizes on correlation while using model residuals to provide the uncertainty in estimating unknown values. Examination of the methodology provides insight toward choosing linear or nonlinear modeling terms. Static tolerances common in most packages are replaced with tailorable tolerances that exploit residuals to fit each data element. The methodology evaluation includes observing computation time, model fit, and the comparison of known  values to replaced values created through imputation. Overall, the country conflict dataset illustrates promise with modeling first-order interactions, while presenting a need for further refinement that mimics predictive mean matching.





References:
[1] J. Luengo, S. Garc´ıa, and F. Herrera, ”On the Choice of the Best
Imputation Methods for Missing Values Considering Three Groups of
Classification Methods,” Knowl. Inf. Syst., vol. 32, no. 1. 2012.
[2] S. van Buuren, Flexible Imputation of Missing Data, 2nd ed. CRC
Press, 2018.
[3] D. B. Rubin, ”Multiple Imputation after 18+ Years,” J. Am. Stat. Assoc.,
vol. 91, no. 434, pp. 473–489, Jun. 1996.
[4] S. van Buuren and K. Groothuis-Oudshoorn, ”Multivariate Imputation
by Chained Equations in R,” J. Stat. Softw., vol. 45, no. 3, pp. 1–67,
Dec. 2011.
[5] Y. Si et al., ”Multiple Imputation with Massive Data: An Application to
the Panel Study of Income Dynamics,” arXiv Prepr. arXiv2007.03016,
Jul. 2020.
[6] Y. Deng, C. Chang, M. S. Ido, and Q. Long, ”Multiple Imputation
for General Missing Data Patterns in the Presence of High-dimensional
Data,” Sci. Rep., vol. 6, no. 1, pp. 1–10, Feb. 2016.
[7] R. J. Little, ”On Algorithmic And Modeling Approaches To Imputation
In Large Data Sets,” Stat. Sin., vol. 30, no. 4, pp. 1685–1696, Jan. 2020.
[8] D. Ahner and L. Brantley, ”Finding the Fuel of the Arab Spring Fire: a
Historical Data Analysis,” J. Def. Anal. Logist., vol. 2, no. 2, pp. 58–68,
Jan. 2018.
[9] Z. J. Kane, ”An Imputation Approach to Developing Alternative Futures
of Country Conflict,” Air Force Institute of Technology, 2019.
[10] C. D. Nguyen, J. B. Carlin, and K. J. Lee, ”Practical Strategies
for Handling Breakdown of Multiple Imputation Procedures,” Emerg.
Themes Epidemiol., vol. 18, no. 1, pp. 1–8, Dec. 2021.
[11] C. O. Plumpton, T. Morris, D. A. Hughes, and I. R. White, ”Multiple
Imputation Of Multiple Multi-Item Scales When A Full Imputation
Model Is Infeasible,” BMC Res. Notes, vol. 9, no. 1, pp. 1–16, Dec.
2016. [12] E. N´u˜nez, E. W. Steyerberg, and J. N´u˜nez, ”Regression Modeling
Strategies”, Rev. Espa˜nola Cardiol. (English Ed.), vol. 64, no. 6, pp.
501–507, Jun. 2011.
[13] J. A. Nelder, ”The Selection of Terms in Response-Surface
Models—How Strong is the Weak-Heredity Principle?,” Am. Stat., vol.
52, no. 4, pp. 315–318, May 1998.
[14] J. R. Oneal and B. Russett, ”Rule Of Three, Let It Be? When More
Really Is Better,” Confl. Manag. Peace Sci., vol. 22, no. 4, pp. 293–310,
Sep. 2005.
[15] G. S. Patton and P. D. Harkins, War As I Knew It, Houghton Mifflin
Company, 1995.
[16] Y. Luo, ”Evaluating The State Of The Art In Missing Data Imputation
For Clinical Data,” Brief. Bioinform., vol. 23, no. 1, Jan. 2022.