Adjusted K Nearest Neighbor Method Based on Decile Mean in Missing Data Imputation

Authors

  • Patchana Suwannasaen College of Research Methodology and Cognitive Science, Burapha University.
  • Pattrawadee Makmee College of Research Methodology and Cognitive Science, Burapha University.
  • Afifi Lateh Faculty of Education, Prince of Songkla University.

Keywords:

missing data imputation, missing data, adjusted K Nearest Neighbor method

Abstract

The objective of this research was to develop a new method for missing data estimation by using Decile Mean K-Nearest Neighbor Bhattacharyya Imputation (DKNN-BH). This method evaluated the missing data by K-Nearest Neighbor Imputation (KNN) from fine-tuning of the decile mean and Bhattacharyya distance to compare the effectiveness of the new missing data estimation with Mean Imputation (MI), K-Nearest Neighbor Imputation (KNN) and Decile Mean K-Nearest Neighbor Imputation (DKNN) methods. The Monte Carlo simulation was implemented for 300 cases with 4 options : sample size, level of missing data, size of outliers, and k constants for new missing data DKNN-BH, KNN and DKNN methods. Each situation was replicated 500 times. The results showed that the new developed missing data estimation method, DKNN-BH derived from the fine tuning of KNN using Decile Mean and Bhattacharyya distance. There were 2 steps of DKNN-BH: calculation of Bhattacharyya distance and estimation of missing data using Decile Mean method. After comparing the efficacy of both data missing estimation methods from simulation results, the new method (DKNN-BH) was better than the old one in all cases by using the lowest mean square error. The simulation results also revealed that when the percentage of missing data were 5, 10, 20, 30 and 40, the percentage of outliers were 0, 5, 10, 20 and k constant values were 11, 13, 15, 17, and 19 respectively, the lowest mean square error will decrease as the percentage of outliers and k constants decrease.

Author Biographies

Patchana Suwannasaen, College of Research Methodology and Cognitive Science, Burapha University.

College of Research Methodology and Cognitive Science, Burapha University, 169 Longhaad Bangsaen Road, Saensook, Mueang, ChonBuri 20131, Thailand.

Pattrawadee Makmee, College of Research Methodology and Cognitive Science, Burapha University.

College of Research Methodology and Cognitive Science, Burapha University, 169 Longhaad Bangsaen Road, Saensook, Mueang, ChonBuri 20131, Thailand.

Afifi Lateh, Faculty of Education, Prince of Songkla University.

Faculty of Education, Prince of Songkla University, Pattani Campus, 181 Charoen Pradit Road, Rusamilae, Mueang, Pattani 94000, Thailand.

References

Bhattacharyya, A. 1943. On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society 35: 99-109.

Bishop, C.M. 1995. Neural networks for pattern recognition. Oxford university press, UK.

Cartwright, M.H., Shepperd, M.J. and Song, Q. 2003. Dealing with missing software project data, pp. 154-165. In Proceedings of the 9th IEEE International Software Metrics Symposium (METRICS'03). IEEE Computer Society, Sydney.

Hengpraprohm, K. and Meesad, P. 2008. Feature selection of K-Nearest Neighbor for missing value imputation using K-Nearest Neighbor. Information Technology Journal 4(7): 55-61. (in Thai)

Kim, J.O. and Curry, J. 1977. The treatment of missing data in multivariate analysis. Sociological Methods & Research 6(2): 215-240.

Kim, K.Y., Kim, B.J. and Yi, G.S. 2004. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics 5(1): 160.

Ladha, L. and Deepa, T. 2011. Feature selection methods and algorithms. International journal on computer science and engineering 3(5): 1787-1797.

Liao, S.G., Lin, Y., Kang, D.D., Chandra, D., Bon, J., Kaminski, N. and Tseng, G.C. 2014. Missing value imputation in high-dimensional phenomic data: imputable or not, and how?. BMC Bioinformatics 15(1): 346.

Pasunon, P. and Nilakorn, P. 2007. Outliers detection in regression analysis by Bhattacharyya Statistics, pp. 11-18. In The Proceeding of 45th Kasetsart University Annual Conference. Kasetsart University, Bangkok. (in Thai)

Rana, S., Siraj-Ud-Doulah, M., Midi, H. and Imon, A.H.M.R. 2012. Decile mean: A new robust measure of central tendency. Chiang Mai journal of science 39(3): 478-485.

Robins, J.M. and Wang, N. 2000. Inference for imputation estimators. Biometrika 87: 113-124.

Schioler, H. and Hartmann, U. 1992. Mapping neural network derived from the Parzen window estimator. Neural Networks 5(6): 903-909.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R. and Altman, R.B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17(6): 520-525.

Vongprasert, J. 2019. Jacknife and Regression Approaches to Missing Data Imputation. Journal of Applied Statistics and Information Technology 3(1): 52-61.

Published

2021-07-27

How to Cite

Suwannasaen, P., Makmee, P., & Lateh, A. (2021). Adjusted K Nearest Neighbor Method Based on Decile Mean in Missing Data Imputation. Recent Science and Technology, 13(2), 330–342. Retrieved from https://li01.tci-thaijo.org/index.php/rmutsvrj/article/view/225931

Issue

Section

Research Article