Comparison of the efficiency of noise handling methods in dependent variable for classification model

Main Article Content

Kritsadee Siriruang
Prapasiri Ratchaprapapornkul

Abstract

Noisy data is a major problem often encountered in datasets. When noise affects the dependent variable, it can lead to incorrect group in classification. Therefore, it is essential to handle noise before analyzing and classifying the data. This research aims to compare the effectiveness of class noise handling method in classification model between noise removal methods using four noise filter: Condensed Nearest Neighbor (CNN), Edited Nearest Neighbors (ENN), Cross-Validated Committees Filter (CVCF), and Iterative Partitioning Filter (IPF), and relabel methods that utilize noise filter from noise removal methods along with multiple imputation from three methods: polytomous regression (polyreg), random forest (rf), and multiple imputation through XGBoost (mixgb). The study is conducted through Monte Carlo simulation under the scenario with sample sizes of 100, 500, and 1,000 units, and noise level of 10%, 20%, 30%, and 40%. The performance of noise handling methods is evaluated based on the F1 score of four data classification models: k-NN, Random Forest, Naïve Bayes, and Support Vector Machine. Comparison is done through N-Way analysis of variance (N-Way ANOVA). The study found that the class noise handling methods have an interaction effect with all factors, including sample size, noise level, and classification model, affecting the F1 score significantly at the .05 level of significance. For small sample sizes (n = 100), relabel method tended to perform better than remove method. However, as sample size increased, both methods showed similar performance. Overall, the combination of ENN noise filter with polytomous regression imputation tended to yield the highest F1 score in most cases, except for sample sizes of 1,000 units where ENN alone showed the highest performance. The findings of this research provide insights into appropriate class noise handling methods for different data classification scenarios.

Article Details

Section
Original Articles

References

Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131–167. https://doi.org/10.1613/jair.606

Buczak, P., Chen, J. J., & Pauly, M. (2023). Analyzing the effect of imputation on classification performance under MCAR and MAR missing mechanisms. Entropy, 25(3), Article 521. https://doi.org/10.3390/e25030521

Deng, Y., & Lumley, T. (2024). Multiple imputation through xgboost. Journal of Computational and Graphical Statistics, 33(2), 531–549. https://doi.org/10.1080/10618600.2023.2252501

Frénay, B., & Verleysen, M. (2013). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5), 845–869. https://doi.org/10.1109/TNNLS.2013.2292894

Garcia, L. P., Lehmann, J., de Carvalho, A. C., & Lorena, A. C. (2019). New label noise injection methods for the evaluation of noise filters. Knowledge-Based Systems, 163, 693–704. https://doi.org/10.1016/j.knosys.2018.09.031

Hassan, H., Ahmad, N. B., & Anuar, S. (2020, May). Improved students’ performance prediction for multi-class imbalanced problems using hybrid and ensemble approach in educational data mining. Journal of Physics: Conference Series, 1529(5), Article 052041. IOP Publishing. https://doi.org/10.1088/1742-6596/1529/5/052041

Kim, Y., Soyata, T., & Behnagh, R. F. (2018). Towards emotionally aware AI smart classroom: Current issues and directions for engineering and education. IEEE Access, 6, 5308–5331. https://doi.org/10.1109/ACCESS.2018.2791861

Li, C., & Mao, Z. (2023). A label noise filtering method for regression based on adaptive threshold and noise score. Expert Systems with Applications, 228, Article 120422. https://doi.org/10.1016/j.eswa.2023.120422

Li, P., Stuart, E. A., & Allison, D. B. (2015). Multiple imputation: A flexible tool for handling missing data. JAMA, 314(18), 1966–1967. https://doi.org/10.1001/jama.2015.15281

Luengo, J., Sanchez-Tarrago, D., Prati, R. C., & Herrera, F. (2021). Multiple instance classification: Bag noise filtering for negative instance noise cleaning. Information Sciences, 579, 388–400. https://doi.org/10.1016/j.ins.2021.07.076

Miao, Q., Cao, Y., Xia, G., Gong, M., Liu, J., & Song, J. (2015). RBoost: Label noise-robust boosting algorithm based on a nonconvex loss function and the numerically stable base learners. IEEE Transactions on Neural Networks and Learning Systems, 27(11), 2216–2228. https://doi.org/10.1109/TNNLS.2015.2475750

Miller, D., & Soh, L.-K. (2013). Meta-reasoning algorithm for improving analysis of student interactions with learning objects using supervised learning. In S. K. D’Mello, R. A. Calvo, & A. Olney (Eds.), Proceedings of the 6th International Conference on Educational Data Mining (pp. 129–135). International Educational Data Mining Society. https://educationaldatamining.org/files/conferences/EDM2013/papers/paper_6.pdf

Moura, K. G., Prudêncio, R. B., & Cavalcanti, G. D. (2022). Label noise detection under the noise at random model with ensemble filters. Intelligent Data Analysis, 26(5), 1119–1138. https://doi.org/10.3233/IDA-215980

Nematzadeh, Z., Ibrahim, R., & Selamat, A. (2020). Improving class noise detection and classification performance: A new two-filter CNDC model. Applied Soft Computing, 94, Article 106428. https://doi.org/10.1016/j.asoc.2020.106428

Nguyen-Van, T. (2020). A discrete-time state estimation for nonlinear systems with noises. IEEE Access, 8, 147089–147096. https://doi.org/10.1109/ACCESS.2020.3014377

Nicholson, B., Sheng, V. S., & Zhang, J. (2016). Label noise correction and application in crowdsourcing. Expert Systems with Applications, 66, 149–162. https://doi.org/10.1016/j.eswa.2016.09.003

Prati, R. C., Luengo, J., & Herrera, F. (2019). Emerging topics and challenges of learning from noisy data in nonstandard classification: A survey beyond binary class noise. Knowledge and Information Systems, 60(1), 63–97. https://doi.org/10.1007/s10115-018-1244-4

Rajput, D., Wang, W. J., & Chen, C. C. (2023). Evaluation of a decided sample size in machine learning applications. BMC Bioinformatics, 24(1), Article 48. https://doi.org/10.1186/s12859-023-05156-9

Sáez, J. A. (2022). Noise models in classification: Unified nomenclature, extended taxonomy and pragmatic categorization. Mathematics, 10(20), Article 3736. https://doi.org/10.3390/math10203736

Sáez, J. A., Galar, M., Luengo, J., & Herrera, F. (2013). Tackling the problem of classification with noisy data using multiple classifier systems: Analysis of the performance and robustness. Information Sciences, 247, 1–20. https://doi.org/10.1016/j.ins.2013.06.002

Sáez, J. A., Krawczyk, B., & Woźniak, M. (2016). On the influence of class noise in medical data classification: Treatment using noise filtering methods. Applied Artificial Intelligence, 30(6), 590–609. https://doi.org/10.1080/08839514.2016.1193719

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147

Sun, Y., Li, J., Xu, Y., Zhang, T., & Wang, X. (2023). Deep learning versus conventional methods for missing data imputation: A review and comparative study. Expert Systems with Applications, 227, Article 120201. https://doi.org/10.1016/j.eswa.2023.120201

Sutton, O. (2012). Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction [Lecture notes]. Department of Mathematics, University of Leicester.

Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03

Zhang, J., & Cui, S. (2023). Investigating the number of Monte Carlo simulations for statistically stationary model outputs. Axioms, 12(5), Article 481. https://doi.org/10.3390/axioms12050481

Zhu, X., & Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22(3–4), 177–210. https://doi.org/10.1007/s10462-004-0751-8