Performance Comparison of Transformation Methods in Data Mining Classification Technique
Main Article Content
Abstract
Transformation is a data preparation process for data mining. The main objective of this research was to compare five transformation methods in terms of classification accuracy that the transformed data provided. Those methods were the following: No transformation, Min-Max Normalization, Z-Score Standardization, Decimal Scaling, and Median Method. Three classification methods K-Nearest Neighbor, Artificial Neural Network, and Naïve Bayes were used to evaluate the transformation methods. Each of these datasets was divided into two groups at a ratio of 70:30. The first group was a training data set; the second group was a testing data set. The range of tested random seed parameter was from 10, 20, 30, 40 to 50. Six datasets were datasets of White Wine Quality, Pima Indians Diabetes, and Vertebral Column of which data were not much different and datasets of Indian Liver Patient, Working Hours, and Avocado of which data were much different. All algorithms and procedures were implemented in R programming language. On 4 out of 6 tested datasets, transformation by Decimal Scaling and classification by K-Nearest Neighbor were the best combination, followed by transformation by Decimal Scaling and classification by Artificial Neural Network. Our findings may directly benefit those who are interested in efficiently mining some big data.
Keywords: Transformation, Min-Max Normalization, Z-Score Standardization, Decimal Scaling, Median Method, K-Nearest Neighbor, Artificial Neural Network, Naïve Bayes
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
บทความที่ได้รับการตีพิมพ์เป็นลิขสิทธิ์ของคณะวิทยาศาสตร์และเทคโนโลยี มหาวิทยาลัยธรรมศาสตร์ ข้อความที่ปรากฏในแต่ละเรื่องของวารสารเล่มนี้เป็นเพียงความเห็นส่วนตัวของผู้เขียน ไม่มีความเกี่ยวข้องกับคณะวิทยาศาสตร์และเทคโนโลยี หรือคณาจารย์ท่านอื่นในมหาวิทยาลัยธรรมศาสตร์ ผู้เขียนต้องยืนยันว่าความรับผิดชอบต่อทุกข้อความที่นำเสนอไว้ในบทความของตน หากมีข้อผิดพลาดหรือความไม่ถูกต้องใด ๆ
References
Amit, P. and Achin, J., 2017, Comparative Analysis of KNN Algorithm using Various Normalization Techniques, International Computer Network and Information Security 11: 36-42.
Ramana, B. V., 2012, Indian Liver Patient, Available Source: https://www.mldata.io/dataset-details/indian_liver_patient/ December 26, 2019.
Mota, H. D., 2011, Vertebral Column Data Set, Available Source: https://www.kaggle.com/caesarlupum/vertebralcolumndataset, January 25, 2020.
Justin K., 2018, Avocado Prices Data Set, Available Source: https://www.kaggle.com/neuromusic/avocado-prices, December 20, 2019.
Patro, S. K. and Sahu, K. K., 2017, Normalization: A Preprocessing Stage, Department of CSE & IT, VSSUT, Burla, Odisha, India.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J., 2009, Wine Quality Data Set, Available Source: https://archive.ics.uci.edu/ml/datasets/Wine+Quality, December 8, 2019.
Shams, R., 2014, Creating Training, Validation and Test Sets Data Preprocessing, Available Source: https://www.youtube.com/watch?v=uiDFa7iY9yo, January 13, 2020.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B, 2001, Missing Values Estimation Methods for DNA Microarrays Bioinformatics, 17(1): 520-525.