Data Quality Enhancement for Decision Tree Algorithm using Knowledge-Based Model

Main Article Content

Sirichanya Chanmee
Kraisak Kesorn*

Abstract

Data mining is an approach to discovering knowledge or unrevealed patterns from huge data sets by using several methods, such as statistics, machine learning and other data analysis techniques. However, the main limitation of these conventional techniques is the ignorance of data relationships and semantics. The data are considered as meaningless numbers with statistical methods being used for model building. For example, the decision tree, a classification method of data mining, is produced from a given set of labeled data, and those data are classified without understanding the semantics of the data or the relationships between attributes. To understand the inherent meaning in the data and to take advantage of the relationships between data elements, we introduce a knowledge-based approach to improve data quality. The proposed approach uses the ontology as the background knowledge to assist the decision tree classification in the process of data preparation. The ontology is used to infer the relationships between attributes and concepts in an ontology. This relationship information can assist the system in identifying related attributes which could assist in the classification process. Two datasets in different domains; agriculture and economics, were used to evaluate the generalization of the proposed approach. Accuracy was the standard measure of success, and was tested in the evaluation of the model. The experimental results showed that the proposed approach can efficiently enhance the performance of the data classification process.


                                                                                                                        


Keywords: data analytics; data mining; ontology; semantic; classification; decision tree  


*Corresponding author: Tel.: +66 81 555 7499


             E-mail: kraisakk@nu.ac.th

Article Details

Section
Original Research Articles

References

[1] Hand, D. J., 2007. Principles of data mining. Drug-Safety, 30(7), 621-622.
[2] Dou, D., Wang, H. and Liu, H., 2015. Semantic data mining: A survey of ontology-based approaches. Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing. Anaheim, CA, USA, February 7-9, 2015, 244-251.
[3] Anand, S. S., Bell, D. A., and Hughes, J. G., 1995, The Role of Domain Knowledge in Data Mining, Proceedings of the 4th International Conference on Information and Knowledge Management, Baltimore, Maryland, USA, November, 1995, 37-43.
[4] Kuo, Y.-T., Lonie, A., Sonenberg, L. and Paizis, K., 2007. Domain ontology driven data mining: A medical case study. Proceedings of the 2007 International Workshop on Domain Driven Data Mining, San Jose, California, USA, August 12, 2007, 11-17.
[5] Staab, S. and Studer, R., 2009. Handbook on Ontologies. Heidelberg: Springer Science & Business Media.
[6] Marinica, C. and Guillet, F., 2010. Knowledge-based interactive postmining of association rules using ontologies. IEEE Transactions on Knowledge and Data Engineering, 22(6), 784-797.
[7] Asadifar, S. and Kahani, M., 2017. Semantic association rule mining: A new approach for stock market prediction. Proceedings of the 2nd Conference on Swarm Intelligence and Evolutionary Computation, Kerman, Iran, March 7-9,2017, 106-111.
[8] Benites, F. and Sapozhnikova, E., 2014. Using semantic data mining for classification improvement and knowledge extraction. Proceedings of the LWA 2014 Workshops, Aachen, Germany, September 8-10, 2014, 8-10.
[9] Effati, M. and Sadeghi‐Niaraki, A., 2015, A Semantic-based classification and regression tree approach for modelling complex spatial rules in motor vehicle crashes domain. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(4), 181-194.
[10] Wang, H., Azuaje, F. and Bodenreider, O., 2005. An ontology-driven clustering method for supporting gene expression analysis. Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems, Dublin, Ireland, June 23-24,2005, 389-394.
[11] Trappey, A. J. C., Trappey, C. V., Hsu, F. and Hsiao, D. W., 2009. A fuzzy ontological knowledge document clustering methodology. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(3), 806-814.
[12] Tang, A. and Fong, S., 2010. A taxonomy-based classification model by using abstraction and aggregation. Proceedings of the 6th International Conference on Advanced Information Management and Service, Seoul, South Korea, November 30 - December 2, 2010, 448-454.
[13] Zhang, J., Silvescu, A. and Honavar, V., 2002. Ontology-driven induction of decision trees at multiple levels of abstraction. Proceeding of International Symposium on Abstraction, Reformulation, and Approximation, Kananaskis, AB, Canada, August 2-4, 2002, 316-323.
[14] Vieira, J. and Antunes, C., 2014. Decision tree learner in the presence of domain knowledge. Proceedings of Chinese Semantic Web and Web Science Conference, Wuhan, China, August 8-12, 2014, 42-55.
[15] Dua, D. and Karra Taniskidou, E., 2017. UCI Machine Learning Repository. [online] Available at: http://archive.ics.uci.edu/ml
[16] Knublauch, H., Fergerson, R. W., Noy, N. F. and Musen, M. A., 2004. The Protégé OWL Plugin: An open development environment for semantic web applications. Proceedings of the Semantic Web, Hiroshima, Japan, November 7-11, 2004, 229-243.
[17] Crop Ontology Curation Tool, 2011. Soybean Ontology. [online] Available at: http://www.cropontology.org/ontology/CO_336/Soybean.
[18] Jearanaiwongkul, W., Anutariya, C., and Andres, F., 2018. An ontology-based approach to plant disease identification system. Proceedings of the 10th International Conference on Advances in Information Technology, Bangkok, Thailand, December 10-13, 2018, 1-8.
[19] Markell, S. and Malvick, D., 2018. Soybean Disease Diagnostic Series-Publications. [online] Available at: https://www.ag.ndsu.edu/publications/crops/soybean-disease-diagnostic-series.
[20] Michalski, R. S., 1980. Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of development. An expert system for soybean disease diagnosis. International Journal of Policy Analysis and Information Systems, 4(2), 125-161.
[21] Kargioti, E., Kontopoulos, E. and Bassiliades, N., 2009. OntoLife: An ontology for semantically managing personal information. Proceedings of Artificial Intelligence Applications and Innovations III, Thessaloniki, Greece, April 23-25, 2009, 127-133.
[22] Baraldi, A. N. and Enders, C. K., 2010. An introduction to modern missing data analyses. Journal of School Psychology, 48(1), 5-37.
[23] Bedrick, E. J., 2005. Biserial Correlation. In Encyclopedia of Biostatistics.
[24] Andy, F., 2000. Discovering Statistics Using Spss for Windows: Advanced Techniques for the Beginner. CA.: Sage Publications.
[25] McHugh, M. L., 2013. The Chi-Square test of independence. Biochemia Medica, 23(2), 143-149.
[26] Akoglu, H., 2018. User’s Guide to Correlation Coefficients. Turkish Journal of Emergency Medicine, 18(3), 91-93.
[27] Singh, A., Thakur, N. and Sharma, A., 2016. A review of supervised machine learning algorithms. Proceedings of the 3rd International Conference on Computing for Sustainable Global Development, New Delhi, India, March 16-18, 2016, 1310-1315.
[28] Cios, K. J., Pedrycz, W., Swiniarski, R. W. and Kurgan, L. A., 2007. Data Mining: A Knowledge Discovery Approach. New York: Springer US.
[29] Kotsiantis, S. B., 2013. Decision trees: a recent overview. Artificial Intelligence Review, 39(4), 261-283.
[30] EMC Education Services, 2015. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Indianapolis: Wiley.
[31] Quinlan, J. R., 1993. C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc.
[32] Su, J. and Zhang, H., 2006. A fast decision tree learning algorithm. Proceedings of the 21st National Conference on Artificial Intelligence, Boston, Massachusetts, July 16-20, 2006, 500-505.
[33] Guyon, I., Weston, J., Barnhill, S. and Vapnik, V., 2002. Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389-422.
[34] Abdi, H. and Williams, L. J., 2010. Principal component analysis. WIREs Computational Statistics, 2(4), 433-459.
[35] Chandrashekar, G. and Sahin, F., 2014. A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28.
[36] Syarif, I., Prugel-Bennett, A. and Wills, G., 2016. SVM parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (Telecommunication Computing Electronics and Control), 14(4), 1502-1509.
[37] Chapelle, O., 2007. Training a support vector machine in the primal. Neural Computation, 19(5), 1155-1178.
[38] Cai, Y. and Wang, X., 2011. The analysis and optimization of KNN algorithm space-time efficiency for Chinese text categorization. Proceedings of Advances in Computer Science, Environment, Ecoinformatics, and Education, Wuhan, China, August 21-22, 2011, 542-550.
[39] Breiman, L., Friedman, J., Stone, C. J. and Olshen, R., 1984. Classification and Regression Trees, Wardsworth, Belmount: Chapman and Hall.
[40] Rokach, L. and Maimon, O., 2005. Top-down induction of decision trees classifiers - a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(4), 476-487.