A smarter forest: Enhancing cardiovascular risk prediction using a knowledge-based random forest
Main Article Content
Abstract
Predicting heart disease and other cardiovascular issues accurately is critical for enabling early intervention and improving patient outcomes. This study proposed the semantic random forest (SRF) framework, which enhances the classification performance of conventional random forest (RF) algorithms for heart disease prediction. The conventional RF framework is augmented through the integration of knowledge from a formal ontology model that encapsulates domain-specific medical knowledge, thereby providing a structured representation of concepts, relationships, and axioms. The SRF framework utilizes this ontology during the classification process to yield more precise predictions. The effectiveness of the proposed SRF framework was evaluated against the conventional RF, AdaBoost, and gradient boosting algorithms, with a focus on their ability to classify heart disease instances accurately. Experimental results demonstrate that the proposed SRF framework outperformed the baseline algorithms on two datasets, achieving the highest accuracy and Matthews correlation coefficient values of 0.8296 and 0.6589 on the University of California at Irvine dataset and 0.9856 and 0.9706 on Mendeley dataset, respectively. The results demonstrate that ontology-based structured knowledge significantly improves the classification power of traditional RF algorithms, which highlights this knowledge-driven approach’s potential to predict heart disease risks in computer-aided medical diagnoses.
Downloads
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Adnan, M. N., Ip, R. H. L., Bewong, M., & Islam, M. Z. (2021). BDF: A new decision forest algorithm. Information Sciences, 569, 687–705. https://doi.org/10.1016/j.ins.2021.05.017
Adnan, M. N., & Islam, M. Z. (2015). Improving the random forest algorithm by randomly varying the size of the bootstrap samples for low dimensional data sets. In Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 391–396). ESANN. https://www.esann.org/sites/default/files/proceedings/legacy/es2015-21.pdf
AlGhanem, H., Shanaa, M., Salloum, S., & Shaalan, K. (2020). The role of KM in enhancing AI algorithms and systems. Advances in Science, Technology and Engineering Systems Journal, 5(4), 388–396. https://doi.org/10.25046/aj050445
Alwosheel, A., van Cranenburgh, S., & Chorus, C. G. (2018). Is your dataset large enough? Sample size requirements when using artificial neural networks for discrete choice analysis. Journal of Choice Modeling, 28, 167–182. https://doi.org/10.1016/j.jocm.2018.07.002
Andras Janosi, W. S. (1988). Heart disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X
Avinash, M., Nithya, M., & Aravind, S. (2022). Automated machine learning-algorithm selection with fine-tuned parameters. In Proceedings of the Sixth International Conference on Intelligent Computing and Control Systems (ICICCS) (pp. 1175–1180). IEEE. https://doi.org/10.1109/ICICCS53718.2022.9788236
Chanmee, S., & Kesorn, K. (2021). Semantic data mining in the information age: A systematic review. International Journal of Intelligent Systems, 36(8), 3880–3916. https://doi.org/10.1002/int.22443
Chanmee, S., & Kesorn, K. (2023). Semantic decision trees: A new learning system for the ID3-Based algorithm using a knowledge base. Advanced Engineering Informatics, 58, Article 102156. https://doi.org/10.1016/j.aei.2023.102156
Chanmee, S., & Kesorn, K. (2024). COVID-19 cases classification using a semantic decision forest method. ICIC Express Letters Part B: Applications, 15(11), 1175–1182. https://doi.org/10.24507/icicelb.15.11.1175
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), Article 6. https://doi.org/10.1186/s12864-019-6413-7
Dinh, A., Miertschin, S., Young, A., & Mohanty, S. D. (2019). A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Medical Informatics and Decision Making, 19(1), Article 211. https://doi.org/10.1186/s12911-019-0918-5
Ed-daoudy, A., Maalmi, K., & El Ouaazizi, A. (2023). A scalable and real-time system for disease prediction using big data processing. Multimedia Tools and Applications, 82(20), 30405–30434. https://doi.org/10.1007/s11042-023-14562-3
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1), Article 140. https://doi.org/10.1186/s40537-021-00516-9
Gaïffas, S., Merad, I., & Yu, Y. (2023). WildWood: A new random forest algorithm. IEEE Transactions on Information Theory, 69(10), 6586–6604. https://doi.org/10.1109/TIT.2023.3287432
Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques (3rd ed.). Elsevier Science.
Harrell, F. E., Jr. (2001). Cox proportional hazard regression model. In F. E. Harrell, Jr. (Ed.), Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis (pp. 465–507). Springer. https://doi.org/10.1007/978-1-4757-3462-1_19
He, Y., Chen, J., Dong, H., Horrocks, I., Allocca, C., Kim, T., & Sapkota, B. (2024). DeepOnto: A Python package for ontology engineering with deep learning. Semantic Web, 15(5), 1991–2004. https://doi.org/10.3233/SW-243568
Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844. https://doi.org/10.1109/34.709601
Hossain, M. I., Maruf, M. H., Khan, M. A. R., Prity, F. S., Fatema, S., Ejaz, M. S., & Khan, M. A. S. (2023). Heart disease prediction using distinct artificial intelligence techniques: Performance analysis and comparison. Iran Journal of Computer Science, 6(4), 397–417. https://doi.org/10.1007/s42044-023-00148-7
Ishak, A., Ginting, A., Siregar, K., & Junika, C. (2020). Classification of heart disease using decision tree algorithm. IOP Conference Series: Materials Science and Engineering, 1003, Article 012119. https://doi.org/10.1088/1757-899X/1003/1/012119
Jensen, M., Cox, A. P., Chaudhry, N., Ng, M., Sule, D., Duncan, W., Ray, P., Weinstock-Guttman, B., Smith, B., Ruttenberg, A., Szigeti, K., & Diehl, A. D. (2013). The neurological disease ontology. Journal of Biomedical Semantics, 4(1), Article 42. https://doi.org/10.1186/2041-1480-4-42
Juraphanthong, W., & Kesorn, K. (2024). The intelligent approach of auto-regressive integrated moving average with exogenous semantic (ARIMAXS) variables for COVID-19 incidence prediction. ICIC Express Letters Part B: Applications, 15(2), 207–216. https://doi.org/10.24507/icicelb.15.02.207
Juraphanthong, W., & Kesorn, K. (2025). Autoregressive integrated moving average with semantic information: An efficient technique for the intelligent prediction of dengue cases. Engineering Applications of Artificial Intelligence, 143, Article 109985. https://doi.org/10.1016/j.engappai.2024.109985
Karpatne, A., Atluri, G., Faghmous, J. H., Steinbach, M., Banerjee, A., Ganguly, A., Shekhar, S., Samatova, N., & Kumar, V. (2017). Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Transactions on Knowledge and Data Engineering, 29(10), 2318–2331. https://doi.org/10.1109/TKDE.2017.2720168
Khadir, A. C., Aliane, H., & Guessoum, A. (2021). Ontology learning: Grand tour and challenges. Computer Science Review, 39, Article 100339. https://doi.org/10.1016/j.cosrev.2020.100339
Knublauch, H., Fergerson, R. W., Noy, N. F., & Musen, M. A. (2004). The protégé OWL plugin: An open development environment for semantic web applications. In S. A. McIlraith, D. Plexousakis, & F. van Harmelen (Eds.), The semantic Web–ISWC 2004 (pp. 229–243). Springer. https://doi.org/10.1007/978-3-540-30475-3_17
Kuncheva, L. I., & Whitaker, C. J. (2003). Measures of diversity in classifier ensembles and their relationship with ensemble accuracy. Machine Learning, 51(2), 181–207. https://doi.org/10.1023/A:1022859003006
Kwon, O., Na, W., & Kim, Y.-H. (2020). Machine learning: A new opportunity for risk prediction. Korean Circulation Journal, 50(1), 85–87. https://doi.org/10.4070/kcj.2019.0314
Maghdid, S. S., & Rashid, T. A. (2022). An extensive dataset for the heart disease classification system [Dataset]. Mendeley Data, V2. https://doi.org/10.17632/65gxgy2nmg.2
McHugh, M. L. (2013). The chi-square test of independence. Biochemia Medica, 23(2), 143–149. https://doi.org/10.11613/BM.2013.018
Miraftabzadeh, S. M., Longo, M., Foiadelli, F., Pasetti, M., & Igual, R. (2021). Advances in the application of machine learning techniques for power system analytics: A survey. Energies, 14(16), Article 4776. https://doi.org/10.3390/en14164776
Mitraka, E., Topalis, P., Dritsou, V., Dialynas, E., & Louis, C. (2015). Describing the breakbone fever: IDODEN, an ontology for dengue fever. PLoS Neglected Tropical Diseases, 9(2), Article e0003479. https://doi.org/10.1371/journal.pntd.0003479
Polpinij, J. (2011). The Cancerology ontology: Designed to support the search of evidence-based oncology from biomedical literature. In Proceedings of the 24th International Symposium on Computer-Based Medical Systems (CBMS) (pp. 1–6). IEEE. https://doi.org/10.1109/CBMS.2011.5999168
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery, 8(4), Article e1249. https://doi.org/10.1002/widm.1249
Sargsyan, A., Kodamullil, A. T., Baksi, S., Darms, J., Madan, S., Gebel, S., Keminer, O., Jose, G. M., Balabin, H., DeLong, L. N., Kohler, M., Jacobs, M., & Hofmann-Apitius, M. (2020). The COVID-19 ontology. Bioinformatics, 36(24), 5703–5705. https://doi.org/10.1093/bioinformatics/btaa1057
Shanmugasundaram, G., Selvam, V. M., Saravanan, R., & Balaji, S. (2018). An investigation of heart disease prediction techniques. In 2018 IEEE International Conference on System, Computation, Automation and Networking (ICSCA) (pp. 1–6). IEEE. https://doi.org/10.1109/ICSCAN.2018.8541165
Shouman, M., Turner, T., & Stocker, R. (2011). Using a decision tree for diagnosing heart disease patients. In P. Vamplew, A. Stranieri, & K.-L. Ong (Eds.), Proceedings of the Ninth Australasian Data Mining Conference - Volume 121 (pp. 23–30). Australian Computer Society. https://dl.acm.org/doi/10.5555/2483628.2483633
Smiti, A. (2020). A critical overview of outlier detection methods. Computer Science Review, 38, Article 100306. https://doi.org/10.1016/j.cosrev.2020.100306
Spencer, R., Thabtah, F., Abdelhamid, N., & Thompson, M. (2020). Exploring feature selection and classification methods for predicting heart disease. Digital Health, 2020, Article 6. https://doi.org/10.1177/2055207620914777
Spoladore, D., Tosi, M., & Lorenzini, E. C. (2024). Ontology-based decision support systems for diabetes nutrition therapy: A systematic literature review. Artificial Intelligence in Medicine, 151, Article 102859. https://doi.org/10.1016/j.artmed.2024.102859
Tan, Z., Luo, L., & Zhong, J. (2023). Knowledge transfer in evolutionary multi-task optimization: A survey. Applied Soft Computing, 138, Article 110182. https://doi.org/10.1016/j.asoc.2023.110182
Tripoliti, E. E., Papadopoulos, T. G., Karanasiou, G. S., Naka, K. K., & Fotiadis, D. I. (2017). Heart failure: Diagnosis, severity estimation, and prediction of adverse events using machine learning techniques. Computational and Structural Biotechnology Journal, 15, 26–47. https://doi.org/10.1016/j.csbj.2016.11.001
Verma, J. P. (2019). Non-parametric Correlations. In J. P. Verma (Ed.), Statistics and research methods in psychology with Excel (pp. 523–565). Springer. https://doi.org/10.1007/978-981-13-3429-0_13
Wang, L. (2015, December 8). Heart failure ontology. BioPortal. https://bioportal.bioontology.org/ontologies/HFO
Zhang, Y., Xin, Y., Li, Q., Ma, J., Li, S., Lv, X., & Lv, W. (2017). Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications. BioMedical Engineering OnLine, 16(1), Article 125. https://doi.org/10.1186/s12938-017-0416-x