A smarter forest: Enhancing cardiovascular risk prediction using a knowledge-based random forest

Main Article Content

Sirichanya Chanmee
Kraisak Kesorn

Abstract

Predicting heart disease and other cardiovascular issues accurately is critical for enabling early intervention and improving patient outcomes. This study proposed the semantic random forest (SRF) framework, which enhances the classification performance of conventional random forest (RF) algorithms for heart disease prediction. The conventional RF framework is augmented through the integration of knowledge from a formal ontology model that encapsulates domain-specific medical knowledge, thereby providing a structured representation of concepts, relationships, and axioms. The SRF framework utilizes this ontology during the classification process to yield more precise predictions. The effectiveness of the proposed SRF framework was evaluated against the conventional RF, AdaBoost, and gradient boosting algorithms, with a focus on their ability to classify heart disease instances accurately. Experimental results demonstrate that the proposed SRF framework outperformed the baseline algorithms on two datasets, achieving the highest accuracy and Matthews correlation coefficient values of 0.8296 and 0.6589 on the University of California at Irvine dataset and 0.9856 and 0.9706 on Mendeley dataset, respectively. The results demonstrate that ontology-based structured knowledge significantly improves the classification power of traditional RF algorithms, which highlights this knowledge-driven approach’s potential to predict heart disease risks in computer-aided medical diagnoses.

Downloads

Download data is not yet available.

Article Details

How to Cite
Chanmee, S., & Kesorn, K. (2025). A smarter forest: Enhancing cardiovascular risk prediction using a knowledge-based random forest. Science, Engineering and Health Studies, 19, 25020007. https://doi.org/10.69598/sehs.19.25020007
Section
Physical sciences

References

Adnan, M. N., Ip, R. H. L., Bewong, M., & Islam, M. Z. (2021). BDF: A new decision forest algorithm. Information Sciences, 569, 687–705. https://doi.org/10.1016/j.ins.2021.05.017

Adnan, M. N., & Islam, M. Z. (2015). Improving the random forest algorithm by randomly varying the size of the bootstrap samples for low dimensional data sets. In Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 391–396). ESANN. https://www.esann.org/sites/default/files/proceedings/legacy/es2015-21.pdf

AlGhanem, H., Shanaa, M., Salloum, S., & Shaalan, K. (2020). The role of KM in enhancing AI algorithms and systems. Advances in Science, Technology and Engineering Systems Journal, 5(4), 388–396. https://doi.org/10.25046/aj050445

Alwosheel, A., van Cranenburgh, S., & Chorus, C. G. (2018). Is your dataset large enough? Sample size requirements when using artificial neural networks for discrete choice analysis. Journal of Choice Modeling, 28, 167–182. https://doi.org/10.1016/j.jocm.2018.07.002

Andras Janosi, W. S. (1988). Heart disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X

Avinash, M., Nithya, M., & Aravind, S. (2022). Automated machine learning-algorithm selection with fine-tuned parameters. In Proceedings of the Sixth International Conference on Intelligent Computing and Control Systems (ICICCS) (pp. 1175–1180). IEEE. https://doi.org/10.1109/ICICCS53718.2022.9788236

Chanmee, S., & Kesorn, K. (2021). Semantic data mining in the information age: A systematic review. International Journal of Intelligent Systems, 36(8), 3880–3916. https://doi.org/10.1002/int.22443

Chanmee, S., & Kesorn, K. (2023). Semantic decision trees: A new learning system for the ID3-Based algorithm using a knowledge base. Advanced Engineering Informatics, 58, Article 102156. https://doi.org/10.1016/j.aei.2023.102156

Chanmee, S., & Kesorn, K. (2024). COVID-19 cases classification using a semantic decision forest method. ICIC Express Letters Part B: Applications, 15(11), 1175–1182. https://doi.org/10.24507/icicelb.15.11.1175

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), Article 6. https://doi.org/10.1186/s12864-019-6413-7

Dinh, A., Miertschin, S., Young, A., & Mohanty, S. D. (2019). A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Medical Informatics and Decision Making, 19(1), Article 211. https://doi.org/10.1186/s12911-019-0918-5

Ed-daoudy, A., Maalmi, K., & El Ouaazizi, A. (2023). A scalable and real-time system for disease prediction using big data processing. Multimedia Tools and Applications, 82(20), 30405–30434. https://doi.org/10.1007/s11042-023-14562-3

Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1), Article 140. https://doi.org/10.1186/s40537-021-00516-9

Gaïffas, S., Merad, I., & Yu, Y. (2023). WildWood: A new random forest algorithm. IEEE Transactions on Information Theory, 69(10), 6586–6604. https://doi.org/10.1109/TIT.2023.3287432

Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques (3rd ed.). Elsevier Science.

Harrell, F. E., Jr. (2001). Cox proportional hazard regression model. In F. E. Harrell, Jr. (Ed.), Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis (pp. 465–507). Springer. https://doi.org/10.1007/978-1-4757-3462-1_19

He, Y., Chen, J., Dong, H., Horrocks, I., Allocca, C., Kim, T., & Sapkota, B. (2024). DeepOnto: A Python package for ontology engineering with deep learning. Semantic Web, 15(5), 1991–2004. https://doi.org/10.3233/SW-243568

Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844. https://doi.org/10.1109/34.709601

Hossain, M. I., Maruf, M. H., Khan, M. A. R., Prity, F. S., Fatema, S., Ejaz, M. S., & Khan, M. A. S. (2023). Heart disease prediction using distinct artificial intelligence techniques: Performance analysis and comparison. Iran Journal of Computer Science, 6(4), 397–417. https://doi.org/10.1007/s42044-023-00148-7

Ishak, A., Ginting, A., Siregar, K., & Junika, C. (2020). Classification of heart disease using decision tree algorithm. IOP Conference Series: Materials Science and Engineering, 1003, Article 012119. https://doi.org/10.1088/1757-899X/1003/1/012119

Jensen, M., Cox, A. P., Chaudhry, N., Ng, M., Sule, D., Duncan, W., Ray, P., Weinstock-Guttman, B., Smith, B., Ruttenberg, A., Szigeti, K., & Diehl, A. D. (2013). The neurological disease ontology. Journal of Biomedical Semantics, 4(1), Article 42. https://doi.org/10.1186/2041-1480-4-42

Juraphanthong, W., & Kesorn, K. (2024). The intelligent approach of auto-regressive integrated moving average with exogenous semantic (ARIMAXS) variables for COVID-19 incidence prediction. ICIC Express Letters Part B: Applications, 15(2), 207–216. https://doi.org/10.24507/icicelb.15.02.207

Juraphanthong, W., & Kesorn, K. (2025). Autoregressive integrated moving average with semantic information: An efficient technique for the intelligent prediction of dengue cases. Engineering Applications of Artificial Intelligence, 143, Article 109985. https://doi.org/10.1016/j.engappai.2024.109985

Karpatne, A., Atluri, G., Faghmous, J. H., Steinbach, M., Banerjee, A., Ganguly, A., Shekhar, S., Samatova, N., & Kumar, V. (2017). Theory-guided data science: A new paradigm for scientific discovery from data. IEEE Transactions on Knowledge and Data Engineering, 29(10), 2318–2331. https://doi.org/10.1109/TKDE.2017.2720168

Khadir, A. C., Aliane, H., & Guessoum, A. (2021). Ontology learning: Grand tour and challenges. Computer Science Review, 39, Article 100339. https://doi.org/10.1016/j.cosrev.2020.100339

Knublauch, H., Fergerson, R. W., Noy, N. F., & Musen, M. A. (2004). The protégé OWL plugin: An open development environment for semantic web applications. In S. A. McIlraith, D. Plexousakis, & F. van Harmelen (Eds.), The semantic Web–ISWC 2004 (pp. 229–243). Springer. https://doi.org/10.1007/978-3-540-30475-3_17

Kuncheva, L. I., & Whitaker, C. J. (2003). Measures of diversity in classifier ensembles and their relationship with ensemble accuracy. Machine Learning, 51(2), 181–207. https://doi.org/10.1023/A:1022859003006

Kwon, O., Na, W., & Kim, Y.-H. (2020). Machine learning: A new opportunity for risk prediction. Korean Circulation Journal, 50(1), 85–87. https://doi.org/10.4070/kcj.2019.0314

Maghdid, S. S., & Rashid, T. A. (2022). An extensive dataset for the heart disease classification system [Dataset]. Mendeley Data, V2. https://doi.org/10.17632/65gxgy2nmg.2

McHugh, M. L. (2013). The chi-square test of independence. Biochemia Medica, 23(2), 143–149. https://doi.org/10.11613/BM.2013.018

Miraftabzadeh, S. M., Longo, M., Foiadelli, F., Pasetti, M., & Igual, R. (2021). Advances in the application of machine learning techniques for power system analytics: A survey. Energies, 14(16), Article 4776. https://doi.org/10.3390/en14164776

Mitraka, E., Topalis, P., Dritsou, V., Dialynas, E., & Louis, C. (2015). Describing the breakbone fever: IDODEN, an ontology for dengue fever. PLoS Neglected Tropical Diseases, 9(2), Article e0003479. https://doi.org/10.1371/journal.pntd.0003479

Polpinij, J. (2011). The Cancerology ontology: Designed to support the search of evidence-based oncology from biomedical literature. In Proceedings of the 24th International Symposium on Computer-Based Medical Systems (CBMS) (pp. 1–6). IEEE. https://doi.org/10.1109/CBMS.2011.5999168

Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery, 8(4), Article e1249. https://doi.org/10.1002/widm.1249

Sargsyan, A., Kodamullil, A. T., Baksi, S., Darms, J., Madan, S., Gebel, S., Keminer, O., Jose, G. M., Balabin, H., DeLong, L. N., Kohler, M., Jacobs, M., & Hofmann-Apitius, M. (2020). The COVID-19 ontology. Bioinformatics, 36(24), 5703–5705. https://doi.org/10.1093/bioinformatics/btaa1057

Shanmugasundaram, G., Selvam, V. M., Saravanan, R., & Balaji, S. (2018). An investigation of heart disease prediction techniques. In 2018 IEEE International Conference on System, Computation, Automation and Networking (ICSCA) (pp. 1–6). IEEE. https://doi.org/10.1109/ICSCAN.2018.8541165

Shouman, M., Turner, T., & Stocker, R. (2011). Using a decision tree for diagnosing heart disease patients. In P. Vamplew, A. Stranieri, & K.-L. Ong (Eds.), Proceedings of the Ninth Australasian Data Mining Conference - Volume 121 (pp. 23–30). Australian Computer Society. https://dl.acm.org/doi/10.5555/2483628.2483633

Smiti, A. (2020). A critical overview of outlier detection methods. Computer Science Review, 38, Article 100306. https://doi.org/10.1016/j.cosrev.2020.100306

Spencer, R., Thabtah, F., Abdelhamid, N., & Thompson, M. (2020). Exploring feature selection and classification methods for predicting heart disease. Digital Health, 2020, Article 6. https://doi.org/10.1177/2055207620914777

Spoladore, D., Tosi, M., & Lorenzini, E. C. (2024). Ontology-based decision support systems for diabetes nutrition therapy: A systematic literature review. Artificial Intelligence in Medicine, 151, Article 102859. https://doi.org/10.1016/j.artmed.2024.102859

Tan, Z., Luo, L., & Zhong, J. (2023). Knowledge transfer in evolutionary multi-task optimization: A survey. Applied Soft Computing, 138, Article 110182. https://doi.org/10.1016/j.asoc.2023.110182

Tripoliti, E. E., Papadopoulos, T. G., Karanasiou, G. S., Naka, K. K., & Fotiadis, D. I. (2017). Heart failure: Diagnosis, severity estimation, and prediction of adverse events using machine learning techniques. Computational and Structural Biotechnology Journal, 15, 26–47. https://doi.org/10.1016/j.csbj.2016.11.001

Verma, J. P. (2019). Non-parametric Correlations. In J. P. Verma (Ed.), Statistics and research methods in psychology with Excel (pp. 523–565). Springer. https://doi.org/10.1007/978-981-13-3429-0_13

Wang, L. (2015, December 8). Heart failure ontology. BioPortal. https://bioportal.bioontology.org/ontologies/HFO

Zhang, Y., Xin, Y., Li, Q., Ma, J., Li, S., Lv, X., & Lv, W. (2017). Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications. BioMedical Engineering OnLine, 16(1), Article 125. https://doi.org/10.1186/s12938-017-0416-x