Main Article Content
The purpose of this study is to compare the prediction performance of discrete-time survival analysis methods, with and without considering the relationship between longitudinal data observed from the same individuals. In this research, we consider a Random Forest and CatBoost with fixed effects, as well as a mixed-effect machine learning model that considers both fixed and random effects. We applied these methods to predict diabetes status using a diabetes screening dataset collected from the Thai population. It was observed that the dataset is highly imbalanced. Our results show that, for the fixed effect model, considering the relationships among observations from the same individuals resulted in better prediction performance when using CatBoost. However, for the mixed effect model, only the fixed effect components extracted from the Random Forest model achieved higher prediction performance. In summary, this research demonstrates that considering the relationships between data does not always lead to the improvement of prediction performance depending on various limitations and factors such as data characteristics, model selection, random effect variables, and methods of fixed effect component extraction from tree-based models. Therefore, while fixed-effect models are commonly used in discrete-time survival analysis, using a mixed-effect model along with machine learning could be an alternative approach to improve predictive performance.
Wang, P., Li, Y. and Reddy, C.K., 2019, Machine learning for survival analysis: A survey, ACM Computing Surveys (CSUR), 51(6): 1-36.
Suresh, K., Severn, C. and Ghosh, D., 2022, Survival prediction models: an introduction to discrete-time modeling, BMC Medical Research Methodology, 22(1): 207.
Domingos, P., 2012. A few useful things to know about machine learning. Communications of the ACM, 55(10), pp.78-87.
Kattan, M.W., 2003, Comparison of Cox regression with other methods for determining prediction mod-els and nomograms, The Journal of urology, 170(6): S6-S10.
Breiman, L., 2001, Random forests, Machine learning, 45: 5-32.
Cestnik, B., 1990, Estimating Probabilities: A Crucial Task in Machine Learning, ECAI: 147-149.
Micci-Barreca, D., 2001, A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems, ACM SIGKDD Explorations Newsletter, 3(1): 27-32.
Dorogush, A.V., Ershov, V. and Gulin, A., 2018, CatBoost: gradient boosting with categorical features sup-port, arXiv preprint arXiv:1810.11363.
Sarakarn, P. and Jumparway, D., 2020, Coverage and flexibility: issues should be considered for analyzing by generalized linear model in health science research, J Health Sci Comm Publ Health, 3(2): 144-158. (in Thai)
Bolker, B.M., Brooks, M.E., Clark, C.J., Geange, S.W., Poulsen, J.R., Stevens, M.H.H. and White, J.S.S., 2009, Generalized linear mixed models: a practical guide for ecology and evolution, Trends in ecology & evolu-tion, 24(3): 127-135.
Breslow, N.E. and Clayton, D.G., 1993. Approximate inference in generalized linear mixed models. Jour-nal of the American statistical Association, 88(421), pp.9-25.
Ngufor, C., Van Houten, H., Caffo, B.S., Shah, N.D. and McCoy, R.G., 2019, Mixed effect machine learning: A framework for predicting longitudinal change in hemoglobin A1c, Journal of biomedical informatics, 89: 56-67.
Google for developers, Machine Learning Glossary, Available Source: https://developers.google.com/machine-learning/glossary, February 21, 2023.