Analysis and prediction of undergraduate student dropouts using machine learning
Main Article Content
Abstract
This study presents an analysis and comparison of the performance of eight machine learning models, namely
Decision Tree, Random Forest, Logistic Regression, K-Nearest Neighbors, AdaBoost, Hist Gradient Boosting, XGBoost, and Light Gradient Boosting, in identifying at-risk undergraduate students in Thailand. This study focuses on analyzing academic data from university students to understand the factors influencing academic success or failure, and developing predictive models to help instructors and administrators identify learning trends and implement effective teaching strategies. The dataset consists of records from 2,104 undergraduate students in science and technology from 2020 to 2024 in Thailand, with 36 data features. During data preparation, missing values were imputed using the k-nearest neighbors (k=5) method, and class imbalance was corrected using the synthetic minority sampling technique (SMOTE). Recursive feature elimination (RFE) was used to identify important features. The model performance was evaluated using cross-validation using four main metrics: accuracy, precision, recall, and F1 score. The results indicate that the Random Forest model performs the best. It achieved an accuracy of 84.32%, a precision of 0.85, a recall of 0.84, and an F1 score of 0.85. The findings suggest the significant potential of this model in the educational system, particularly in providing targeted interventions for at-risk students to reduce dropout rates and increase academic success. Future research should explore additional influencing factors and investigate other machine learning models that may provide improved performance, as well as compare the results across diverse student populations to ensure reliability and broader applicability.
Article Details
References
AI Thailand. (2022). National AI action plan for Thailand’s development (2022–2027). https://ai.in.th/about-aithailand/
Al-Sulami, A., Al-Masre, M., & Al-Malki, N. (2023). Predicting at-risk students’ performance based on LMS activity using deep learning. International Journal of Advanced Computer Science and Applications, 14(6), 567–574. https://doi.org/10.14569/IJACSA.2023.0140661
Alshamaila, Y., Alsawalqah, H., Aljarah, I., Habib, M., Faris, H., Alshraideh, M., & Salih, B. A. (2024). An automatic prediction of students’ performance to support the university education system: A deep learning approach. Multimedia Tools and Applications, 83(15), 46369–46396. https://doi.org/10.1007/s11042-024-18262-4
Altabrawee, H., Ali, A. B. J., & Ajmi, Q. (2019). Predicting students’ performance using machine learning techniques. Journal of Physics: Conference Series, 1234(1), 012010. https://doi.org/10.1088/1742-6596/1234/1/012010
Arefin, S., Chowdhury, M., Parvez, R., Ahmed, T., Abrar, A. F. M. S., & Sumaiya, F. (2024). Understanding APT detection using machine learning algorithms: Is superior accuracy a thing? 2024 IEEE International Conference on Electro Information Technology (EIT),
–537. https://doi.org/10.1109/EIT60633.2024.10609886
Asselman, A., Khaldi, M., & Aammou, S. (2023). Enhancing the prediction of student performance based on the machine learning XGBoost algorithm. Interactive Learning Environments, 31(6), 3360–3379. https://doi.org/10.1080/10494820.2021.1928235
Ayyalasomayajula, M. M. T., Agarwal, A., & Khan, S. (2024). Reddit social media text analysis for depression prediction: Using logistic regression with enhanced term frequency-inverse document frequency features. International Journal of Electrical and Computer
Engineering, 14(5), 5998–6005. https://doi.org/10.11591/ijece.v14i5.pp5998-6005
Baabdullah, A. M. (2024). The precursors of AI adoption in business: Towards an efficient decision-making and functional performance. International Journal of Information Management, 75, Article 102745. https://doi.org/10.1016/j.ijinfomgt.2023.102745
Barzani, A. R., Pahlavani, P., Ghorbanzadeh, O., Gholamnia, K., & Ghamisi, P. (2024). Evaluating the impact of recursive feature elimination on machine learning models for predicting forest fire-prone zones. Fire, 7(12), Article 440. https://doi.org/10.3390/fire7120440
Cahaya Hidayati, C., Phyu Thet, E. W., & Nouanthong, P. (2024). Evaluation: Media booklet about education on prevention of flour albus for teenage girl. Journal Evaluation in Education (JEE), 5(2), 50–60. https://doi.org/10.37251/jee.v5i2.928
Gašević, D., Dawson, S., Rogers, T., & Gasevic, D. (2016). Learning analytics should not promote one size fits all: The effects of instructional conditions in predicting academic success. Internet and Higher Education, 28, 68–84. https://doi.org/10.1016/j.iheduc.
10.002
Halder, R. K., Uddin, M. N., Uddin, M. A., Aryal, S., & Khraisat, A. (2024). Enhancing K-nearest neighbor algorithm: A comprehensive review and performance analysis of modifications. Journal of Big Data, 11(1), 1–55. https://doi.org/10.1186/s40537-024-00973-y
Hashem, R., Ali, N., Zein, F. E., Fidalgo, P., & Khurma, O. A. (2024). AI to the rescue: Exploring the potential of ChatGPT as a teacher ally for workload relief and burnout prevention. Research and Practice in Technology Enhanced Learning, 19, Article 19023. https://
doi.org/10.58459/rptel.2024.19023
Hofmann, P., Lämmermann, L., & Urbach, N. (2024). Managing artificial intelligence applications in healthcare: Promoting information processing among stakeholders. International Journal of Information Management, 75, Article 102728. https://doi.org/10.1016/j.ijinfomgt.2023.102728
Hussain, S., & Khan, M. Q. (2023). Student-Performulator: Predicting students’ academic performance at secondary and intermediate level using machine learning. Annals of Data Science, 10(3), 637–655. https://doi.org/10.1007/s40745-021-00341-0
Ibrahim Adedeji Adeniran, I. A., Efunniyi, C. P., Osundare, O. S., & Abhulimen, A. O. (2024). Integrating business intelligence and predictive analytics in banking: A framework for optimizing financial decision-making. Finance & Accounting Research Journal, 6(8),
–1530. https://doi.org/10.51594/farj.v6i8.1505
Ibrahim, M., Abdelraouf, H., Amin, K. M., & Semary, N. (2023). Keystroke dynamics based user authentication using histogram gradient boosting. International Journal of Computers and Information (IJCI), 9(2). https://ijci.journals.ekb.eg/
Kumar, P., Senthil Pandi, S., Kumaragurubaran, T., & Rahul Chiranjeevi, V. (2024). Human activity recognitions in handheld devices using random forest algorithm. 2024 International Conference on Automation and Computation (AUTOCOM), 159–163. https://doi.
org/10.1109/AUTOCOM60220.2024.10486087
Li, H., Song, J., Xue, M., Zhang, H., & Song, M. (2024). A survey of neural trees: Co-evolving neural networks and decision trees. IEEE Transactions on Neural Networks and Learning Systems. Advance online publication. https://doi.org/10.1109/TNNLS.2024.3446891
Nayak, P., Vaheed, S., Gupta, S., & Mohan, N. (2023). Predicting students’ academic performance by mining the educational data through machine learning-based classification model. Education and Information Technologies, 28(11), 14611–14637. https://doi.org/
1007/s10639-023-11706-8
Omotehinwa, T. O., Oyewola, D. O., & Moung, E. G. (2024). Optimizing the light gradient-boosting machine algorithm for an efficient early detection of coronary heart disease. Informatics and Health, 1(2), 70–81. https://doi.org/10.1016/j.infoh.2024.06.001
Pamir, Javaid, N., Akbar, M., Aldegheishem, A., Alrajeh, N., & Mohammed, E. A. (2022). Employing a machine learning boosting classifiers based stacking ensemble model for detecting non technical losses in smart grids. IEEE Access, 10, 121886–121899. https://doi.
org/10.1109/ACCESS.2022.3222883
Perez, B., Castellanos, C., & Correal, D. (2018). Applying data mining techniques to predict student dropout: A case study. 2018 IEEE 1st Colombian Conference on Applications in Computational Intelligence (Col-CACI). https://doi.org/10.1109/COLCACI.2018.8484847
Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia, 4(2), Article 1883. https://doi.org/10.4249/scholarpedia. 1883
Rabbi, F., Ullah, N., Hossain, I., Ashour, O. M., & Ashour, O. (2024). Predicting management skills of undergraduate students by leveraging different machine learning classifiers [Preprint]. ResearchGate. https://www.researchgate.net/publication/380880140
Rahayu, W., Jollyta, D., Hajjah, A., Nora Marlim, Y., & Desnelita, Y. (2024). Synthetic Minority Oversampling Technique (SMOTE) for boosting the accuracy of C4.5 algorithm model. Journal of Artificial Intelligence and Engineering Applications, 3(3). https://ioinformatic.
org/
Rastrollo-Guerrero, J. L., Gómez-Pulido, J. A., & Durán-Domínguez, A. (2020). Analyzing and predicting students’ performance by means of machine learning: A review. Applied Sciences, 10(3), Article 1042. https://doi.org/10.3390/app10031042
Shoaib, M., Sayed, N., Singh, J., Shafi, J., Khan, S., & Ali, F. (2024). AI student success predictor: Enhancing personalized learning in campus management systems. Computers in Human Behavior, 158, Article 108301. https://doi.org/10.1016/j.chb.2024.108301
Singh, S. P., Singh, P., & Mishra, A. (2020). Predicting potential applicants for any private college using LightGBM. 2020 International Conference on Innovative Trends in Information Technology (ICITIIT). https://doi.org/10.1109/ICITIIT49094.2020.9071525
Venkatesan, R. G., Karmegam, D., & Mappillairaju, B. (2024). Exploring statistical approaches for predicting student dropout in education: A systematic review and meta-analysis. Journal of Computational Social Science, 7(1), 171–196. https://doi.org/10.1007/
s42001-023-00231-w
Weist, M. D., Garbacz, A., Schultz, B., Bradshaw, C. P., & Lane, K. L. (2024). Revisiting the percentage of K-12 students in need of preventive interventions in schools in a “peri-COVID” era: Implications for the implementation of tiered programming. Prevention
Science, 25(3), 481–487. https://doi.org/10.1007/s11121-023-01618-x
Wijayanti, E. B., Rosal, D., Setiadi, I. M., & Setyoko, B. H. (2024). Dataset analysis and feature characteristics to predict rice production based on eXtreme Gradient Boosting. Journal of Computing Theories and Applications, 1(3), 299–310. https://doi.org/10.62411/
jcta.10057
Xing, H. J., Liu, W. T., & Wang, X. Z. (2024). Bounded exponential loss function based AdaBoost ensemble of OCSVMs. Pattern Recognition, 148, Article 110191. https://doi.org/10.1016/j.patcog.2023.110191
Xu, Q., & Yin, J. (2021). Application of Random Forest algorithm in physical education. Scientific Programming, 2021, Article 1996904. https://doi.org/10.1155 /2021/1996904
Yu, C., Jin, Y., Xing, Q., Zhang, Y., Guo, S., & Meng, S. (2024). Advanced user credit risk prediction model using LightGBM, XGBoost and Tabnet with SMOTEENN [Preprint]. arXiv. https://arxiv.org/abs/2408.03497v1