Efficiency Comparison of Classification Models for Pali and Sanskrit in Thai Languages Using Machine Learning Techniques
Main Article Content
Abstract
This research aims to test and compare the performance of classification models for Pali and Sanskrit in Thai language using Machine learning techniques. The study focuses on improving the accuracy in distinguishing between words from these two languages, which often exhibit similarities in pronunciation and spelling. Five models were tested: Random Forest, Decision Tree, K-Nearest Neighbors (K-NN), Naive Bayes, and Support Vector Machine (SVM). The evaluation process employed 10-fold cross-validation to assess model performance. The results indicate that the SVM model is the most efficient, with an accuracy of 95.75% and a precision of 90.90%. K-NN follows closely with an accuracy of 92.86%, while Naive Bayes achieves 81.29%. The Random Forest model, however, shows the lowest performance with an accuracy of only 55.27%. These findings highlight the SVM model's effectiveness in accurately classifying Pali and Sanskrit words in Thai language. The results can be applied to further developments in language classification, translation, and educational technology tools for language learning.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Alzoubi, Y. I., Topcu, A. E., & Erkaya, A. E. (2023). Machine learning-based text classification comparison: Turkish language context. Applied Sciences, 13(16), Article 9428. https://doi.org/10.3390/app13169428
Bradley, D. (2024). Sociolinguistics in Mainland Southeast Asia. In M. J. Ball, R. Mesthrie & C. Meluzzi (Eds.), The Routledge handbook of sociolinguistics around the world (2nd ed., pp. 227-237). Routledge. https://doi.org/10.4324/9781003198345-21
Brena, R. F., Zuvirie, E., Preciado, A., Valdiviezo, A., Gonzalez-Mendoza, M., & Zozaya-Gorostiza, C. (2021). Automated evaluation of foreign language speaking performance with machine learning. International Journal on Interactive Design and Manufacturing (IJIDeM), 15(2), 317-331. https://doi.org/10.1007/s12008-021-00759-z
Dinesh, P., Vickram, A. S., & Kalyanasundaram, P. (2024). Medical image prediction for diagnosis of breast cancer disease comparing the machine learning algorithms: SVM, KNN, logistic regression, random forest and decision tree to measure accuracy. AIP Conference Proceedings, 2853(1). Article 020140. https://doi.org/10.1063/5.0203746
Duraiswamy, D. (2024). Cultural and trade links between India and Siam: Their impact on the Maritime Silk Road. Acta Via Serica, 9(1), 67-90. https://doi.org/10.22679/avs.2024.9.1.003
Gudadhe, S. R., Bardekar, A. A., & Ranit, A. B. (2024). Integrating machine learning and NLP: Efficient retrieval of characters in Pali script preservation. Juni Khyat Journal (जूनी ख्यात), 14(4), 94-103.
Hemal, S. H., Khan, M. A. R., Ahammad, I., Rahman, M., Khan, M. A. S. D., & Ejaz, S. (2024). Predicting the impact of internet usage on students’ academic performance using machine learning techniques in Bangladesh perspective. Social Network Analysis & Mining, 14(1), Article 66. https://doi.org/10.1007/s13278-024-01234-9
Joshi, A., Dabre, R., Kanojia, D., Li, Z., Zhan, H., Haffari, G., & Dippold, D. (2024). Natural language processing for dialects of a language: A survey. arXiv. https://doi.org/10.48550/arXiv.2401.05632
Kaewpanitch, C., & Tongngam, S. (2020). Analysis of key features affecting the effectiveness of English language learning among undergraduate students utilizing the data mining techniques. NIDA Development Journal, 60(1-2), 1-29. (in Thai)
Kanraweekultana, N., Waijanya, S., Promrit, N., Nopnapaporn, U., Korsanan, A., & Poolphol, S. (2024). Comparison of capability of data classification models to predict consistent results for depression analysis based on user-behaviour tracking and facial expression recognition during PHQ-9 assessment. Engineering and Applied Science Research, 51(1), 11-21. https://doi.org/10.14456/easr.2024.2
Kashif, K., Alwan, A., Wu, Y., De Nardis, L., & Di Benedetto, M. G. (2024). MKELM based multi-classification model for foreign accent identification. Heliyon, 10(16), Article e36460. https://doi.org/10.1016/j.heliyon.2024.e36460
Khairy, M., Mahmoud, T. M., Omar, A., & Abd El-Hafeez, T. (2024). Comparative performance of ensemble machine learning for Arabic cyberbullying and offensive language detection. Language Resources and Evaluation, 58(2), 695-712. https://doi.org/10.1007/s10579-023-09683-y
Li, A. (2024). Nominalizations and its grammaticalization in Standard Thai. Folia Linguistica, 58(2), 449-481. https://doi.org/10.1515/flin-2024-2006
Makayasa, B. A., Siregar, M. U., Sugiantoro, B., & Fatwanto, A. (2024). Comparison of classification algorithm and language model in accounting financial transaction record: A natural language processing approach. International Journal on Advanced Science, Engineering & Information Technology, 14(3), 1044-1052. https://doi.org/10.18517/ijaseit.14.3.19179
Maqsood, S., Shahid, A., Afzal, M. T., Roman, M., Khan, Z., Nawaz, Z., & Aziz, M. H. (2022). Assessing English language sentences readability using machine learning models. PeerJ Computer Science, 8, Article e818. https://doi.org/10.7717/peerj-cs.818
Modhugu, V. R., & Ponnusamy, S. (2024). Comparative analysis of machine learning algorithms for liver disease prediction: SVM, logistic regression, and decision tree. Asian Journal of Research in Computer Science, 17(6), 188-201. https://doi.org/10.9734/ajrcos/2024/v17i6467
Mousa, A., Shahin, I., Nassif, A. B., & Elnagar, A. (2024). Detection of Arabic offensive language in social media using machine learning models. Intelligent Systems with Applications, 22, Article 200376. https://doi.org/10.1016/j.iswa.2024.200376
Perroni, M. G., Veiga, C. P. D., Forteski, E., Marconatto, D. A. B., da Silva, W. V., Senff, C. O., & Su, Z. (2024). Integrating relative efficiency models with machine learning algorithms for performance prediction. SAGE Open, 14(2), Article 21582440241257800. https://doi.org/10.1177/21582440241257800
Phann, R., Soomlek, C., & Seresangtakul, P. (2023). Multi-class text classification on Khmer news using ensemble method in machine learning algorithms. Acta Informatica Pragensia, 12(2), 243-259. https://doi.org/10.18267/j.aip.210
Saechan, A., & Ruangjaroon, S. (2024). The influences of orthographic forms, stress placement and consonantal manners on syllabification and acoustic durations of intervocalic consonants with geminate graphemes by Thai L2 speakers of English [Doctoral dissertation, Srinakharinwirot University]. Srinakharinwirot University. http://ir-ithesis.swu.ac.th/dspace/handle/123456789/2717
Saputra, N., Antoni, R., Widodo, A., Solissa, E. M., & Arief, I. (2024). Improving foreign language proficiency in society by decision tree classification. AIP Conference Proceedings, 3001(1). Article 070006. https://doi.org/10.1063/5.0183888
Tammanam, K., Promrit, N., & Waijanya, S. (2021). A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysis. Engineering and Applied Science Research, 48(5), 614-626. https://doi.org/10.14456/easr.2021.63
Wang, Z., Pang, Y., & Lin, Y. (2024). Smart expert system: Large language models as text classifiers. arXiv. https://doi.org/10.48550/arXiv.2405.10523
Wei, Y. (2024). Chinese and English text classification techniques incorporating CHI feature selection for ELT cloud classroom. Open Computer Science, 14(1), Article 20240007. https://doi.org/10.1515/comp-2024-0007
Zhang, C., Hofmann, F., Plößl, L., & Gläser-Zikuda, M. (2024). Classification of reflective writing: A comparative analysis with shallow machine learning and pre-trained language models. Education and Information Technologies, 29, 21593-21619. https://doi.org/10.1007/s10639-024-12720-0
Zhong, X., Jin, C., An, M., & Cambria, E. (2024). XTime: A general rule-based method for time expression recognition and normalization. Knowledge-Based Systems, 297, Article 111921. https://doi.org/10.1016/j.knosys.2024.111921