Emotion Classification from Speech Waveform Using Machine Learning and Deep Learning Techniques

Smitha Narendra Pai; Punnath Balakrishnan Shanthi; Shivaprasad Hegde

doi:10.55003/cast.2024.257184

pdf

Published: Oct 16, 2024

DOI: https://doi.org/10.55003/cast.2024.257184

Keywords:

speech emotion detection support vector machine decision tree multi-layer perceptron convolutional neural network long short-term memory

Smitha Narendra Pai

Department of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India

Punnath Balakrishnan Shanthi

Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India

Shivaprasad Hegde

Department of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India

Abstract

Emotions play a key role in determining the human mental state and indirectly express an individual’s well- being. A speech emotion recognition system can extract a person’s emotions from his/her speech inputs. There are some universal emotions such as anger, disgust, fear, happiness, pleasantness, sadness and neutral. These emotions are of significance especially in a situation like the Covid pandemic, when the aged or sick are vulnerable to depression. In the current paper, we examined various classification models with finite computational strength and resources in order to determine the emotion of a person from his/her speech. Speech prosodic features like pitch, loudness, and tone of speech, and work spectral features such as Mel Frequency Capstral Coefficients (MFCCs) of the voice were used to analyze the emotions of a person. Although sequence to sequence state of the art models for speech detection that offer high levels of accuracy and precision are currently in use, the computational needs of such approaches are high and inefficient. Therefore, in this work, we emphasised analysis and comparison of different classification algorithms such as multi layer perceptron, decision tree, support vector machine, and deep neural networks such as convolutional neural network and long short term memory. Given an audio file, the emotions that were exhibited by the speaker were recognized using machine learning and deep learning techniques. A comparative study was performed to identify the most appropriate algorithms that could be used to recognize emotions. Based on the experiment results, the MLP classifier and convolutional neural network model offered better accuracy with smaller variations when compared with other models used for the study.

Issue

Vol. 25 No. 1 (2025)

Section

Original Research Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Copyright Transfer Statement

The copyright of this article is transferred to Current Applied Science and Technology journal with effect if and when the article is accepted for publication. The copyright transfer covers the exclusive right to reproduce and distribute the article, including reprints, translations, photographic reproductions, electronic form (offline, online) or any other reproductions of similar nature.

The author warrants that this contribution is original and that he/she has full power to make this grant. The author signs for and accepts responsibility for releasing this material on behalf of any and all co-authors.

Here is the link for download: Copyright transfer form.pdf

References

Abdu, S. A., Yousef, A. H., & Salem, A. (2021). Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion, 76, 204-226. https://doi.org/10.1016/j.inffus.2021.06.003

Abdusalomov, A. B., Safarov, F., Rakhimov, M., Turaev, B., & Whangbo, T. K. (2022). Improved feature parameter extraction from speech signals using machine learning algorithm. Sensors, 22(21), Article 8122. https://doi.org/10.3390/s22218122

Akinpelu, S., & Viriri, S. (2023). Speech emotion classification using attention based network and regularized feature selection. Scientific Reports, 13(1), Article 11990. https://doi.org/10.1038/s41598-023-38868-2

Ancilin, J., & Milton, A. (2021). Improved speech emotion recognition with Mel frequency magnitude coefficient. Applied Acoustics, 179, Article 108046. https://doi.org/10.1016/j.apacoust.2021.108046

Aouani, H., & Ayed, Y. B. (2020). Speech emotion recognition with deep learning. Procedia Computer Science, 176, 251-260. https://doi.org/10.1016/j.procs.2020.08.027

Choudhury, A. R., Ghosh, A., Pandey, R., & Barman, S. (2018). Emotion recognition from speech signals using excitation source and spectral features. In 2018 IEEE Applied signal processing conference (ASPCON) (pp. 257-261). IEEE. https://doi.org/10.1109/ASPCON.2018.8748626

Constantinescu, C., & Brad, R. (2023). An Overview on Sound Features in Time and Frequency Domain. International Journal of Advanced Statistics and IT&C for Economics and Life Sciences, 13(1), 45-58. https://doi.org/10.2478/ijasitels-2023-0006

de Pinto, M. G., Polignano, M., Lops, P., & Semeraro, G. (2020). Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients. In 2020 IEEE conference on evolving and adaptive intelligent systems (EAIS) (pp. 1-5). IEEE. https://doi.org/10.1109/EAIS48028.2020.9122698

El Maghraby, E. E., Gody, A. M., & Farouk, M. H. (2021). Audio-visual speech recognition using LSTM and CNN. Recent Advances in Computer Science and Communications, 14(6), 2023-2039. http://doi.org/10.2174/2666255813666191218092903

Gudmalwar, A. P., Rao, C. V. R., & Dutta, A. (2019). Improving the performance of the speaker emotion recognition based on low dimension prosody features vector. International Journal of Speech Technology, 22, 521-531. https://doi.org/10.1007/s10772-018-09576-4

Jain, M., Narayan, S., Balaji, P., Bharath, K. P., Bhowmick, A., Karthik, R., & Muthu, R. K. (2020). Speech emotion recognition using support vector machine. arXiv preprint arXiv:2002.07590. https://doi.org/10.48550/arXiv.2002.07590

Kaneria, A. V., Rao, A. B., Aithal, S. G., & Pai, S. N. (2021). Prediction of song popularity using machine learning concepts. In Smart sensors measurements and instrumentation: Select proceedings of CISCON 2020 (pp. 35-48). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-16-0336-5_4

Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H., & Alhussain, T. (2019). Speech emotion recognition using deep learning techniques: A review. IEEE access, 7, 117327-117345. https://doi.org/10.1109/ACCESS.2019.2936124

Lalitha, S., Geyasruti, D., Narayanan, R., & Shravani, M. (2015). Emotion detection using MFCC and cepstrum features. Procedia Computer Science, 70, 29-35. https://doi.org/10.1016/j.procs.2015.10.020

Lech, M., Stolar, M., Best, C., & Bolia, R. (2020). Real-time speech emotion recognition using a pre-trained image classification network: Effects of bandwidth reduction and companding. Frontiers in Computer Science, 2, Article 14. https://doi.org/10.3389/fcomp.2020.00014

Liu, Z.-T., Wu, M., Cao, W.-H., Mao, J.-W., Xu, J. P., & Tan, G.-Z. (2018). Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing, 273, 271-280. https://doi.org/10.1016/j.neucom.2017.07.050

Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS One, 13(5), Article e0196391. https://doi.org/10.1371/journal.pone.0196391

Madanian, S., Chen, T., Adeleye, O., Templeton, J. M., Poellabauer, C., Parry, D., & Schneider, S. L. (2023). Speech emotion recognition using machine learning—A systematic review. Intelligent systems with applications, 20, Article 200266. https://doi.org/10.1016/j.iswa.2023.200266

Mashhadi, M. M. R., & Osei-Bonsu, K. (2023). Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest. Plos One, 18(11), Article e0291500. https://doi.org/10.1371/journal.pone.0291500

Milton, A., Roy, S. S., & Selvi, S. T. (2013). SVM scheme for speech emotion recognition using MFCC feature. International Journal of Computer Applications, 69(9), 34-39. https://doi.org/ 10.5120/11872-7667

Mustaqeem, & Kwon, S. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics, 8(12), Article 2133. https://doi.org/10.3390/math8122133

Nassif, A. B., Shahin, I., Elnagar, A., Velayudhan, D., Alhudhaif, A., & Polat, K. (2022). Emotional speaker identification using a novel capsule nets model. Expert Systems with Applications, 193, Article 116469. https://doi.org/10.1016/j.eswa.2021.116469

Noroozi, F., Marjanovic, M., Njegus, A., Escalera, S., & Anbarjafari, G. (2017). Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 10(1), 60-75. https://doi.org/10.1109/TAFFC.2017.2713783

Oppenheim, A. V., & Schafer, R. W. (2004). From frequency to quefrency: A history of the cepstrum. IEEE signal processing Magazine, 21(5), 95-106. https://doi.org/10.1109/MSP.2004.1328092

Patni, H., Jagtap, A., Bhoyar, V., & Gupta, A. (2021). Speech emotion recognition using MFCC, GFCC, chromagram and RMSE features. In 2021 8th international conference on signal processing and integrated networks (SPIN) (pp. 892-897). IEEE. https://doi.org/10.1109/SPIN52536.2021.9566046

Pichora-Fuller, M. K., & Dupuis, K. (2020). Toronto emotional speech set (TESS). https://doi.org/10.5683/SP2/E8H2MF

Poojary, N. N., Shivakumar, G. S., & Akshath, K. B. H. (2021). Speech emotion recognition using MLP classifier. International Journal of Science Research in Computer Science, Engineering and Information Technology, 7(4), 218-222. https://doi.org/10.32628/CSEIT217446

Rathor, S., Kansal, M., Verma, M., Garg, M., & Tiwari, R. (2021). Use of artificial intelligence in emotion recognition by ensemble based multilevel classification. IOP Conference Series: Materials Science and Engineering, 1116, Article 012196. https://doi.org/10.1088/1757-899x/1116/1/012196

Samantaray, A. K., Mahapatra, K., Kabi, B., & Routray, A. (2015). A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages. In 2015 IEEE 2nd international conference on recent trends in information systems (ReTIS) (pp. 372-377). IEEE. https://doi.org/10.1109/ReTIS.2015.7232907

Singh, N., Khan, R. A., & Shree, R. (2012). MFCC and prosodic feature extraction techniques: a comparative study. International Journal of Computer Applications, 54(1), 9-13. https://doi.org/10.5120/8529-2061

Sun, L., Zou, B., Fu, S., Chen, J., & Wang, F. (2019). Speech emotion recognition based on DNN-decision tree SVM model. Speech Communication, 115, 29-37. https://doi.org/10.1016/j.specom.2019.10.004

Wang, C., Ren, Y., Zhang, N., Cui, F., & Luo, S. (2022). Speech emotion recognition based on multi‐feature and multi‐lingual fusion. Multimedia Tools and Applications, 81(4), 4897-4907. https://doi.org/10.1007/s11042-021-10553-4

Wani, T. M., Gunawan, T. S., Qadri, S. A. A., Kartiwi, M., & Ambikairajah, E. (2021). A comprehensive review of speech emotion recognition systems. IEEE Access, 9, 47795-47814. https://doi.org/10.1109/ACCESS.2021.3068045

Wani, T. M., Gunawan, T. S., Qadri, S. A. A., Mansor, H., Kartiwi, M., & Ismail, N. (2020). Speech emotion recognition using convolution neural networks and deep stride convolutional neural networks. In 2020 6th international conference on wireless and telematics (ICWT) (pp. 1-6). IEEE. http://doi.org/10.1109/ICWT50448.2020.9243622

Xie, Y., Liang, R., Liang, Z., Huang, C., Zou, C., & Schuller, B. (2019). Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11), 1675-1685. https://doi.org/10.1109/TASLP.2019.2925934

Ying, Y., Tu, Y., & Zhou, H. (2021). Unsupervised feature learning for speech emotion recognition based on autoencoder. Electronics, 10(17), Article 2086. https://doi.org/10.3390/electronics10172086

Yuan, X., Wong, W. P., & Lam, C. T. (2022). Speech emotion recognition using multi-layer perceptron classifier. In 2022 IEEE 10th international conference on information, communication and networks (ICICN) (pp. 644-648). IEEE. https://doi.org/10.1109/ICICN56848.2022.10006474

Article Sidebar

Main Article Content

Abstract

Article Details

Copyright Transfer Statement

References