Emotion Classification from Speech Waveform Using Machine Learning and Deep Learning Techniques
Main Article Content
Abstract
Emotions play a key role in determining the human mental state and indirectly express an individual’s well- being. A speech emotion recognition system can extract a person’s emotions from his/her speech inputs. There are some universal emotions such as anger, disgust, fear, happiness, pleasantness, sadness and neutral. These emotions are of significance especially in a situation like the Covid pandemic, when the aged or sick are vulnerable to depression. In the current paper, we examined various classification models with finite computational strength and resources in order to determine the emotion of a person from his/her speech. Speech prosodic features like pitch, loudness, and tone of speech, and work spectral features such as Mel Frequency Capstral Coefficients (MFCCs) of the voice were used to analyze the emotions of a person. Although sequence to sequence state of the art models for speech detection that offer high levels of accuracy and precision are currently in use, the computational needs of such approaches are high and inefficient. Therefore, in this work, we emphasised analysis and comparison of different classification algorithms such as multi layer perceptron, decision tree, support vector machine, and deep neural networks such as convolutional neural network and long short term memory. Given an audio file, the emotions that were exhibited by the speaker were recognized using machine learning and deep learning techniques. A comparative study was performed to identify the most appropriate algorithms that could be used to recognize emotions. Based on the experiment results, the MLP classifier and convolutional neural network model offered better accuracy with smaller variations when compared with other models used for the study.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Copyright Transfer Statement
The copyright of this article is transferred to Current Applied Science and Technology journal with effect if and when the article is accepted for publication. The copyright transfer covers the exclusive right to reproduce and distribute the article, including reprints, translations, photographic reproductions, electronic form (offline, online) or any other reproductions of similar nature.
The author warrants that this contribution is original and that he/she has full power to make this grant. The author signs for and accepts responsibility for releasing this material on behalf of any and all co-authors.
Here is the link for download: Copyright transfer form.pdf
References
Abdu, S. A., Yousef, A. H., & Salem, A. (2021). Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion, 76, 204-226. https://doi.org/10.1016/j.inffus.2021.06.003
Abdusalomov, A. B., Safarov, F., Rakhimov, M., Turaev, B., & Whangbo, T. K. (2022). Improved feature parameter extraction from speech signals using machine learning algorithm. Sensors, 22(21), Article 8122. https://doi.org/10.3390/s22218122
Akinpelu, S., & Viriri, S. (2023). Speech emotion classification using attention based network and regularized feature selection. Scientific Reports, 13(1), Article 11990. https://doi.org/10.1038/s41598-023-38868-2
Ancilin, J., & Milton, A. (2021). Improved speech emotion recognition with Mel frequency magnitude coefficient. Applied Acoustics, 179, Article 108046. https://doi.org/10.1016/j.apacoust.2021.108046
Aouani, H., & Ayed, Y. B. (2020). Speech emotion recognition with deep learning. Procedia Computer Science, 176, 251-260. https://doi.org/10.1016/j.procs.2020.08.027
Choudhury, A. R., Ghosh, A., Pandey, R., & Barman, S. (2018). Emotion recognition from speech signals using excitation source and spectral features. In 2018 IEEE Applied signal processing conference (ASPCON) (pp. 257-261). IEEE. https://doi.org/10.1109/ASPCON.2018.8748626
Constantinescu, C., & Brad, R. (2023). An Overview on Sound Features in Time and Frequency Domain. International Journal of Advanced Statistics and IT&C for Economics and Life Sciences, 13(1), 45-58. https://doi.org/10.2478/ijasitels-2023-0006
de Pinto, M. G., Polignano, M., Lops, P., & Semeraro, G. (2020). Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients. In 2020 IEEE conference on evolving and adaptive intelligent systems (EAIS) (pp. 1-5). IEEE. https://doi.org/10.1109/EAIS48028.2020.9122698
El Maghraby, E. E., Gody, A. M., & Farouk, M. H. (2021). Audio-visual speech recognition using LSTM and CNN. Recent Advances in Computer Science and Communications, 14(6), 2023-2039. http://doi.org/10.2174/2666255813666191218092903
Gudmalwar, A. P., Rao, C. V. R., & Dutta, A. (2019). Improving the performance of the speaker emotion recognition based on low dimension prosody features vector. International Journal of Speech Technology, 22, 521-531. https://doi.org/10.1007/s10772-018-09576-4
Jain, M., Narayan, S., Balaji, P., Bharath, K. P., Bhowmick, A., Karthik, R., & Muthu, R. K. (2020). Speech emotion recognition using support vector machine. arXiv preprint arXiv:2002.07590. https://doi.org/10.48550/arXiv.2002.07590
Kaneria, A. V., Rao, A. B., Aithal, S. G., & Pai, S. N. (2021). Prediction of song popularity using machine learning concepts. In Smart sensors measurements and instrumentation: Select proceedings of CISCON 2020 (pp. 35-48). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-16-0336-5_4
Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H., & Alhussain, T. (2019). Speech emotion recognition using deep learning techniques: A review. IEEE access, 7, 117327-117345. https://doi.org/10.1109/ACCESS.2019.2936124
Lalitha, S., Geyasruti, D., Narayanan, R., & Shravani, M. (2015). Emotion detection using MFCC and cepstrum features. Procedia Computer Science, 70, 29-35. https://doi.org/10.1016/j.procs.2015.10.020
Lech, M., Stolar, M., Best, C., & Bolia, R. (2020). Real-time speech emotion recognition using a pre-trained image classification network: Effects of bandwidth reduction and companding. Frontiers in Computer Science, 2, Article 14. https://doi.org/10.3389/fcomp.2020.00014
Liu, Z.-T., Wu, M., Cao, W.-H., Mao, J.-W., Xu, J. P., & Tan, G.-Z. (2018). Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing, 273, 271-280. https://doi.org/10.1016/j.neucom.2017.07.050
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS One, 13(5), Article e0196391. https://doi.org/10.1371/journal.pone.0196391
Madanian, S., Chen, T., Adeleye, O., Templeton, J. M., Poellabauer, C., Parry, D., & Schneider, S. L. (2023). Speech emotion recognition using machine learning—A systematic review. Intelligent systems with applications, 20, Article 200266. https://doi.org/10.1016/j.iswa.2023.200266
Mashhadi, M. M. R., & Osei-Bonsu, K. (2023). Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest. Plos One, 18(11), Article e0291500. https://doi.org/10.1371/journal.pone.0291500
Milton, A., Roy, S. S., & Selvi, S. T. (2013). SVM scheme for speech emotion recognition using MFCC feature. International Journal of Computer Applications, 69(9), 34-39. https://doi.org/ 10.5120/11872-7667
Mustaqeem, & Kwon, S. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics, 8(12), Article 2133. https://doi.org/10.3390/math8122133
Nassif, A. B., Shahin, I., Elnagar, A., Velayudhan, D., Alhudhaif, A., & Polat, K. (2022). Emotional speaker identification using a novel capsule nets model. Expert Systems with Applications, 193, Article 116469. https://doi.org/10.1016/j.eswa.2021.116469
Noroozi, F., Marjanovic, M., Njegus, A., Escalera, S., & Anbarjafari, G. (2017). Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 10(1), 60-75. https://doi.org/10.1109/TAFFC.2017.2713783
Oppenheim, A. V., & Schafer, R. W. (2004). From frequency to quefrency: A history of the cepstrum. IEEE signal processing Magazine, 21(5), 95-106. https://doi.org/10.1109/MSP.2004.1328092
Patni, H., Jagtap, A., Bhoyar, V., & Gupta, A. (2021). Speech emotion recognition using MFCC, GFCC, chromagram and RMSE features. In 2021 8th international conference on signal processing and integrated networks (SPIN) (pp. 892-897). IEEE. https://doi.org/10.1109/SPIN52536.2021.9566046
Pichora-Fuller, M. K., & Dupuis, K. (2020). Toronto emotional speech set (TESS). https://doi.org/10.5683/SP2/E8H2MF
Poojary, N. N., Shivakumar, G. S., & Akshath, K. B. H. (2021). Speech emotion recognition using MLP classifier. International Journal of Science Research in Computer Science, Engineering and Information Technology, 7(4), 218-222. https://doi.org/10.32628/CSEIT217446
Rathor, S., Kansal, M., Verma, M., Garg, M., & Tiwari, R. (2021). Use of artificial intelligence in emotion recognition by ensemble based multilevel classification. IOP Conference Series: Materials Science and Engineering, 1116, Article 012196. https://doi.org/10.1088/1757-899x/1116/1/012196
Samantaray, A. K., Mahapatra, K., Kabi, B., & Routray, A. (2015). A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages. In 2015 IEEE 2nd international conference on recent trends in information systems (ReTIS) (pp. 372-377). IEEE. https://doi.org/10.1109/ReTIS.2015.7232907
Singh, N., Khan, R. A., & Shree, R. (2012). MFCC and prosodic feature extraction techniques: a comparative study. International Journal of Computer Applications, 54(1), 9-13. https://doi.org/10.5120/8529-2061
Sun, L., Zou, B., Fu, S., Chen, J., & Wang, F. (2019). Speech emotion recognition based on DNN-decision tree SVM model. Speech Communication, 115, 29-37. https://doi.org/10.1016/j.specom.2019.10.004
Wang, C., Ren, Y., Zhang, N., Cui, F., & Luo, S. (2022). Speech emotion recognition based on multi‐feature and multi‐lingual fusion. Multimedia Tools and Applications, 81(4), 4897-4907. https://doi.org/10.1007/s11042-021-10553-4
Wani, T. M., Gunawan, T. S., Qadri, S. A. A., Kartiwi, M., & Ambikairajah, E. (2021). A comprehensive review of speech emotion recognition systems. IEEE Access, 9, 47795-47814. https://doi.org/10.1109/ACCESS.2021.3068045
Wani, T. M., Gunawan, T. S., Qadri, S. A. A., Mansor, H., Kartiwi, M., & Ismail, N. (2020). Speech emotion recognition using convolution neural networks and deep stride convolutional neural networks. In 2020 6th international conference on wireless and telematics (ICWT) (pp. 1-6). IEEE. http://doi.org/10.1109/ICWT50448.2020.9243622
Xie, Y., Liang, R., Liang, Z., Huang, C., Zou, C., & Schuller, B. (2019). Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11), 1675-1685. https://doi.org/10.1109/TASLP.2019.2925934
Ying, Y., Tu, Y., & Zhou, H. (2021). Unsupervised feature learning for speech emotion recognition based on autoencoder. Electronics, 10(17), Article 2086. https://doi.org/10.3390/electronics10172086
Yuan, X., Wong, W. P., & Lam, C. T. (2022). Speech emotion recognition using multi-layer perceptron classifier. In 2022 IEEE 10th international conference on information, communication and networks (ICICN) (pp. 644-648). IEEE. https://doi.org/10.1109/ICICN56848.2022.10006474