Classification of Human Emotion from Speech Recognition Using Deep Learning
Main Article Content
Abstract
Human emotions are complex mental processes that respond to surrounding stimulators. It is a mechanism that allows humans to adjust themselves and express their emotions in various situations. In a particular situation, humans can manifest their emotions diversely. Therefore, it is difficult to catch and understand the actual emotions. Predicting the interlocutors’ emotions help to decide proper actions for specific situations such as for treating patients with depression or those who need psychotherapy. This study develops deep learning models to classify human emotions by using human speech. Then, humans’ voices are classified based on five emotional types including normal emotion, anger, surprise, happiness, and sadness. The objectives of this study are to 1) compare the performances of two classifying models i.e., Convolution Neuron Networks (CNN), and Long Short-Term Memory (LSTM), and 2) propose the most appropriate model for classifying humans’ emotions from speech recognition. It reveals that classification results generated by LSTM outperform CNN. With LSTM, there are four classes to recognize humans’ speech emotions such as normal, angry, surprised, happy, and sad.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The content and information in the articles published in the Journal of Science and Technology, Sisaket Rajabhat University reflect the opinions and responsibilities of the respective authors. The editorial board of the journal does not necessarily agree with, nor share responsibility for, these views.
Articles, information, content, images, etc., published in the Journal of Science and Technology, Sisaket Rajabhat University are copyrighted by the Faculty of Science and Technology, Sisaket Rajabhat University. If any individual or organization wishes to republish all or part of the content, or use it for any other purpose, they must obtain written permission from the Journal of Science and Technology, Sisaket Rajabhat University beforehand.
References
Hammond, M. (2006). Evolutionary theory and emotions. In Stets E.J. and Turnur J.H. Handbook of the Sociology of Emotions, New York: Springer, 368–385.
Tokuno, S., Tsumatori, G., Shono, S., Takei, E., Yamamoto, T., Suzuki, G., Mituyoshi, S. and Shimura M. (2001). Usage of emotion recognition in military health care. In 2011 Defense Science Research Conference and Expo (DSR), 1–5.
Yamashita, Y., Onodera, M., Shimoda, K. and Tobe, Y. (2019). Visualizing health with emotion polarity history using voice. In Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers, 1210–1213.
Kittichaiwatthana, P., Praneetpholkrang, P., and Kanjanawattana, S. (2020). Facial Expression Recognition using Deep Learning. SUT International Virtual Conference on Science and Technology, 41.
Song, I., Kim, HJ. and Jeon, P. (2014). Deep learning for real-time robust facial expression recognition on a smartphone. In 2014 IEEE International Conference on Consumer Electronics (ICCE), 564–567.
Dagar, D., Hudait, A., Tripathy, HK. and Das, MN. (2016). Automatic emotion detection model from facial expression. In 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT), 77–85.
Lugović, S., Dunder, I. and Horvat, M. (2016). Techniques and applications of emotion recognition in speech. In 2016 39th international convention on information and communication technology, electronics and microelectronics (mipro), 278–1283.
Xie, Y., Liang, R., Liang, Z., Huang, C., Zou, C. and Schuller, B. (2019). Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11), 1675–1685.
Shewalkar, AN. (2018). Comparison of rnn, lstm and gru on speech recognition data. In Partial Fulfillment of the Requirements for the Degree of Master of Science. North Dakota State University of Agriculture and Applied Science.
Rawat, W. & Wang, Z. (2017). Deep convolutional neural networks for image classification: A comprehensive review. Neural computation, 29(9), 2352–2449.
Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L. and Schmauch, B. (2018). Cnn+ lstm architecture for speech emotion recognition with data augmentation. In Proceeding Workshop on Speech, Music and Mind (SMM 2018), 21-25.
Zhao, J., Mao, X. and Chen, L. (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical signal processing and control, 47, 312–323.