• 1. College of electronic information engineering, Taiyuan University of Technology, Taiyuan 030024, P. R. China;
  • 2. Shanxi Academy of Advanced Research Innovation, Taiyuan 030032, P. R. China;
  • 3. College of computer science and technology, Taiyuan University of Technology, Taiyuan 030024, P. R. China;
XUE Peiyun, Email: xuepeiyun@tyut.edu.cn
Export PDF Favorites Scan Get Citation

Emotion can reflect the psychological and physiological health of human beings, and the main expression of human emotion is voice and facial expression. How to extract and effectively integrate the two modes of emotion information is one of the main challenges faced by emotion recognition. In this paper, a multi-branch bidirectional multi-scale time perception model is proposed, which can detect the forward and reverse speech Mel-frequency spectrum coefficients in the time dimension. At the same time, the model uses causal convolution to obtain temporal correlation information between different scale features, and assigns attention maps to them according to the information, so as to obtain multi-scale fusion of speech emotion features. Secondly, this paper proposes a two-modal feature dynamic fusion algorithm, which combines the advantages of AlexNet and uses overlapping maximum pooling layers to obtain richer fusion features from different modal feature mosaic matrices. Experimental results show that the accuracy of the multi-branch bidirectional multi-scale time sensing dual-modal emotion recognition model proposed in this paper reaches 97.67% and 90.14% respectively on the two public audio and video emotion data sets, which is superior to other common methods, indicating that the proposed emotion recognition model can effectively capture emotion feature information and improve the accuracy of emotion recognition.

Copyright © the editorial department of Journal of Biomedical Engineering of West China Medical Publisher. All rights reserved