Research on bimodal emotion recognition algorithm based on multi-branch bidirectional multi-scale time perception_Journal of Biomedical Engineering

Authors：

 XUE Peiyun ^1,2 , WANG Sibin ¹ , BAI Jing ¹ , QIANG Yan ³

1. College of electronic information engineering, Taiyuan University of Technology, Taiyuan 030024, P. R. China;
2. Shanxi Academy of Advanced Research Innovation, Taiyuan 030032, P. R. China;
3. College of computer science and technology, Taiyuan University of Technology, Taiyuan 030024, P. R. China;

Corresponding author：

XUE Peiyun, Email: xuepeiyun@tyut.edu.cn

Keywords：

Bimodal emotion recognition; Feature fusion; Feature extraction; Temporal awareness; Speech emotion recognition

DOI：

10.7507/1001-5515.202404047

Video：

Export PDF Favorites Scan Get Citation

Abstract Full text Figures/Tables Video References Cited by

Emotion can reflect the psychological and physiological health of human beings, and the main expression of human emotion is voice and facial expression. How to extract and effectively integrate the two modes of emotion information is one of the main challenges faced by emotion recognition. In this paper, a multi-branch bidirectional multi-scale time perception model is proposed, which can detect the forward and reverse speech Mel-frequency spectrum coefficients in the time dimension. At the same time, the model uses causal convolution to obtain temporal correlation information between different scale features, and assigns attention maps to them according to the information, so as to obtain multi-scale fusion of speech emotion features. Secondly, this paper proposes a two-modal feature dynamic fusion algorithm, which combines the advantages of AlexNet and uses overlapping maximum pooling layers to obtain richer fusion features from different modal feature mosaic matrices. Experimental results show that the accuracy of the multi-branch bidirectional multi-scale time sensing dual-modal emotion recognition model proposed in this paper reaches 97.67% and 90.14% respectively on the two public audio and video emotion data sets, which is superior to other common methods, indicating that the proposed emotion recognition model can effectively capture emotion feature information and improve the accuracy of emotion recognition.

Citation： XUE Peiyun, WANG Sibin, BAI Jing, QIANG Yan. Research on bimodal emotion recognition algorithm based on multi-branch bidirectional multi-scale time perception. Journal of Biomedical Engineering, 2025, 42(3): 528-536, 543. doi: 10.7507/1001-5515.202404047 Copy

1.	Song T, Liu S, Zheng W, et al. Variational instance-adaptive graph for EEG emotion recognition. IEEE Transactions on Affective Computing, 2021, 14(1): 343-356..
2.	Nfissi A, Bouachir W, Bouguila N, et al. CNN-n-GRU: end-to-end speech emotion recognition from raw waveform signal using CNNs and gated recurrent unit networks//21st IEEE International Conference on Machine Learning and Applications (ICMLA), Shanghai: IEEE, 2022: 699-702..
3.	Zhong Y, Hu Y, Huang H, et al. A lightweight model based on separable convolution for speech emotion recognition//Interspeech 2020, Shanghai: International Speech Communication Association(ISCA), 2020: 3331-3335..
4.	Rajamani S T, Rajamani K T, Mallol-Ragolta A, et al. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition// IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto: IEEE, 2021: 9414489..
5.	姜特, 陈志刚, 万永菁. 基于注意力机制的多任务3D CNN-BLSTM情感语音识别. 华东理工大学学报(自然科学版), 2022, 48(4): 534-542..
6.	任艳. 面部运动单元识别中的特征提取方法研究. 南京: 东南大学, 2012..
7.	Yao L, Zhao H. Deep learning method of facial expression recognition based on Gabor filter bank combined with PCNN. Wireless Personal Communications, 2023, 131(2): 955-971..
8.	Kumar S, Sagar V, Punetha D. A comparative study on facial expression recognition using local binary patterns, convolutional neural network and frequency neural network. Multimedia Tools and Applications, 2023, 82(16): 24369-24385..
9.	Lokku G, Reddy G H, Prasad M N G. Optimized scale-invariant feature transform with local tri-directional patterns for facial expression recognition with deep learning model. The Computer Journal, 2022, 65(9): 2506-2527..
10.	Zhao S, Tao H, Zhang Y, et al. A two-stage 3D CNN based learning method for spontaneous micro-expression recognition. Neurocomputing, 2021, 448: 276-289..
11.	Wang X, Wang G, Cui Y. Facial expression recognition based on improved ResNet. The Journal of China Universities of Posts and Telecommunications, 2023, 30(1): 28-38..
12.	Dahmouni A, Rossamy R, Hamdani M, et al. Bimodal emotional recognition based on long term recurrent convolutional network//Proceedings of the 6th International Conference on Networking, Intelligent Systems & Security, Larache: ACM, 2023: 1-5..
13.	Song K S, Nho Y H, Seo J H, et al. Decision-level fusion method for emotion recognition using multimodal emotion recognition information//15th International Conference on Ubiquitous Robots (UR), Honolulu: IEEE, 2018: 472-476..
14.	Zhang S, Wang X, Zhang G, et al. Multimodal emotion recognition integrating affective speech with facial expression. WSEAS Transactions on Signal Processing, 2014, 10: 526-537..
15.	Kakuba S, Han D S. Bimodal speech emotion recognition using fused intra and cross modality features//Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN), Paris: IEEE, 2023: 109-113..
16.	Jaratrotkamjorn A, Choksuriwong A. Bimodal emotion recognition using deep belief network//23rd International Computer Science and Engineering Conference (ICSEC), Phuket: IEEE, 2019: 103-109..
17.	Liu M, Tang J. Audio and video bimodal emotion recognition in social networks based on improved alexnet network and attention mechanism. Journal of Information Processing Systems, 2021, 17(4): 754-771..
18.	Wu M. Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition. IEEE Transactions on Affective Computing, 2020, 13(2): 805-817..
19.	Noroozi F, Marjanovic M, Njegus A, et al. Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 2017, 10(1): 60-75..
20.	Ye J, Wen X C, Wei Y, et al. Temporal modeling matters: a novel temporal emotional modeling approach for speech emotion recognition// IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island: IEEE, 2023: 1-5..
21.	Burkhardt F, Paeschke A, Rolfes M, et al. A database of German emotional speech//9th European Conference on Speech Communication and Technology, Lisbon: ISCA, 2005: 446..
22.	Liu Z T, Wu B H, Li D Y, et al. Speech emotion recognition based on selective interpolation synthetic minority over-sampling technique in small sample environment. Sensors, 2020, 20(8): 2297..
23.	Livingstone S R, Russo F A. The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 2018, 13(5): e0196391..
24.	Costantini G, Iaderola I, Paoloni A, et al. EMOVO corpus: an Italian emotional speech database//Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik: European Language Resources Association (ELRA), 2014: 3501-3504..
25.	Martin O, Kotsia I, Macq B, et al. The eNTERFACE'05 audio-visual emotion database//International Conference on Data Engineering Workshops, Atlanta: IEEE Computer Society, 2006: 145..
26.	Wen X C, Liu K H, Luo Y, et al. TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition. Soft Computing, 2023: 1-13..
27.	Wen X C, Ye J X, Luo Y, et al. CTL-MTNet: a novel CapsNet and transfer learning-based mixed task net for the single-corpus and cross-corpus speech emotion recognition//International Joint Conference on Artificial Intelligence, Vienna: IJCAI, 2022: 2305-2311..
28.	Aftab A, Morsali A, Ghaemmaghami S, et al. LIGHT-SERNET: a lightweight fully convolutional neural network for speech emotion recognition//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore: IEEE, 2022: 6912-6916..
29.	Ye J X, Wen X C, Wang X Z, et al. GM-TCNet: gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Communication, 2022, 145: 21-35..
30.	Tuncer T, Dogan S, Acharya U R. Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowledge-Based Systems, 2021, 211: 106547..
31.	Vasuki P, Aravindan C. Hierarchical classifier design for speech emotion recognition in the mixed-cultural environment. Journal of Experimental & Theoretical Artificial Intelligence, 2021, 33(3): 451-466..
32.	Ancilin J, Milton A. Improved speech emotion recognition with Mel frequency magnitude coefficient. Applied Acoustics, 2021, 179(3): 108046..
33.	Ozer I. Pseudo-colored rate map representation for speech emotion recognition. Biomedical Signal Processing and Control, 2021, 66: 102502..
34.	Assunção G, Menezes P, Perdigão F. Speaker awareness for speech emotion recognition. International Journal of Online and Biomedical Engineering, 2020, 16(4): 15-22..
35.	Jason C A, Kumar S. An appraisal on speech and emotion recognition technologies based on machine learning. International Journal of Recent Technology and Engineering, 2020, 8(5): 2266-2276..
36.	Tang G, Xie Y, Li K, et al. Multimodal emotion recognition from facial expression and speech based on feature fusion. Multimedia Tools and Applications, 2023, 82(11): 16359-16373..
37.	Dedeoglu M, Zhang J, Liang R. Emotion classification based on audiovisual information fusion using deep learning//2019 International Conference on Data Mining Workshops (ICDMW), Beijing: IEEE, 2019: 131-134..
38.	Pandeya Y R, Bhattarai B, Lee J. Deep-learning-based multimodal emotion classification for music videos. Sensors, 2021, 21(14): 4927..
39.	Chumachenko K, Iosifidis A, Gabbouj M. Self-attention fusion for audiovisual emotion recognition with incomplete data//26th International Conference on Pattern Recognition (ICPR), Montréal: IEEE, 2022: 2822-2828..
40.	Wozniak M, Sakowicz M, Ledwosinski K, et al. Bimodal emotion recognition based on vocal and facial features. Procedia Computer Science, 2023, 225: 2556-2566..
41.	Middya A I, Nag B, Roy S. Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities. Knowledge-Based Systems, 2022, 244: 108580..
42.	Zhang S, Zhang S, Huang T, et al. Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 28(10): 3030-3043..
43.	Liu Y, Chen L, Li M, et al. Deep residual network with DS evidence theory for bimodal emotion recognition//China Automation Congress (CAC), Beijing: IEEE, 2021: 4674-4679..

1. Song T, Liu S, Zheng W, et al. Variational instance-adaptive graph for EEG emotion recognition. IEEE Transactions on Affective Computing, 2021, 14(1): 343-356..
2. Nfissi A, Bouachir W, Bouguila N, et al. CNN-n-GRU: end-to-end speech emotion recognition from raw waveform signal using CNNs and gated recurrent unit networks//21st IEEE International Conference on Machine Learning and Applications (ICMLA), Shanghai: IEEE, 2022: 699-702..
3. Zhong Y, Hu Y, Huang H, et al. A lightweight model based on separable convolution for speech emotion recognition//Interspeech 2020, Shanghai: International Speech Communication Association(ISCA), 2020: 3331-3335..
4. Rajamani S T, Rajamani K T, Mallol-Ragolta A, et al. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition// IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto: IEEE, 2021: 9414489..
5. 姜特, 陈志刚, 万永菁. 基于注意力机制的多任务3D CNN-BLSTM情感语音识别. 华东理工大学学报(自然科学版), 2022, 48(4): 534-542..
6. 任艳. 面部运动单元识别中的特征提取方法研究. 南京: 东南大学, 2012..
7. Yao L, Zhao H. Deep learning method of facial expression recognition based on Gabor filter bank combined with PCNN. Wireless Personal Communications, 2023, 131(2): 955-971..
8. Kumar S, Sagar V, Punetha D. A comparative study on facial expression recognition using local binary patterns, convolutional neural network and frequency neural network. Multimedia Tools and Applications, 2023, 82(16): 24369-24385..
9. Lokku G, Reddy G H, Prasad M N G. Optimized scale-invariant feature transform with local tri-directional patterns for facial expression recognition with deep learning model. The Computer Journal, 2022, 65(9): 2506-2527..
10. Zhao S, Tao H, Zhang Y, et al. A two-stage 3D CNN based learning method for spontaneous micro-expression recognition. Neurocomputing, 2021, 448: 276-289..
11. Wang X, Wang G, Cui Y. Facial expression recognition based on improved ResNet. The Journal of China Universities of Posts and Telecommunications, 2023, 30(1): 28-38..
12. Dahmouni A, Rossamy R, Hamdani M, et al. Bimodal emotional recognition based on long term recurrent convolutional network//Proceedings of the 6th International Conference on Networking, Intelligent Systems & Security, Larache: ACM, 2023: 1-5..
13. Song K S, Nho Y H, Seo J H, et al. Decision-level fusion method for emotion recognition using multimodal emotion recognition information//15th International Conference on Ubiquitous Robots (UR), Honolulu: IEEE, 2018: 472-476..
14. Zhang S, Wang X, Zhang G, et al. Multimodal emotion recognition integrating affective speech with facial expression. WSEAS Transactions on Signal Processing, 2014, 10: 526-537..
15. Kakuba S, Han D S. Bimodal speech emotion recognition using fused intra and cross modality features//Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN), Paris: IEEE, 2023: 109-113..
16. Jaratrotkamjorn A, Choksuriwong A. Bimodal emotion recognition using deep belief network//23rd International Computer Science and Engineering Conference (ICSEC), Phuket: IEEE, 2019: 103-109..
17. Liu M, Tang J. Audio and video bimodal emotion recognition in social networks based on improved alexnet network and attention mechanism. Journal of Information Processing Systems, 2021, 17(4): 754-771..
18. Wu M. Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition. IEEE Transactions on Affective Computing, 2020, 13(2): 805-817..
19. Noroozi F, Marjanovic M, Njegus A, et al. Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 2017, 10(1): 60-75..
20. Ye J, Wen X C, Wei Y, et al. Temporal modeling matters: a novel temporal emotional modeling approach for speech emotion recognition// IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island: IEEE, 2023: 1-5..
21. Burkhardt F, Paeschke A, Rolfes M, et al. A database of German emotional speech//9th European Conference on Speech Communication and Technology, Lisbon: ISCA, 2005: 446..
22. Liu Z T, Wu B H, Li D Y, et al. Speech emotion recognition based on selective interpolation synthetic minority over-sampling technique in small sample environment. Sensors, 2020, 20(8): 2297..
23. Livingstone S R, Russo F A. The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 2018, 13(5): e0196391..
24. Costantini G, Iaderola I, Paoloni A, et al. EMOVO corpus: an Italian emotional speech database//Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik: European Language Resources Association (ELRA), 2014: 3501-3504..
25. Martin O, Kotsia I, Macq B, et al. The eNTERFACE'05 audio-visual emotion database//International Conference on Data Engineering Workshops, Atlanta: IEEE Computer Society, 2006: 145..
26. Wen X C, Liu K H, Luo Y, et al. TWACapsNet: a capsule network with two-way attention mechanism for speech emotion recognition. Soft Computing, 2023: 1-13..
27. Wen X C, Ye J X, Luo Y, et al. CTL-MTNet: a novel CapsNet and transfer learning-based mixed task net for the single-corpus and cross-corpus speech emotion recognition//International Joint Conference on Artificial Intelligence, Vienna: IJCAI, 2022: 2305-2311..
28. Aftab A, Morsali A, Ghaemmaghami S, et al. LIGHT-SERNET: a lightweight fully convolutional neural network for speech emotion recognition//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore: IEEE, 2022: 6912-6916..
29. Ye J X, Wen X C, Wang X Z, et al. GM-TCNet: gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Communication, 2022, 145: 21-35..
30. Tuncer T, Dogan S, Acharya U R. Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowledge-Based Systems, 2021, 211: 106547..
31. Vasuki P, Aravindan C. Hierarchical classifier design for speech emotion recognition in the mixed-cultural environment. Journal of Experimental & Theoretical Artificial Intelligence, 2021, 33(3): 451-466..
32. Ancilin J, Milton A. Improved speech emotion recognition with Mel frequency magnitude coefficient. Applied Acoustics, 2021, 179(3): 108046..
33. Ozer I. Pseudo-colored rate map representation for speech emotion recognition. Biomedical Signal Processing and Control, 2021, 66: 102502..
34. Assunção G, Menezes P, Perdigão F. Speaker awareness for speech emotion recognition. International Journal of Online and Biomedical Engineering, 2020, 16(4): 15-22..
35. Jason C A, Kumar S. An appraisal on speech and emotion recognition technologies based on machine learning. International Journal of Recent Technology and Engineering, 2020, 8(5): 2266-2276..
36. Tang G, Xie Y, Li K, et al. Multimodal emotion recognition from facial expression and speech based on feature fusion. Multimedia Tools and Applications, 2023, 82(11): 16359-16373..
37. Dedeoglu M, Zhang J, Liang R. Emotion classification based on audiovisual information fusion using deep learning//2019 International Conference on Data Mining Workshops (ICDMW), Beijing: IEEE, 2019: 131-134..
38. Pandeya Y R, Bhattarai B, Lee J. Deep-learning-based multimodal emotion classification for music videos. Sensors, 2021, 21(14): 4927..
39. Chumachenko K, Iosifidis A, Gabbouj M. Self-attention fusion for audiovisual emotion recognition with incomplete data//26th International Conference on Pattern Recognition (ICPR), Montréal: IEEE, 2022: 2822-2828..
40. Wozniak M, Sakowicz M, Ledwosinski K, et al. Bimodal emotion recognition based on vocal and facial features. Procedia Computer Science, 2023, 225: 2556-2566..
41. Middya A I, Nag B, Roy S. Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities. Knowledge-Based Systems, 2022, 244: 108580..
42. Zhang S, Zhang S, Huang T, et al. Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 28(10): 3030-3043..
43. Liu Y, Chen L, Li M, et al. Deep residual network with DS evidence theory for bimodal emotion recognition//China Automation Congress (CAC), Beijing: IEEE, 2021: 4674-4679..

Journal of Biomedical Engineering

Research on bimodal emotion recognition algorithm based on multi-branch bidirectional multi-scale time perception

Abstract Full text Figures/Tables Video References Cited by

Previous Article

Next Article

Format

Content