RT info:eu-repo/semantics/article
T1 Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review
A1 Chaves-Villota, Andrea
A1 Jimenez-Martín, Ana
A1 Jojoa Acosta, Mario Fernando
A1 Bahillo Martínez, Alfonso
A1 García-Domínguez, Juan Jesús
K1 Emotion recognition
K1 Speech
K1 Linguistic
K1 Acoustic
K1 Fusion
K1 Deep learning
K1 Machine learning
K1 Low and high-level features
AB Emotion Recognition (ER) has gained significant attention due to its importance in advanced human-machine interaction and its widespread real-world applications. In recent years, researchon ER systems has focused on multiple key aspects, including the development of high-qualityemotional databases, the selection of robust feature representations, and the implementationof advanced classifiers leveraging AI-based techniques. Despite this progress in research, ERstill faces significant challenges and gaps that must be addressed to develop accurate andreliable systems. To systematically assess these critical aspects, particularly those centered onAI-based techniques, we employed the PRISMA methodology. Thus, we include journal andconference papers that provide essential insights into key parameters required for datasetdevelopment, involving emotion modeling (categorical or dimensional), the type of speechdata (natural, acted, or elicited), the most common modalities integrated with acoustic andlinguistic data from speech and the technologies used. Similarly, following this methodology,we identified the key representative features that serve as critical emotional information sourcesin both modalities. For acoustic, this included those extracted from the time and frequencydomains, while for linguistic, earlier embeddings and the most common transformer modelswere considered. In addition, Deep Learning (DL) and attention-based methods were analyzedfor both. Given the importance of effectively combining these diverse features for improving ER,we then explore fusion techniques based on the level of abstraction. Specifically, we focus ontraditional approaches, including feature-, decision-, DL-, and attention-based fusion methods.Next, we provide a comparative analysis to assess the performance of the approaches includedin our study. Our findings indicate that for the most commonly used datasets in the literature:IEMOCAP and MELD, the integration of acoustic and linguistic features reached a weightedaccuracy (WA) of 85.71% and 63.80%, respectively. Finally, we discuss the main challengesand propose future guidelines that could enhance the performance of ER systems using acousticand linguistic features from speech.
PB Elsevier Ltd.
SN 0885-2308
YR 2026
FD 2026
LK https://uvadoc.uva.es/handle/10324/80150
UL https://uvadoc.uva.es/handle/10324/80150
LA eng
NO Computer Speech & Language Volume, 2026, vol. 96, p. 101873
NO Producción Científica
DS UVaDOC
RD 28-nov-2025