Por favor, use este identificador para citar o enlazar este ítem:https://uvadoc.uva.es/handle/10324/80150
Título
Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review
Autor
Año del Documento
2026
Editorial
Elsevier Ltd.
Descripción
Producción Científica
Documento Fuente
Computer Speech & Language Volume, 2026, vol. 96, p. 101873
Abstract
Emotion Recognition (ER) has gained significant attention due to its importance in advanced human-machine interaction and its widespread real-world applications. In recent years, research
on ER systems has focused on multiple key aspects, including the development of high-quality
emotional databases, the selection of robust feature representations, and the implementation
of advanced classifiers leveraging AI-based techniques. Despite this progress in research, ER
still faces significant challenges and gaps that must be addressed to develop accurate and
reliable systems. To systematically assess these critical aspects, particularly those centered on
AI-based techniques, we employed the PRISMA methodology. Thus, we include journal and
conference papers that provide essential insights into key parameters required for dataset
development, involving emotion modeling (categorical or dimensional), the type of speech
data (natural, acted, or elicited), the most common modalities integrated with acoustic and
linguistic data from speech and the technologies used. Similarly, following this methodology,
we identified the key representative features that serve as critical emotional information sources
in both modalities. For acoustic, this included those extracted from the time and frequency
domains, while for linguistic, earlier embeddings and the most common transformer models
were considered. In addition, Deep Learning (DL) and attention-based methods were analyzed
for both. Given the importance of effectively combining these diverse features for improving ER,
we then explore fusion techniques based on the level of abstraction. Specifically, we focus on
traditional approaches, including feature-, decision-, DL-, and attention-based fusion methods.
Next, we provide a comparative analysis to assess the performance of the approaches included
in our study. Our findings indicate that for the most commonly used datasets in the literature:
IEMOCAP and MELD, the integration of acoustic and linguistic features reached a weighted
accuracy (WA) of 85.71% and 63.80%, respectively. Finally, we discuss the main challenges
and propose future guidelines that could enhance the performance of ER systems using acoustic
and linguistic features from speech.
Palabras Clave
Emotion recognition
Speech
Linguistic
Acoustic
Fusion
Deep learning
Machine learning
Low and high-level features
ISSN
0885-2308
Revisión por pares
SI
Patrocinador
Proyecto FrailAlert SBPLY/21/180501/000216 cofinanciado por la Junta de Comunidades de Castilla-La Mancha y la Unión Europea a través del Fondo Europeo de Desarrollo Regional
ActiTracker TED2021-130867B-I00 financiado por MCIN/AEI/10.13039/501100011033 y por European Union NextGenerationEU/PRTR
INDRI (PID2021-122642OB-C41 /AEI/10.13039/501100011033/ FEDER, UE)
Ministerio de Ciencia e Innovación bajo el proyecto PID2023-146254OB-C41
ActiTracker TED2021-130867B-I00 financiado por MCIN/AEI/10.13039/501100011033 y por European Union NextGenerationEU/PRTR
INDRI (PID2021-122642OB-C41 /AEI/10.13039/501100011033/ FEDER, UE)
Ministerio de Ciencia e Innovación bajo el proyecto PID2023-146254OB-C41
Version del Editor
Idioma
eng
Tipo de versión
info:eu-repo/semantics/publishedVersion
Derechos
openAccess
Collections
Files in this item
Tamaño:
2.748Mb
Formato:
Adobe PDF
Except where otherwise noted, this item's license is described as Atribución 4.0 Internacional










