Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review

Chaves Villota, Andrea; Jiménez Martín, Ana; Jojoa Acosta, Mario Fernando; Bahillo Martínez, Alfonso; García Domínguez, Juan Jesús

doi:10.1016/j.csl.2025.101873

Título

Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review

dc.contributor.author	Chaves Villota, Andrea
dc.contributor.author	Jiménez Martín, Ana
dc.contributor.author	Jojoa Acosta, Mario Fernando
dc.contributor.author	Bahillo Martínez, Alfonso
dc.contributor.author	García Domínguez, Juan Jesús
dc.date.accessioned	2025-11-28T11:03:43Z
dc.date.available	2025-11-28T11:03:43Z
dc.date.issued	2026
dc.identifier.citation	Computer Speech & Language Volume, 2026, vol. 96, p. 101873	es
dc.identifier.issn	0885-2308	es
dc.identifier.uri	https://uvadoc.uva.es/handle/10324/80150
dc.description	Producción Científica	es
dc.description.abstract	Emotion Recognition (ER) has gained significant attention due to its importance in advanced human-machine interaction and its widespread real-world applications. In recent years, research on ER systems has focused on multiple key aspects, including the development of high-quality emotional databases, the selection of robust feature representations, and the implementation of advanced classifiers leveraging AI-based techniques. Despite this progress in research, ER still faces significant challenges and gaps that must be addressed to develop accurate and reliable systems. To systematically assess these critical aspects, particularly those centered on AI-based techniques, we employed the PRISMA methodology. Thus, we include journal and conference papers that provide essential insights into key parameters required for dataset development, involving emotion modeling (categorical or dimensional), the type of speech data (natural, acted, or elicited), the most common modalities integrated with acoustic and linguistic data from speech and the technologies used. Similarly, following this methodology, we identified the key representative features that serve as critical emotional information sources in both modalities. For acoustic, this included those extracted from the time and frequency domains, while for linguistic, earlier embeddings and the most common transformer models were considered. In addition, Deep Learning (DL) and attention-based methods were analyzed for both. Given the importance of effectively combining these diverse features for improving ER, we then explore fusion techniques based on the level of abstraction. Specifically, we focus on traditional approaches, including feature-, decision-, DL-, and attention-based fusion methods. Next, we provide a comparative analysis to assess the performance of the approaches included in our study. Our findings indicate that for the most commonly used datasets in the literature: IEMOCAP and MELD, the integration of acoustic and linguistic features reached a weighted accuracy (WA) of 85.71% and 63.80%, respectively. Finally, we discuss the main challenges and propose future guidelines that could enhance the performance of ER systems using acoustic and linguistic features from speech.	es
dc.format.mimetype	application/pdf	es
dc.language.iso	eng	es
dc.publisher	Elsevier Ltd.	es
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.subject.classification	Emotion recognition	es
dc.subject.classification	Speech	es
dc.subject.classification	Linguistic	es
dc.subject.classification	Acoustic	es
dc.subject.classification	Fusion	es
dc.subject.classification	Deep learning	es
dc.subject.classification	Machine learning	es
dc.subject.classification	Low and high-level features	es
dc.title	Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review	es
dc.type	info:eu-repo/semantics/article	es
dc.identifier.doi	10.1016/j.csl.2025.101873	es
dc.relation.publisherversion	https://www.sciencedirect.com/science/article/pii/S0885230825000981	es
dc.identifier.publicationfirstpage	101873	es
dc.identifier.publicationtitle	Computer Speech & Language	es
dc.identifier.publicationvolume	96	es
dc.peerreviewed	SI	es
dc.description.project	Proyecto FrailAlert SBPLY/21/180501/000216 cofinanciado por la Junta de Comunidades de Castilla-La Mancha y la Unión Europea a través del Fondo Europeo de Desarrollo Regional	es
dc.description.project	ActiTracker TED2021-130867B-I00 financiado por MCIN/AEI/10.13039/501100011033 y por European Union NextGenerationEU/PRTR	es
dc.description.project	INDRI (PID2021-122642OB-C41 /AEI/10.13039/501100011033/ FEDER, UE)	es
dc.description.project	Ministerio de Ciencia e Innovación bajo el proyecto PID2023-146254OB-C41	es
dc.rights	Atribución 4.0 Internacional	*
dc.rights	Atribución 4.0 Internacional	*
dc.type.hasVersion	info:eu-repo/semantics/publishedVersion	es

Arquivos deste item

Nome:: Deep feature representations and ...
Tamanho:: 2.748Mb
Formato:: PDF

Visualizar/Abrir

Este item aparece na(s) seguinte(s) coleção(s)

DEP71 - Artículos de revista [399]

Mostrar registro simples

Exceto quando indicado o contrário, a licença deste item é descrito como Atribución 4.0 Internacional