• español
  • English
  • français
  • Deutsch
  • português (Brasil)
  • italiano
    • español
    • English
    • français
    • Deutsch
    • português (Brasil)
    • italiano
    • español
    • English
    • français
    • Deutsch
    • português (Brasil)
    • italiano
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Stöbern

    Gesamter BestandBereicheErscheinungsdatumAutorenSchlagwortenTiteln

    Mein Benutzerkonto

    Einloggen

    Statistik

    Benutzungsstatistik

    Compartir

    Dokumentanzeige 
    •   UVaDOC Startseite
    • WISSENSCHAFTLICHE ARBEITEN
    • Departamentos
    • Dpto. Teoría de la Señal y Comunicaciones e Ingeniería Telemática
    • DEP71 - Artículos de revista
    • Dokumentanzeige
    •   UVaDOC Startseite
    • WISSENSCHAFTLICHE ARBEITEN
    • Departamentos
    • Dpto. Teoría de la Señal y Comunicaciones e Ingeniería Telemática
    • DEP71 - Artículos de revista
    • Dokumentanzeige
    • español
    • English
    • français
    • Deutsch
    • português (Brasil)
    • italiano

    Exportar

    RISMendeleyRefworksZotero
    • edm
    • marc
    • xoai
    • qdc
    • ore
    • ese
    • dim
    • uketd_dc
    • oai_dc
    • etdms
    • rdf
    • mods
    • mets
    • didl
    • premis

    Citas

    Por favor, use este identificador para citar o enlazar este ítem:https://uvadoc.uva.es/handle/10324/80150

    Título
    Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review
    Autor
    Chaves-Villota, Andrea
    Jimenez-Martín, Ana
    Jojoa Acosta, Mario FernandoAutoridad UVA Orcid
    Bahillo Martínez, AlfonsoAutoridad UVA Orcid
    García-Domínguez, Juan Jesús
    Año del Documento
    2026
    Editorial
    Elsevier Ltd.
    Descripción
    Producción Científica
    Documento Fuente
    Computer Speech & Language Volume, 2026, vol. 96, p. 101873
    Zusammenfassung
    Emotion Recognition (ER) has gained significant attention due to its importance in advanced human-machine interaction and its widespread real-world applications. In recent years, research on ER systems has focused on multiple key aspects, including the development of high-quality emotional databases, the selection of robust feature representations, and the implementation of advanced classifiers leveraging AI-based techniques. Despite this progress in research, ER still faces significant challenges and gaps that must be addressed to develop accurate and reliable systems. To systematically assess these critical aspects, particularly those centered on AI-based techniques, we employed the PRISMA methodology. Thus, we include journal and conference papers that provide essential insights into key parameters required for dataset development, involving emotion modeling (categorical or dimensional), the type of speech data (natural, acted, or elicited), the most common modalities integrated with acoustic and linguistic data from speech and the technologies used. Similarly, following this methodology, we identified the key representative features that serve as critical emotional information sources in both modalities. For acoustic, this included those extracted from the time and frequency domains, while for linguistic, earlier embeddings and the most common transformer models were considered. In addition, Deep Learning (DL) and attention-based methods were analyzed for both. Given the importance of effectively combining these diverse features for improving ER, we then explore fusion techniques based on the level of abstraction. Specifically, we focus on traditional approaches, including feature-, decision-, DL-, and attention-based fusion methods. Next, we provide a comparative analysis to assess the performance of the approaches included in our study. Our findings indicate that for the most commonly used datasets in the literature: IEMOCAP and MELD, the integration of acoustic and linguistic features reached a weighted accuracy (WA) of 85.71% and 63.80%, respectively. Finally, we discuss the main challenges and propose future guidelines that could enhance the performance of ER systems using acoustic and linguistic features from speech.
    Palabras Clave
    Emotion recognition
    Speech
    Linguistic
    Acoustic
    Fusion
    Deep learning
    Machine learning
    Low and high-level features
    ISSN
    0885-2308
    Revisión por pares
    SI
    DOI
    10.1016/j.csl.2025.101873
    Patrocinador
    Proyecto FrailAlert SBPLY/21/180501/000216 cofinanciado por la Junta de Comunidades de Castilla-La Mancha y la Unión Europea a través del Fondo Europeo de Desarrollo Regional
    ActiTracker TED2021-130867B-I00 financiado por MCIN/AEI/10.13039/501100011033 y por European Union NextGenerationEU/PRTR
    INDRI (PID2021-122642OB-C41 /AEI/10.13039/501100011033/ FEDER, UE)
    Ministerio de Ciencia e Innovación bajo el proyecto PID2023-146254OB-C41
    Version del Editor
    https://www.sciencedirect.com/science/article/pii/S0885230825000981
    Idioma
    eng
    URI
    https://uvadoc.uva.es/handle/10324/80150
    Tipo de versión
    info:eu-repo/semantics/publishedVersion
    Derechos
    openAccess
    Aparece en las colecciones
    • DEP71 - Artículos de revista [373]
    Zur Langanzeige
    Dateien zu dieser Ressource
    Nombre:
    Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities A systematic review.pdf
    Tamaño:
    2.748Mb
    Formato:
    Adobe PDF
    Thumbnail
    Öffnen
    Atribución 4.0 InternacionalSolange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: Atribución 4.0 Internacional

    Universidad de Valladolid

    Powered by MIT's. DSpace software, Version 5.10