<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="static/style.xsl"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-04-23T20:28:20Z</responseDate><request verb="GetRecord" identifier="oai:uvadoc.uva.es:10324/80150" metadataPrefix="mods">https://uvadoc.uva.es/oai/request</request><GetRecord><record><header><identifier>oai:uvadoc.uva.es:10324/80150</identifier><datestamp>2025-12-15T09:25:05Z</datestamp><setSpec>com_10324_1191</setSpec><setSpec>com_10324_931</setSpec><setSpec>com_10324_894</setSpec><setSpec>col_10324_1379</setSpec></header><metadata><mods:mods xmlns:mods="http://www.loc.gov/mods/v3" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-1.xsd">
<mods:name>
<mods:namePart>Chaves Villota, Andrea</mods:namePart>
</mods:name>
<mods:name>
<mods:namePart>Jiménez Martín, Ana</mods:namePart>
</mods:name>
<mods:name>
<mods:namePart>Jojoa Acosta, Mario Fernando</mods:namePart>
</mods:name>
<mods:name>
<mods:namePart>Bahillo Martínez, Alfonso</mods:namePart>
</mods:name>
<mods:name>
<mods:namePart>García Domínguez, Juan Jesús</mods:namePart>
</mods:name>
<mods:extension>
<mods:dateAvailable encoding="iso8601">2025-11-28T11:03:43Z</mods:dateAvailable>
</mods:extension>
<mods:extension>
<mods:dateAccessioned encoding="iso8601">2025-11-28T11:03:43Z</mods:dateAccessioned>
</mods:extension>
<mods:originInfo>
<mods:dateIssued encoding="iso8601">2026</mods:dateIssued>
</mods:originInfo>
<mods:identifier type="citation">Computer Speech &amp; Language Volume, 2026, vol. 96, p. 101873</mods:identifier>
<mods:identifier type="issn">0885-2308</mods:identifier>
<mods:identifier type="uri">https://uvadoc.uva.es/handle/10324/80150</mods:identifier>
<mods:identifier type="doi">10.1016/j.csl.2025.101873</mods:identifier>
<mods:identifier type="publicationfirstpage">101873</mods:identifier>
<mods:identifier type="publicationtitle">Computer Speech &amp; Language</mods:identifier>
<mods:identifier type="publicationvolume">96</mods:identifier>
<mods:abstract>Emotion Recognition (ER) has gained significant attention due to its importance in advanced human-machine interaction and its widespread real-world applications. In recent years, research&#xd;
on ER systems has focused on multiple key aspects, including the development of high-quality&#xd;
emotional databases, the selection of robust feature representations, and the implementation&#xd;
of advanced classifiers leveraging AI-based techniques. Despite this progress in research, ER&#xd;
still faces significant challenges and gaps that must be addressed to develop accurate and&#xd;
reliable systems. To systematically assess these critical aspects, particularly those centered on&#xd;
AI-based techniques, we employed the PRISMA methodology. Thus, we include journal and&#xd;
conference papers that provide essential insights into key parameters required for dataset&#xd;
development, involving emotion modeling (categorical or dimensional), the type of speech&#xd;
data (natural, acted, or elicited), the most common modalities integrated with acoustic and&#xd;
linguistic data from speech and the technologies used. Similarly, following this methodology,&#xd;
we identified the key representative features that serve as critical emotional information sources&#xd;
in both modalities. For acoustic, this included those extracted from the time and frequency&#xd;
domains, while for linguistic, earlier embeddings and the most common transformer models&#xd;
were considered. In addition, Deep Learning (DL) and attention-based methods were analyzed&#xd;
for both. Given the importance of effectively combining these diverse features for improving ER,&#xd;
we then explore fusion techniques based on the level of abstraction. Specifically, we focus on&#xd;
traditional approaches, including feature-, decision-, DL-, and attention-based fusion methods.&#xd;
Next, we provide a comparative analysis to assess the performance of the approaches included&#xd;
in our study. Our findings indicate that for the most commonly used datasets in the literature:&#xd;
IEMOCAP and MELD, the integration of acoustic and linguistic features reached a weighted&#xd;
accuracy (WA) of 85.71% and 63.80%, respectively. Finally, we discuss the main challenges&#xd;
and propose future guidelines that could enhance the performance of ER systems using acoustic&#xd;
and linguistic features from speech.</mods:abstract>
<mods:language>
<mods:languageTerm>eng</mods:languageTerm>
</mods:language>
<mods:accessCondition type="useAndReproduction">info:eu-repo/semantics/openAccess</mods:accessCondition>
<mods:accessCondition type="useAndReproduction">http://creativecommons.org/licenses/by/4.0/</mods:accessCondition>
<mods:accessCondition type="useAndReproduction">Atribución 4.0 Internacional</mods:accessCondition>
<mods:accessCondition type="useAndReproduction">Atribución 4.0 Internacional</mods:accessCondition>
<mods:titleInfo>
<mods:title>Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review</mods:title>
</mods:titleInfo>
<mods:genre>info:eu-repo/semantics/article</mods:genre>
</mods:mods></metadata></record></GetRecord></OAI-PMH>