<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="static/style.xsl"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-04-26T21:59:55Z</responseDate><request verb="GetRecord" identifier="oai:uvadoc.uva.es:10324/80150" metadataPrefix="rdf">https://uvadoc.uva.es/oai/request</request><GetRecord><record><header><identifier>oai:uvadoc.uva.es:10324/80150</identifier><datestamp>2025-12-15T09:25:05Z</datestamp><setSpec>com_10324_1191</setSpec><setSpec>com_10324_931</setSpec><setSpec>com_10324_894</setSpec><setSpec>col_10324_1379</setSpec></header><metadata><rdf:RDF xmlns:rdf="http://www.openarchives.org/OAI/2.0/rdf/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ds="http://dspace.org/ds/elements/1.1/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ow="http://www.ontoweb.org/ontology/1#" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/rdf/ http://www.openarchives.org/OAI/2.0/rdf.xsd">
<ow:Publication rdf:about="oai:uvadoc.uva.es:10324/80150">
<dc:title>Deep feature representations and fusion strategies for speech emotion recognition from acoustic and linguistic modalities: A systematic review</dc:title>
<dc:creator>Chaves Villota, Andrea</dc:creator>
<dc:creator>Jiménez Martín, Ana</dc:creator>
<dc:creator>Jojoa Acosta, Mario Fernando</dc:creator>
<dc:creator>Bahillo Martínez, Alfonso</dc:creator>
<dc:creator>García Domínguez, Juan Jesús</dc:creator>
<dc:description>Producción Científica</dc:description>
<dc:description>Emotion Recognition (ER) has gained significant attention due to its importance in advanced human-machine interaction and its widespread real-world applications. In recent years, research&#xd;
on ER systems has focused on multiple key aspects, including the development of high-quality&#xd;
emotional databases, the selection of robust feature representations, and the implementation&#xd;
of advanced classifiers leveraging AI-based techniques. Despite this progress in research, ER&#xd;
still faces significant challenges and gaps that must be addressed to develop accurate and&#xd;
reliable systems. To systematically assess these critical aspects, particularly those centered on&#xd;
AI-based techniques, we employed the PRISMA methodology. Thus, we include journal and&#xd;
conference papers that provide essential insights into key parameters required for dataset&#xd;
development, involving emotion modeling (categorical or dimensional), the type of speech&#xd;
data (natural, acted, or elicited), the most common modalities integrated with acoustic and&#xd;
linguistic data from speech and the technologies used. Similarly, following this methodology,&#xd;
we identified the key representative features that serve as critical emotional information sources&#xd;
in both modalities. For acoustic, this included those extracted from the time and frequency&#xd;
domains, while for linguistic, earlier embeddings and the most common transformer models&#xd;
were considered. In addition, Deep Learning (DL) and attention-based methods were analyzed&#xd;
for both. Given the importance of effectively combining these diverse features for improving ER,&#xd;
we then explore fusion techniques based on the level of abstraction. Specifically, we focus on&#xd;
traditional approaches, including feature-, decision-, DL-, and attention-based fusion methods.&#xd;
Next, we provide a comparative analysis to assess the performance of the approaches included&#xd;
in our study. Our findings indicate that for the most commonly used datasets in the literature:&#xd;
IEMOCAP and MELD, the integration of acoustic and linguistic features reached a weighted&#xd;
accuracy (WA) of 85.71% and 63.80%, respectively. Finally, we discuss the main challenges&#xd;
and propose future guidelines that could enhance the performance of ER systems using acoustic&#xd;
and linguistic features from speech.</dc:description>
<dc:date>2025-11-28T11:03:43Z</dc:date>
<dc:date>2025-11-28T11:03:43Z</dc:date>
<dc:date>2026</dc:date>
<dc:type>info:eu-repo/semantics/article</dc:type>
<dc:identifier>Computer Speech &amp; Language Volume, 2026, vol. 96, p. 101873</dc:identifier>
<dc:identifier>0885-2308</dc:identifier>
<dc:identifier>https://uvadoc.uva.es/handle/10324/80150</dc:identifier>
<dc:identifier>10.1016/j.csl.2025.101873</dc:identifier>
<dc:identifier>101873</dc:identifier>
<dc:identifier>Computer Speech &amp; Language</dc:identifier>
<dc:identifier>96</dc:identifier>
<dc:language>eng</dc:language>
<dc:relation>https://www.sciencedirect.com/science/article/pii/S0885230825000981</dc:relation>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:rights>http://creativecommons.org/licenses/by/4.0/</dc:rights>
<dc:rights>Atribución 4.0 Internacional</dc:rights>
<dc:rights>Atribución 4.0 Internacional</dc:rights>
<dc:publisher>Elsevier Ltd.</dc:publisher>
<dc:peerreviewed>SI</dc:peerreviewed>
</ow:Publication>
</rdf:RDF></metadata></record></GetRecord></OAI-PMH>