RT info:eu-repo/semantics/doctoralThesis
T1 Corpus linguistics and contrastive analysis of sensory discourse: Applications for bilingual (ES-EN) text production
A1 Sanz Valdivieso, Lucía
A2 Universidad de Valladolid. Escuela de Doctorado
K1 Filología inglesa
K1 English Philology
K1 Filología inglesa
K1 Translation and interpretation
K1 Traducción e interpretación
K1 General Linguistica
K1 Lingüística general
K1 Computer Science
K1 Ciencias de la computación
K1 5701.13 Lingüística Aplicada a la Traducción E Interpretación
AB International professionals such as Spanish wine and olive oil experts need to write technical texts in English to participate in the international market. To achieve it, many L2 English professionals resort to language technologies to help them obtain domain-specific texts in English. However, available tools are often prone to linguistic and/or domain-specific mistakes. Not only that, but this problem is also aggravated for users of linguistically low-resource language varieties, who have few reliable tools at hand: when the available bilingual data is scarce or defective, it is not easy to develop a language- and domain-compliant writing tool. Such added limitation is also exacerbated by current neural systems’ need of big datasets to achieve SOTA performance. This dissertation proposes a data-centric Corpus Linguistics-informed intervention focused on terminology injection as a domain adaptation strategy for neural bilingual production of low-resource language varieties. On the one hand, a language model based on small comparable domain-specific monolingual corpora is used to select the most similar data that had been automatically downloaded from domain-specific, selected sources. On the other hand, a full-form Spanish-English glossary is employed as a terminological reference to filter or curate big corpora so that language- and domain-adequate equivalences are learned by the system. Additionally, a backtranslation approach is used to augment the datasets used to train the system. To assess the proposed domain adaptation protocol, a set of experimental Neural MT systems were evaluated and compared among them and with commercial system Google Translate from three perspectives: automated metrics, human judgement, and comparison to the gold-standard terminological reference. Results of the automated evaluation of the experimental and commercial systems suggest the domain-adapted systems outperform Google Translate in the translation of wine and olive oil tasting notes. Nevertheless, according to human judgment, the best-scoring experimental system is outperformed by Google Translate in terms of general performance and amount and severity of terminological errors. Most importantly, the findings show an improvement in the terminological performance of the experimental systems after the training using the domain-specific curated data. These results are in line with or surpass previous experimental systems trained on different data-centric domain adaptation strategies based on terminology injection which are usually more computationally demanding and less efficient than the techniques here proposed. However, diversity in evaluation frameworks, language pairs, and lack of detailed results often hinder comparability with previous literature. In sum, and in spite of the many limitations of the study, this dissertation shows the potential of Corpus Linguistics for the development of domain-adapted neural language production tools aimed at aiding Spanish professionals of linguistically low-resource fields successfully engage in international professional communication.
YR 2025
FD 2025
LK https://uvadoc.uva.es/handle/10324/75898
UL https://uvadoc.uva.es/handle/10324/75898
LA eng
NO Escuela de Doctorado
DS UVaDOC
RD 13-abr-2026