This soraUVALAL_Readme.txt file was generated on 2022-06-15 by Sonja Mujcinovic & Raquel Fernández Fuertes INDEX OF THE soraUVALAL DATASET 1. GENERAL INFORMATION 1.1. Title of dataset 1.2. Author information 1.2.1. PI and co-PI 1.2.2. Labs 1.2.3. People involved in the data collection 1.3. Corpus description 1.4. Funding sources 1.5. Citing information 2. ACCESS INFORMATION 2.1. Licenses or restrictions 2.2. Publications 3. METHODOLOGICAL INFORMATION 3.1. Data elicitation procedure 3.2. Data transcription procedure 4. DATA 4.1. Inventory of data files 4.2. Database 4.3. Last update 5. RELATED DATASETS 1. GENERAL INFORMATION 1.1. Title of dataset: the soraUVALAL corpus 1.2. Author Information 1.2.1. PI and co-PI: Name: Sonja Mujcinovic Institution: University of Valladolid (Spain) Address: Facultad de Filosofía y Letras, Paseo del Cauce s/n 47011, Valladolid (Spain) Email: sonja.mujcinovic@uva.es Name: Raquel Fernández Fuertes  Institution: University of Valladolid  Address: Facultad de Filosofía y Letras, Paseo del Cauce s/n 47011, Valladolid (Spain)  Email: raquelff@uva.es  1.2.2. Lab: Name of the lab: UVALAL (University of Valladolid Language Acquisition Lab) Institution: University of Valladolid (Spain) Address: https://uvalal.uva.es Email: gir.uvalal@uva.es 1.2.3. People involved in the data collection and data transcription The collection of both the oral and the written data, as well as the corresponding transcriptions of both data sets were done by Sonja Mujcinovic, Tamara Gómez Carrero and Luis Miguel Toquero Pérez. 1.3. Corpus description This corpus contains oral and written experimental production data from a total of 106 sequential bilingual children for whom English was their L2. These children belong to three groups depending on whether their L1 was Spanish (n=33), Bosnian (n=39) or Danish (n=34). Within each language group, two subgroups appear depending on the time of exposure the children have had to the L2 (either 2 or 4 years). The data were collected in the schools the participants attended in the country where they lived (i.e., Spain, Denmark, and Bosnia). The criteria applied when selecting the participants were the following: - both parents and the child had to share the same L1 (Spanish, Bosnian or Danish depending on the group); - the L2 of the participants had to be English (if the participants had an L3 which they started learning as part of the curricula during the 3rd or the 4th year of instruction of the L2, they were not excluded from the study; otherwise, they were excluded); - the participants had only received instruction in the L2 in educational settings; - the participants had received instruction for either 2 or 4 years at their primary school; - the participants who took part in any exchange programs or lived in an English-speaking country for longer than two weeks were excluded from the study. 1.4. Funding sources - 2018-2022: Spanish Ministry of Science, Innovation and Universities and European Regional Development Fund (ERDF) [PGC2018-097693-B-I00], Linguistic competence indicators in heritage and non-native languages: linguistic, psycholinguistic and social aspects of English-Spanish bilingualism, PRINCIPAL INVESTIGATOR: R. Fernández Fuertes (University of Valladolid, Spain) - 2017-2019: Regional Government of Castile and León (Spain) and ERDF [VA009P17], Aspectos de la dimensión internacional del contacto de lenguas: diagnósticos de la competencia lingüística bilingüe inglés-español, PRINCIPAL INVESTIGATOR: R. Fernández Fuertes (University of Valladolid, Spain)  1.5. Citing information Publications using this dataset (or any part of it) should cite this dataset as follows: Mujcinovic, S. (2015). The analysis of subjects in the oral and written production of L2 English learners: transfer and language typology. In Pedro A. Fuertes-Olivera et al. (eds.), Current Work in Corpus Linguistics: Working with Traditionally-conceived Corpora and Beyond. Selected Papers from the 7th International Conference on Corpus Linguistics (CILC2015). Procedia Social and Behavioral Sciences. Amsterdam: Elsevier. 2. ACCESS INFORMATION 2.1. Licenses or restrictions: There are no licenses/restrictions placed on the data from the corpora in CHILDES (Child Language Data Exchange System) as they are freely available at the CHILDES project (https://childes.talkbank.org/) (MacWhinney 2000). However, in order to be able to run the CLAN programs (Computerized Language ANalysis) to perform automatic searches and calculations in the data from the soraUVALAL corpus the CLAN software needs to be downloaded and installed. The CLAN software is freely available in CHILDES and there are Windows, Mac and Unix versions (https://dali.talkbank.org/clan/). 2.2. Publications: A partial or total access to information contained in the database can be found at the UVALAL webpage (publications section, http://uvalal.uva.es/index.php/results/publications-2/). 3. METHODOLOGICAL INFORMATION 3.1. Data elicitation procedure In order to obtain the data, two different tasks were designed: an oral task and a written task. The participants that have been selected for this corpus have completed both tasks. All the data were collected in the schools that the participants were attending in their home country. The oral task is a semi-guided audio recorded individual interview which lasted 8 to 16 minutes. The participants were interviewed in a quiet room at their school. A protocol was followed to ensure uniformity across groups and across languages, and also to encourage naturalistic speech. Different topics were proposed (e.g., family, hobbies, interests, school, preferences, music, friends, etc.), and the participants were encouraged to talk about any desired topic. The questions asked were formulated so that the participants answered with complete sentences. The participants were allowed to ask for vocabulary which was provided to them in a non-inflected form. The written task is a wordless picture sequence task adapted from the A1-ball story from the Edmonton Narrative Norms Instrument (ENNI) (Schneider et al. 2005). The A1-ball story is based on five pictures in which an elephant and a giraffe are playing with a ball. The changes that have been made to the original ENNI story are related to the characters and their biological gender. Thus, the characters in the adapted version are Mary Giraffe and Tom Elephant. Beforehand, an oral warm-up session was held in both English and their native language in order to also make sure that the participants completely understood the task. During these sessions, first a random picture of the characters was shown and later these characters were introduced as Mary Giraffe and Tom Elephant. Participants were encouraged to comment in their L1 what they could see in these introductory pictures. The task was considered as another classroom activity since it was conducted in the classroom where the whole class participated together. The participants were first shown the sequence of the five pictures which were projected on a screen for all to see. Then they were asked to write the story in their own words. One hour in total was given to complete the task and participants were allowed to ask for vocabulary which was provided to them in a non-inflected form in the case of verbs. 3.2. Data transcription procedure All audio files recorded and all the narratives written were transcribed using the CHAT (Codes for the Human Analysis of Transcripts) transcription system from the CHILDES project (MacWhinney 2000). All people involved in the transcription of the soraUVALAL corpus were previously trained in the CHAT transcription system. Furthermore, all transcribers were bilinguals in English (the target language) and in the L1 of the participants whose data they were transcribing. To ensure uniformity in the transcription procedure followed by the different people involved in the transcription process, a transcribing-in-chat document was elaborated when the transcription procedure started and was frequently updated. This document was based on and in agreement with the CHAT transcription manual in CHILDES (https://talkbank.org/manuals/CHAT.pdf). 4. DATA 4.1. Inventory of data files The inventory of the files in the soraUVALAL corpus appears in the following CSV file: soraUVALAL_files inventory.csv. The information in the inventory is divided into two parts: 1. Initially, a table with a list of files appears where each file corresponds to an interview (oral or written). For each file the following information appears: (i) the name of the file (the first two letters correspond to the city where the data were collected (i.e., BL stands for Banja Luka, SO stands for Soroe and VA stands for Valladolid), the following two letters stand for the country where the data were collected (BO stands for Bosnia, DK stands for Denmark and ES stands for Spain); the numbers 2 or 4 indicate the years of instruction in the L2 (English) and finally, the last two numbers are the numbers assigned to each participant; (ii) age of the participants (from 10;00 to 13;00 years old); (iii) gender of the participants (male / female); (iv) session duration (hours:minutes:seconds); (v) task modality (oral / written); (vi) amount of utterances produced by each participant; (vii) amount of words produced by each participant; (viii) standard deviation; (ix) mean length of utterances measured in words (MLUw); (x) researcher´s speech in utterances (i.e., the input of the researcher who was present in the interviews quantified in terms of number of utterances); and (xi) researcher´s speech in words (i.e., the input of the researcher who was present in the interviews quantified in terms of number of words). 2. After the file inventory, a series of calculations appear. These correspond to (i) “individual averages” where averages per language group, time of instruction and modality have been calculated; (ii) “total averages” where averages per language group and modality have been calculated; (iii) “total per group” where the totals per language group, time of instruction and modality have been calculated; and (iv) “total” where total amounts for session duration, utterances and words have been calculated and also subclassified in terms of modality (oral and written). 4.2. Database The soraUVALAL corpus is fully available at https://slabank.talkbank.org/access/English/soraUVALAL.html Data type: transcribed data (CHAT transcription files of all the oral recordings and all the written narratives) Age range covered: 9;00-13;00 Number of files: 212 Number of words: 60,096 Number of utterances: 12,882 Number of hours recorded: 26 hours 36 min 4.3. Last update: 2022 5. RELATED DATASETS - Bilingual acquisition data: Subject Overtness_SO-L2 dataset: https://uvadoc.uva.es/handle/10324/53753