DiPSS - longitudinal corpus of drift in Polish students of Spanish

Sawicka-Stępińska, Brygida; Sypiańska, Jolanta

dc.contributor.author	Sawicka-Stępińska, Brygida
dc.contributor.author	Sypiańska, Jolanta
dc.date.accessioned	2025-12-02T08:29:05Z
dc.date.available	2025-12-02T08:29:05Z
dc.date.issued	2025-11-30
dc.identifier.uri	http://hdl.handle.net/11321/954
dc.description	The DiPSS corpus (part 1) is a longitudinal speech resource documenting the phonetic productions of L1 Polish students learning L2 English and L3 Spanish. It includes recordings from first year Spanish philology students across five testing points over two academic years, capturing word-initial stops (lenis and fortis), vowels (e, o, u, a), rhotics ({rr}) and approximants ([β, ð, ɣ]). The corpus integrates rich metadata including L2/L3 proficiency, language aptitude (LLAMA, Meara & Rogers, 2019), and age of onset for foreign languages, allowing for longitudinal and cross-linguistic analyses. DiPSS is designed as an open-access resource suitable for research in L1 drift, cross-linguistic influence, speech production and multilingual acquisition. Its detailed annotation, metadata and longitudinal structure result in a valuable tool for both linguistic research and computational modeling. The task consisted in reading words presented on auto-advancing slides in Polish, Spanish, and English. Instructions for the entire task were delivered in Polish. Prior to the Spanish and English sets of target words, participants received a written instruction along with a brief audio prompt in the respective language to establish the appropriate language mode. Audio was captured using the AKG C4000 microphone connected to a computer via a Focusrite Scarlett 2i2 audio interface and recorded using Audacity software, version 3.4.2. Data were collected from 28 speakers across testing times 1–4, and 22 speakers across testing times 1–5. The testing times correspond to: T1: October, year 1, during the opening week of the program, T2: November, year 1, after approximately five full weeks of instruction, T3: February, year 1, at the end of the first semester, T4: June, year 1, at the end of the first academic year, T5: June-September, year 2, at the end of the second year of studies. Metadata corresponding to the speakers include the following information: A: Sociodemographic data: speaker ID, gender, age B: Language background: self-reported L1, L2 and L3, level of Spanish: (A - absolute beginners, B - false beginners, C - advanced learners) C and D: L2 and L3 profile (self-reported proficiency, age of onset of formal education, age of exposure to naturalistic speech, stay in Spanish/English speaking countries for longer than a month, weekly exposure to naturalistic speech) E: Proficiency and language aptitude test results. The DiPSS corpus consists of five packages (T1-T5) of recordings with forced-aligned three-tier annotation in TextGrid, performed using WebMAUS Basic (Kisler, T. et al. 2017). Each package corresponds to one testing time and contains three sets of data: Polish, Spanish, and English. Packages T1-T4 each include 28 recordings per language, with corresponding TextGrid files. Package T5 includes 22 recordings per language, also with their corresponding TextGrid files. In total, the corpus comprises 402 pairs of WAV and TextGrid files from 28 speakers. The total recording time is approximately 20 hours, and the complete corpus size is 2.5 GB. The recordings in the released DiPSS corpus part 1 cover data collected in mid-2020s. The labels of the recordings adhere to a structured format: SPEAKER ID_TESTING TIME_LANGUAGE, wherein: SPEAKER ID corresponds to a unique speaker ID consisting of 6 characters, TESTING TIME corresponds to one of the five recording sessions (T1, T2, T3, T4, T5), LANGUAGE corresponds to the language in which the task was recorded (PL – Polish, ES – Spanish, EN – English). The data were processed using the server infrastructure developed within "Digital Research Infrastructure for the Arts and the Humanities" (POIR.04.02.00-00-D006/20).
dc.language.iso	spa
dc.language.iso	pol
dc.language.iso	eng
dc.publisher	Adam Mickiewicz University, Poznań
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	CC
dc.subject	DiPSS
dc.subject	speech resource
dc.title	DiPSS - longitudinal corpus of drift in Polish students of Spanish
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	audio
hidden	false
hasMetadata	false
has.files	yes
branding	CLARIN-PL
contact.person	Brygida Sawicka-Stępińska brygida.sawicka-stepinska@amu.edu.pl Adam Mickiewicz University, Poznań
size.info	20 hours
files.size	2646643848
files.count	7