Show simple item record

 
dc.contributor.author Sawicka-Stępińska, Brygida
dc.contributor.author Sypiańska, Jolanta
dc.date.accessioned 2025-12-02T08:29:05Z
dc.date.available 2025-12-02T08:29:05Z
dc.date.issued 2025-11-30
dc.identifier.uri http://hdl.handle.net/11321/954
dc.description The DiPSS corpus (part 1) is a longitudinal speech resource documenting the phonetic productions of L1 Polish students learning L2 English and L3 Spanish. It includes recordings from first year Spanish philology students across five testing points over two academic years, capturing word-initial stops (lenis and fortis), vowels (e, o, u, a), rhotics ({rr}) and approximants ([β, ð, ɣ]). The corpus integrates rich metadata including L2/L3 proficiency, language aptitude (LLAMA, Meara & Rogers, 2019), and age of onset for foreign languages, allowing for longitudinal and cross-linguistic analyses. DiPSS is designed as an open-access resource suitable for research in L1 drift, cross-linguistic influence, speech production and multilingual acquisition. Its detailed annotation, metadata and longitudinal structure result in a valuable tool for both linguistic research and computational modeling. The task consisted in reading words presented on auto-advancing slides in Polish, Spanish, and English. Instructions for the entire task were delivered in Polish. Prior to the Spanish and English sets of target words, participants received a written instruction along with a brief audio prompt in the respective language to establish the appropriate language mode. Audio was captured using the AKG C4000 microphone connected to a computer via a Focusrite Scarlett 2i2 audio interface and recorded using Audacity software, version 3.4.2. Data were collected from 28 speakers across testing times 1–4, and 22 speakers across testing times 1–5. The testing times correspond to: T1: October, year 1, during the opening week of the program, T2: November, year 1, after approximately five full weeks of instruction, T3: February, year 1, at the end of the first semester, T4: June, year 1, at the end of the first academic year, T5: June-September, year 2, at the end of the second year of studies. Metadata corresponding to the speakers include the following information: A: Sociodemographic data: speaker ID, gender, age B: Language background: self-reported L1, L2 and L3, level of Spanish: (A - absolute beginners, B - false beginners, C - advanced learners) C and D: L2 and L3 profile (self-reported proficiency, age of onset of formal education, age of exposure to naturalistic speech, stay in Spanish/English speaking countries for longer than a month, weekly exposure to naturalistic speech) E: Proficiency and language aptitude test results. The DiPSS corpus consists of five packages (T1-T5) of recordings with forced-aligned three-tier annotation in TextGrid, performed using WebMAUS Basic (Kisler, T. et al. 2017). Each package corresponds to one testing time and contains three sets of data: Polish, Spanish, and English. Packages T1-T4 each include 28 recordings per language, with corresponding TextGrid files. Package T5 includes 22 recordings per language, also with their corresponding TextGrid files. In total, the corpus comprises 402 pairs of WAV and TextGrid files from 28 speakers. The total recording time is approximately 20 hours, and the complete corpus size is 2.5 GB. The recordings in the released DiPSS corpus part 1 cover data collected in mid-2020s. The labels of the recordings adhere to a structured format: SPEAKER ID_TESTING TIME_LANGUAGE, wherein: SPEAKER ID corresponds to a unique speaker ID consisting of 6 characters, TESTING TIME corresponds to one of the five recording sessions (T1, T2, T3, T4, T5), LANGUAGE corresponds to the language in which the task was recorded (PL – Polish, ES – Spanish, EN – English). The data were processed using the server infrastructure developed within "Digital Research Infrastructure for the Arts and the Humanities" (POIR.04.02.00-00-D006/20).
dc.language.iso spa
dc.language.iso pol
dc.language.iso eng
dc.publisher Adam Mickiewicz University, Poznań
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label CC
dc.subject DiPSS
dc.subject speech resource
dc.title DiPSS - longitudinal corpus of drift in Polish students of Spanish
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
hidden false
hasMetadata false
has.files yes
branding CLARIN-PL
contact.person Brygida Sawicka-Stępińska brygida.sawicka-stepinska@amu.edu.pl Adam Mickiewicz University, Poznań
size.info 20 hours
files.size 2646643848
files.count 7


 Files in this item

This item is
Distributed under Creative Commons
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Attribution Required
Icon
Name
Metadata_speakers_DiPSS_part1.xlsx
Size
14.57 KB
Format
Microsoft Excel 2007
Description
Unknown
 Download file
Icon
Name
DiPSS_part1_corpus_description.pdf
Size
274.26 KB
Format
PDF
Description
Unknown
 Download file
Icon
Name
T1.zip
Size
537.42 MB
Format
application/zip
Description
Unknown
 Download file
Icon
Name
T2.zip
Size
459.27 MB
Format
application/zip
Description
Unknown
 Download file
Icon
Name
T3.zip
Size
517.29 MB
Format
application/zip
Description
Unknown
 Download file
Icon
Name
T4.zip
Size
514.76 MB
Format
application/zip
Description
Unknown
 Download file
Icon
Name
T5.zip
Size
495.01 MB
Format
application/zip
Description
Unknown
 Download file

Show simple item record