The LnNor Corpus: A spoken multilingual corpus of non-native and native Norwegian, English and Polish (Part 1)

Magdalena, Wrembel; Hwaszcz, Krzysztof; Agnieszka, Pludra; Skałba, Anna; Weckwerth, Jarosław; Walczak, Angelika; Sypiańska, Jolanta; Żychliński, Sylwiusz; Malarski, Kamil; Kędzierska, Hanna; Kaźmierski, Kamil; Gruszecka, Justyna; Dziubalska-Kolaczyk, Katarzyna; Czarnecki-Verner, Tristan; Cal, Zuzanna; Balas, Anna

dc.contributor.author	Magdalena, Wrembel
dc.contributor.author	Hwaszcz, Krzysztof
dc.contributor.author	Agnieszka, Pludra
dc.contributor.author	Skałba, Anna
dc.contributor.author	Weckwerth, Jarosław
dc.contributor.author	Walczak, Angelika
dc.contributor.author	Sypiańska, Jolanta
dc.contributor.author	Żychliński, Sylwiusz
dc.contributor.author	Malarski, Kamil
dc.contributor.author	Kędzierska, Hanna
dc.contributor.author	Kaźmierski, Kamil
dc.contributor.author	Gruszecka, Justyna
dc.contributor.author	Dziubalska-Kolaczyk, Katarzyna
dc.contributor.author	Czarnecki-Verner, Tristan
dc.contributor.author	Cal, Zuzanna
dc.contributor.author	Balas, Anna
dc.date.accessioned	2024-01-31T10:37:39Z
dc.date.available	2024-01-31T10:37:39Z
dc.date.issued	2024-01-31
dc.identifier.uri	http://hdl.handle.net/11321/931
dc.description	The LnNor corpus was created as part of the data collection in two projects: CLIMAD (Cross- linguistic influence in multilingualism across domains: phonology and syntax) and ADIM (Across-domain Investigations in Multilingualism: Modeling L3 Acquisition in Diverse Settings), led by Prof. Magdalena Wrembel at Adam Mickiewicz University in Poznań, Poland and by Prof. Marit Westergaard at the Arctic University of Norway, from December 2021 to April 2024 with funding from the National Science Centre (NCN) in Poland and Norway Grants. The CLIMAD and ADIM projects explored cross-linguistic influence (CLI) in the acquisition, processing, and use of a third language (L3/Ln) across various language domains and focused on different settings and stages of acquisition from a multilingual perspective. A range of sophisticated methodologies, such as perception and production tests, grammaticality judgement tasks and online brain imaging techniques like EEG, were leveraged to unravel the intricacies of multilingual processing. By capturing real-time insights into the interplay of cross-linguistic influences, the projects not only provided valuable contributions to the understanding of L3/Ln acquisition but also advanced theoretical frameworks in this field. Corpus data collection covered a broad range of speech elicitation tasks. The recordings consist of word, sentence and text reading, picture story description, video story retelling, spontaneous speech and socio-phonetic interviews in Polish, English and Norwegian. The corpus contains metadata based on the Language History Questionnaire (Li et al. 2020) such as age, gender, native languages, proficiency level, length of language exposure, age of onset. Data was collected from different groups of speakers: • L1 Polish learners of Norwegian as L3/Ln, attending Scandinavian studies at Poznań College of Modern Languages and the University of Szczecin (instructed learners); • L1 Polish learners of Norwegian as L3/Ln, living in Norway (naturalistic learners) • L1 English natives as controls • L1 Norwegian natives as controls • speakers of L2/L3/Ln English and L2/L3/Ln Norwegian with various L1 backgrounds Six types of speech tasks were recorded in Norwegian, English and Polish: • word reading • sentence reading • text reading (“The North Wind and the Sun”) • picture description • picture story telling • video story telling Metadata corresponding to the recordings include the following information: • speaker ID, age, gender, education, current residence, speaker status • (instructed/naturalistic/native), native language, additional languages spoken • recording ID • language: PL (Polish), EN (English), NO (Norwegian) • status: L1, L2, L3/Ln • speech task: WR (word reading), SR1/2/... (sentence reading), TR1/2/... (text reading), PD (picture description), ST (story telling), VT (video story telling) • recording date, recording place, iteration, recording environment, recording device, type of microphone, noise level, etc. The labels of the recordings adhere to a structured format: PROJECT_SPEAKER ID_LANGUAGE STATUS_TASK, wherein: • PROJECT corresponds to the project within which the data were collected (A for ADIM, C for CLIMAD) • SPEAKER ID corresponds to a unique speaker ID consisting of 8 characters • LANGUAGE STATUS represents the language in which the task was recorded and its status for the speaker (e.g., L1PL, L2EN, L3NO) • TASK corresponds to the type of speech task recorded (e.g., TR, SR, WR, etc.) The LnNor corpus has been created to represent multilingual speech with a focus on L3/Ln Norwegian learners as well as native controls of Norwegian, English and Polish. The corpus is designed to study linguistic variation in learners acquiring Norwegian as a foreign language in instructed and naturalistic settings. Additionally, a subcorpus of native speech patterns is provided to serve as a benchmark, against which the learners' productions could be compared. Furthermore, parts of the corpus contain word alignment with orthographic transcriptions of speech to facilitate subsequent analyses across various linguistic domains. All speech samples were recorded with the use of Shure SM-35 unidirectional cardioid head-worn condenser microphones, using portable Marantz PMD620 solid state recorders with signal digitized at 48 kHz, 16-bit. This set-up was selected to minimize ambient noise and provide clear and focused recordings. The LnNOR corpus part 1 consists of 1073 annotated files from 78 speakers. The speakers included 53 L1 Polish, 16 L1 Norwegian and 9 L1 speakers of other European languages. The total recording time is approximately 35 hours and the full size is 18 GB. The recordings in the released LnNor corpus part 1 cover data collected between 2021-2022.
dc.language.iso	nor
dc.language.iso	eng
dc.language.iso	pol
dc.publisher	Adam Mickiewicz University
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	CC
dc.source.uri	https://adim.web.amu.edu.pl/en/
dc.subject	L2 English
dc.subject	L3 Norwegian
dc.subject	L1 Polish
dc.subject	spoken data
dc.title	The LnNor Corpus: A spoken multilingual corpus of non-native and native Norwegian, English and Polish (Part 1)
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	audio
has.files	yes
branding	CLARIN-PL
demo.uri	https://adim.web.amu.edu.pl/en/lnnor-corpus/
contact.person	Krzysztof Hwaszcz krzysztof.hwaszcz@gmail.com University of Wrocław
contact.person	Magdalena Wrembel magdala@amu.edu.pl Adam Mickiewicz University
sponsor	NCN GRIEG-1 project financed by EEA and Norway Grants UMO-2019/34/H/HS2/00495 1. Across-domain Investigations in Multilingualism: Modeling L3 Acquisition in Diverse Settings (ADIM)
sponsor	OPUS-19-HS financed by Polish National Science Centre UMO-2020/37/B/HS2/00617 2. Cross-linguistic influence in multilingualism across domains: Phonology and syntax (CLIMAD)
size.info	38 hours
files.size	13434829732
files.count	1

Files in this item

This item is

Distributed under Creative Commons

and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)

Name: LnNor_Corpus_part_1.zip
Size: 12.51 GB
Format: application/zip
Description: corpus files

Download file

Show simple item record