ROMi Multimodal Corpus of Czech as a Second Language


A small research team of sociolinguists produced recordings of monologues of bilingual Roma people speaking Czech. They transcribed a very small part of the recordings. Clarin offered technical as well as financial help to make this data accessible on a much larger scale: over 500,000 words in 500 recordings. 
Personally sensitive portions were anonymised in two ways: transcriptions use a different word with the same morphology (e.g. a person’s name), audi files use a “beep” in the same passage. Resulting data (audio files + transcription) are available as: 1. downloadable anonymised version with a free license in a repository, easy to find via generic search engines, 2. Anonymised version freely accessible in a specialised search engine and web application (synchronised audio + transcription). 3. Original versions without anonymisation available to researchers under a restricted license to protect privacy of the participants. This showcase demonstrates that even with moderate amount of money (cca 10,000 EUR) it is possible to produce significant data (especially important for researchers in didactic and second language learning) and make the data accessible for general public, even though the originals contain some sensitive information. 

Tool type


Tool task

browse, corpus exploration, search

Key words

data, spoken corpus, speech processing, mono-lingual

Research domain

Ethnolect, Language Documentation, Linguistics, Sociology





CLARIN centre

Charles University in Prague

Contact person

Jan Hajic


Similar to