Parallel Corpora


Bilingual (Polish-Bulgarian) parallel corpora of contemporary texts:

  • Polish-Bulgarian
  • Polish-Lithuanian
  • Polish-Ukrainian
  • Polish-Russian


Parallel Corpora are a constantly developing bilingual resource. The corpora contain manually parallelised contemporary texts:

  • Polish-Bulgarian Corpus: Polish and Bulgarian with a total volume of more than 27.5 million word forms.
  • Polish-Lithuanian Corpus: Polish and Lithuanian with a total volume of more than 16.5 million word forms.
  • Polish-Ukrainian Corpus: Polish and Ukrainian with a total volume of more than 1.2 million word forms.
  • Polish-Russian Corpus: Polish and Russian with a total volume of more than 5.6 million word forms.

All functional styles are represented in the Corpora. The colloquial speech is shown in film dialogues. In addition to the translations (from Polish into Russian or from Russian into Polish), the corpus includes translations from third languages. Currently, the work on full resource rigging is being completed, including lemmatization and morphological-syntactic annotation of all the word-forms. The selection of corpus resources was guided by the principle of ensuring high representation of various lexems and terms. Great emphasis was placed on the presentation of new lexics typical of colloquial speech, which is visible in many movie dialogues.