KPWr (Korpus Języka Polskiego Politechniki Wrocławskiej, Polish Corpus of Wrocław University of Technology)


KPWr is a collection of text documents available under a Creative Commons license. Documents have been tagged with the wcrft2 tool and described with different types of information such as syntactic phrases (chunks), relations between syntactic phrases, identification units (including relations between them and lemmatization), unified word meanings, spatial expressions, verbs with a default subject, text keywords, temporal expressions (locally and globally normalized), situations, semantic roles and correlations. Detailed statistics can be found on the page: Each document is saved in three files that contain the following information:
*.xml (CCL file) – contains tokenization, sentence division, morphological analysis of the text, annotations and lemons,
*.rel.xml (CCL-REL file) – contains relations between annotations,
*.ini (INI file) – contains document metadata.
Additionally, the body can be exported in conll, txt and json format.
Samples for the corpus have been taken from sources such as: Wikipedia, Wikinews, information portals with content under Creative Commons license, literary works from the public domain or made available under an open license, etc., which provide legal and free use of the corpus.
The Corpus in its latest published version consists of 449,985 tokens, but is constantly being expanded and developed towards a balanced corpus, containing equally scientific, official, artistic/rethorical, press/publicist and colloquial texts.

Bibliographic address of the main publication (in case of using KPWr, please cite this publication):

Bartosz Broda, Michał Marcińczuk, Marek Maziarz, Adam Radziszewski, Adam Wardyński. KPWr: Towards a Free Corpus of Polish. Proceedings of LREC’12, 2012.

Michał Marcińczuk, Marcin Oleksy, Jan Kocoń, Tomasz Bernaś, Michał Wolski. Towards an event annotated corpus of Polish. Cognitive Studies | Études cognitives, 2015.

Auxiliary materials:

Link to the manual

Examples of applications

Kobyliński, Ł. et al. “PolEval 2019 — the next chapter in evaluating Natural Language Processing tools for Polish.” (2019).

Altuna, B., Aranzabe, M. J., & Díaz de Ilarraza, A. (2020). EusTimeML: A mark-up language for temporal information in Basque. Research in Corpus Linguistics, 8(1), 86-104.

Łukasz Kobyliński and Michał Wasiluk. Deep learning in event detection in Polish. In Christiane Fellbaum, Piek Vossen, Ewa Rudnicka, Marek Maziarz, and Maciej Piasecki, editors, Proceedings of the 10th Global WordNet Conference (GWC 2019), pages 216–221, Wrocław, 2019. Oficyna Wydawnicza Politechniki Wrocławskiej.