Polish Parliamentary Corpus – a collection of texts from the plenary sittings of the Sejm and Senate of the Polish Republic


The Polish Parliamentary Corpus is a collection of proceedings of Polish parliament dating from 1919 to present. It includes transcripts of Sejm sittings (including Legislative Sejm and State National Council), Sejm committee sittings from 1993, Sejm interpellations and questions from 1997, Senate sittings from 1922–1939 and 1989 to present and Senate committee sittings from 2015. The collection is consequently updated with the most current data acquired from the Sejm and the Senate web portals. Currently the size of the textual data in the corpus amounts to over 340 thousand documents and almost 750 million tokens. The texts are described by metadata and are automatically processed by linguistic tools (at the level of segmentation, morpho-compositional analysis, recognition of syntactic groups and proper names). Both searchable and downloadable versions of the corpus are available.

Bibliographic address of the main publication (in case of using Polish Parliamentary Corpus, please cite this publication):

Ogrodniczuk, M. (2018). Polish Parliamentary Corpus, In: Fišer, D., Eskevich, M., & de Jong, F. (Eds.) Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, 15–19. European Language Resources Association. http://lrec-conf.org/workshops/lrec2018/W2/pdf/11_W2.pdf


Downloadable version of the corpus: http://clip.ipipan.waw.pl/PPC
Corpus search engine: https://kdp.nlp.ipipan.waw.pl

Link to the manual

Corpus search engine user manual https://kdp.nlp.ipipan.waw.pl/manual

