Polish Parliamentary Corpus

Name

Polish Parliamentary Corpus – a collection of texts from the plenary sittings of the Sejm and Senate of the Polish Republic

Description

The Polish Parliamentary Corpus is a collection of proceedings of Polish parliament dating from 1919 to present. It includes transcripts of Sejm sittings (including Legislative Sejm and State National Council), Sejm committee sittings from 1993, Sejm interpellations and questions from 1997, Senate sittings from 1922–1939 and 1989 to present and Senate committee sittings from 2015. The collection is consequently updated with the most current data acquired from the Sejm and the Senate web portals. Currently the size of the textual data in the corpus amounts to over 340 thousand documents and almost 750 million tokens. The texts are described by metadata and are automatically processed by linguistic tools (at the level of segmentation, morpho-compositional analysis, recognition of syntactic groups and proper names). Both searchable and downloadable versions of the corpus are available.

Bibliographic address of the main publication (in case of using Polish Parliamentary Corpus, please cite this publication):

Ogrodniczuk, M. (2018). Polish Parliamentary Corpus, In: Fišer, D., Eskevich, M., & de Jong, F. (Eds.) Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, 15–19. European Language Resources Association. http://lrec-conf.org/workshops/lrec2018/W2/pdf/11_W2.pdf

Auxiliary materials:

Information about the corpus: https://kdp.nlp.ipipan.waw.pl/overview
The presentation from CLARIN-PL workshop: http://clarin-pl.eu/wp-content/uploads/2019/10/kdp.pdf

Korpus dyskursu parlamentarnego – dr hab. M. Ogrodniczuk, dr hab. M. Derwojedowa

Access

Downloadable version of the corpus: http://clip.ipipan.waw.pl/PPC
Corpus search engine: https://kdp.nlp.ipipan.waw.pl

Link to the manual

Corpus search engine user manual https://kdp.nlp.ipipan.waw.pl/manual

Examples of applications

Roselló Beneitez N. U. (2020). Development and evaluation of a Polish Automatic Speech Recognition system using the TLK toolkit. Praca magisterska. Universitat Politècnica de València (Politechnika w Walencji).
Szczyszek M. (2019). Emocje w parlamencie – parlament w emocjach: ujęcie statystyczne. O projekcie słownika polskiego parlamentaryzmu XX wieku (lata 1918–2018). Prace Językoznawcze 20(3), 203–218.
Ustaszewski M. (2016). Data Sparsity in Highly Inflected Languages: The Case of Morphosyntactic Tagging in Polish. Praca magisterska. Euskal Herriko Unibertsitatea (Uniwersytet Kraju Basków).
Przybyła P., & Teisseyre P. (2014). Analysing Utterances in Polish Parliament to Predict Speaker’s Background. Journal of Quantitative Linguistics 21(4), 350–376.