Korpusomat – a tool for creating searchable morphosyntactically tagged corpora


Korpusomat is a simple web application enabling re-searchers to create morphosyntactically annotated text cor-pora without much technical knowledge about the underlying computational linguistic components. Korpusomat combines existing tools, such as morphological analyser, tagger and corpus search engine, and provides an easy-to-use environment for building corpora technically compatible with the National Corpus of Polishfrom almost any text, including texts in binary formats. The resulting corpora can be then queried using standard search tools such as Poliqarp.

Language corpora, which lie on the intersection of corpus linguistics and computer technologies, are huge collections of texts, used in corpus research, applied linguistics and lexicography. Linguistic corpora are understood as collections of texts that are designed to efficiently extract, classifiy and verify the information regarding the formal structure and the content of the language. The application of corpus methods with the use of appropriate tools and digital databases enables the users to significantly extend the scope of research, eliminate the time-consuming process of manual annotation, conduct manual statistics, etc. Examples of applications of corpus analysis include measuring the frequency of words, phrases and collocations; exploring the most common contexts of word or phrase occurrences; examining language changes over time, using historical text corpora, studying the actual use of language by its users (domain-specific corpora, foreign language corpora).

