ComCorp – a tool for comparing the linguistic features of corpora


The tool allows for uploading any two corpora (previously zipped) and then for comparing them with regard to the following linguistic features: the presence of specific multiword units, the presence of grammatical tags (according to NKJP tagset), the presence of the vocabulary specific for given corpora, the presence of the vocabulary that differes across the corpora, the presence of proper names, morphosyntactic features of verbs, statistical features of the corpora. The ComCorp tool is also used to detect the linguistic characteristics that are common and different in any two sets of texts.

Walkowiak, T.: Language Processing Modelling Notation – Orchestration of NLP Microservices. In: Advances in Dependability Engineering of Complex Systems: Proceedings of the Twelfth International Conference on Dependability and Complex Systems DepCoS-RELCOMEX, 2017, Springer International Publishing, pp. 464-473

The formation of balanced corpora; rapid exploration of text collections; the comparison of linguistic features of diverse text collections (e.g. in terms of authorship, genre, year of creation, etc.).