TiCCLops: Text-Induced Corpus Clean-up online processing system

Description

TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form.

Tool type

Processing flow

Tool task

text processing, orthographic normalisation

Key words

service, text processing, web-application

Research domain

Linguistics, Ortography

Language

Dutch

Country

Netherlands

CLARIN centre

The Institute for Dutch Lexicology 

Contact person

dr. Martin Reynaert (Tilburg University) 

URL

https://portal.clarin.inl.nl/ticclops

Similar to