Description
TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form.
Tool type
Processing flow
Tool task
text processing, orthographic normalisation
Key words
service, text processing, web-application
Research domain
Linguistics, Ortography
Language
Dutch
Country
Netherlands
CLARIN centre
The Institute for Dutch Lexicology
Contact person
dr. Martin Reynaert (Tilburg University)
URL
https://portal.clarin.inl.nl/ticclops
Similar to