W2C – Web to Corpus – tool

Description

A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc.

Tool type

Tool for creating own tools and resources

Tool task

corpus creation, corpus building

Key words

web data, wikipedia, corpus, text processing, multi-lingual

Research domain

Computational Linguistics, Linguistics 

Language

Multiple languages

Country

Czech

CLARIN centre

Charles University in Prague

Contact person

Martin Majliš

URL

https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0022-60D6-1?show=full

Similar to