Description
Victor is a web page cleaning tool. It is aimed at removing menu, ads, footers, headers, etc. from HTML web pages, so that only main web page content remains. Victor is based on a conditional random fields algorithm.
Tool type
Tool for creating own tools and resources
Tool task
html cleaning
Key words
web data, text processing, web-service
Research domain
Computational Linguistics, Linguistics
Language
Czech
Country
Czech
CLARIN centre
Charles University in Prague
Contact person
Michal Marek
URL
http://ufal.mff.cuni.cz/victor
Similar to