Victor

Description

Victor is a web page cleaning tool. It is aimed at removing menu, ads, footers, headers, etc. from HTML web pages, so that only main web page content remains. Victor is based on a conditional random fields algorithm.

Tool type

Tool for creating own tools and resources

Tool task

html cleaning

Key words

web data, text processing, web-service

Research domain

Computational Linguistics, Linguistics

Language

Czech

Country

Czech

CLARIN centre

Charles University in Prague

Contact person

Michal Marek

URL

http://ufal.mff.cuni.cz/victor

Similar to