jusText

Description

jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.

Tool type

Tool for creating own tools and resources

Tool task

corpus cleaning, corpus building

Key words

web data, text processing, service

Research domain

Computational Linguistics

Language

English

Country

Czech

CLARIN centre

Masaryk University

Contact person

Jan Pomikálek

URL

https://code.google.com/p/justext/

Similar to

Blog-Reader