CorpoGrabber

Kocoń, Jan

dc.contributor.author	Kocoń, Jan
dc.date.accessioned	2017-06-28T09:14:07Z
dc.date.available	2017-06-28T09:14:07Z
dc.date.issued	2017-06-28
dc.identifier.uri	http://hdl.handle.net/11321/403
dc.description	CorpoGrabber: The Toolchain to Automatic Acquiring and Extraction of the Website Content Jan Kocoń, Wroclaw University of Technology CorpoGrabber is a pipeline of tools to get the most relevant content of the website, including all subsites (up to the user-defined depth). The proposed toolchain can be used to build a big Web corpora of text documents. It requires only the list of the root websites as the input. Tools composing CorpoGrabber are adapted to Polish, but most subtasks are language independent. The whole process can be run in parallel on a single machine and includes the following tasks: downloading of the HTML subpages of each input page URL [1], extracting of plain text from each subpage by removing boilerplate content (such as navigation links, headers, footers, advertisements from HTML pages) [2], deduplication of plain text [2], removing of bad quality documents utilizing Morphological Analysis Converter and Aggregator (MACA) [3], tagging of documents using Wrocław CRF Tagger (WCRFT) [4]. Last two steps are available only for Polish. The result is a corpora as a set of tagged documents for each website. References [1] https://www.httrack.com/html/faq.html [2] J. Pomikalek. 2011. Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. Thesis. Masaryk University, Faculcy of Informatics. Brno. [3] A. Radziszewski, T. Sniatowski. 2011. Maca – a configurable tool to integrate Polish morphological data. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation. Barcelona, Spain. [4] A. Radziszewski. 2013. A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.
dc.language.iso	pol
dc.language.iso	eng
dc.publisher	Jan Kocoń
dc.rights	GNU LGPL 3.0
dc.rights.uri	http://www.gnu.org/licenses/lgpl.html
dc.rights.label	PUB
dc.subject	CorpoGrabber
dc.subject	corpus
dc.subject	acquiring
dc.subject	web scraping
dc.subject	corpora builder
dc.title	CorpoGrabber
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN-PL
contact.person	Jan Kocoń jan.kocon@pwr.edu.pl Wroclaw University of Science and Technology
sponsor	Ministry of Science and Higher Education (Poland) 6358/IA/119/2013 CLARIN-PL nationalFunds
files.size	5123
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
GNU LGPL 3.0

Name: corpograbber.zip
Size: 5 KB
Format: application/zip
Description: Unknown

Download file

Show simple item record