Show simple item record Sopyła, Krzysztof 2021-07-30T11:23:09Z 2021-07-30T11:23:09Z 2021
dc.description Cleaned Polish Oscar corpus (part: 128M lines, 3.53 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences - remove ungrammatical sentences - perform sentence tokenization and save each sentence in a new line, after each document the new line was added
dc.language.iso pol
dc.publisher Ermlab
dc.subject corpus
dc.title Cleaned Polish Oscar corpus (128M lines)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files no
branding CLARIN-PL
contact.person Krzysztof Sopyła Ermlab
files.size 0
files.count 0

Show simple item record