Show simple item record

 
dc.contributor.author Sopyła, Krzysztof
dc.date.accessioned 2021-07-30T09:37:57Z
dc.date.available 2021-07-30T09:37:57Z
dc.date.issued 2021
dc.identifier.uri http://hdl.handle.net/11321/842
dc.description Cleaned Polish Oscar corpus (part: 32M lines, 3.35 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences - remove ungrammatical sentences - perform sentence tokenization and save each sentence in a new line, after each document the new line was added
dc.language.iso pol
dc.publisher Ermlab
dc.source.uri https://github.com/Ermlab/PoLitBert/
dc.subject corpus
dc.title Cleaned Polish Oscar corpus (32M lines)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files no
branding CLARIN-PL
demo.uri https://minio.clarin-pl.eu/ermlab/public/PoLitBert/corpus-oscar/corpus_oscar_2020-04-10_32M_lines.zip
contact.person Krzysztof Sopyła office@ermlab.com Ermlab
files.size 0
files.count 0


Show simple item record