Cleaned Polish Oscar corpus (32M lines)

dc.contributor.author	Sopyła, Krzysztof
dc.date.accessioned	2021-07-30T09:37:57Z
dc.date.available	2021-07-30T09:37:57Z
dc.date.issued	2021
dc.identifier.uri	http://hdl.handle.net/11321/842
dc.description	Cleaned Polish Oscar corpus (part: 32M lines, 3.35 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences - remove ungrammatical sentences - perform sentence tokenization and save each sentence in a new line, after each document the new line was added
dc.language.iso	pol
dc.publisher	Ermlab
dc.source.uri	https://github.com/Ermlab/PoLitBert/
dc.subject	corpus
dc.title	Cleaned Polish Oscar corpus (32M lines)
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	no
branding	CLARIN-PL
demo.uri	https://minio.clarin-pl.eu/ermlab/public/PoLitBert/corpus-oscar/corpus_oscar_2020-04-10_32M_lines.zip
contact.person	Krzysztof Sopyła office@ermlab.com Ermlab
files.size	0
files.count	0