dc.contributor.author |
Sopyła, Krzysztof |
dc.date.accessioned |
2021-07-30T11:23:09Z |
dc.date.available |
2021-07-30T11:23:09Z |
dc.date.issued |
2021 |
dc.identifier.uri |
http://hdl.handle.net/11321/845 |
dc.description |
Cleaned Polish Oscar corpus (part: 128M lines, 3.53 GB). Data was prepared with a few cleaning heuristics:
- remove sentences shorter than
- remove non-polish sentences
- remove ungrammatical sentences
- perform sentence tokenization and save each sentence in a new line, after each document the new line was added |
dc.language.iso |
pol |
dc.publisher |
Ermlab |
dc.source.uri |
https://github.com/Ermlab/PoLitBert/ |
dc.subject |
corpus |
dc.title |
Cleaned Polish Oscar corpus (128M lines) |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
has.files |
no |
branding |
CLARIN-PL |
demo.uri |
https://minio.clarin-pl.eu/ermlab/public/PoLitBert/corpus-oscar/corpus_oscar_2020-04-10_128M_lines.zip |
contact.person |
Krzysztof Sopyła office@ermlab.com Ermlab |
files.size |
0 |
files.count |
0 |