VIEW bibtexVIEW cmdi
Please use the following text to cite this item or export to a predefined format:
dc.contributor.author | Sopyła, Krzysztof |
dc.date.accessioned | 2021-07-30T09:37:57Z |
dc.date.available | 2021-07-30T09:37:57Z |
dc.date.issued | 2021 |
dc.identifier.uri | http://hdl.handle.net/11321/842 |
dc.description | Cleaned Polish Oscar corpus (part: 32M lines, 3.35 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences - remove ungrammatical sentences - perform sentence tokenization and save each sentence in a new line, after each document the new line was added |
dc.language.iso | pol |
dc.publisher | Ermlab |
dc.source.uri | https://github.com/Ermlab/PoLitBert/ |
dc.subject | corpus |
dc.title | Cleaned Polish Oscar corpus (32M lines) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | no |
branding | CLARIN-PL |
demo.uri | https://minio.clarin-pl.eu/ermlab/public/PoLitBert/corpus-oscar/corpus_oscar_2020-04-10_32M_lines.zip |
contact.person | Krzysztof Sopyła office@ermlab.com Ermlab |
files.size | 0 |
files.count | 0 |