| dc.contributor.author | Sopyła, Krzysztof |
| dc.date.accessioned | 2021-07-30T11:23:09Z |
| dc.date.available | 2021-07-30T11:23:09Z |
| dc.date.issued | 2021 |
| dc.identifier.uri | http://hdl.handle.net/11321/845 |
| dc.description | Cleaned Polish Oscar corpus (part: 128M lines, 3.53 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences - remove ungrammatical sentences - perform sentence tokenization and save each sentence in a new line, after each document the new line was added |
| dc.language.iso | pol |
| dc.publisher | Ermlab |
| dc.source.uri | https://github.com/Ermlab/PoLitBert/ |
| dc.subject | corpus |
| dc.title | Cleaned Polish Oscar corpus (128M lines) |
| dc.type | corpus |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | no |
| branding | CLARIN-PL |
| demo.uri | https://minio.clarin-pl.eu/ermlab/public/PoLitBert/corpus-oscar/corpus_oscar_2020-04-10_128M_lines.zip |
| contact.person | Krzysztof Sopyła office@ermlab.com Ermlab |
| files.size | 0 |
| files.count | 0 |