• CLARIN-PL Repository Home
  • View Item
  •  
  •   What can you do?
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login via Your home institution
    •    Register
  •   Statistics  
    •    StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About and Policies
    •    Help Desk
 
 

Cleaned Polish Oscar corpus (128M above lines)

 
CLARIN-PL
  Authors
Sopyła, Krzysztof
 Project URL
https://github.com/Ermlab/PoLitBert/
 Demo URL
https://minio.clarin-pl.eu/ermlab/public/PoLitBert/corpus-oscar/corpus_oscar_2020-04-10_128M_above_lines.zip
 Date issued
2021
 Type
corpus
 Language(s)
Polish
 Description
Cleaned Polish Oscar corpus (part: 128M above lines, 1.93 GB). Data was prepared with a few cleaning heuristics: - remove sentences shorter than - remove non-polish sentences - remove ungrammatical sentences - perform sentence tokenization and save each sentence in a new line, after each document the new line was added
 Publisher
Ermlab
 Subject(s)
corpus
 Collection(s)
CLARIN-PL
Show full item record
 
 
  • © 2024 CLARIN-PL. All Rights Reserved.
  • Base on DSpace modified by UFAL MFF UK and CLARIN-PL
  • Privacy policy | Licenses