What are we working on?

CLARIN-PL tasks

Tasks

Headlines

Partner Centers

A1
A2Construction of Language Technology Centre


The task A2 primary goal is to promote services, resources and tools for Polish language processing for researchers in the humanities and social sciences. This objective requires formation of Language Technology Centre – type B centre for Polish CLARIN section, integrated with CLARIN infrastructure.
To evaluate the practical system functioning, we want to provide basic services as soon as possible and start working together with humanists and social scientists in order to develop applications and research tools of higher-level.
An important objective of the Centre is also to coordinate CLARIN consortium in Poland at the technical level and to synchronize the elaborated solutions in a coherent system.
The most important demands are:
to ensure appropriate storage (repository) system with persistent identifiers for system resources and tools - verified regularly by the relevant quality procedures (DSA, 2010), based on the agreement of CLARIN ERIC and compatible with the requirements of compliance CLARIN (CLARIN compliance seal);
• to support all CLARIN specifications related to the accepted standards, formats, protocols, and programming interfaces (APIs);
• to participate in a national federation in identifying and services, coordinated through centers for networking and supercomputing;
• to follow strictly the standards of intellectual property rights, licenses and ethical rules;
• to establish security policy, for example by certification of servers and responsible management of personal data;
• to manage meta-data in accordance with accepted standards (eg ISOcat) and CLARIN agreements;
• to share integrated resources and tools useful in humanities and social sciences research.
An important part of the project and the Language Technology Centre establishing is also a specially designed and implemented server with disk array of large data volume, high speed of work and an archiving system. Server solution for CLARIN is built based on a set of servers, each of which has a separate function, so they all can work as a supercomputer. The installed operating system is compatible with Linux systems.

Contact person:
   dr Maciej Piasecki maciej.piasecki@pwr.wroc.pl;
   Marcin Pol marcin.pol@pwr.wroc.pl

close...

Wrocław University of Technology
A3Long-term archiving of digital data


Within task A3 we are building a prototype of a deep archive aimed for a long-term storage of digital data (for even up to 50 years). The access to the archive will be granted only upon request (it will not be made available on-line). Unlike commercially available local storage, the deep archive enables a persistent and safe data archiving.
This is possible due to built-in mechanisms protecting the physical condition of records and media (by the ability to reduce the level of chemically active oxygen, which causes corrosion).
Such solution helps to build the IT infrastructure necessary to carry out studies in the area of humanities and social sciences. In particular, it will allow long-term archiving of written materials, audiovisual recordings, scans and images. The deep archive will complement the CLARIN infrastructure and will enable reliable storage of data related to the project.

Contact person: Krzysztof Marasek kmarasek@pjwstk.edu.pl

close...

Polish-Japanese Institute of Information Technology
A4Polish Speech Recording Corpus for Training and Evaluation


The aim of task A4 is to prepare a Polish speech recording corpus. The corpus will be available free of charge to the researchers of: Polish speech, form and structure of the dialogue, experimental phonetics, forms of the social interaction etc. As an element of speech recognition systems, the corpus will aid the search of information in speech (key words) and the recognition of speakers.
The corpus will consist of three types of recordings:
a) high quality studio recordings of speech (about 200 people and 50 hours of speech),
b) recordings of broadcast (about 50 hours of speech: interviews, news, ads, radio dramas etc.),
c) telephone dialogues recorded in call centres (spontaneous dialogues with background noise).


close...

Polish-Japanese Institute of Information Technology
A5Corpus of (Press) Articles Published Between 1945 and 1954.


The aim of task A5 is to create a corpus of short press articles published between 1945 and 1954 (eventually it will be expanded to include contemporary press). This period will be divided into separate months. Randomised text fragments from the most important newspapers and magazines will represent each month. The texts will be chronologically arranged and linguistically processed so as to enable the observation of changes in the ways various occurrences, events and processes are presented in the daily press. Combined with plWordNet the corpus could be searched by topic using general keywords like FINANCE, WAR, NATURE. The corpus is aimed at researchers of linguistics, cultural anthropology, history and social studies. The corpus purpose is to create a tool to work with articles that are not easily accessible, as the low quality paper on which they were printed is gradually degrading. The corpus will be equipped with advanced tools of chronological analysis. It will bridge the gap in sourcing the research on Polish language and modern history.

Fig. 1 Recurrent phenomena embedded in timeline.


Contact person: Adam Pawłowski, apawlow@uni.wroc.pl

close...

University of Wrocław
A6The Corpus of Conversation Recordings


The aim of task A6 is to create a large corpus of conversation recordings. The sales representatives trained for this task will record informal conversations in natural circumstances. Additionally the recordings collected during other projects will be processed. The conversation transcripts will be temporally annotated by hand. The resulting corpus will contain over 120 hours of conversations representing unofficial Polish speech, which differs considerably from studio recordings or language spoken in the media. The gathered data will be described by demographic indicators (age, sex, education, background). The access to the corpus will be possible through a search system supporting specialized corpus queries, as well as the exploration and visualization of conversation data. Also, the corpus will be released on an open licence. We expect that the conversation corpus will have a wide range of applications in linguistics, sociology, anthropology and psychology, as well as in research on speech discourse and communication modelling in informal circumstances.

Contact person: Piotr Pęzik, piotr.pezik@gmail.com

close...

University of Lodz
A7Parallel Polish-English Text Corpus


Description: The scope of the task A7 is to create a corpus from Polish and English versions of the same text, in which various types of translation equivalents in both languages (division, merge, sentence insertion, omission, transposition, etc.) will be annotated. In total there will be about 50 million tokens in the corpus. The access to the corpus will be possible through a user-friendly search system. The corpus will prove especially useful in comparative analysis of language and culture, as well as in Polish-English psycholinguistic and sociolinguistic research.

Contact person: Piotr Pęzik, piotr.pezik@gmail.com

close...

University of Lodz
A8Polish, Bulgarian and Russian Text Corpus


The aim of task A8 is to collect, process and parallelise a corpus of texts in three languages: Polish, Bulgarian and Russian. The corpus will contain 6 million tokens and include both the original texts and their translations, which could be compared sentence by sentence to observe the translation strategies. The corpus will consist of digitalized contemporary literary texts, journals, scientific and specialist texts. A manually prepared description of temporal and quantitative meanings of verbal, adverbial and participial forms will allow to separate linguistic forms and their meanings in selected 2 000 sentences in three languages. This type of manual semantic annotation will be applied in computer linguistics for the very first time. The corpus will fill the gap in digital resources for Slavic languages and will become an important milestone in credible manual and automatic translation. The parallel corpus of three Slavic languages available online will support queries of various users. It will prove useful not only in creating bilingual and trilingual dictionaries, but also in confrontational research conducted by specialists in Polish and Slavic studies. Moreover, it will be applicable to literary studies, cultural studies, sociology, political science, history, intercultural communication and anthropology, as well as come as a considerable aid to translators and terminologists. The parallel corpus of Polish, Bulgarian and Russian texts could also be used to verify automatic translation programs. Furthermore, it can play an important part in teaching native and foreign languages on different levels (from primary school to university).

Contact persons:
   Violetta Koseska amaz1312@gmail.com;
   Wojciech Sosnowski wososnow@uw.edu.pl

close...

Institute of Slavic Studies, Polish Academy of Sciences
A9The Corpus of Polish and Lithuanian Texts


The aim of task A9 is to create a corpus of Polish and Lithuanian texts. The corpus will be divided into two subcorpora. The first one will be a parallel Polish-Lithuanian corpus containing 6 million words from texts related on the level of corresponding sentences. A manually prepared description of temporal and quantitative meanings of verbal, adverbial and participial forms will allow to fully separate linguistic forms and their meanings in selected 2 000 sentences in both languages. The second one will be a comparative corpus of 400 000 words. A parallel corpus requires digitalised texts, which means scanning printed text and transforming it into a text file by an optical character recognition program. It is a necessary step, as the texts – linguistically important Polish-Lithuanian and Lithuanian-Polish translations – were created before computer typesetting was possible. The parallel corpus can be useful in producing dictionaries, gaining information on grammar and word meaning, in building and synchronizing linguistic nets, and well as developing automatic translation. Lithuanian content of the comparative corpus will be translated into Polish to make it available also to users that do not speak Lithuanian. The comparative corpus will be a resource for research on history, political science, ethnography, cultural studies, sociology and anthropology.

Contact persons:
   Violetta Koseska amaz1312@gmail.com
   Roman Roszko roman.roszko@ispan.waw.pl


close...

Institute of Slavic Studies, Polish Academy of Sciences
A10plWordNet 3.0 – Semantic Dictionary of the Polish Language


The aim of task A10 is to extend a dictionary of the Polish language called plWordNet [http://plwordnet.pwr.wroc.pl/wordnet]. It is a unique dictionary – intended for both people and computers. Only through plWordNet can a computer learn the true meaning of words. Researchers have agreed that it would be best to present a computer with all word meanings in form of a set of interconnected relations. For example, a tiger meaning ‘an animal’ is connected to a cat meaning ‘a feline’ with a relation A-KIND-OF (a tiger is a kind of undomesticated cat); a bumper is connected to a car with a relation A-PART-OF; a bachelor and a married man are classified in the plWordNet as OPPOSITES. Of course words have often more meanings, like a tiger is also the name of a WWII German tank. In that case a tiger is connected to a tank with a relation A-KIND-OF (a tiger is a kind of tank). For computers words in such a net gain meaning and start a life of their own [http://www.nlp.pwr.wroc.pl/pl/slowosiec-20/relacje- slowosieci].

Contact person: Marek Maziarz: mawroc@gmail.com

close...

Wrocław University of Technology
A11The Dictionary of Polish-English Semantic Relations (Linking plWordNet to Princeton WordNet)


The aim of task A11 is to collate plWordNet and Princeton WordNet. Basing on the correspondence in meaning and the position in net’s structure the plWordNet synsets are mapped onto the Princeton WordNet synsets. We link the synsets using one of seven inter-lingual relations:
· synonymy {kolor 3} - {color 1},
· near-synonymy {pracownia 1} - {workshop 1},
· inter-register synonymy {angol 1} - {Englishman 1},
· hyponymy {brat cioteczny 1} - ‘the son of an aunt’ - {cousin 1},
· hypernymy {palec 1} - {finger 1},
· meronymy {katapulta 2} - {airplane 1},
· holonymy {eskadra lotnictwa taktycznego 1} - {airplane 1}.
As a result a large bilingual linguistic database will be created, with the advantages of a dictionary, a bilingual thesaurus and a lexical-semantic database.
Contact person: Ewa Rudnicka, email: ewa.rudnicka78@gmail.com

close...

Wrocław University of Technology
A12Polish Shallow Semantic Parser


The main goal of the task A12 is to create a parser, a tool to perform semantic text analysis. The analysis will result in a formal description of a text, above all containing information about the relations between entities present in the text, for example:
an object – to, co jest poddane jakiejś czynności, akcji, procesowi
(e.g. moving a pawn, peeling an egg, owning a property),
a subject – the executant or an originator of an action (e.g. workers’ protest, Peter’s journey, howling of a wolf, volcano eruption).
Additionally, an analysis will be performed, targeted at disambiguating the sentence elements, that is linking them to the plWordNet concepts and the SUMO ontology. SUMO (Suggested Upper Merged Ontology http://www.ontologyportal.org ) describes relations between entities, for example zdanie [a sentence] is a subclass of wyrażenie językowe [Linguistic expression], and człowiek [Man] is a subclass of istota ludzka [Human]. The disambiguation will result in a relation, for example between człowiek [Human] (as word that appears in a text) and the meaning of {człowiek 1} [Human 1] that appears in plWordNet. Also an information on the relation of the word that appeared in the text will be available, for example człowiek is linked to Human from SUMO as an equivalent (equivalence?) relation.
The parser will conduct a shallow analysis, since its aim is not to identify the whole sentence structure, but to describe selected elements of the semantic text structure as accurately as possible. The focus is mostly on nominal phrases, which refer to actual entities and their attributes.
How does a parser work?
[Policja] [rozpoczęła] [poszukiwanie[zaginionego człowieka]].
[Police] [began] [searching for [a missing person]].
Poszukiwanie zaginionego człowieka is the analysed nominal phrase:
poszukiwanie is linked to the meaning of {poszukiwanie 1} in plWordNet,
zaginiony is linked to the meaning of {zaginiony 1} in plWordNet,
człowiek is linked to the meaning of {człowiek 1} in plWordNet.
A relation between the sentence elements and the SUMO ontology is defined:
człowiek – subsumed – Human.
Semantic relations are established between:
poszukiwanie → człowieka – human as an object of search,
zaginionego → człowieka – human as a subject, an executor, a provoker of disappearance.
The parser will be mainly used as a knowledge base to answer questions about the elements that appear in the text. Linking the text elements to the higher level ontology may also be used in a system that summarizes text by generalizing some of the concepts.

Contact person: Paweł Kędzia: paw.kedzia@gmail.com

close...

Wrocław University of Technology
A13Syntactic-Semantic Subcategorisation (Valency) Database


The aim of task A13 is to create a database containing 15 000 predicates (12 000 nouns, 3 000 nouns and adjectives), as well as a tool tools to enable its extension and usage. The Predicates are words in a sentence that make place for further words (arguments). For example, a predicate deliver can have four arguments: somebody delivers something to someone by something → A courier delivers parcels to customers by lorry.
Predicates in the database will have a list of semantic arguments with which they can be combined. For example, only dogs can woof and only cats can meow, but both people and newspapers can write. While creating this database the linguists will make use of a treebank, in which words are annotated with their meanings from plWordNet. The base itself will be a bridge between plWordNet (a lexical database) and a semanto-syntactic analyser (task A20) – it will extend all applications of both tools. Moreover, it can be used to indicate the relations between concepts that appear in texts, which would then be made available by a search engine. The so-called ‘open idioms’ (e.g. ‘to keep your fingers crossed’) and metaphoric expressions (e.g. ‘highlander’s proverb states that...’) allows for a research in cultural and literary studies basing only on the collected data. In turn, the tools designed to create and manage the database can be employed by users without any experience in computer science or linguistics to build domain resources. It will enable sociologists, psychologists, historians etc. to use advanced tools in their field to the extent that was not available before. Furthermore, the database can support systems of automatic information extraction, question answering and textual abridgement.

close...

Institute of Computer Science,
Polish Academy of Sciences
A14Program for Searching Multi-word Lexems in Texts and a Dictionary of Multi-word Lexems.


The task focuses on lexems that belong to Polish vocabulary and consist of at least two words (multi-word lexems). Those include:
• idioms (for example, crocodile tears, after dinner mustard, black mass),
• domain-specific terms/concepts (for example, topological space, dynamic-link library, multi-band compressor),
• idiomatic phrases from dictionaries of idioms (for example, black humour, to give one’s word).
The aim of the task A14 is to create a program which will automatically extract multi-word lexems from texts and a dictionary containing 60 000 lexems. In the future the developed methods will allow to extend the multi-word lexems lexicon semi-automatically, with minimal human supervision. Furthermore, each lexem will be described semantically and syntactically. The syntactic description will map out the fixed/free word order in a unit, inflection of its contents and additional limitations in its usage (for example, existence only in a plural form). Since the multi-word lexems are an important element of vocabulary, they will be included in the plWordNet and linked with semantic relations to other words.

Contact persons:
   Adam Radziszewski: adam.radziszewski@pwr.wroc.pl
   Mariusz Paradowski mariusz.paradowski@pwr.wroc.pl

close...

Wrocław University of Technology
A15Program for Searching Proper Names in Texts and a Dictionary of Polish Proper Names.


The task focuses on tools and resources for automatic recognition and classification of proper names (named entities) in a text. The existing tool, Liner2 (http://www.nlp.pwr.wroc.pl/liner2) recognizes 56 categories. Currently its capacity will be extended to 100 categories recognized by a program based on an especially annotated corpus, a dictionary of popular named entities and plWordNet. Classifying to respective categories means ascribing the indicated entity to a set of synsets (linked with a relation instance), for example, Rudolf Schuster – {person, human, individual}, {president}, through the analysis of context in which the entity appeared, for example ‘Rudolf Schuster – a politician and the president of Slovakia between 1999 and 2004.’ Also, a tool will be designed to create domain-specific dictionaries of named entities for a chosen set of documents, e.g. a dictionary of political science terms based on text from that field.

Contact person: Michał Marcińczuk: marcinczuk@gmail.com

close...

Wrocław University of Technology
A16Expanding the WordnetLoom Application.


The aim of task A16 is to expand the existing version of WordnetLoom application so as to add new features to plWordNet. WordnetLoom is an application which presents relations between the nodes (synsets) creating a wordnet in a form of a graph. Each node in a graph represents a meaning. An edge linking two respective nodes defines the direction and the relation type between them.
The main goal is to separate the visualisation layer from the logic layer, which will have a positive impact on the application performance. The application will be able to open any WordNet in a given input format. We plan to internationalize the graphical user interface and improve the querying of synsets/lexical units by adding, among other things, search by multiple criteria. An important part of this task is also releasing a synset visualisation on plWordNet’s web page similar to the one available in the WordnetLoom application.

Contact person: Paweł Kędzia: paw.kedzia@gmail.com

close...

Wrocław University of Technology
A17A System for Querying Large Text Collections with Metadata


Texts gathered in large collections (corpora) can have metadata about:
~ syntax (which parts of speech occur, what function do the respective words perform in a sentence, how do the words inflect);
~ semantics (which meaning of the word is used, which words have an evaluating meaning);
~ the document’s origin (when was the text written, who is the author, what kind of source is that).
The aim of task A17 is to develop techniques that will allow for querying text collections with simultaneous criteria on different information levels. The resulting system will enable users to efficiently query text fragments meeting different criteria at the same time. The system will have a user-friendly interface. It will accommodate to various workflows and different levels of experience in linguistics and computer science.
The system will facilitate the retrieval of information even more complex than it is currently possible. Researchers in linguistics and cultural studies will be able to track changes in language over the years, especially changes in meaning and connotations of given phrases. Researchers in history, sociology and political science will gain a fuller picture of a person or social phenomenon in question basing on evaluating terms.
Also, the system will aid other CLARIN tools. It will prove useful in building plWordNet, as well as various kinds of dictionaries and automatic tagging methods.

close...

Institute of Computer Science
Polish Academy of Sciences
A18A Speech Recording Manipulation Toolkit.


The aim of task A18 is to develop tools, which will enable access to speech recordings stored in CLARIN. At present there are no tools for Polish language to process and analyse such recordings. Most importantly the tools will allow for:
~ search for words/phrases in recorded speech collections;
~ segmentation into fragments uttered by respective speakers;
~ orthographic-to-phonetic conversion;
~ detection of various kind of events in the recordings (speech, music, noise, etc.).

close...

Polish-Japanese Institute of Information Technology
A19Specialized Text Analysis Tools


Specialized texts differ from articles or literary texts as they contain specific vocabulary (names of theories, technologies, tools) and occasionally grammatical constructions. Also, they include abbreviations, which can be interpreted differently depending on the field. For example, Polish ‘kl.’ can expand to class, classification, cleric or cloister. Due to specific vocabulary and ways of conveying information, specialized texts require adjusting linguistic tools, which are currently most efficient for texts from newspapers, magazines or prose works.
The aim of task A19 is to create a toolkit which is easily adjustable to new fields of knowledge and coping even with the sloppy texts (typos, incomplete sentences, etc.). An important element of such system will be a dictionary for a chosen field with words that do not belong to general vocabulary and a dictionary of multi-word lexems.
Dictionaries of terminology and tools recognizing naming units are essential to querying large text collections, labelling texts with field-specific information and creating query systems. A dictionary of terminology may be used to search the Internet (an example of a huge collection) for texts from a chosen field, that is containing enough specialized vocabulary, and then to divide them into subgroups of closely related terms.

close...

Institute of Computer Science
Polish Academy of Sciences
A20A System for Deep Semanto-Syntactic Text Analysis


The aim of task A20 is to develop a tool for a deep semanto-syntactic analysis. Contrary to shallow analysis, in which a program recognizes only selected elements of the sentence’s structure (like a nominal phrase), deep analysis allows for describing the sentence’s structure in full and naming the function of each element. Currently available tools will be improved to analyze raw texts – not cleaned up and grammatically disambiguated.

close...

Institute of Computer Science
Polish Academy of Sciences
A21A System for Recognizing and Analysing Information Structure of a Polish Text.


Description: The aim of this task is to develop a comprehensive tool which will recognize information scattered in a text concerning:
~ objects – people, organizations, locations, items, etc.;
~ relations – links between objects, for example name equivalence, affiliation to organization, authorship, object location;
~ situations – information on who did what to whom, as well as where and when did it happen.
Currently we expand and improve types of recognized relations. We focus mostly on temporal relations (when something happened) and spacial relations (where someone/something is situated, for example: a bank is situated across the park.)
The developed tool will be used to process Polish texts.

Contact persons:
   Jan Kocoń: janekkocon@gmail.com
   Michał Marcińczuk: marcinczuk@gmail.com

close...

Wrocław University of Technology
A22A Program for Extracting Semanto-Pragmatic Information from Texts.


The aim of this task is to develop a system for extracting semanto-pragmatic information from text documents concerning primarily the relations between text fragments, for example between sentences:
~ two text fragments are equivalent:
◦ Fragment 1: The operation took place in Gaza and – to a lesser extent – in Chan Junis and Rafah.
◦ Fragment 2: The operation took place in Gaza and – to a lesser extent – in Chan Junis and Rafah.
~ two text fragments are opposite:
◦ Fragment 1: Failure in the 137th second of flight most probably means that the first stage of rocket made in Russia is to blame.
◦ Fragment 2: Some say that it could be the uncontrolled jettison of rocket stages and the premature ignition of the 2nd stage engine.
~ one text fragment is the elaboration of the other:
◦ Fragment 1: Moreover, the social criterion should take into consideration that each variant will cross the farmland or be located near the houses.
◦ Fragment 2: During the second round table talks the parties will discuss criteria, which should be taken into consideration before deciding on the bypass route variant.
In addition, task A22 aims at constructing tools for keywords extraction and automatic text summarization. The development of a system for keyword extraction and document structure recognition will benefit all other elements of CLARIN-PL infrastructure components, as it will enable fast analysis of significant documents and their fragments.
Automatic text summarization could be used as a tool for quick document browsing, which itself is highly useful for researchers in humanities and social sciences. Additionally, it will allow the researchers to rank documents under a relevance criterion.

Contact person: Paweł Kędzia: paw.kedzia@gmail.com

close...

Wrocław University of Technology
A23A Program for Topic-Based Classification of Polish and English Texts.


The aim of task A23 is to create a classifier program, that is a special computer program for automatic labeling of Polish and English texts according to the topic. The set of topics will be based on Wikipedia, as it it the largest and most up-to-date open multilingual encyclopedia. The classifier will be available as an independent web service. Additionally, this solution will be applied to classify texts from the National Corpus of Polish and the British National Corpus, which will enable a search not only by content, but also by topic. For example, a user interested in abortion could formulate a linguistic query on all inflections of the word ‘abortion’ and in addition narrow the topic context, like ‘abortion in the context of Catholic theology’ or ‘abortion in the context of the Spanish law.’ Moreover, a user could compare the labeled categories in similar texts written in different languages (Polish and English). Because of that researchers in linguistics, cultural studies, intercultural communication, sociology and political science will be able to automatically narrow corpus discourse analysis in a given language to specific topic categories.

Contact person: Piotr Pęzik: piotr.pezik@gmail.com

close...

University of Lodz
A24
The table shows list of tasks which concern partner centers co-operating within CLARIN PL project...