Purpose of the Programme

Purpose of the programme

The division according to the effect you can achieve with our tools and applications:

basic text processing (tokenization, morphosyntactic tagging, syntactic parsing, recognition of named entities)
creating, reviewing and annotating bodies
lexicographic search (extraction of terms and multi word units, unification of lexical meanings, search for word examples for further research)
speech processing

The rapidly growing range of services, tools and functions within the CLARIN-PL infrastructure can be overwhelming. It is therefore often difficult to realise what type of assistance we offer. In order to overcome the above difficulties, as well as those resulting from the unstable terminology associated with the use of machine language processing in Polish scientific discourse, we present a list of infrastructure elements ordered by the functionality criterion.

A helpful but thereby simplified criterion for functional division is the ‘research phase‘. We have chosen to assign the individual functions of our tools/services/applications to one of four research phases that can typically be detailed when working with NLP methods in scientific settings (H&SS sector).

Forming

(Where and possibly how to obtain the research material in the textual form?)

This is the stage that precedes the actual research activities – its aim is to obtain text in a form that is suitable for further stages of machine language analysis. In practice, this stage includes activities such as OCR, transcription of spoken texts, downloading texts from the Internet, collecting posts from social networking sites, etc. CLARIN only partially supports activities in this research phase.

Processing

(How to prepare the material for further research?)

In the processing stage the collected texts are processed either by machine or by hand. As a result of processing, the text material is provided with an additional layer of information relating to the linguistic-communicative aspects of the text. Machine processing can mean, for example, morphosyntactic tagging, basic form assignment (lemmatisation), word stemming (tokenisation), normalisation, etc. Manual processing means manual annotation/marking/coding carried out in order to assign information to text fragments that cannot be automatically detected.

Analysis

(What information can be obtained from the material?)

In the analysis phase, the information assigned to the text in the processing phase undergoes extraction, grouping and other more advanced processes, the effect of which is to organise it according to a pre-defined directory. Examples of functions specific to this phase can be stylometric analysis or terminology extraction. The analysis may also be performed according to queries formulated individually by the researcher while browsing the corpora using standard search engines (e.g. KonText, Korpusomat).

Discussion

(How to interpret information obtained from the material)?

This is the stage of research that currently takes place completely outside the CLARIN infrastructure. It is the stage of substantive interpretation of the data produced in the previous stages. CLARIN staff will be happy to provide the necessary technical assistance, but data interpretation is usually a task entirely dependent on the discipline represented by the researcher.

Research phase	Function	Services
Analysis	Authorship analysis	WebSty, Topic, LEM
Analysis	Analysis of the grammatical features of the text	WebSty, Verbs, KonText, Korpusomat, Chronocorpus, ComCorp, LEM
Analysis	Sentiment analysis	Multiemo, Sentemo
Processing	Syntactic analysis	Tager
Analysis	Stylistic analysis	Websty, Verbs
Analysis	Topic analysis	Topic
Processing	Annotation (marking, coding) of corpora	Inforex
Forming	Automatic transcription
Analysis	Automatic text summarisation	Summarize
Forming	Cleaning the text from redundant elements	Speller, Punctuator
(ACCESS)	Programmatic access
Analysis	Extracting information from text	WebSty, LEM, Topic, TermoPL, MeWeX, Spatial, NER
Analysis	Text grouping	Websty, Topic
Analysis	Identifying named entities in texts	NER
Analysis	Identifying keywords in texts	Respa
Analysis	Identifying foreign words in texts	Inkluz
Analysis	Identifying temporal expressions	NER
Analysis	Identifying spatial expressions	Spatial
Analysis	Thematic classification	Topic, WiKNN
Processing	Lemmatisation
Forming	Text standardisation
Forming	Improving punctuation	Punctuator
Forming	Improving spelling	Speller
Analysis	Comparing the characteristics of corpora	ComCorp, WebSty, Verbs, LEM
Analysis	Browsing the contents of corpora	KonText, Korpusomat, Inforex, Chronopress, Chronocorpus, Federated Content Search, VLO
Forming	(Acoustic) speech processing
Processing	Morphosyntactic tagging	Tager
Analysis	Creating verb characteristics of texts	Verbs, LEM
Forming	Creating corpora	Korpusomat, DSpace, CLARIN Cloud, KonText, Inforex
Analysis	Creating simple statistics (concordance, frequency, collocation)	KonText, Korpusomat, LEM, WebSty
Processing	Disambiguating lexical meanings	WSD, LEM
Analysis	Extracting characteristic phrases	TermoPL, MeWeX
Analysis	Extracting multi-word units	TermoPL, MeWeX
Analysis	Extracting terminology	TermoPL, MeWeX
Processing	Managing metadata	Inforex, Korpusomat, DSpace