Purpose of the programme

The division according to the effect you can achieve with our tools and applications:

  • basic text processing (tokenization, morphosyntactic tagging, syntactic parsing, recognition of named entities)
  • creating, reviewing and annotating bodies
  • lexicographic search (extraction of terms and multi word units, unification of lexical meanings, search for word examples for further research)
  • speech processing

The rapidly growing range of services, tools and functions within the CLARIN-PL infrastructure can be overwhelming. It is therefore often difficult to realise what type of assistance we offer. In order to overcome the above difficulties, as well as those resulting from the unstable terminology associated with the use of machine language processing in Polish scientific discourse, we present a list of infrastructure elements ordered by the functionality criterion.

A helpful but thereby simplified criterion for functional division is the ‘research phase‘. We have chosen to assign the individual functions of our tools/services/applications to one of four research phases that can typically be detailed when working with NLP methods in scientific settings (H&SS sector).

Forming

(Where and possibly how to obtain the research material in the textual form?)

This is the stage that precedes the actual research activities – its aim is to obtain text in a form that is suitable for further stages of machine language analysis. In practice, this stage includes activities such as OCR, transcription of spoken texts, downloading texts from the Internet, collecting posts from social networking sites, etc. CLARIN only partially supports activities in this research phase.

Processing

(How to prepare the material for further research?)

In the processing stage the collected texts are processed either by machine or by hand. As a result of processing, the text material is provided with an additional layer of information relating to the linguistic-communicative aspects of the text. Machine processing can mean, for example, morphosyntactic tagging, basic form assignment (lemmatisation), word stemming (tokenisation), normalisation, etc. Manual processing means manual annotation/marking/coding carried out in order to assign information to text fragments that cannot be automatically detected.

Analysis

(What information can be obtained from the material?)

In the analysis phase, the information assigned to the text in the processing phase undergoes extraction, grouping and other more advanced processes, the effect of which is to organise it according to a pre-defined directory. Examples of functions specific to this phase can be stylometric analysis or terminology extraction. The analysis may also be performed according to queries formulated individually by the researcher while browsing the corpora using standard search engines (e.g. KonText, Korpusomat).

Discussion

(How to interpret information obtained from the material)?

This is the stage of research that currently takes place completely outside the CLARIN infrastructure. It is the stage of substantive interpretation of the data produced in the previous stages. CLARIN staff will be happy to provide the necessary technical assistance, but data interpretation is usually a task entirely dependent on the discipline represented by the researcher.

Research phase

Function

Services

Analysis

Authorship analysis

Analysis

Analysis of the grammatical features of the text

Analysis

Sentiment analysis

Processing

Syntactic analysis

Analysis

Stylistic analysis

Analysis

Topic analysis

Processing

Annotation (marking, coding) of corpora

Forming

Automatic transcription

Analysis

Automatic text summarisation

Forming

Cleaning the text from redundant elements

(ACCESS)

Programmatic access

Analysis

Extracting information from text

Analysis

Text grouping

Analysis

Identifying named entities in texts

Analysis

Identifying keywords in texts

Analysis

Identifying foreign words in texts

Analysis

Identifying temporal expressions

Analysis

Identifying spatial expressions

Analysis

Thematic classification

Processing

Lemmatisation

Forming

Text standardisation

Forming

Improving punctuation

Forming

Improving spelling

Analysis

Comparing the characteristics of corpora

Analysis

Browsing the contents of corpora

Forming

(Acoustic) speech processing

Processing

Morphosyntactic tagging

Analysis

Creating verb characteristics of texts

Forming

Creating corpora

Analysis

Creating simple statistics (concordance, frequency, collocation)

Processing

Disambiguating lexical meanings

Analysis

Extracting characteristic phrases

Analysis

Extracting multi-word units

Analysis

Extracting terminology

Processing

Managing metadata