One of the goals of the research project realised in Collegium Civitas (a non-state university) in Warsaw was to check the content of web pages of the Polish institutions (public and private) related to culture, in its broadest sense.
Around 3200 institution were pre-selected and almost 200 000 documents were acquired from their web sites. The content of the web pages was divided into paragraphs of different sizes (around 1 200 000). The goal was to classify the paragraphs into 20 semantic classes defined by the sociologists. The classes describe different aspects of the use of the web page as a communication medium and they were organised into three groups: competences, functions of the culture, thematic areas plus 6 individual classes (e.g. auto-presentation or local function).
The initial vision was a simple system for supervised classification of text documents. After the Context of Use Analysis, the plan was expanded to a complex system encompassing user-controlled corpus building, text preprocessing (text segmentation and morpho-syntactic tagging and parsing), automated sample selection, manual annotation, training classifiers and automated annotation and result analysis. Moreover, we discovered that there is no open corpus annotation editor focused on applications in Social Sciences. The constructed prototype system can be also adopted to many similar tasks in Digital H&SS.
semantic classification, corpus building, text processing, morpho-syntactic tagging, parsing, annotation, analysis, sample selection
classifier, mono-lingual, text processing, web-application
Communication & Media Studies, Cultural Sciences, Discourse, History, Sociology, Political Studies
Wrocław University of Technology