Services for speech processing
Name
Services for speech processing:
- ALIGN
- ANNPRO
- DIA
- G2P
- KWS
- RECO
- VAD
Description
Versions
-
Name
ALIGN – a tool for matching speech transcription to the audio recording
Description
The ALIGN service, or the so-called “speech alignment”, is one of the most functional tools. It is used to match a speech transcription to a provided audio recording. The result of the tool can be understood as the automatic generation of time codes, when both the audio signal and its transcription are known. Its functionality stems from the fact that it can be used to find specific events in large collections of recordings. It also enables the computation of statistics which is related to the timing of individual events (and other event characteristics). The matching is performed at both word and phoneme level. The result of this process is an orthographic and a phonetic alignment (segmentation and labelling) of the recorded speech, which is then rendered into the desired target format (currently TextGrid, but other formats are planned) and returned to the user via the Emu-webApp browser, which allows the user to view the segmentation results.
An extended version of the ALIGN service is to be implemented in the form of a better acoustic model. Such an extension is essential for noisy data because the current version of the tool works well for clean and predictable data, but it can produce errors or fail completely for very noisy or low energy signals..
Publication to be cited in case of usage
Auxiliary materials
Serwis przetwarzania mowy – dr inż. Danijel Koržinek
Access
https://mowa.clarin-pl.eu/tools/ui/align/segment ; https://mowa.clarin-pl.eu:8433/
Link to the manual
Examples of application
The correlation of acoustic-phonetic phenomena with linguistic description of speech.
-
Name
ANNPRO – a plug-in for Annotation Pro
Description
Annotation Pro is equipped with a mechanism which allows for a custom automated segmentation module based on the user’s own tools. This enables the user to use their own annotation tools in the Annotation Pro environment, giving them access to, for example, the option to perform automatic segmentation/transcription of data in the case of files compiled in a collection, simultaneous processing of multiple files and multiple annotation layers. Desktop tools can be more effective when annotating audio or video files, especially those with longer durations and the presence of various types of additional sounds. A tool dedicated to annotation and annotation exploration will immediately provide the control over the process. Running an offline version will enable annotation of ‘sensitive’ data that cannot be sent to external servers due to data protection or other restrictions.
Publication to be cited in case of usage
Auxiliary materials
Serwis przetwarzania mowy – dr inż. Danijel Koržinek
Access
https://mowa.clarin-pl.eu/tools/annotationpro
Link to the manual
Examples of application
The automation of the recording segmentation process in Annotation Pro
-
Name
DIA – a tool for recognizing speakers (voice biometrics)
Description
DIA is a tool used to divide large audio files into smaller segments, which are uttered by individual speakers. There are several types of speaker segmentation strategies: the first is to recognize when a speaker changes to another speaker; the second is to add information about which part belongs to the same speaker; and the third strategy is to identify the recognized parts so that we know who exactly is speaking in the recognized segment. The tool supports the second strategy, in which we recognize speaker changes, know how many speakers there are, and at what moments of the recording they occur. However, the tool treats speakers anonymously, not identifying them, but only assigning them successive numbers. The tool is useful for adapting different tools and models to individual speakers, as well as for other types of analysis that require speaker segmentation.
Publication to be cited in case of usage
Auxiliary materials
Serwis przetwarzania mowy – dr inż. Danijel Koržinek
Access
http://mowa.clarin-pl.eu/tools/ui/speech/diarize ; https://mowa.clarin-pl.eu:8433/
Link to the manual
Examples of application
Analyzing the activity of individual speakers in a recording of multiple people
-
Name
G2P – a tool for converting orthographic notation into phonetic notation
Description
The G2P (grapheme-to-phoneme) tool allows for the conversion of any orthographically written text into its phonetic (spoken) form. This is one of the basic steps in any speech processing. The tool accepts any form of text; however, it does not perform text normalization. This means that it does not automatically convert numbers, dates or abbreviations. The system uses rules (972 basic rules and 4802 word substitutions due to exceptions) and includes a list of exceptions for proper names, foreign names and uncommon words. The tool can generate both lists of words taking into account different pronunciations (from coarticulation effect resulting from the context) as well as canonical transcription of text. The tool uses a variant of the SAMPA phonetic alphabet, modified to include only letters of the alphabet (without symbols like apostrophe or tilde, which have been replaced with i and n).
A transcription is a record of text pronunciation. An orthographic alphabet does not fulfill this function, because an orthographic transcription does not tell you (in spite of appearances) how exactly you should read a given word. Furthermore, the multitude of alphabets (Latin, Cyrillic, Korean, and others) would require knowledge of their system in order to read a word from a given language. It is important to note, however, that while there is an international phonetic alphabet (API – Alphabet Phonétique International), but it is not always widely used. The International Phonetic Alphabet (IPA) transcription system was created based on the phonetics and phonology of Western European languages and is not very well adapted to Polish.
Several extensions to this tool are planned to be implemented. The first one is to normalize the text (numbers, dates etc.) before conversion. Another planned extension is to include different forms of phonetic alphabet and possibly add additional levels of annotation (accents or syllabification). The implementation of these extensions depends on the interest of the community in the mentioned tools.
Publication to be cited in case of usage
Auxiliary materials
Serwis przetwarzania mowy – dr inż. Danijel Koržinek
Access
http://mowa.clarin-pl.eu/tools/ui/phonetize/word ;
https://mowa.clarin-pl.eu/tools/ui/phonetize/list
Link to the manual
http://mowa.clarin-pl.eu/tools/ui/phonetize/word ;
https://mowa.clarin-pl.eu:8433/docs/doc.html#g2p
Examples of application
Corpus analysis of texts concerning pronunciation or pronunciation similarity
-
Name
KWS – a keyword detection tool
Description
An accurate transcription of the audio material is often not necessary because we are only interested in the occurrence of individual words. Keyword detection is a process that involves downloading an audio file and a list of keywords. A list of occurrences of these words within the audio file is then generated. However, note that the language model has a limited dictionary size, so it is impossible to predict all possible words. For this reason, the system uses word combinations and words in such a way that when there is a need to find a word outside the dictionary, the syllable representation of the word is used. This allows the system to cope with words that are outside the dictionary, but is more prone to errors when very short keywords are provided. An overall precision of the tool is at ~95% and a sensitivity level (Recall) for known words at ~82% and a low level for unknown words (at ~20%). The syllable-based model needs improvement in the future to avoid errors for unknown words.
Publication to be cited in case of usage
Auxiliary materials
Serwis przetwarzania mowy – dr inż. Danijel Koržinek
Access
http://mowa.clarin-pl.eu/tools/ui/speech/kws
Link to the manual
http://mowa.clarin-pl.eu/tools/ui/speech/kws ;
https://mowa.clarin-pl.eu:8433/docs/doc.html#kws
Examples of application
Finding terms in interviews or television programs
-
Name
RECO – an automatic speech recognition tool
Description
This tool uses a speech recognition system to generate the most likely orthographic transliteration of audio recordings of speech in Polish. First, the audio signal is subjected to feature extraction in the form of time frames. A standard set of 39 features (mainly MFCC) with 100 frames per second, 25ms overlap of adjacent frames is used. The frames are then filtered using the VAD module. The frames containing only speech are then subjected to speaker recognition to adapt the acoustic model. The acoustic model predicts the probability of words based on the observed acoustic features. The output of the acoustic model are phonemes that need to be converted into words. This is done by the phoneme-to-grapheme (P2G, phoneme-to-grapheme) conversion module. The word strings, however, need to be arranged in a sequence suitable for the grammar of the language. This is done by a language model that computes the probability of the word sequence. The decoder selects the sequence with the highest probability and returns it as a word sequence. These words are specified by the dictionary, which is mapped to the phonetic output of the acoustic model. P2G conversion is an important bridge between the speech sound and the notation that is used when reading and writing words on paper. This procedure is important both during the training phase (to convert trained transcriptions to phonemes) and during normal use (to convert phonemes to read text).
Publication to be cited in case of usage
Auxiliary materials
Serwis przetwarzania mowy – dr inż. Danijel Koržinek
Access
https://mowa.clarin-pl.eu/tools/ui/speech/recognize ;
https://mowa.clarin-pl.eu:8433/
Link to the manual
https://mowa.clarin-pl.eu:8433/docs/doc.html#asr ;
https://mowa.clarin-pl.eu:8433/apidoc/index.html#api-Narz%C4%99dzia-RECTool
Examples of application
Transcription of the audio material
-
Name
VAD – a speech detection tool
Description
Voice activity detection (VAD) is often used at the pre-processing stage for many speech processing tools, since the audio data is usually non-monogenic and contains mixed fragments of speech, music, background and silence. Distinguishing between these different types of audio is crucial for a high performance transcription system. Its goal is to isolate parts of recording containing speech from those parts containing other types of events (silence, noise, music, etc.). This tool is completely independent of language and speech domain; nevertheless, it can generate errors with very noisy data. A small experiment confirmed a high level of sensitivity (Recall ~ 99%) and medium precision (Precision ~ 58%). However, this was the intended goal, not to lose any parts containing speech, sometimes accepting parts that do not contain it. This is because the tools which work with the data generated by VAD accept a small amount of noisy data, but work erroneously when any part of speech is missed.
Publication to be cited in case of usage
Auxiliary materials
Serwis przetwarzania mowy – dr inż. Danijel Koržinek
Access
https://mowa.clarin-pl.eu/tools/ui/speech/vad ;
https://mowa.clarin-pl.eu:8433/
Link to the manual
https://mowa.clarin-pl.eu:8433/docs/doc.html#det ;
https://mowa.clarin-pl.eu:8433/apidoc/index.html#api-Narz%C4%99dzia-VADTool
Examples of application
Analysis of speech activity in a recording; visualization of the recording