Fast lexical and phonetic search in the MALACH archive

Description

The presented search system originated in the MALACH project which was carried out in 2001 through 2007. The goals of the projects were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the-art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech and emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 h of video constituting the Czech portion of the archive and find the query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or phrases containing Jewish slang. The interface to the search engine has been recently redesigned to run over the Internet and thus it no longer requires the installation of the dedicated software. Novel search engine techniques are also being investigated within a recently started follow-up project AMALACH. 

Tool type

Resources

Tool task

corpus exploration, search, speech recognition, information extraction

Key words

speech processing, mono-lingual, web-application

Research domain

History, Speech Recognition

Language

Czech

Country

Czech

CLARIN centre

Charles University in Prague

Contact person

URL

Similar to

AAM-LR, WebMaus