Back to Our Roots for Retrieving Very Short Passages

Nada Naji, University of Neuchâtel
Jacques Savoy, University of Neuchâtel

Tuesday, 10:30am


This article tackles the task of retrieving very short documents via even shorter queries. The problem on hand may relate to the retrieval of tweets, image and table captions, short text messages (SMS) and sponsored retrieval among others. In such cases, document and/or query expansion using thesauri and other external resources (e.g., Wikipedia) usually available on the World Wide Web (WWW) are proven to be effective approaches. However, the focus of this paper is on documents that are written in lesser known languages for which the WWW is of limited use. Our experiments are based on two main corpora extracted from historical manuscripts written in Latin and Middle High German. We found that retrieving very short documents whose lengths are quite similar with short queries given that no external enrichment resources are available, the classical tf-idf model performs as satisfactorily as the more complex models do, if not better sometimes.