of The American Society for Information Science

Vol. 26, No. 5

June/July 2000

Go to
 Bulletin Index

bookstore2Go to the ASIS Bookstore

  Copies

The TREC Spoken Document Retrieval Track

by Ellen Voorhees and John Garofolo

The Text REtrieval Conference (TREC) is a workshop series designed to encourage research on text retrieval for realistic applications by providing large test collections, uniform scoring procedures and a forum for organizations interested in comparing results. In recent years the conference has contained one main task and a set of additional tasks called tracks. The main task investigates the performance of systems that search a static set of documents using new questions. This task is similar to how a researcher might use a library - the collection is known, but the questions likely to be asked are not known. The tracks focus research on problems related to the main task, such as retrieving documents written in a variety of languages using questions in a single language (cross-language retrieval), retrieving documents from very large (100GB) document collections and retrieval performance with humans in the loop (interactive retrieval). Taken together, the tracks represent the majority of the research performed in the most recent TRECs, and they keep TREC a vibrant research program by encouraging research in new areas of information retrieval.

The three most recent TRECs have had a track on Spoken Document Retrieval (SDR), that is, on content-based retrieval of excerpts from recordings of speech. In practice, SDR is accomplished by using a combination of automatic speech recognition and information retrieval technologies. A speech recognizer is applied to an audio stream and generates a time-marked transcription of the speech. The transcript is then indexed and searched by a retrieval system. The result returned for a query is a list of temporal pointers to the audio stream ordered by decreasing similarity between the content of the speech being pointed to and the query.

The aim of the TREC SDR track is to provide the infrastructure required to enable research on the SDR problem. Large collections of multimedia documents are already being assembled, and content-based access to all of the information is required. While the component speech recognition and information retrieval technologies are mature enough to expect usable SDR systems for some domains, there remain a number of research issues. The track fosters research on the development of large-scale, near-real-time, continuous speech recognition technology as well as on retrieval technology that is robust in the face of input errors. More importantly, the track provides a venue for investigating hybrid systems that may be more effective than simple stovepipe combinations.

The three SDR tracks used different corpora, but for each track the corpus was a collection of broadcast news stories that were made available in several different forms. A hand-produced transcript of the entire corpus was the "reference" (ground truth) transcript. The reference transcript contained story boundaries that defined the documents in the collection; all other versions of the corpus used these same human-defined boundaries. "Baseline" transcripts were produced using one particular speech recognizer configured for different levels of recognition accuracy. In addition, many participants ran their own speech recognition system against audio files to produce their own "speech" transcripts.

The National Institute of Standards and Technology (NIST) provided a set of written information needs (called "topics" in TREC) that were used to search each version of the transcripts. The different versions of the transcripts allowed participants to observe the effect of recognizer errors on their retrieval strategy. The different speech transcripts provided a comparison of how different recognition strategies affect retrieval. To make this comparison as complete as possible, participants were also encouraged to retrieve using other groups' speech transcripts.

The TREC-6 (1997) SDR track was the first formal evaluation of SDR technology. The corpus was 50 hours of news broadcasts, an enormous amount of audio to recognize at the time, but a tiny IR document collection of only 1451 stories. The task in the track was "known item" searching using 49 test topics. The goal in known item searching is to retrieve a single specific document rather than a set of relevant documents. Although the TREC-6 track was primarily a feasibility experiment, it did demonstrate that speech recognition and IR technologies were sufficiently advanced to do a credible job of retrieving specific documents. The better systems were able to retrieve the target document at rank 1 over 70% of the time using their own speech transcripts, compared to the best performance on the reference transcripts of 79%. Search performance was a bigger factor in the overall results than recognition accuracy, although participants that had both speech and IR expertise obtained the best results. These promising results were considered preliminary, however, because the known item task is diagnostically limited and the collection size was so small.

The TREC-7 (1998) SDR track used the standard ad hoc retrieval task and an 87-hour, 2866-story broadcast news corpus. A team of three NIST assessors created 23 test topics and judged the retrieved documents for relevance after the retrieval results were submitted to NIST. Once again, the overall performance of the systems was quite good, with only a very gradual decline in retrieval performance as recognition errors rose. Nonetheless, analysis of the retrieval results when participants used each other's speech transcripts did show a correlation between recognition word error rate and retrieval performance, a correlation that was not present in the TREC-6 known item search results. The correlation is stronger when recognizer error is computed over content-based words (for example, named entities) rather than all words.

The 1999 TREC-8 SDR track was designed to determine if the technology scaled for realistically large spoken document collections. As such, the track used a subset of the TDT-2 corpus consisting of 557 hours and almost 22,000 stories. This amount of audio was large enough that it required recognition algorithms that worked in close to real time, as opposed to the 40- or even 300-times real time algorithms that were common in other speech recognition evaluations. In addition to the test conditions supported in TREC-7, a story boundaries unknown condition was added to provide a more realistic picture of how systems could perform if given a set of continuous, unsegmented recording streams to recognize and search. Despite the required focus on recognition speed, the recognition error rates improved from 1998. The retrieval results were comparable to TREC-7, suggesting that the technology scaled for a collection almost an order of magnitude larger with no loss in accuracy. The rate at which retrieval performance degrades due to increasing recognition errors also appears to be independent of collection size. Retrieving from unsegmented streams is a harder problem, however. Retrieval effectiveness for the unknown boundary condition was always worse than the corresponding run using known story boundaries.

The TREC SDR Track has provided an infrastructure for the development and evaluation of spoken document retrieval technology and a common forum for the exchange of knowledge between the speech recognition and information retrieval research communities. It has also provided objective, demonstrable proof that the technology can be successfully applied to realistic audio collections. The track is scheduled to continue in TREC-9 and beyond, with an eventual goal of expanding the retrieval task to include other media types.

More information regarding TREC can be found on the TREC web site at http://trec.nist.gov

TREC SDR Track Participants

    TREC-6 (1997)

      AT&T Labs Research
      Carnegie Mellon University
      City University
      CLARITECH Corporation
      Dublin City University
      IBM T.J. Watson Research Center
      Royal Melbourne Institute of Technology
      Swiss Federal Institute of Technology (ETH)
      University of Glasgow
      University of Maryland
      University of Massachusetts
      University of Sheffield
      U.S. Department of Defense

    TREC-7 (1998)

      AT&T Labs Research
      Carnegie Mellon University (2 groups)
      Defense Evaluation and Research Agency
      Royal Melbourne Institute of Technology/University of Melbourne/CSIRO
      TNO-TPD TU-Delft
      University of Cambridge
      University of Maryland
      University of Massachusetts
      University of Sheffield/University of Cambridge/SoftSound/ICSI
      U.S. Department of Defense

    TREC-8 (1999)

      AT&T Labs Research
      Carnegie Mellon University
      IBM T.J. Watson Research Center
      LIMSI-CNRS
      Royal Melbourne Institute of Technology
      SUNY Buffalo
      TwentyOne
      University of Cambridge
      University of Massachusetts
      University of Sheffield/University of Cambridge/SoftSound/ICSI

Ellen Voorhees and John Garofolo are with the National Institute of Standards and Technology (NIST) in Gaithersburg, MD 20899.


ASIS Home Search ASISSend us a Comment

How to Order

@ 2000, American Society for Information Science