Annual Meeting Coverage

Report from Technical Session

Classification and Indexing for Image Collections: Theory and Practice

Reported by Ernie Dornfeld

Technical Session sponsored by Special Interest Groups/Classification Research (CR) and Visualization, Images and Sound (VIS)

Speakers:

James M. Turner, Université de Montréal: Explorations in Using Audio Description as a Tool for Indexing Moving Image Documents

Corinne Jorgensen, State University of New York at Buffalo: How People Describe Images: Continuing Research

Abby Goodrum, Drexel University, Sharing Congruence: Text-Based and Image-Based Representations for Moving Images

Andrew Grove, Corbis Corporation: Classification and Indexing for Image Collections: Theory and Practice

Moderator: Nancy Blase, University of Washington

The practical and research questions surrounding image collections were discussed in this session, sponsored by SIG/Classification Research and SIG/Visualization, Images and Sound.

James M. Turner, Université de Montréal, discussed results of preliminary research in using audio descriptions as a tool for automating indexing of moving image documents. Audio descriptions, developed as an aid for the visually impaired, accompany some television shows. They fill the gaps in dialogue with description of the video scene.

Are such audio descriptions useful as data for indexing and retrieving scenes from moving image documents? Since shot-level indexing of moving images is very expensive, it is attractive to use sources of information about these documents that already exist in electronic and textual form. Audio descriptions are one such textual source, along with others such as running descriptions, dialogue and narration.

Turner’s earlier work indicated a close correspondence between the content of running descriptions of moving images and shot-level indexing done by professionals, suggesting that textual descriptions of moving image content might serve as the basis for deriving subject indexing.

The current work tests a sample of documentary and scripted television shows, in French and English, digitizing a 30-minute sample of each to produce a database of described shots. Preliminary analysis of described audio suggests that there is a complex relationship between moving image shots and the accompanying audio description, where the latter exists at all. Audio description is recorded in quiet space in the regular sound track and is not always closely associated with the moving image to which it refers; for example, it may appear with the next shot in a sequence. In the first sample analyzed, less than half of all shots had audio description at all.

In the video segment analyzed so far in this ongoing study, the narration is responsible for a good part of the useful description of potential use for indexing. This suggests that automatic processing of associated text material for indexing of moving image documents will rely on a combination of sources. While audio description by itself is not a sufficient source of subject indexing data for moving images, its use in combination with other sources of data is a promising technique for subject indexing of moving image documents.

Corinne Jorgensen, State University of New York at Buffalo, presented continuing research on how people describe images. Jorgensen’s earlier work investigated the attributes people use to describe images by asking them to describe images in words. Jorgensen then organized subjects’ descriptive words into classes, including such groupings as objects and human figures, locational and relational terms, color terms, the story of the picture, style and type, and others. Overall, these classes can be further characterized as either descriptive of physical attributes of the picture, interpretive or reactive.

Jorgensen has now compared subjects’ description of a series of images when presented with

the opportunity to describe them in their own words without further stimulus

a template of classes of content elements as a framework for subjects’ content description.

Findings were that the presence of a template suggesting classes of content words decreased the proportion of subjects’ words referring to literal objects and increased the proportion of terms describing the picture’s story or interpreting its content.

New work to replicate these findings has involved refinement of the template used by subjects to include more descriptive text illustrating what the term classes mean and the use of a new set of images. Preliminary results of the current work indicate consistency with the earlier findings.

Implications of this work for subject indexing of images include the observation that the reluctance of subject indexers to assign affective or other abstract terms to images may be analogous to the observations of subjects’ behavior in this study. Suggesting interpretive or abstract terms as a class that is appropriate for images may increase the frequency of their assignment. The ideal retrieval system for images would use a combination of techniques, some optimal for retrieval of object images and others for abstract concepts.

Abby Goodrum, Drexel University, discussed research on similarity measures between text- and image-based representations of moving images and the original moving images. This work is based on the observation that text-based representation of moving images can fail to represent image attributes such as color, motion and image content, while semantic information in images can be lost in image-only representation.

The model proposed as a basis for this work suggested that for general tasks involving images, such as browsing, navigation and visualization, an image-based representation would prove more useful. An image-based representation would likewise provide more useful tools for non-language tasks, such as access to iconographic or emotive elements. By contrast, it was expected that for linguistic tasks such as those involving hierarchy, class membership or known-item retrieval, text-based representation would be a more powerful tool.

Still-image extracts and text descriptions were prepared from a sample of 10-second moving image documents. Subjects were asked to make judgments of similarity between document representations and original moving images. Two tasks were assigned to subjects: a specific task to find an image depicting a target item and a general, purpose-based task which asked subjects to find images suitable for use in a project.

Subjects judged image-based representation to be more congruent than text representations with the original moving images. Results showed that the text representation of the moving image provided better performance of the specific task, while no clear superiority of representation was shown for the general, purpose-based task. This research showed no clear continuity of appropriateness of image representation for tasks ranging from the specific to the general.

Andrew Grove, Corbis Corporation, described the controlled vocabulary used at Corbis Corporation for indexing a very large collection of images. He began the discussion by pointing out that text and images have different subject indexing requirements, as text is to some degree self-descriptive, while images contain no internal statements about their content.

The Corbis controlled vocabulary, while continually evolving, now contains about 65,000 terms, 21,000 of which are lead-in terms. The thesaurus features a hierarchical display and can display graphical as well as textual scope notes for use by indexers and searchers.

Decisions on term selection involve considering image content and context, the image collection’s heterogeneous audience and the practical factors such as limited resources and alternate access which can be provided in natural language.

Especially notable in the thesaurus structure is the treatment of place names, whose records include relationships such as "previous name," "subsequent name" and "historic, obsolete name." This feature allows control for place name changes while enabling accurate description of historical images.

Lessons from the construction of the Corbis controlled vocabulary include practical adaptation of the principle of facet division. Facets have traditionally been considered a fundamental principle of division in an indexing language. Grove has found that facet divisions often are most useful conceptually and in managing the vocabulary if they are deep in the hierarchy instead.