B  U L  L E  T I  N


of the American Society for Information Science and Technology         Vol. 29, No. 3        February/March 2003

Go to
Bulletin Index

bookstore2Go to the ASIST Bookstore

Copies

Annual Meeting Coverage

2002 ASIST Award of Merit
Karen Sparck Jones

Karen Sparck Jones, professor of computers and information at the Computer Laboratory at Cambridge University, is the 2002 winner of the ASIST Award of Merit. The citation honoring her and her work reads as follows:

    Karen Sparck Jones has a long, rich and remarkable career in information science. As her nominator noted, “It is remarkable, one might even say moving, that someone who co-authored a paper in one of the great founding collections of our discipline, the Proceedings of the 1958 International Conference on Scientific Information in Washington, DC, should still be an architect of information science more than 40 years later.” She has made outstanding theoretical contributions to information retrieval and natural language processing and has built upon this theoretical framework through numerous experiments. Her work is among the most highly cited in the field. She has also been very instrumental in the design and implementation of TREC (Text REtrieval Conference) and has authored several major publications and has influenced a whole generation of researchers and practitioners. Dr. Jones continues to work with TREC and developers of Web search engines, and she has branched out into new areas, such as video retrieval based on speech. For her extraordinary contributions spanning six decades, the 2002 ASIST Award of Merit is presented to Karen Sparck Jones.

Karen Sparck Jones is professor of computers and information in the Computer Laboratory at the University of Cambridge. She can be reached by mail at William Gates Building, JJ Thomson Avenue, Cambridge CB3 0FD, United Kingdom; by phone: (+44) 1223 334631; by fax: (+44) 1223 334611; or by e-mail: ksj@cl.cam.ac.uk

    Dr. Jones was unable to attend the ASIST 2002 Annual Meeting. Her speech in acceptance of the honor was read by Edie Rasmussen, conference chair from the University of Pittsburgh.

First and most importantly, thank you for doing me the honor of giving me this award. I much appreciate, when I see the distinguished list of my predecessors, being invited to follow in their footsteps. Among them are two whom I would especially like to mention as colleagues for many years, from whom I learnt so much: the late Cyril Cleverdon and the late Gerard Salton.

It is a particular pleasure also, as someone from the other side of the pond, to be recognized by an American society. But as several of the names of previous recipients attest, though your society is American in name, it is not so in nature. These names (and your now adding mine) confirm the truly international character of our concerns. This has been a feature of research in automated information management right from the beginning as illustrated, for example, by the first conference I attended: the 1958 Washington International Conference on Scientific Information (ICSI), which had participants from Europe and the Soviet Union. I have been happy ever since to be a member of, and to collaborate with others in, this international community.

Thus the very widely used tf*idf (term frequency*inverse document frequency) form of term weighting is a combination of the late Gerard Salton's work on tf weighting in the 60s and mine on idf weighting in the early 70s. In 1976 my English colleague Stephen Robertson and I were glad to publish what has become a much cited and influential paper on relevance weighting in your journal. Now, the NIST/DARPA Text Retrieval Conferences (TRECs), the latest just underway at NIST, have brought researchers from all over the world to participate in large, well organized and highly productive IR evaluation experiments that have been significantly advancing the state of the art.

The work I have just mentioned also illustrates two other vital points about research in our field. One is that you can have a good idea but, even though computing technology may be advancing fast, it may be a while before the idea's time comes. Statistical approaches to IR were intellectually developed and given proof of concept quite early, in the 60s. But using them for real did not take off until the 90s. Older approaches to information systems had become entrenched and too costly to change. It took the new situation represented by the arrival of bulk full text and the Web to provide the clean slate opportunity to implement novel strategies: Alta Vista, for instance, started off with idf weighting. Now, after decades of being cried in the wilderness, statistical methods are delivering the goods, not only for mainstream document retrieval but for a whole range of other information management operations, like filtering (selective dissemination of information - SDI) for example.

This point connects with my third. When I compare the program for the 1964 conference of the then- American Documentation Institute, which I attended, with that for your current meeting there are, not surprisingly, some notable differences. The 1964 papers are more varied than one might suppose, more “modern” also than one might expect, and of course some topics are, for good reasons, extremely hardy perennials. But, setting aside the differences in format, the current conference is much bigger, with almost three times as many contributions and, more significantly, it is evident that technology has given us many more new things to do and to try.

One of the most important things that technology has given us is the bulk data for statistical methods to work, so we can derive meaningful content indicators from surface keys like using word occurrences and co-occurrences as clues to document content. This applies to the Web, too. Statistical clues mean that we can see the shape of the information forest through the billions of pages of information – or sometimes misinformation – trees. From this point of view the heterogeneity of the Web is high-class grist to the indexing mill. We are used to dealing with conventional text documents, but if the same sort of lexical combination occurs in an invoice, a memo, a meeting record and a picture caption, as in a technical report, it all helps to see what some organization or community is really doing.

This hodge-podge heterogeneity has its downside: I don't necessarily want a scrappy one-line lecture slide when I'm looking for an informative review of some topic. But the Web has also given some familiar devices a new and very effective form, most notably in hyperlinks. This is, of course, citation indexing writ large, but with a significant gain in both breadth and depth as a way of marking content relationships. But it is also noticeable that the system that was the first to really exploit this new and improved tool – Google – recognizes other document marks as significant as well.

We live in very exciting times only equaled, as I remember it, by those of the late 50s and early 60s, when we first began to apply computers to information management.

One of the new challenges I welcome is one implicit in what I have already said, namely how far you can push old ideas in new directions. This is well illustrated by the TREC enterprise. The TREC program has not only confirmed in very large tests that statistical methods work acceptably; it has also provided the opportunity to explore their application to new modes of document retrieval, like cross language or spoken document search, and has pushed into quite new areas like question answering. In its expansion TREC has attracted researchers from other areas, and it is significant that the ascendance of statistical approaches to information management in the last decade has been enriched by interaction between hitherto separate communities, as shown for example by the import of so-called (but misleadingly named) language modeling from speech recognition to document retrieval.

Other tasks, like information extraction, topic detection and tracking, and summarization have been addressed in TREC's sister (or perhaps I should say brother) government-sponsored programs, with their own evaluation scenarios. But the way in which those working on the different tasks trade ideas, techniques and tools has encouraged a whole ferment of research activity that gives substance to the claim that processing (and using) information is working it into a seamless web, not sticking it in a pigeonhole, as was too often presupposed in research or enforced in practice in the past. This new research is remaking connections adumbrated in the early days, as illustrated by the paper by Margaret Masterman, Roger Needham and myself presented at ICSI (my first paper), on the analogy between machine translation and information retrieval based on the use of a thesaurus, connections that were subsequently broken.

Statistical methods play an important part in these various tasks; but they can't all be done by statistical means alone. People used to think that document indexing required natural language processing (syntactic parsing and semantic interpretation), even artificial intelligence; but that has never been demonstrated for general-purpose retrieval systems. On the other hand it is clear that with more refined tasks like question answering or with other tasks like translation or summarizing in particular contexts when better quality than rock bottom is essential, we do need some “proper” language processing.

So one reason why this current research on a range of tasks is interesting is the intellectual one – how to get statistical and non-statistical language processing to work together. This is not simply a matter of, for instance, making parsing more effective by enhancing it with statistically-based lexical preferences, or if parsing fails, defaulting to statistical collocations that can group some words together as complex concept indicators.  It is about the much tougher matter of choosing a statistical or non-statistical strategy, on the fly, as appropriate to some specific, individual information-seeking situation. For example, trying to determine from a user's statement of information need what style of interpretation to apply. Is this input, at this point in an extended interactive session, actually a question or actually a search topic? Would this user be satisfied now with a simple extractive summary or a fancier constructed one?

This last example leads to the other reason for investigating ways of combining statistical and non-statistical processes. This is the very practical one that if we can do these various tasks automatically, and in different modes, rough or smooth, and we can build engines that can combine these capabilities, we are some good way to realizing the vision we all have of being able to offer people the flexible, integrated information systems we know they want. That is, where the user can move freely from initiating a cross-language document search, getting a question answered, having a set of selected text passages summarized, being given a filtering specification from the result, you name it.

We are already learning a lot from current researches and can see some of the results spreading into engineered prototypes and even into practical systems more rapidly than in the past. But much of the work in the evaluation programs I have referred to is laboratory study with sanitized, frozen, two-dimensional users. Doing proper evaluations with real users doing real things in real times, and with fancy multi-tasking systems as well, is now the big hurdle that's looming ever closer. So as a research community we have lots more to do: but of course that's what a researcher like me really likes!

Personal References

  • Masterman, M., Needham, R.M. & Sparck Jones, K.  (1959). The analogy between mechanical translation and library retrieval. In Proceedings of the International Conference on Scientific Information 1958 (pp. 917-935). Washington, DC: National Academy of Sciences - National Research Council
  • Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28,  11-21.  (Reprinted in Griffith, B. C. (Ed.) Key Papers in Information Science, 1980, and in Willett, P. (Ed.) Document Retrieval Systems, 1988)
  • Robertson, S.E. & Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27, 129-146. (Reprinted in Willett, P. (Ed.) Document Retrieval Systems, 1988)
  • Sparck Jones, K. (1995). Reflections on TREC.  Information Processing and Management, 31, 291-314
  • _____ (2000). Further reflections on TREC. Information Processing and Management, 36, 37-85
  • _____ (in press). Meta-reflections on TREC. In Voorhees, E.M. and Harman, D.K. (Eds.) TREC: Experiment and evaluation in information retrieval. Cambridge MA: MIT Press
  • Sparck Jones, K., Gazdar, G. & Needham, R. (Eds.). (2000). Computers, language and speech: Formal theories and statistical data. Philosophical Transactions of the Royal Society of London, Series A, , 358 (issue1769), 1225-1431
  • Other Relevant References
  • Croft, W. B. (Ed.) (in press).  Language modelling for information retrieval. Dordrecht: Kluwer
  • Proceedings of the International Conference on Scientific Information 1958. Washington, DC: National Academy of Sciences - National Research Council, 1959
  • Voorhees, E.M. & Harman, D.K. (Eds.) (in press) TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT Press

How to Order

American Society for Information Science and Technology
8555 16th Street, Suite 850, Silver Spring, Maryland 20910, USA
Tel. 301-495-0900, Fax: 301-495-0810 | E-mail:
asis@asis.org

Copyright © 2003, American Society for Information Science and Technology