Feature

Beyond Boole: The Next Logical Step

by Charles H. Davis

Enormous retrieval sets are like the weather: people complain, but few do anything about it. There have been some good suggestions for "filtering" output. However, most vendors, librarians and other information specialists are mesmerized by Boolean logic, while computer scientists insist on vain attempts to eliminate human intervention altogether. Lost in the shuffle is a well established technique that empowers searchers and provides sensibly ranked output.

Futurists, commentators, megatrendists, and other darlings of the best-seller lists have been saying for some time that human-kind has entered a new age - the Information Age. . . . Spend a few days in this brave new world, however, and you come away with a markedly different perspective. . . computers have just mired us in the data swamp. They give you access to mountains of data, but in many cases they have not made it easier to glean useful information. . . . Until you can get a handle on all the data available to you, the Information Age will remain little more than a blurb on a dust jacket.

Luddite lunacy? Someone bashing information science in American Libraries? No, it's Technical Editor Bob Ryan introducing us to "Managing Gigabytes" in a 1991 issue of Byte magazine (volume 16, no. 5).

Astonishingly, librarians and other information specialists have become victims of their own success. For example, many online public-access catalogs (OPACs) now provide keyword Boolean searching. People already comfortable with the AND, OR and NOT logic of online searching appreciate this feature and have come to expect it whether they are searching locally or accessing a library through the Internet.

Unfortunately, neither the vendors, with their virtual hammerlock on online software, nor the designers of OPACs have provided features that allow intermediaries and end-users to cope with large amounts of output. A useful interface has been suggested by Michael Buckland and others that would permit "filtering" by such categories as language and date of publication. However, while reducing the overall size of the retrieval set, this technique does not produce the same kind of output achieved by permitting the searcher to specify each term's perceived importance (its "weight"). With few exceptions (for example, Personal Librarian and Topic Real-Time), software entrepreneurs also have ignored the needs of people who use large databases and catalogs. Interestingly, researchers investigated the "impediments" to enhanced retrieval systems several years ago. Although vendors had lots of excuses, they simply failed to see what was in it for them (as noted by Peter H. Smit and Manfred Kochen in a 1988 article in Information Processing & Management, vol. 24, no. 3).

The problem is this: perfectly good Boolean logic can lead to enormous retrieval sets given the sizes of our research libraries and specialized databases. Ranking output with a well established but still little-known technique called weighted-term searching provides a general and straightforward solution to the problem.

What Is Weighted-Term Search Logic?

It is really just an extension of its famous Boolean counterpart. As such, weighted-term logic can be used with either controlled vocabulary or free-text searching and with inverted or uninverted files. Moreover, it is superior to Boolean logic in that it can rank output in decreasing order of probable relevance. This allows searchers to limit the amount of material they retrieve and, more importantly, to do so in a rational way that complements the original file organization: alphabetical, chronological or other. Weighted-term logic is also appropriate for relational databases and hypertext environments, offering a convenient way to limit the number of links that are generated. In addition, the technique can be used with any standard interface: command- or menu-driven, form-fill or query by example.

The most straightforward of these systems employs whole numbers and simple arithmetic so that it can be used even by people who are not particularly adept at mathematics. Nevertheless, the technique requires a measure of skill and is meant for intermediaries or sophisticated end users.

In its simplest form, weighted-term logic allows the searcher to assign a threshold value that remains constant for each search. This threshold is an integer to which other terms' weights are compared. The searcher then assigns every search term (a character string comprising a subject heading, keyword or phrase - truncated or in its entirety) a number (weight) that is directly proportional to the value that the searcher places on it. The first time the computer finds a term in a given record, it adds the corresponding weight to a cumulative total that eventually becomes the total weight for that document. The total weight is then compared with the threshold value previously chosen by the searcher. If the document weight is equal to or greater than the threshold, then the document or its surrogate is retrieved.

Although they can be more complicated, such systems often use thresholds and weights that are simply non-zero integers - whole numbers other than zero, sometimes within a limited range such as -9 through +9. Searchers are free to choose their own numbers, but these numbers are not entirely arbitrary; the thresholds and term weights are related in a way that reflects the corresponding Boolean logic.

Basic Examples

Imagine three sets of documents indexed under (or containing) the following descriptors: Mars, Geology and Atmosphere. (The system can be programmed to permit coordination on any keywords, subject terms or other descriptors desired.)

Inclusive OR is achieved by assigning every term a value equal to or greater than the threshold, but consider how much more you can get using weighted term logic, even using only two terms. After choosing a threshold of, say, 5, you can assign one term (Mars) a value of 6 and a second (Geology) a value of 5. The computer will look for each term or phrase and provide you with three possible retrieval sets:

  1. Mars AND Geology (with a total weight of 11);
  2. Mars (with a total weight of 6); and
  3. Geology (with a total weight of 5).
Interestingly, exclusive OR, a feature rarely seen in commercial systems, is easily accomplished using a negative threshold. Individual terms can be assigned negative values equal to the threshold, which will satisfy the search criterion; however, when both terms are present - voila - the computer adds the two negative weights and finds that their sum is less than the threshold, producing, for example, (Mars OR Geology) minus (Martian Geology). Sample weighted-term search: Threshold = -1. Mars weight = -1. Geology weight = -1. For documents with either term, the total document weight would equal minus one. For those with both terms, the total weight would equal minus two, which by mathematical convention and the computer's definition is less than minus one.

Slick.

Logical AND is achieved by assigning each term a weight so that only the combined weight (the sum of the weights of the terms) equals the threshold: Mars AND Geology AND Atmosphere.

Threshold = 6;
Mars, Geology and Atmosphere each has a weight of 2.
No one term, and no two terms would result in a hit since 2 or 4 would be less than the threshold value. With all three terms present: 2+2+2=6.

Logical NOT is accomplished by assigning negative weights to undesired terms so that the total is less than the threshold. Remember that the technique always requires assigning weights to individual terms so that the sums of their weights equal or exceed the threshold. For example, if you want the equivalent of Mars NOT Atmosphere, you might use a threshold of, say, 7, and then assign Mars a weight of 7 and Atmosphere a weight of -1. In fact any negative number assigned to Atmosphere would have the same effect. Searchers can also adjust thresholds and small negative numbers simply to reduce the likelihood of retrieving a document with a less desirable term. In other words, the technique is more flexible than the "all or nothing" approach with logical NOT.

More Sophisticated Examples

Consider this Boolean operation:

(Mars AND Geology) NOT Atmosphere.
It's really up to you as a searcher. You could choose a threshold of 5, and then assign Mars a weight of 3, Geology a weight of 2 and Atmosphere a weight of -1. The computer would select all those documents with Mars and Geology, adding 3 + 2 and faithfully getting 5, which is equal to the threshold; however, if it found Atmosphere, it would subtract 1, getting 4, which no longer satisfies the requirement. Alternatively, you could assign both Mars and Geology weights of 3 and then assign Atmosphere a weight of -2. Six minus two equals four, which is one less than the threshold of five.

Here is an example that shows the advantage over Boolean logic more clearly:

(Mars AND Geology) OR (Mars AND Atmosphere)
One searcher might choose a threshold of 3, assigning a weight of 2 to Mars while assigning Geology and Atmosphere weights of 1 each. Another searcher might choose a threshold of 9, assigning Mars a weight of 8 while assigning Geology and Atmosphere weights of 2 and 1, respectively. Since the requirement is that the individual weights or their sums be greater than or equal to the threshold, these two sets of weights are equivalent as far as the Boolean logic is concerned; however, they have greater discriminating power, as the retrieval sets will consist of easily identifiable subsets. In the first case there will be two sets of retrieved documents with total weights of 4 and 3; in the second case there will be three sets with total weights of 11, 10 and 9.

Using small integers will not always permit unique identification of the terms that cause a corresponding hit; however, there are tricks of the programmer's trade that can be employed if such things are wanted. For example, the system can be modified so that it will automatically employ term weights to the base two. H.P. Iker showed as early as 1967 that this method results in completely unambiguous results.

Displaying the Output

When relatively few items are retrieved, the best items (those with the highest total weights) can be identified by visual inspection. This has the virtue of retaining the original file order - for example, by author, title or perhaps by a chronological arrangement. However, with weighting, the computer also can sort by document weights, meaning that the documents with the greatest weights (the ones having the most - or most important - descriptors) will appear first. While there is no guarantee that the documents displayed first will be most relevant, there is a much greater likelihood that they will be. At the very least they will reflect the values of the searcher.

Going Fishing

Searchers don't always know in advance what they want, and one tried and true Boolean technique has been to string together several terms using logical OR. An immediate advantage to using weighted terms should be apparent here: one simply has to choose a suitable threshold, perhaps just "1," and then assign the same weight to all potentially useful terms. Not only will the same documents appear as would have been retrieved by the Boolean search, but they can be sorted by document weight. Value added, indeed! One can imagine an "alphabet" of terms from A to Z, each with a weight of one. The computer could be programmed to sort the resulting retrieval sets, whose combined weights might be (in decreasing order): 26, 25, 24. . . 1. If searchers specified a cut-off number - the maximum number of documents they were willing to examine - then they would be skimming off the top, looking only at those documents most likely to have satisfied the search criteria.

Conclusion

Weighted-term searching provides searchers with a more powerful technique than Boolean logic. The method described has nothing to do with word frequencies in documents, nor does it involve the assignment of values by indexers. It empowers searchers to control their strategies.

Long taken for granted in many technical information centers and special libraries, weighted-term logic is overdue for general use. From a systems standpoint, it is fairly easy to implement and represents a straightforward method for getting ranked output. It can be coupled with term truncation to provide powerful capabilities for database searching and record display currently unavailable through bibliographic utilities, online search services, or typical database management software packages.

Many of the advanced information retrieval methods being explored by computer scientists seek to eliminate the intermediary altogether. Whether such a development would represent an improvement is debatable, since there is as yet no satisfactory alternative to human judgment in most practical matters - particularly those involving the arts and humanities. The method described here can be used profitably in any field by search intermediaries or end users who wish to employ techniques more sophisticated than those afforded by simple Boolean coordination.

Information specialists generally have been good about adopting contemporary techniques, but they have overlooked weighted-term searching, whose strengths were demonstrated over a quarter of a century ago in Ralph H. Sprague, Jr.'s A Comparison of Systems for Selectively Disseminating Information (Bureau of Business Research, Graduate School of Business, Indiana University ). Nor can they expect vendors to implement systems that provide ranked output unless encouragement is provided. Writing about electronic publishing, Richard Bowers observes in Information Today (volume 8, number 6): "The conventional wisdom seems to be that whatever the technology experts decide is sufficient. . . . [However, t]he technology industry is simply trying to protect its self-interests by establishing standards that will enable them to work together to reduce risk and to create profitable products."

If vendors implement state-of-the-art systems only when there is a demand for them, then they should hear from practitioners that there is such a demand. With the innovative freedom implied by client-server retrieval systems, information professionals should consider a new approach that is now at least 30 years old.


Charles H. Davis is a past president of ASIS, a Visiting Scholar at Indiana University and a Professor Emeritus at the University of Illinois, Urbana-Champaign, where he also served as dean from 1979 to 1986.


Selected Bibliography