Google and Search Techniques


One of the most difficult tasks in information science is to design automated techniques for classifying material by its subject matter. For many years this has been a manual process. The most widely know example is the way books are classified in a library. A trained librarian evaluates the book and assigns it a category from a master list of available choices. The two most popular lists are the Dewey Decimal System and the Library of Congress System. This runs into problems when a book covers more that one field or when a book is about a new field that has not yet been assigned a category. However, since books are fairly large documents the method usually works reasonably well.

The next level of classification was applied to shorter documents that appear in periodical publications like magazines and journals. Every scientific disciple has one or more indexes covering articles in the field. Some indexes use broad categories similar to that used for book, but because the field is so specialized many subcategories must be created to keep the number of items under one heading manageable. It is a quite frequent occurrence that people will disagree about which subcategory should be assigned. Many indexes try to supplement this by use of "keywords". These are supposed to be so specific that they will limit retrieval to a small collection of documents. The issue also arises as to whether the keywords are from a controlled or are open-ended list. A controlled list fails when new concepts are discussed and an open-ended list fails when the searcher has no idea of what terms were assigned.

Because of these failures and the high cost of using humans for classifying the goal has been to find an automated way to index material. When computers will a large amount of storage started to appear in the 1970s many projects were started to make use of the new tools. The first attempts used keywords because storage was too expensive to store information about the whole document. In the 1990s the dramatic drop in the price and capacity of electronic storage led to the use of "full-text" indexing. In this technique all the important words of a document are indexed. When a person does a search all the documents containing the search word or words are found. Unless the word is highly selective so that it appears very infrequently the number of documents found is too large to be useful. This is the well-known tradeoff between recall and relevance.

The first step to solve this problem was the introduction of "Boolean Logic" to the retrieval. This means that the user specifies several words in a formal format and the computer applies a filtering process to limit the size of the collection returned. We see examples of this in Google all the time. The user is instructed to search, for example, on red AND apple instead of just apple. Thus only documents which have these two words within the same document will be returned. There are many obvious weaknesses to this approach, for example the phrase "the red headed boy was eating a green apple" would be found. So various additional techniques are added such as having the words within a certain proximity to each other, or counting the number of times the word appears in the document. These are all crude attempts to substitute a technique that computers can do well (that is counting) for what is really needed, a technique to extract the concepts in the document.

On the web the majority of searches are for entertainment or shopping. In both cases the search is aimed at an object or object class and not a concept. The object name has good discrimination and thus occurs in few documents and has a high probability of being relevant. So the list of links returned appears to satisfy the request. In addition there are sometimes many documents which have essentially the same content value, for example which stores carry a specific product. So any returned items will appear relevant. The fact that these documents are so similar is the reason that the search engines spend so much effort trying to order documents. The criteria used are some combination of popularity and commercial payments. Very little harm is done using these techniques for popular searches, except perhaps to the credibility of the search engines.

With specialized material these techniques are very poor. We assume at this point that the limitations discussed in the first part of this essay have been overcome and the relevant material has at least been processed by the search engines. One of the more promising approaches has been clustering analysis where documents are grouped together based upon some measure of similarity between them. As you can see this is a fairly esoteric topic, however a Google search for the phrase "clustering techniques in document analysis" produces 171,000 hits! Obviously present search engines are not up to the task. This failure is not unique to the "free" search engines as all the recent comments about the failure of the intelligence agencies to "connect the dots" about terrorist activities strikingly illustrates. From work done by me and my colleagues at one time the most promising approach combines mechanical techniques to cluster documents with an iterative process to refine the clusters so that coalesce into very small groupings. The key to the search strategy then consists of submitting a document that is "similar" to what is desired. The document is then analyzed using the same clustering techniques and the closest sets returned. The searcher then studies what is found and repeats the process using the "best" of the returned search. After doing this several times quite good relevance can be achieved. The difficulties with this technique are that the entire collection needs to be "clustered" before any searching is done. This is extremely expensive and time-consuming when a collection is large and is one of the reasons that search engines don't consider using it.

So for the foreseeable future we will be stuck with inadequate search technology and with the misperception that what we find with a web search optimally represents what is available. As the younger generation passes through school the older techniques of information retrieval will become less popular and thus the magnitude of the problem will increase. Having the gateway to the "information highway" controlled by three or four commercial search engine companies represents a serious threat to the diffusion of knowledge and free speech in general.

Back to Part 1

Moral: Freedom of speech doesn't mean much if nobody can hear you


Click here to see all my essays in context.
If you have any comments you would like to add email me at robert.feinman@gmail.com
Copyright © 2004 Robert D Feinman
Feel free to use the ideas, but the words are mine.