More than 10 billion accessible web pages are interconnected throughout the world, which means we are now heavily dependent on the database those pages provide. The key tools to use so many pages are keyword retrieval, clustering representation, and index retrieval. What makes these tools so powerful?
Let's start with a fundamental reason. Consider, for example, reserving an airline ticket to the United States via the Internet. You may easily find the appropriate web page by typing the "ticket", "airplane", and "reservation" keywords. You may need to add "airline" or "trip" as well. It works pretty much the same way, regardless of language. The difference is in the wording.
The keywords used here have related meanings, and the relation has a layered structure called thesaurus (Figure 1).
Although the thesaurus is not uniquely defined, the structure is very similar if two groups or different individuals share a common culture. That is why we were able to easily trace upward to find the meta expression shown in Figure 1. This similarity makes web pages useful and valuable.
There is an interesting episode to illustrate how this language structure works. For secret communication, we convert natural language to code. To break the code, we must specify the original language first. During World War II, the US navy successfully deciphered Japanese code, but the Japanese navy failed to break the US code. US forces used the Navaho Language as the original language, which was completely unknown in Japan. However, Navaho does not have words like airplane, cannon or battleship. So what did they do? For details, refer to Simon Singh's excellent book, The Code Book, published by Bantam Doubleday Dell, 1999. Singh points out the importance of expressions beyond mere coding. The current Internet society works under a common concept of value.
Constructing a reliable thesaurus still depends on a lot of human effort and time. To cut down on that effort and time, we recently announced a new retrieval engine based on the statistical distribution of words, which works in both Japanese and English. With it, we can avoid the thesaurus. For details, refer to H. Itoh, H. Mano and Y. Ogawa, "Term Distillation in Patent Retrieval," in Proceedings of the Workshop on Patent Corpus Processing, pp.41-45, published by the ACL, 2003. More challenging technology on automatic thesaurus construction is available in Tokunaga, Iwayama, Tanaka, "Automatic thesaurus construction based on grammatical relations", Technical Report from Department of Computer Science, Tokyo Inst. Tech.1995.