Column

(No.11) Occurrences of Character Strings

It is well known that the frequency at which characters occur differs between characters. As a rule of thumb, occurrence probability decreases with character string length, but the variance among various character strings increases. This was found to be true in any language. Decryption starts from this uneven distribution of string occurrences. A longer character string is often divided into sub strings called element strings. The original longer word is called a composite word. If the composite word is used often, it will be abbreviated by a simple expression, often represented as a 4-syllable spelling in Japanese; personal computer → "pa-so-ko-n."

We will inspect this phenomenon using the novels written by Natsume Soseki. If the -character word represented as A m, and the -character word represented by B n are merged into composite word A mB n, character length is m+n. Its occurrence probability will be p(A mB n)if each component is independent. Of course, the composite word's occurrence probability is smaller than each of p(A m),p(B n), but much larger than the probability of both word probabilities p(A m)×p(B n), i.e., .


The whole of Natsume Soseki's literature contains about 2,210,000 characters in Japanese. Using this corpus, I counted the following three character strings, where "~" indicates an approximated number:


For simplicity, I assume the average character length of Japanese to be 2.2, which means the above corpus includes 1 million words. Therefore, the approximate occurrences of individual character strings are:


If these occurrence events are independent of each other, the expected probability of string "hikinoba" expressed as


In reality, however, the occurrence is 1,000 times higher. That is to say, independent events predict the occurrence of 3×10 -9×10 6=3×10 -3a far smaller value than the one actually observed. As a matter of fact, even one example of a composite word is enough to speculate that existence. This rule is applicable in most existing natural languages.

Using this nature, we can predict future incoming character strings. At the point when the larger string example is found in the list, the system can propose the spelling. Here is an example: "thank" is enough to predict "thank you," and "thank you v" is enough to predict "thank you very much." More than 15 years ago, J. Darraph and I. Witten of the University of Calgary developed an input stroke prediction system and named it "Reactive Keyboard." It was introduced to the market with little fanfare at the time. Today, however, this technology is widely used for one-finger input on Japanese mobile phones. The old technology has found new life in a new environment. This system merely uses the frequency of string occurrences. It was postulated that even infants may follow a similar method (E. Batchelder, "Computational Evidence for The Use of Frequency Information in Discovery of The Infant's First Lexicon," Ph.D. Thesis of the Graduate Faculty in Linguistics, The City University of New York, 1997).

A recent study on humans and the tamarin (an ape) (Fitch and Hauser, Science, Vol. 303, Issue 5656, pp. 377-380, 16 January, 2004) also shows that tamarin seems to learn the acoustic structure of adjacent phoneme strings. But for more complex structures of string pairs that are far apart, like "If... then...," the modification relation is too difficult to learn, although it is easily learned by infants. Even an ape, however, can show some success.

In information retrieval, efficient text search is a very important technology. For this purpose, the character string search, often called n-gram search, is often used. The next step in using modification relation search technology or use of the thesaurus is very close to being ready for the market. We will soon exceed the level of the ape.

(Ej,2004.05)



Back to index

Page Top