Skip to main content Skip to first level navigation
Skip to main content First level navigation Menu


Main content


(No.26) Image Tagging and Its Hierarchy

We can identify almost everything visible by its name. If the object is a human face, we can tell whether the face is of a male, an Asian, a 40-year old, or if the name Mr. Takahashi, for example, might apply. We do this from a hierarchical perspective. Using this hierarchy, we can communicate with others without help from images.

Here, we have a picture (figure 1(a)). The task to assign keywords that represent the image content is often called image tagging.

Fig.1(a): A photo of Great Wall and tags

Fig.1(a): Tagging

As the chosen keywords vary with each person, it is impossible, in principle, to assign common keywords to be used for all. In this example, we may assign words like "scenery," "Great Wall," "mountain," "sky," and "people" as common keys. A photographer may want to assign several more keys. For example, "Mr. Inoue" as the person in the image, or "City of Beijing" to indicate the location of the image may be assigned. This is the goal of image tagging technology, which responds to user requests. A rough procedure to do this follows.

(1) Image Analysis
This module analyzes the given image and recognizes objects in it. For example, in Figure 1(a), scenery, Great Wall, mountain, sky, people, and green are possibly identified. The most popular approach is to divide the picture into tens of meshes, then compare each image in the mesh with training pictures using keyword tags. Processing speed and performance are greatly influenced, depending on the image features used.

(2) Related Keyword Tagging
Based on first level keywords like scenery, Great Wall, mountain, sky, people, and green, we come up with related words: Beijing, sightseeing spot, World cultural heritage. This type of dictionary is called a thesaurus, which connects to higher level concepts. Sometimes, information attached to the file may include a date or gps tag. The person's name in the photo is not available using current technologies.

(3) Hierarchy of Tagged Keywords
The hierarchical structure of keywords is critical for conceptual communication. This structure varies with each individual. Figure 1(b) shows the total procedure.

Fig.1(b): A photo of Great Wall and a sample of hierarchical tags

Fig. 1(b): Tagging

As explained in Figure 1(a), (b), the extracted tags should comprise multiple layered structures, not simply a simple list of attributes like pansy or yellow. If only the lowest representations (attributes) are available, more objects are required. If five million tags are provided for one million objects, the utility value will not be high. For high usability, the concept of hierarchy is inevitable. Without a hierarchy, efficient recognition or communication is impossible (Column No.5, " Hierarchical Representation of Knowledge and Culture").

What is the underlying concept of hierarchical representation? Is it shared only by the people of a specific culture? Or is it shared by people in common? How about differentiating between humans and animals? The hierarchy concept is shared only within the same species?

Here is an interesting experimental report regarding chimpanzees. First, four categories of pictures are prepared: flower, tree, grass, and ground, each of which has four subclasses as shown in Figure 2. Then, one of the pictures is presented to a chimpanzee, who must select the most related category. The chimpanzee can identify the correct category: It can choose the flower category if the cherry blossom is presented. Beyond that, it can identify the correct category even if the flower is unknown (Matsuzawa, Tomonaga, Tanaka; "Cognitive Development in Chimpanzees", Springer, 2006). If chimpanzees can use the same category as humans, we may be able to communicate with them.

Fig.2: Four images each for four categories

Fig.2: An example of image categorization experiment

Now, consider the case of the honeybee. How does it find a flower hidden inside the woods? The honeybee is known also to identify flowers having an abundance of honey. We see that the honeybee seems to use a top-down searching strategy based on its hierarchical knowledge.

However, it is too far a stretch to assume a honeybee possesses hierarchical knowledge. Its approach is possible even by using simple parameters only. For example. if the honeybee encounters a preferable temperature, preferable optical spectrum, and preferable object-size as shown in Figure 3, even a random search strategy can easily lead to the target. The honeybee is able to measure flying height and optical resolution; it can identify the object size. If the honeybee can also control the search order, it is a big step forward. With this capability, the honeybee can progress from a global search to a local search by flying higher at the beginning and lower later, as shown in Figure 4.

Fig.3: Multiple parameters for honeybee

Fig.3: Multidimensional representation with parameters

Fig.4: Space hierarchy targeting a flower

Fig.4: Hierarchical structure of scene Search spaces are hierarchical, but search strategy is a simple cascade.

Figure 5 shows another case using color information, in which search order in a color space is fixed. Figure 5 explains the wider area search at the beginning using globally available color. The honeybee then chooses a specific color that has a higher probability of being the target.

Fig.5: Search hierarchy by color

Fig.5: Search hierarchy targeting a flower

The above approach is almost identical to a "weak learner" (Column No.20, " Weak Classifier and Strong Classifier"). Each classifier's ability is weak, but by arranging the search strategy, we can narrow the target from a wide space to a highly probable space, step by step. It is also well known that the honeybee has a thermo sensor,( " Honeybee nest-in Japanese").

Here is another simple example. Figure 6 is the logical map of a human face. A face is part of the human body. If a face is detected, then the object must be human and the following production rules apply.

Face is identified -> the object is human; (sufficient condition)
An object is human -> the object stays in a human-made environment;
(almost sufficient condition.)

Fig.6: Relationship among human face, human, and artificial environment

Fig.6: A logical map of human face

After the human-made (artificial) environment is identified, the following inferences may hold:

Artificial environment -> temperature and humidity are stable.
Artificial environment -> there exist a smaller kind and number of plants.
Artificial environment -> many brighter areas exist even at night.

We could say that these statements are weak classifiers rather than inferences. Therefore, just as before, simple but repetitive logical deductions lead to a higher level search (Column No.21, " Weak Classifiers and Strong Logic").

The above example encourages us to communicate with the chimpanzee. Normally, images are more robust than language. We can show a flower to a chimpanzee and ask it to get the flower. In return, we can reward success by letting the chimpanzee choose a piece of fruit or cake. In this way, we may be able to play with chimps fairly well. Incidentally, studies reveal that the chimp's short term memory far exceeds that of humans. Unknown animal talents are waiting to be discovered (" Chimp Champ; Ape aces memory test, outscores people," Science News Online, Vol. 172, No. 23, p. 355).

How can we apply this technology? Digital photos are everywhere. A single user of a digital camera may store hundreds of thousands of photos, but they are stored without tags in digital memory. As a result, it is hard to retrieve the images later. If a computer program could propose possible tags when the image was stored, it would save time and bother. Preliminary research is already underway as listed below. " Mining Digital Imagery Data for Automatic Linguistic Indexing of Pictures (PDF:386KB)", 2002.

(Ej, 2007.12)