Cover Image

PAPERBACK
$45.00



View/Hide Left Panel

Large-Scale Visual Semantic Extraction

SAMY BENGIO1
Google Research



ABSTRACT

Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large-scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task, we propose to build a so-called embedding space, into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image or annotations similar to a given annotation. One can even build a visiosemantic tree from these annotations that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts.

INTRODUCTION

The emergence of the Web as a tool for sharing information has caused a massive increase in the size of potential data sets available for machines to learn from. Millions of images on web pages have tens of thousands of possible annotations in the form of HTML tags that can be conveniently collected by querying search engines (Torralba et al., 2008), tags such as in www.flickr.com, or human-curated

image

1 This work summarizes the following papers: Weston et al. (2010) with Samy Bergio and Nicolas Usunier, and Bengio et al. (2010) with Jason Weston and David Grangier.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 61
Large-Scale Visual Semantic Extraction SAMy Bengio1 Google Research ABSTRACT Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large-scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task, we propose to build a so-called embedding space, into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image or annotations similar to a given annotation. One can even build a visiosemantic tree from these annotations that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts. INTRODUCTION The emergence of the Web as a tool for sharing information has caused a mas- sive increase in the size of potential data sets available for machines to learn from. Millions of images on web pages have tens of thousands of possible annotations in the form of HTML tags that can be conveniently collected by querying search engines (Torralba et al., 2008), tags such as in www.flickr.com, or human-curated 1This work summarizes the following papers: Weston et al. (2010) with Samy Bergio and Nicolas Usunier, and Bengio et al. (2010) with Jason Weston and David Grangier. 61

OCR for page 61
62 FRONTIERS OF ENGINEERING labels such as in www.image-net.org (Deng et al., 2009). We therefore need machine learning algorithms for image annotation that can scale to learn from and annotate such data. This includes (i) scalable training and testing times and (ii) scalable memory usage. In the ideal case we would like a fast algorithm that fits on a laptop, at least at annotation time. For many recently proposed models tested on small data sets, it is unclear if they satisfy these constraints. In the first part of this work, we study feasible methods for just such a goal. We consider models that learn to represent images and annotations jointly in a low-dimension embedding space. Such embeddings are fast at testing time because the low dimension implies fast computations for ranking annotations. Simultaneously, the low dimension also implies small memory usage. To obtain good performance for such a model, we propose to train its parameters by learn - ing to rank, optimizing for the top annotations in the list, for example, optimizing precision at k (p@k). In the second part of this work, we propose a novel algorithm to improve testing time in multiclass classification tasks where the number of classes (or labels) is very large and where even a linear algorithm in the number of classes can become computationally infeasible. We propose an algorithm for learning a tree structure of the labels in the previously proposed joint embedding space, which, by optimizing the overall tree loss, provides a superior accuracy to existing tree labeling methods. JOINT EMBEDDING OF IMAGES AND LABELS We propose to learn a mapping into a feature space where images and anno - tations are both represented, as illustrated in Figure 1. The mapping functions are therefore different but are learned jointly to optimize the supervised loss of interest for our final task, that of annotating images. We start with a representation of images x ∈ℜ d and a representation of annotations i ∈ϒ = {1,, Y } , indices into a dictionary of possible annotations. We then learn a mapping from the image feature space to the joint space ℜ D , Φ I ( x ) : ℜd → ℜ D , while jointly learning a mapping for annotations, Φ W ( i ) : {1,, Y } → ℜ D . These are chosen to be linear maps, i.e., Φ I ( x ) = Vx and Φ W ( i ) = Wi , where Wi indexes the ith column of a D × Y matrix, but potentially any mapping could be used. In our work, we use sparse high-dimensional feature vectors of bags of visual terms for image vectors x and each annotation has its own learned repre- sentation (even if, for example, multiword annotations share words). Our goal is, for a given image, to rank the possible annotations such that the highest ranked

OCR for page 61
63 LARGE-SCALE VISUAL SEMANTIC EXTRACTION FIGURE 1 Joint embedding space. annotations best describe the semantic content of the image. We consider the following model: fi ( x ) = Φ W ( i ) Φ I ( x ) = Wi T Vx, T where the possible annotations i are ranked according to the magnitude of fi ( x ) , largest first. Image Labeling as a Learning-To-Rank Task Labeling an image can be viewed as a ranking task where, given an image, one needs to order labels such that the top ones correspond to the image while the bottom ones are unrelated to it. Various learning-to-rank methods have been pro -

OCR for page 61
64 FRONTIERS OF ENGINEERING posed in the machine learning literature over the years, some of which can scale to large data sets while others cannot. The simplest scalable approach is the follow - ing. One can decompose the ranking task as a large sum of several smaller tasks: loss = ∑ ∑ ∑ 1 − f y+ ( x ) + f y− ( x ) , + x y+ y− where for each training image x, we want the score of each good label y+ to be higher than the score of any bad label y– by a margin of at least 1, otherwise we pay the corresponding price. This loss can be trained very efficiently on very large data sets using stochastic gradient descent. However, a better alternative would be a loss that concentrates on the top of the ranking, instead of considering every triplet (x,y+,y–) uniformly. In a previous study (Weston et al., 2010), we proposed the weighted approximate-rank pairwise (WARP) loss, which can weigh each of the triplets according to the estimated rank of the good labels and still yields an efficient implementation. The resulting model is much faster to train and obtains a much better performance at the top of the ranking. Large-Scale Learning We trained an embedding model with the WARP loss on a very large data set, containing more than 10 million training images and more than 100,000 labels, where labels correspond to queries uttered on Google Image Search and images attributed to these labels were images often clicked for these queries. That meant a very noisy data set, where queries are in several languages, with many spelling mistakes and many apparently similar queries. An interesting side effect of training such a model is that it provides a natural way of organizing labels among themselves, by looking at the nearest labels of a given label in the embedding space. Table 1 shows some examples of the near- est labels of some labels, where we see several misspellings, translations, and semantically similar labels. Finally, we show in Table 2 examples of images from our test set (there were more than 3 million of them, different from the 10 million training images), as well as the nearest 10 labels in the embedding space. LEARNING LABEL TREES Labeling images when the number of labels is large (in the section Joint Embedding of Images and Labels, we had on the order of 100,000 labels) can be prohibitive for real-time applications. We thus proposed (Bengio et al., 2010) a novel approach to learn a label tree, where each node makes a prediction of the subset of labels to be considered by its children, thus decreasing the number of labels at a logarithmic rate until a prediction is reached. Existing approaches (Beygelzimer et al., 2009a, 2009b; Hsu et al., 2009) typically lose accuracy com -

OCR for page 61
65 LARGE-SCALE VISUAL SEMANTIC EXTRACTION TABLE 1 Nearest labels in the embedding space learned on the Web data. Target Label Nearest Labels Barack obama barak obama, obama, barack, barrack obama, bow wow, george bush david beckham beckham, david beckam, alessandro del piero, del piero, david becham Santa santa claus, papa noel, pere noel, santa clause, joyeux noel, tomte dolphin delphin, dauphin, whale, delfin, delfini, baleine, blue whale, walvis Cows cattle, shire, dairy cows, kuh, horse, cow, shire horse, kone, holstein Rose rosen, hibiscus, rose flower, rosa, roze, pink rose, red rose, a rose pine tree abies alba, abies, araucaria, pine, neem tree, oak tree, pinus sylvestris Mount fuji mt fuji, fujisan, fujiyama, mountain, zugspitze, fuji mountain eiffel tower eiffel, tour eiffel, la tour eiffel, big ben, paris, blue mosque, eifel tower Ipod i pod, ipod nano, apple ipod, ipod apple, new ipod, ipod shuffle f18 f 18, eurofighter, f14, fighter jet, tomcat, mig21, f 16 TABLE 2 Examples of the top 10 labels obtained for some test images. Target Image Nearest Labels Target Image Nearest Labels delfini, orca, dolphin, barrack obama, barack mar, delfin, dauphin, obama, barack hussein whale, cancun, killer obama, barack obama, whale, sea world james marsden, jay z, obama, nelly, falco, barack eiffel tower, ipod, ipod nano, nokia, statue, eiffel, mole i pod, nintendo ds, antoneliana, la tour nintendo, lg, pc, nokia eiffel, londra, cctv 7610, vino tower, big ben, calatrava, tokyo tower Source: Adapted from Weston et al., 2010. pared to naive linear time approaches. Instead, we apply the following two steps: (a) learning a label tree and (b) learning predictors for each of the nodes of the tree. Learning the Label Tree Structure In order to learn a label tree such as the one in Figure 2, we proceed as fol - lows. Given a set of labels in a node, we look for a partition of that set into subsets such that, inside a subset, labels are difficult to distinguish with classifiers trained

OCR for page 61
66 FRONTIERS OF ENGINEERING Obama Fish Car Sting Toyota Dolphin Whale Phone Obama Dolphin Car Sting Whale Toyota Fish Phone Obama Sting Dolphin Phone Whale Car Fish Toyota Toyota Car FIGURE 2 Example of a label tree. on their corresponding images, whereas it is easier to distinguish images belong - ing to labels of a subset from images belonging to labels of another subset. We do so by computing the confusion matrix between all labels, where we count the number of times our classifiers confuse class i with class j, and use this matrix to apply spectral clustering (Ng et al., 2002). This procedure can then be applied recursively to obtain a complete label tree. Table 3 gives an example of labels that were clustered together thanks to that technique.

OCR for page 61
67 LARGE-SCALE VISUAL SEMANTIC EXTRACTION TABLE 3 Examples of obtained clusters of labels. great white sharks, imagenes apple iphone 3gs, apple ipod, chevy colorado, custom de delfines, liopleurodon apple tablet, bumper, iphone trucks, dodge ram, f 250, ford meduse, mermaid tail, monstre 4, htc diamond, htc hd, htc excursion, ford f 150, mini du loch ness, monstruo del magic, htc touch pro 2, iphone truck, nissan frontier, offroad, lago ness, oarfish, oceans, sea 2g, iphone 3, iphone 5g, pickup, toyota tundra otter, shark attacks, sperm iphone app, iphone apple, whale, tauchen, whales iphone apps, iphone nano Learning a Label Embedding Tree Once labels are organized into a tree, one can retrain jointly an embedding model (using the algorithm described in the section Joint Embedding of Images and Labels) where each image can now be labeled either with its original labels, or with any of the nodes of the tree that contains them. Moreover, whenever an internal node is selected as a positive label for a given image during training, we select a competing negative label as a sibling node in the label tree, as this cor- responds to how the tree would then be used at test time. The result provides a structure of labels based on both semantic and visual similarities. Furthermore, the performance of a label embedding tree is not only faster at test time, it is also better on average, as can be seen in the literature (Bengio et al., 2010). REFERENCES Bengio, S., J. Weston, and D. Grangier. 2010. Label embedding trees for large multi-class tasks. Advances in Neural Information Processing Systems 23: Proceedings of the 24th Annual Con - ference on Neural Information Processing Systems, Vancouver Canada, December 6–9, 2010. Beygelzimer, A., J. Langford, and P. Ravikumar. 2009a. Error-correcting tournaments. Proceedings of the International Conference on Algorithmic Learning Theory, Porto, Portugal, October 3–5, 2009, pp. 247–262. Beygelzimer, A., J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl. 2009b. Conditional probability tree estimation analysis and algorithm. Proceedings of the Conference in Uncertainty in Artificial Intelligence, Montreal, Canada, June 18–21, 2009. Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, Fla., June 20–25. 2009. Hsu, D., S. Kakade, J. Langford, and T. Zhang. 2009. Multi-label prediction via compressed sensing. Advances in Neural Information Processing Systems 22: Proceedings of the 23rd Annual Con - ference on Neural Information Processing Systems, Vancouver, Canada, December 7–10, 2009. Ng, A. Y., M. I. Jordan, and Y. Weiss. 2002. On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 15: Proceedings of the 16th Annual Con - ference on Neural Information Processing Systems, Vancouver Canada, December 12–15, 2002. Torralba, A., R. Fergus, and W. T. Freeman. 2008. 80 million tiny images: A large dataset for non- parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11):1958–1970.

OCR for page 61
68 FRONTIERS OF ENGINEERING Weston, J., S. Bengio, and N. Usunier. 2010. Large scale image annotation: Learning to rank with joint word-image embeddings. Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Barcelona, Spain, September 20–24, 2010.