Large-Scale Visual Semantic Extraction
SAMY BENGIO1
Google Research
ABSTRACT
Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large-scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task, we propose to build a so-called embedding space, into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image or annotations similar to a given annotation. One can even build a visiosemantic tree from these annotations that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts.
INTRODUCTION
The emergence of the Web as a tool for sharing information has caused a massive increase in the size of potential data sets available for machines to learn from. Millions of images on web pages have tens of thousands of possible annotations in the form of HTML tags that can be conveniently collected by querying search engines (Torralba et al., 2008), tags such as in www.flickr.com, or human-curated
1 This work summarizes the following papers: Weston et al. (2010) with Samy Bergio and Nicolas Usunier, and Bengio et al. (2010) with Jason Weston and David Grangier.
labels such as in www.image-net.org (Deng et al., 2009). We therefore need machine learning algorithms for image annotation that can scale to learn from and annotate such data. This includes (i) scalable training and testing times and (ii) scalable memory usage. In the ideal case we would like a fast algorithm that fits on a laptop, at least at annotation time. For many recently proposed models tested on small data sets, it is unclear if they satisfy these constraints.
In the first part of this work, we study feasible methods for just such a goal. We consider models that learn to represent images and annotations jointly in a low-dimension embedding space. Such embeddings are fast at testing time because the low dimension implies fast computations for ranking annotations. Simultaneously, the low dimension also implies small memory usage. To obtain good performance for such a model, we propose to train its parameters by learning to rank, optimizing for the top annotations in the list, for example, optimizing precision at k (p@k).
In the second part of this work, we propose a novel algorithm to improve testing time in multiclass classification tasks where the number of classes (or labels) is very large and where even a linear algorithm in the number of classes can become computationally infeasible. We propose an algorithm for learning a tree structure of the labels in the previously proposed joint embedding space, which, by optimizing the overall tree loss, provides a superior accuracy to existing tree labeling methods.
JOINT EMBEDDING OF IMAGES AND LABELS
We propose to learn a mapping into a feature space where images and annotations are both represented, as illustrated in Figure 1. The mapping functions are therefore different but are learned jointly to optimize the supervised loss of interest for our final task, that of annotating images. We start with a representation of images and a representation of annotations indices into a dictionary of possible annotations. We then learn a mapping from the image feature space to the joint space
while jointly learning a mapping for annotations,
These are chosen to be linear maps, i.e., and where Wi indexes the ith column of a D×Y matrix, but potentially any mapping could be used. In our work, we use sparse high-dimensional feature vectors of bags of visual terms for image vectors x and each annotation has its own learned representation (even if, for example, multiword annotations share words). Our goal is, for a given image, to rank the possible annotations such that the highest ranked
FIGURE 1 Joint embedding space.
annotations best describe the semantic content of the image. We consider the following model:
where the possible annotations i are ranked according to the magnitude of fi(x) , largest first.
Image Labeling as a Learning-To-Rank Task
Labeling an image can be viewed as a ranking task where, given an image, one needs to order labels such that the top ones correspond to the image while the bottom ones are unrelated to it. Various learning-to-rank methods have been proposed
in the machine learning literature over the years, some of which can scale to large data sets while others cannot. The simplest scalable approach is the following. One can decompose the ranking task as a large sum of several smaller tasks:
where for each training image x, we want the score of each good label y+ to be higher than the score of any bad label y– by a margin of at least 1, otherwise we pay the corresponding price. This loss can be trained very efficiently on very large data sets using stochastic gradient descent. However, a better alternative would be a loss that concentrates on the top of the ranking, instead of considering every triplet (x,y+,y–) uniformly. In a previous study (Weston et al., 2010), we proposed the weighted approximate-rank pairwise (WARP) loss, which can weigh each of the triplets according to the estimated rank of the good labels and still yields an efficient implementation. The resulting model is much faster to train and obtains a much better performance at the top of the ranking.
Large-Scale Learning
We trained an embedding model with the WARP loss on a very large data set, containing more than 10 million training images and more than 100,000 labels, where labels correspond to queries uttered on Google Image Search and images attributed to these labels were images often clicked for these queries. That meant a very noisy data set, where queries are in several languages, with many spelling mistakes and many apparently similar queries.
An interesting side effect of training such a model is that it provides a natural way of organizing labels among themselves, by looking at the nearest labels of a given label in the embedding space. Table 1 shows some examples of the nearest labels of some labels, where we see several misspellings, translations, and semantically similar labels.
Finally, we show in Table 2 examples of images from our test set (there were more than 3 million of them, different from the 10 million training images), as well as the nearest 10 labels in the embedding space.
LEARNING LABEL TREES
Labeling images when the number of labels is large (in the section Joint Embedding of Images and Labels, we had on the order of 100,000 labels) can be prohibitive for real-time applications. We thus proposed (Bengio et al., 2010) a novel approach to learn a label tree, where each node makes a prediction of the subset of labels to be considered by its children, thus decreasing the number of labels at a logarithmic rate until a prediction is reached. Existing approaches (Beygelzimer et al., 2009a, 2009b; Hsu et al., 2009) typically lose accuracy compared
TABLE 1 Nearest labels in the embedding space learned on the Web data.
Target Label | Nearest Labels |
Barack obama | barak obama, obama, barack, barrack obama, bow wow, george bush |
david beckham | beckham, david beckam, alessandro del piero, del piero, david becham |
Santa | santa claus, papa noel, pere noel, santa clause, joyeux noel, tomte |
dolphin | delphin, dauphin, whale, delfin, delfini, baleine, blue whale, walvis |
Cows | cattle, shire, dairy cows, kuh, horse, cow, shire horse, kone, holstein |
Rose | rosen, hibiscus, rose flower, rosa, roze, pink rose, red rose, a rose |
pine tree | abies alba, abies, araucaria, pine, neem tree, oak tree, pinus sylvestris |
Mount fuji | mt fuji, fujisan, fujiyama, mountain, zugspitze, fuji mountain |
eiffel tower | eiffel, tour eiffel, la tour eiffel, big ben, paris, blue mosque, eifel tower |
Ipod | i pod, ipod nano, apple ipod, ipod apple, new ipod, ipod shuffle |
f18 | f 18, eurofighter, f14, fighter jet, tomcat, mig21, f 16 |
TABLE 2 Examples of the top 10 labels obtained for some test images.
Source: Adapted from Weston et al., 2010.
to naive linear time approaches. Instead, we apply the following two steps: (a) learning a label tree and (b) learning predictors for each of the nodes of the tree.
Learning the Label Tree Structure
In order to learn a label tree such as the one in Figure 2, we proceed as follows. Given a set of labels in a node, we look for a partition of that set into subsets such that, inside a subset, labels are difficult to distinguish with classifiers trained
FIGURE 2 Example of a label tree.
on their corresponding images, whereas it is easier to distinguish images belonging to labels of a subset from images belonging to labels of another subset. We do so by computing the confusion matrix between all labels, where we count the number of times our classifiers confuse class i with class j, and use this matrix to apply spectral clustering (Ng et al., 2002). This procedure can then be applied recursively to obtain a complete label tree. Table 3 gives an example of labels that were clustered together thanks to that technique.
TABLE 3 Examples of obtained clusters of labels.
great white sharks, imagenes de delfines, liopleurodon meduse, mermaid tail, monstre du loch ness, monstruo del lago ness, oarfish, oceans, sea otter, shark attacks, sperm whale, tauchen, whales | apple iphone 3gs, apple ipod, apple tablet, bumper, iphone 4, htc diamond, htc hd, htc magic, htc touch pro 2, iphone 2g, iphone 3, iphone 5g, iphone app, iphone apple, iphone apps, iphone nano | chevy colorado, custom trucks, dodge ram, f 250, ford excursion, ford f 150, mini truck, nissan frontier, offroad, pickup, toyota tundra |
Learning a Label Embedding Tree
Once labels are organized into a tree, one can retrain jointly an embedding model (using the algorithm described in the section Joint Embedding of Images and Labels) where each image can now be labeled either with its original labels, or with any of the nodes of the tree that contains them. Moreover, whenever an internal node is selected as a positive label for a given image during training, we select a competing negative label as a sibling node in the label tree, as this corresponds to how the tree would then be used at test time.
The result provides a structure of labels based on both semantic and visual similarities. Furthermore, the performance of a label embedding tree is not only faster at test time, it is also better on average, as can be seen in the literature (Bengio et al., 2010).
REFERENCES
Bengio, S., J. Weston, and D. Grangier. 2010. Label embedding trees for large multi-class tasks. Advances in Neural Information Processing Systems 23: Proceedings of the 24th Annual Conference on Neural Information Processing Systems, Vancouver Canada, December 6–9, 2010.
Beygelzimer, A., J. Langford, and P. Ravikumar. 2009a. Error-correcting tournaments. Proceedings of the International Conference on Algorithmic Learning Theory, Porto, Portugal, October 3–5, 2009, pp. 247–262.
Beygelzimer, A., J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl. 2009b. Conditional probability tree estimation analysis and algorithm. Proceedings of the Conference in Uncertainty in Artificial Intelligence, Montreal, Canada, June 18–21, 2009.
Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, Fla., June 20–25. 2009.
Hsu, D., S. Kakade, J. Langford, and T. Zhang. 2009. Multi-label prediction via compressed sensing. Advances in Neural Information Processing Systems 22: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, Vancouver, Canada, December 7–10, 2009.
Ng, A. Y., M. I. Jordan, and Y. Weiss. 2002. On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 15: Proceedings of the 16th Annual Conference on Neural Information Processing Systems, Vancouver Canada, December 12–15, 2002.
Torralba, A., R. Fergus, and W. T. Freeman. 2008. 80 million tiny images: A large dataset for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11):1958–1970.
Weston, J., S. Bengio, and N. Usunier. 2010. Large scale image annotation: Learning to rank with joint word-image embeddings. Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Barcelona, Spain, September 20–24, 2010.