National Academies Press: OpenBook

Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium (2012)

Chapter: Large-Scale Visual Semantic Extraction--Samy Bengio

« Previous: Advancing Natural Language Understanding with Collaboratively Generated Content--Evgeniy Gabrilovich
Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×

Large-Scale Visual Semantic Extraction

SAMY BENGIO1
Google Research



ABSTRACT

Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large-scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task, we propose to build a so-called embedding space, into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image or annotations similar to a given annotation. One can even build a visiosemantic tree from these annotations that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts.

INTRODUCTION

The emergence of the Web as a tool for sharing information has caused a massive increase in the size of potential data sets available for machines to learn from. Millions of images on web pages have tens of thousands of possible annotations in the form of HTML tags that can be conveniently collected by querying search engines (Torralba et al., 2008), tags such as in www.flickr.com, or human-curated

image

1 This work summarizes the following papers: Weston et al. (2010) with Samy Bergio and Nicolas Usunier, and Bengio et al. (2010) with Jason Weston and David Grangier.

Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×

labels such as in www.image-net.org (Deng et al., 2009). We therefore need machine learning algorithms for image annotation that can scale to learn from and annotate such data. This includes (i) scalable training and testing times and (ii) scalable memory usage. In the ideal case we would like a fast algorithm that fits on a laptop, at least at annotation time. For many recently proposed models tested on small data sets, it is unclear if they satisfy these constraints.

In the first part of this work, we study feasible methods for just such a goal. We consider models that learn to represent images and annotations jointly in a low-dimension embedding space. Such embeddings are fast at testing time because the low dimension implies fast computations for ranking annotations. Simultaneously, the low dimension also implies small memory usage. To obtain good performance for such a model, we propose to train its parameters by learning to rank, optimizing for the top annotations in the list, for example, optimizing precision at k (p@k).

In the second part of this work, we propose a novel algorithm to improve testing time in multiclass classification tasks where the number of classes (or labels) is very large and where even a linear algorithm in the number of classes can become computationally infeasible. We propose an algorithm for learning a tree structure of the labels in the previously proposed joint embedding space, which, by optimizing the overall tree loss, provides a superior accuracy to existing tree labeling methods.

JOINT EMBEDDING OF IMAGES AND LABELS

We propose to learn a mapping into a feature space where images and annotations are both represented, as illustrated in Figure 1. The mapping functions are therefore different but are learned jointly to optimize the supervised loss of interest for our final task, that of annotating images. We start with a representation of images image and a representation of annotations image indices into a dictionary of possible annotations. We then learn a mapping from the image feature space to the joint space image

image

while jointly learning a mapping for annotations,

image

These are chosen to be linear maps, i.e., image and image where Wi indexes the ith column of a D×Y matrix, but potentially any mapping could be used. In our work, we use sparse high-dimensional feature vectors of bags of visual terms for image vectors x and each annotation has its own learned representation (even if, for example, multiword annotations share words). Our goal is, for a given image, to rank the possible annotations such that the highest ranked

Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×

image

FIGURE 1 Joint embedding space.

annotations best describe the semantic content of the image. We consider the following model:

image

where the possible annotations i are ranked according to the magnitude of fi(x) , largest first.

Image Labeling as a Learning-To-Rank Task

Labeling an image can be viewed as a ranking task where, given an image, one needs to order labels such that the top ones correspond to the image while the bottom ones are unrelated to it. Various learning-to-rank methods have been proposed

Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×

in the machine learning literature over the years, some of which can scale to large data sets while others cannot. The simplest scalable approach is the following. One can decompose the ranking task as a large sum of several smaller tasks:

image

where for each training image x, we want the score of each good label y+ to be higher than the score of any bad label y– by a margin of at least 1, otherwise we pay the corresponding price. This loss can be trained very efficiently on very large data sets using stochastic gradient descent. However, a better alternative would be a loss that concentrates on the top of the ranking, instead of considering every triplet (x,y+,y–) uniformly. In a previous study (Weston et al., 2010), we proposed the weighted approximate-rank pairwise (WARP) loss, which can weigh each of the triplets according to the estimated rank of the good labels and still yields an efficient implementation. The resulting model is much faster to train and obtains a much better performance at the top of the ranking.

Large-Scale Learning

We trained an embedding model with the WARP loss on a very large data set, containing more than 10 million training images and more than 100,000 labels, where labels correspond to queries uttered on Google Image Search and images attributed to these labels were images often clicked for these queries. That meant a very noisy data set, where queries are in several languages, with many spelling mistakes and many apparently similar queries.

An interesting side effect of training such a model is that it provides a natural way of organizing labels among themselves, by looking at the nearest labels of a given label in the embedding space. Table 1 shows some examples of the nearest labels of some labels, where we see several misspellings, translations, and semantically similar labels.

Finally, we show in Table 2 examples of images from our test set (there were more than 3 million of them, different from the 10 million training images), as well as the nearest 10 labels in the embedding space.

LEARNING LABEL TREES

Labeling images when the number of labels is large (in the section Joint Embedding of Images and Labels, we had on the order of 100,000 labels) can be prohibitive for real-time applications. We thus proposed (Bengio et al., 2010) a novel approach to learn a label tree, where each node makes a prediction of the subset of labels to be considered by its children, thus decreasing the number of labels at a logarithmic rate until a prediction is reached. Existing approaches (Beygelzimer et al., 2009a, 2009b; Hsu et al., 2009) typically lose accuracy compared

Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×

TABLE 1 Nearest labels in the embedding space learned on the Web data.

Target Label Nearest Labels
Barack obama barak obama, obama, barack, barrack obama, bow wow, george bush
david beckham beckham, david beckam, alessandro del piero, del piero, david becham
Santa santa claus, papa noel, pere noel, santa clause, joyeux noel, tomte
dolphin delphin, dauphin, whale, delfin, delfini, baleine, blue whale, walvis
Cows cattle, shire, dairy cows, kuh, horse, cow, shire horse, kone, holstein
Rose rosen, hibiscus, rose flower, rosa, roze, pink rose, red rose, a rose
pine tree abies alba, abies, araucaria, pine, neem tree, oak tree, pinus sylvestris
Mount fuji mt fuji, fujisan, fujiyama, mountain, zugspitze, fuji mountain
eiffel tower eiffel, tour eiffel, la tour eiffel, big ben, paris, blue mosque, eifel tower
Ipod i pod, ipod nano, apple ipod, ipod apple, new ipod, ipod shuffle
f18 f 18, eurofighter, f14, fighter jet, tomcat, mig21, f 16

TABLE 2 Examples of the top 10 labels obtained for some test images.

image

Source: Adapted from Weston et al., 2010.

to naive linear time approaches. Instead, we apply the following two steps: (a) learning a label tree and (b) learning predictors for each of the nodes of the tree.

Learning the Label Tree Structure

In order to learn a label tree such as the one in Figure 2, we proceed as follows. Given a set of labels in a node, we look for a partition of that set into subsets such that, inside a subset, labels are difficult to distinguish with classifiers trained

Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×

image

FIGURE 2 Example of a label tree.

on their corresponding images, whereas it is easier to distinguish images belonging to labels of a subset from images belonging to labels of another subset. We do so by computing the confusion matrix between all labels, where we count the number of times our classifiers confuse class i with class j, and use this matrix to apply spectral clustering (Ng et al., 2002). This procedure can then be applied recursively to obtain a complete label tree. Table 3 gives an example of labels that were clustered together thanks to that technique.

Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×

TABLE 3 Examples of obtained clusters of labels.

great white sharks, imagenes de delfines, liopleurodon meduse, mermaid tail, monstre du loch ness, monstruo del lago ness, oarfish, oceans, sea otter, shark attacks, sperm whale, tauchen, whales apple iphone 3gs, apple ipod, apple tablet, bumper, iphone 4, htc diamond, htc hd, htc magic, htc touch pro 2, iphone 2g, iphone 3, iphone 5g, iphone app, iphone apple, iphone apps, iphone nano chevy colorado, custom trucks, dodge ram, f 250, ford excursion, ford f 150, mini truck, nissan frontier, offroad, pickup, toyota tundra

Learning a Label Embedding Tree

Once labels are organized into a tree, one can retrain jointly an embedding model (using the algorithm described in the section Joint Embedding of Images and Labels) where each image can now be labeled either with its original labels, or with any of the nodes of the tree that contains them. Moreover, whenever an internal node is selected as a positive label for a given image during training, we select a competing negative label as a sibling node in the label tree, as this corresponds to how the tree would then be used at test time.

The result provides a structure of labels based on both semantic and visual similarities. Furthermore, the performance of a label embedding tree is not only faster at test time, it is also better on average, as can be seen in the literature (Bengio et al., 2010).

REFERENCES

Bengio, S., J. Weston, and D. Grangier. 2010. Label embedding trees for large multi-class tasks. Advances in Neural Information Processing Systems 23: Proceedings of the 24th Annual Conference on Neural Information Processing Systems, Vancouver Canada, December 6–9, 2010.

Beygelzimer, A., J. Langford, and P. Ravikumar. 2009a. Error-correcting tournaments. Proceedings of the International Conference on Algorithmic Learning Theory, Porto, Portugal, October 3–5, 2009, pp. 247–262.

Beygelzimer, A., J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl. 2009b. Conditional probability tree estimation analysis and algorithm. Proceedings of the Conference in Uncertainty in Artificial Intelligence, Montreal, Canada, June 18–21, 2009.

Deng, J., W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, Fla., June 20–25. 2009.

Hsu, D., S. Kakade, J. Langford, and T. Zhang. 2009. Multi-label prediction via compressed sensing. Advances in Neural Information Processing Systems 22: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, Vancouver, Canada, December 7–10, 2009.

Ng, A. Y., M. I. Jordan, and Y. Weiss. 2002. On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 15: Proceedings of the 16th Annual Conference on Neural Information Processing Systems, Vancouver Canada, December 12–15, 2002.

Torralba, A., R. Fergus, and W. T. Freeman. 2008. 80 million tiny images: A large dataset for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11):1958–1970.

Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×

Weston, J., S. Bengio, and N. Usunier. 2010. Large scale image annotation: Learning to rank with joint word-image embeddings. Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Barcelona, Spain, September 20–24, 2010.

Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×
Page 61
Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×
Page 62
Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×
Page 63
Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×
Page 64
Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×
Page 65
Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×
Page 66
Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×
Page 67
Suggested Citation:"Large-Scale Visual Semantic Extraction--Samy Bengio." National Academy of Engineering. 2012. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium. Washington, DC: The National Academies Press. doi: 10.17226/13274.
×
Page 68
Next: Searching for Statistical Diagrams--Shirley Zhe Chen, Michael J. Cafarella, and Eytan Adar »
Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2011 Symposium Get This Book
×
Buy Paperback | $45.00 Buy Ebook | $36.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The practice of engineering is continually changing. Engineers today must be able not only to thrive in an environment of rapid technological change and globalization, but also to work on interdisciplinary teams. Cutting-edge research is being done at the intersections of engineering disciplines, and successful researchers and practitioners must be aware of developments and challenges in areas that may not be familiar to them.

At the U.S. Frontiers of Engineer Symposium, engineers have the opportunity to learn from their peers about pioneering work being done in many areas of engineering. Frontiers of Engineering 2011: Reports on Leading-Edge Engineering from the 2011 Symposium highlights the papers presented at the event. This book covers four general topics from the 2011 symposium: additive manufacturing, semantic processing, engineering sustainable buildings, and neuro-prosthetics. The papers from these presentations provide an overview of the challenges and opportunities of these fields of inquiry, and communicate the excitement of discovery.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!