labels such as in www.image-net.org (Deng et al., 2009). We therefore need machine learning algorithms for image annotation that can scale to learn from and annotate such data. This includes (i) scalable training and testing times and (ii) scalable memory usage. In the ideal case we would like a fast algorithm that fits on a laptop, at least at annotation time. For many recently proposed models tested on small data sets, it is unclear if they satisfy these constraints.
In the first part of this work, we study feasible methods for just such a goal. We consider models that learn to represent images and annotations jointly in a low-dimension embedding space. Such embeddings are fast at testing time because the low dimension implies fast computations for ranking annotations. Simultaneously, the low dimension also implies small memory usage. To obtain good performance for such a model, we propose to train its parameters by learning to rank, optimizing for the top annotations in the list, for example, optimizing precision at k (p@k).
In the second part of this work, we propose a novel algorithm to improve testing time in multiclass classification tasks where the number of classes (or labels) is very large and where even a linear algorithm in the number of classes can become computationally infeasible. We propose an algorithm for learning a tree structure of the labels in the previously proposed joint embedding space, which, by optimizing the overall tree loss, provides a superior accuracy to existing tree labeling methods.
JOINT EMBEDDING OF IMAGES AND LABELS
We propose to learn a mapping into a feature space where images and annotations are both represented, as illustrated in Figure 1. The mapping functions are therefore different but are learned jointly to optimize the supervised loss of interest for our final task, that of annotating images. We start with a representation of images and a representation of annotations indices into a dictionary of possible annotations. We then learn a mapping from the image feature space to the joint space
while jointly learning a mapping for annotations,
These are chosen to be linear maps, i.e., and where Wi indexes the ith column of a D×Y matrix, but potentially any mapping could be used. In our work, we use sparse high-dimensional feature vectors of bags of visual terms for image vectors x and each annotation has its own learned representation (even if, for example, multiword annotations share words). Our goal is, for a given image, to rank the possible annotations such that the highest ranked