Rob Fergus, New York University and Facebook
Rob Fergus, New York University and Facebook, described Facebook as a multi-modal platform with billions of images, posts, users, comments, and videos uploaded each day. Research is being conducted on a variety of deep learning approaches for handling such a heterogeneous data stream.
The first project Fergus discussed described strategies for handling multi-modal data. Using deep neural networks, World2Vec represents the many different entities found on Facebook as high-dimensional vectors in the same space. Fergus explained that these different types of data (e.g., users and pages) have relations between them, which are then graphed. A complete graph would have 1010 edges1 per day, so this method scales to very large data streams.
Fergus noted that when analyzing these large, noisy data streams, ambiguity creates challenges that can be addressed with these methods—for example, exploring the embedding space can reveal whether a post about “football” is discussing soccer or the National Football League. These methods also allow for Facebook page categorization. Fergus summarized that large-scale embedding methods are very powerful because common space can be used to represent many different types of data, and relationships between data can be represented as distance functions within that space. He explained that these embedding methods can be considered a continuous form of reasoning.
The second project Fergus introduced described ways to train deep models from scratch and thus learn from noisy labels, such as the metadata associated with images found on Flickr. Setting up a successful architecture requires (1) training convolutional neural networks to predict words that appear with an image, (2) treating the words in the comments as targets for the photos, and (3) training convolutional neural networks to predict the words from the images. Training is done using multi-class logistic loss over all image–word pairs and mini-batch stochastic gradient descent with class-uniform sampling so that the common uses do not dominate the training.
For the experimental setup, the networks are trained and then evaluated based on word prediction, transfer learning, and word embeddings. Fergus indicated that this “weak form of supervised training” has been successful; it is possible to learn value representations without explicit human annotation.
1 Edges are the links that connect the points of a graph.
The third program Fergus described focused on training big models faster—specifically, training ImageNet in 1 hour. He discovered that combining more data with bigger models leads to higher accuracy, although the training time also increases in such a scenario. In distributed training, as the batch size increases, optimization issues arise, so the learning rate must be scaled linearly to ensure accuracy. Increasing the number of graphical processing units also allows a linear speed-up in the training time.
The fourth and final program that Fergus presented studied training on large data sets that do not fit on a single graphical processing unit. Though the previously discussed program addressed the issue of faster training, it did not address increasing the size of the model itself. A barrier to this is that the memory of a graphical processing unit is limited. The key solution, Fergus explained, is a “hard” mixture-of-experts approach, in which each example is directed to a single expert.
Rama Chellappa, University of Maryland, College Park, noted that the mixture of experts method has also been useful in Janus’s face verification research, as discussed in Chapter 3 of this proceedings. In response to a question from Kathy McKeown, Columbia University, Fergus explained that the representation could be optimized for a particular objective. Darrell Young, Facebook, asked Fergus how he would use fastText vectors between languages. Fergus thought it may be possible to do machine translation with fastText representation, though he was unsure precisely how.