Anthony Hoogs, Kitware, Inc.
Jonathan Fiscus, National Institute of Standards and Technology
Rob Fergus, New York University and Facebook
Jason Duncan, MITRE Corporation
After each panel participant provided an overview of his work, the panel discussion focused on how to couple evaluation1 with annotation2 to achieve maximum performance. Annotation is a critical component for deep learning in order to generate sufficient data independent of evaluation.
Anthony Hoogs’s company, Kitware, Inc., is the test and evaluation support contractor for the Intelligence Advanced Research Projects Activity (IARPA) Deep Intermodal Video Analytics (DIVA) program, a 5-year project for ground surveillance video analytics. DIVA’s objective is to dramatically advance the state of the art in (1) automatic detection of activities in cluttered scenes, (2) temporal reasoning of video for improved activity detection, and (3) activity detection and scene understanding from overlapping and non-overlapping camera viewpoints. Kitware’s role in achieving this objective is to collect and annotate data, develop an open source video analytics framework, and establish a baseline algorithm. Kitware was chosen primarily because of the surveillance video data it collected (and kept sequestered) during a prior program. So that the restricted data can be used for deep learning, test, and evaluation, for DIVA as well as for other programs, it needs to be annotated; Hoogs explained that it was relatively easy and cost-effective to hire 8 part-time in-house annotators to complete this task.
Jonathan Fiscus, National Institute of Standards and Technology (NIST), shared lessons learned about the qualities of a good evaluation. Evaluations are driven by many factors—for example, task design, data, infrastructure, community engagement, and testing and analysis. He added that evaluation can support transition, if done correctly. In particular, he described the following:
- Task design requires an independent evaluator as well as clear technology use cases and purpose. He highlighted the importance of collaborating to convert the statement of what technology needs to do into
1 The purpose of evaluation is to “provide testing conditions that convincingly suggest that [an] algorithm will perform well on other people’s data, out in the real world. . . . [I]t’s important to keep track of the testing conditions, any modifications [made] to [the] algorithm, and places in [the] annotation scheme that could be changed to improve performance later” (A. Stubbs and J. Pustejovsky, 2013, Natural Language Annotation for Machine Learning: A Corpus-Building for Applications, O’Reilly Media, Sebastopol, Calif.).
2 Annotation is described as “any metadata tag used to mark up elements of a data set. . . . [I]n order for the algorithms to learn efficiently and effectively, the annotation done on the data must be accurate and relevant to the task the machine is being asked to perform” (A. Stubbs and J. Pustejovsky, 2013, Natural Language Annotation for Machine Learning: A Corpus-Building for Applications, O’Reilly Media, Sebastopol, Calif.).
evaluation tasks. Task design can be innovative or prescriptive, and the evaluation can be component or end-to-end.
- Data is needed for training, development testing, cross validation, and evaluation testing. He emphasized the need to have enough data to test system differences and to divide up evaluation resources into publicly available and sequestered data.
- Infrastructure impacts transition, and innovative programs can be more difficult to transition than prescriptive programs. He suggested using the cloud for effective testing.
- Community engagement can be a challenge when difficult problems have competing requirements. It is crucial that evaluators understand the relevant trade-offs.
- Testing and analysis is key for transition. Unfortunately, there is no way to automatically evaluate a new machine learning algorithm. Possible solutions to this problem include task-adaptable annotations.
Fiscus discussed the Defense Advanced Research Projects Agency (DARPA) Media Forensics (MediFor)3 program to highlight the challenges of doing so many evaluation tasks with only one data set. He noted that his team is designing data sets and evaluation tasks to do better factor analysis, which helps to decide the next steps in programming, evaluation, and research.
Rob Fergus, New York University and Facebook, described the “sweet spot” for the amount of data that can be used for deep learning. With too little data, it is difficult to train existing deep learning models; with too much data, it is difficult not to become overwhelmed. And when graduate students try to run big models on only one or two graphical processing units, the number of experiments that can be run decreases. This issue became apparent in a 2010–2011 DARPA deep learning program when graphical processing units did not have as much power as they have now. Fergus explained that this group would have benefitted from appropriate infrastructure and hardware to handle that much data. However, hardware of this caliber is quite expensive, especially for use by university students, and many universities are ill-equipped for large-scale graphical processing unit deployments. From an evaluation perspective, Fergus noted that when developing algorithms, it is essential to cycle many times and complete hundreds or thousands of experiments.
Jason Duncan, MITRE, emphasized that an important goal associated with evaluating machine-generated products is to maintain situational awareness. Researchers want to learn about the world; doing so enables strategic warning and predictive scenarios and helps intelligence personnel get ahead of some threats before they occur. Duncan noted that both technology evaluations (e.g., core technologies) and utility evaluations (e.g., value to mission) are useful in achieving situational awareness. He described the value of multi-source data in knowledge discovery and the benefits of combining evaluation metrics across data input types. And he noted that when dealing with heterogeneous data with ambiguity, it is important to measure the system’s ability to have confidence in its predictions. He added that noise is inevitable in the output of unstructured content processing systems, but in many cases utility evaluations can demonstrate the value to mission of such systems even in the face of imperfect system output.
James Donlon, National Science Foundation, asked if graduate students have enough computing power to make better use of data for academic research and keep up with industry contributions. Fergus responded that the answer to this question depends upon the university in question—some of the larger universities have access to more graphical processing units. He suspects that image work is manageable on most campuses, while research in video remains a difficult computational problem. Hoogs added that the alignment of models to graphical processing units also needs to be considered: even if a university has multiple graphical processing units, this may be inadequate for a model that needs to fit into the memory of one graphical processing unit. Donlon asked if the ability to discover is then hindered by researchers’ need to conform. Fergus responded in the affirmative and explained that conservative solutions are likely if one is limited to running only a small number of experiments. And in reference to Hoogs’ point about model alignment, Fergus noted that most people do many different experiments in parallel instead of completing a single job across multiple graphical processing units.
Hoogs referenced MediFor as a program that successfully provides computing resources. Fiscus added that while that is a great solution, a technology problem arises that can be addressed with cloud architecture. Hoogs added another downside to using program-wide compute resources as MediFor is doing: while shared resources work well for development, they may not be as useful when an evaluation deadline is present and all performers need significant computing resources at the same time. This, then, is a program problem, not just a performer problem. Fiscus noted that his group is currently collecting algorithms using MediFor, but it is too early to gauge whether it is successful.
Rama Chellappa, University of Maryland, College Park, asked if Facebook would be willing to develop agreements to work with academic professors and researchers, instead of only recruiting them. Fergus said that there could be security issues that would prevent such an arrangement, but discussions about the possibility continue at Facebook. Fergus noted that Amazon already has the infrastructure to do something similar: its Elastic Cloud Compute4 could give credits to university researchers. An audience participant added that, in addition to offering academic credits to researchers, Amazon offers an academic rewards program and funding for Ph.D. students.
Donlon asked the panel about the role of reference data sets; ImageNet is central to academic research, and yet reference data sets can also constrain the machinery and capabilities of that machinery. Donlon continued that this issue is relevant for a customer who wants to deploy systems in the real world and asked how to use these reference data sets to guide academic research forward. Duncan responded that the advantages outweigh the disadvantages for reference data sets; especially in the case of government programs, reference data sets help to focus performers on the problem of interest. As an example, Hoogs said that he could not recall a major computer vision program sponsored by the USG that had full training and testing data that was completely open and useful to the computer vision community. Hoogs added that any time a program can set up a full, complete data set, it should, so as to engage the entire community. Fergus explained that one of the reasons academics often conform to reference data sets is because they have a vested interest in publishing papers. Making evaluation results and data sets available to the public adds credibility and gives people an opportunity to compare their work. Duncan added that much of this discussion is dependent upon the problem at hand for the performers.
Kathy Lajoie Malik, Technology Leadership Achieving Intended Outcomes, asked if there has been enough discussion of evaluating the ability to manage change on both the product and the processes, now that the time between research and implementation is shortening. Duncan responded that technology transfer has always been and will continue to be a difficult problem, especially since many analysts are resistant to change. Fiscus added that part of that problem could be addressed in the technology design and in evaluating the technology itself. Hoogs emphasized that it is critical to have the potential technology transition groups and end users represented in the construction of the evaluation.