Peter Pirolli, Institute for Human and Machine Cognition
Peter Pirolli, Institute for Human and Machine Cognition, introduced the human–computer interaction in sensemaking tasks, which are prominent in the intelligence community. He defined sensemaking tasks as tasks with massive amounts of data and with initially ill-defined goals and constraints. The purpose of completing the sensemaking task is to organize and interpret the data to generate knowledge to make a decision or provide a solution to a problem. Examples of sensemaking tasks include the need to understand a health problem in order to make a medical decision or the desire to produce an actionable intelligence report.
Pirolli continued by noting that since the cognitive capacity of human beings is relatively limited, they have to adapt to deal with this influx of available data in sensemaking tasks. Though creating more technology can be helpful, doing that alone is insufficient. The solution to this problem is to develop technology that specifically augments the rate at which humans can perform sensemaking tasks. Such system-level improvements can help humans gain more knowledge, make better inferences, and arrive at better decisions. He added that system-level improvements also have the potential to reduce cognitive bias, increase system assessments, and allow analysts to evaluate the trustworthiness of the system’s products.
When human beings interact with information, Pirolli explained, phenomena occur at multiple time scales, each having their own laws. The lowest level could be considered the neurobiological (i.e., neurons fire and ensembles work together), with the next layers being the psychological (i.e., things are retrieved from memory to make a decision), the rational (i.e., the organization of tasks and their solutions), and the social (i.e., communication and organizational processes). Pirolli explained that it is important to understand how to develop predictive, multi-level models for understanding the human analysts who perform sensemaking tasks. Doing so allows (1) an understanding of the analysts’ performance capabilities, biases, and failure modes; (2) an opportunity to develop better training programs or better systems; (3) a prediction of the impact of new technologies and new interaction techniques on the cognitive system; and (4) an opportunity to help artificial intelligence systems form common ground with humans to improve collaboration. The ultimate goal, he said, is to develop an interdependent joint activity between the human and the system: the user should understand what the system is doing, and the system should understand what the user is doing.
Pirolli introduced techniques developed around cognitive task analysis that try to understand how people
organize their thoughts. Understanding how things are organized makes it possible to target technologies at these different levels that can impact overall performance. Information foraging theory uses mathematical models to try to understand how humans search for information and demonstrates that humans make optimizing choices. Work is also being done around highly integrated cognitive architectures to understand how memory, perception, judgment, and problem solving work together to perform tasks.
Pirolli described the Intelligence Advanced Research Projects Activity (IARPA) Integrated Cognitive–Neuroscience Architectures for Understanding Sensemaking (ICArUS)1 program, which modeled human sensemaking and geospatial intelligence tasks in order to predict human bias and performance in such tasks. Both the task and the data are analyzed at multiple levels and mapped to a cognitive architecture, before being mapped to neural simulation models, which simulated the data to some degree of high fidelity.
Priolli’s last example discussed the use of computational social science and social psychological models. Multi-level modeling is also useful in studying credibility and decision making: computational cognitive models help to better understand how people judge the credibility of others.
Current technology capabilities for sensemaking tools include (1) the use of commercial off-the-shelf interactive visual analytics (e.g., Tableau®) for big data analysis, (2) mature research platforms for visualization grammars and toolkits, and (3) early versions of interactive visualizations to understand what machine learning systems are doing. He added that a standard model of cognition has emerged for constrained, reasonably well-defined tasks. Cognitive models for tutoring systems are well established, as is user modeling and personalization.
Over the coming years, Pirolli stated, challenges will remain in human–artificial intelligence collaboration: (1) Sensemaking tasks for the intelligence community are ill-defined and dynamically changing (e.g., the data changes, as do the tactics of the adversaries); (2) Missions and tasks span multiple time scales, levels of aggregation, and levels of organization; and (3) The user and the machine will continually try to understand one another in the midst of change. Pirolli noted a Defense Advanced Research Projects Agency (DARPA) program, Explainable Artificial Intelligence (XAI),2 that creates new learning processes and interfaces so a user can understand why a system is doing something, how to correct failures, and how to make predictions.
In the next 3–5 years, Pirolli believes the community will likely have (1) multi-level models across the cognitive, rational, and social bands for well-defined stationary sensemaking tasks; (2) an emerging field of visual analytics for machine learning programming; and (3) interactive task learning for usable soft bots for well-defined tasks. Longer-term challenges include developing (1) a foundational science of Human–Autonomy Collaboration (HAC) that would aid the transition from programming to learning to work together; (2) multi-level models of sensemaking that include dynamically adapting humans and machine learning in joint analysis; (3) research methods and evaluation frameworks on joint human–autonomy tasks; and (4) open source data and code for tasks that have some relevance and validity to the intelligence community.
Joseph Mundy, Vision Systems, Inc., highlighted the varying levels of programming skills needed for different tasks. He asked if models for intelligence analysis can predict these needs and suggested that machines could do the mundane clean-up while humans do the discovery work. Pirolli explained that there is some debate about this model for human–machine interaction. He continued that it is important to view the optimal structure as a task itself and consider the best organization between the human and the machine.
Chris Callison-Burch, University of Pennsylvania
Chris Callison-Burch, University of Pennsylvania, defined crowdsourcing as hiring people to accomplish small tasks as low cost; Amazon’s Mechanical Turk is the most common platform for crowdsourcing. Here, humans
1 For information on the ICArUS program, see IARPA, “Integrated Cognitive-Neuroscience Architectures for Understanding Sensemaking (ICArUS),” https://www.iarpa.gov/index.php/research-programs/icarus/baa, accessed August 27, 2017.
2 To learn more about the XAI program, see D. Gunning, “Explainable Artificial Intelligence (XAI),” DARPA, https://www.darpa.mil/program/explainable-artificial-intelligence, accessed August 29, 2017.
perform “artificial artificial intelligence”; in other words, humans perform artificial intelligence tasks because artificial intelligence algorithms are not yet equipped to solve certain problems.
Callison-Burch described ImageNet, a large-scale annotated resource, as a successful example of crowdsourcing that categorizes images using WordNet (a natural language processing resource) ontology. Callison-Burch highlighted a process from Fei-Fei Li that involved collecting candidate images from the internet, which were then verified by humans; over 1,000 object classes and over 1 million labeled images resulted. Li deduced that if only one person were to have to label 10,000 candidate images for each of 40,000 synsets, it would take that person nearly 20 years to complete the task. Thus Li and her team designed the ImageNet Mechanical Turk User Interface so that multiple people could work on categorizing images: This technology allowed 11 million images to be labeled by 25,000 Mechanical Turk workers in only 1 year. Although neural networks have advanced the state of the art in computer vision, a goal related to object recognition, for example, would not have been achievable without labeled training data.
Crowdsourcing has also enabled advances in natural language processing because it allows collection of data needed to train models, according to Callison-Burch. He acknowledged that while the low costs associated with crowdsourcing may be appealing, the real benefit of crowdsourcing is that annotation tasks can be completed quickly by a large number of people.
Callison-Burch explained that machine translation takes a sentence in one language and produces a translation in another language. These translation systems are typically constructed using paired sentences (i.e., parallel data) in which a human has engaged in a translation task. . Certain languages that are of interest to the intelligence community, however, often have very little parallel, or training, data available. Machine translation could be improved, according to Callison-Burch, either by building a better model that accomplishes the task and performs better or by adding more data to the training collection.
During his work with the Human Language Technology Center of Excellence, Callison-Burch quantified the translations from Mechanical Turkers. He found that while machine translations are similar to Turker translations, both are worse than those of professional translators. With additional quality control, he found that he could improve the quality of the output from the Mechanical Turkers.
Callison-Burch performed a number of experiments to study the capabilities of human language technologies on Mechanical Turk. He researched how to create bilingual parallel text for a variety of low resource languages; completed a large-scale demographic study in which the list of languages represented by Mechanical Turk workers increased; and used crowdsourcing to identify cases of written Arabic dialect and create an Arabic-to-English parallel corpus that could be used to train a machine translation system.
He explained that there is a need for low-cost, high-quality translations, especially to expand the number of languages currently covered by Google and the Center for Applied Machine Translation. There are both advantages and disadvantages of crowdsourcing, he reiterated: while it is scalable, costs little, and offers an on-demand workforce, there are quality concerns, it is impossible to make assumptions about skill levels, and it cannot be used for sensitive data. In conclusion, he emphasized that crowdsourcing is a powerful tool that is worthy of additional investment, as it helps enable better data creation and thus better science and technology. Within the next 5 years, Callison-Burch expects to see a tighter integration of crowdsourcing and machine learning (e.g., active learning or domain adaptation) as well as a new crowdsourcing platform for natural language processing that could be deployed on the inside for the intelligence community.
Rama Chellappa, University of Maryland, College Park, asked if there is any research on Turkers who do both image and natural language processing work. Callison-Burch said that work at the intersection of language and vision is only just beginning and added that different language capabilities allow for different tasks. Jonathan Fiscus, National Institute of Standards and Technology, mentioned a group at the National Geospatial-Intelligence Agency that is currently building a new crowdsourcing platform. Kathy McKeown, Columbia University, asked if Callison-Burch could discuss overall systems that had components that would sometimes be machine and sometimes human. He responded that the crowd component is usually placed at the start of the pipeline; however, it would be beneficial to have a component in a live pipeline that made a decision in real time about whether the predictions being made by the machine learning algorithms were good enough, and, if so, the work could be routed there. If not, the work could be routed to the crowd instead.