Proceedings of a Workshop
Assessing and Improving AI Trustworthiness: Current Contexts and Concerns
Proceedings of a Workshop—in Brief
On March 3–4, 2021, the National Academies of Sciences, Engineering, and Medicine held a workshop exploring both current assessments and current approaches to understanding and enhancing trustworthy artificial intelligence (AI) and to identifying potential paths to contribute to improved assessments of AI trustworthiness. Rather than proceed from a given definition of “trustworthiness,” the committee was tasked with clarifying the concept through exploring various attributes of trustworthiness. The workshop was sponsored by the National Institute of Standards and Technology (NIST), which recently launched an AI trustworthiness initiative. In a series of five panel discussions and one keynote address, the workshop planning committee led participants through an overview of current practices in AI trustworthiness, attributes of trustworthy systems, and tools and assessments to better understand and communicate a system’s trustworthiness. The report concludes with four common themes that were identified in the workshop.
In her welcoming remarks, Elham Tabassi, chief of staff for NIST’s Information Technology Laboratory, positioned the workshop in the context of NIST’s ongoing work to advance the development and deployment of more trustworthy AI systems. NIST’s goals for public engagement, in addition to this workshop, include
- Gaining a deeper understanding of the technical requirements of trustworthy AI, moving from general statements of ethical principles to specific engineering requirements.
- Exploring specific requirements, including safety (mitigating against a system producing new harms), security (monitoring a system’s integrity), interpretability (providing a framework for comprehending and interrogating a system’s outputs), privacy (protecting sensitive individual information in a
system), fairness (developing systems whose outputs are not affected by irrelevant, sensitive personal attributes), robustness (designing systems whose outputs change in predictable ways as inputs change), and generalizability (designing systems that can perform reliably when presented with real-world data outside their training and testing sets).
- Developing a common vocabulary for discussing these requirements across the multiple relevant academic disciplines and diverse stakeholder groups.
Eric Horvitz of Microsoft Research and a member of the National Security Commission on Artificial Intelligence (NSCAI) provided remarks based on the NSCAI’s final report,1 which was published just preceding the workshop. He began his remarks by presenting a brief overview of the NSCAI’s origins and structure before discussing its three objectives: (1) understanding ethical considerations around AI as they relate to national security, (2) promoting data standards and sharing, and (3) the evolution of AI over time. Horvitz’s remarks focused on the commission’s recommendations on upholding democratic values in AI, establishing confidence in AI, and accelerating development of AI.
Horvitz emphasized the need to place the rule of law at the core of AI systems, ensuring their compliance with constitutional principles of nondiscrimination, privacy, and due process and seeking to prevent their abuse by authoritarian regimes. The report highlights a growing set of tools to analyze AI systems for fairness—for instance, helping to understand how error rates are distributed across different cohorts of a sample and how efforts to reduce certain errors may exacerbate others. The commission’s report also recommends investment in and adoption of AI systems to enhance auditing and oversight to promote privacy and civil rights. To better establish trust in AI systems, Horvitz pointed to the importance of testing, evaluation, verification, and validation methods. The NSCAI report recommends that NIST provide performance metrics, standards and tools for AI models, data, and training methods. Last, Horvitz described NSCAI recommendations on future investments that would improve the assessment of AI systems: identifying best practices around data, model, and system documentation; facilitating the development of third-party test centers for AI systems; and providing increased support for research and development activity.
AI TRUSTWORTHINESS IN CONTEXT
The first panel brought together experts from three application areas—finance, health, and transportation—to discuss the challenges of creating trustworthy AI systems in these domains. David Palmer of the Federal Reserve Board opened by discussing risk in financial modeling in general, and how it relates to AI. Palmer noted that oversight
authorities expect banks to pursue responsible innovation and properly manage risk. A major risk for banks is managing the degree to which financial models are transparent and interpretable, which can introduce uncertainty for banks (and for regulators) as to whether a given tool is suitable and could possibly cause consumers harm. Today, as more machine learning (ML) is used, regulators need to ensure that developers use good-quality data and appropriate analytical techniques. Palmer closed by suggesting that an assessment of any AI approach should address both the conceptual soundness of the approach and its demonstrated performance in the real world.
Agus Sudjianto of Wells Fargo discussed the use of artificial intelligence in financial modeling, highlighting that all models have weaknesses and may cause potential harm. Sudjianto provided examples such as mis-hedging, which creates market risks; missed detection, which creates compliance risks; and privacy and fairness errors, which create legal risks. Validating financial models should thus assess not only the outputs of these tools but also the unintended outcomes, such as discriminatory lending practices, that might result from the use of flawed tools. An important requirement to identify the root causes or sources of harm is model explainability. Despite the progress on the post hoc explainability tools, for critical applications such as credit approval, ML models that are self-explanatory or inherently interpretable should be used.
Bakul Patel of the U.S. Food and Drug Administration (FDA) discussed the agency’s efforts at fostering responsible health innovation, which he characterized as lowering burdens on both industry and regulators while core requirements of patient safety, product quality, clinical responsibility, cybersecurity responsibility, and a proactive culture of ongoing AI monitoring are met. The FDA’s regulatory approach stratifies risk based on two factors: criticality of health care (ranging from routine visits to life-or-death situations), and information significance (ranging from clinical management information to diagnostic information). For higher-risk AI applications, the FDA proposed a total product life cycle (TPLC) approach in a 2019 white paper.2 Such a regulatory approach incorporates setting expectations for good ML practices among health software developers, systems for device review pre-market and after modifications, and constant real-world performance monitoring to ensure constant safety and effectiveness.
Mark Sendak of the Duke Institute for Health Innovation introduced his team’s work in creating SepsisWatch, an ML tool that identifies patients at high risk of sepsis. This project sought to develop an AI tool that would earn the trust of front-line medical workers to ensure broad adoption. Developing the SepsisWatch tool required engagement with a broad array of stakeholders throughout the development life cycle, ultimately leading to the creation of a multi-stakeholder governance committee. Close collaboration among ML researchers and front-line workers allowed for iterative improvements to SepsisWatch and the integration of the tool into front-line health practice.
Chris Hart, former chair of the National Transportation Safety Board (NTSB), discussed quality assurance protocols for aviation software and how similar protocols might be used to ensure the quality of AI used in transportation. As Hart pointed out, existing protocols ensure the engagement of end users in software design, training for affected users and parties, and operational feedback to improve future design and
2 Food and Drug Administration, 2019, “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD): Discussion Paper and Request for Feedback,” https://www.fda.gov/media/122535/download.
training. Several factors may complicate the applicability of these processes to AI-based software. For instance, it may be difficult to simulate the behavior of AI systems across all operational situations, and the behavior may change with experience; users may have a wide range of experiences, background, and training, unlike commercial pilots, who are highly trained; and the infrastructure to collect and integrate operational feedback may not exist.
Deborah Prince of Underwriters Laboratories (UL) concluded the panel presentations by discussing her standardization work on UL 4600, the Standard for Safety for the Evaluation of Autonomous Products, which provides requirements for fully autonomous systems that move, such as self-driving cars. This standard uses a safety case approach with structured narratives that lead system designers to think through a wide array of scenarios, environments, and risks and present credible proof that systems are robust enough to mitigate these risks. Through these safety cases, the designers are asked to consider criteria that are either mandatory (must be met in all cases), required (must be met unless the given criterion is not applicable to a given system), highly recommended, or recommended. The safety case approach to standards development also incorporates feedback from real-world models, allowing the organization to change or add criteria and incorporate best-practice examples, pitfalls to avoid, and key considerations into future iterations of the UL 4600 standard.
In the discussion that followed, the panel organizers and moderators Anupam Datta and Deirdre Mulligan prompted participants to share their thoughts on the suitability of the safety case model for evaluating systems other than autonomous systems. Sendak noted that in developing the SepsisWatch tool, his team developed a similar safety case methodology in consultation with front-line staff. The participants also discussed the role of training in standardizing and validating AI in their industries. Hart noted that inadequate training played a considerable role in the accident history of the Boeing 737 MAX. Palmer observed that the Federal Reserve Board develops principles-based guidance to ensure that banks understand the operations and risks of their models, but that prescriptive guidance specific to AI is difficult to produce and likely less useful given the rapid evolution of AI and its applications.
ATTRIBUTES OF AI TRUSTWORTHINESS: ROBUST, EXPLAINABLE, AND GENERALIZABLE
Katherine Heller of Duke University and Google Medical Brain opened the session by discussing her work on underspecification in AI.3 This work addresses the significant gap observed between AI systems’ abilities to generalize well from their testing and training domains and their abilities to encode a credible claim about the world. That is, if a model learns from features unique to the training domain but not observed in the broader world, it loses its ability to make credible claims about the broader world. This observation pointed to the need for better-designed evaluations and metrics that reflect a broader testing domain. Heller then discussed the generalizability problem in
3 A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, et al., 2020, “Underspecification Presents Challenges for Credibility in Modern Machine Learning,” ArXiv:2011.03395 [Cs, Stat], November 24, http://arxiv.org/abs/2011.03395.
the context of healthcare. As an example, models trained to identify serious lesions are typically built using clinical images of patients who have received surgery for serious lesions. These AI models then rely heavily on detecting the surgical scars to improve accuracy. Thus, the models exhibit problems generalizing to a broader population of patients with serious skin lesions who may require, but have not yet received, surgery.
Aleksander Madry of the Massachusetts Institute of Technology discussed the role of robustness in AI trustworthiness. System designers want models that are reliable even in less-than-ideal circumstances, are resistant to adversarial manipulation and biased data, and are interpretable. Madry suggested that achieving AI trustworthiness is a matter of alignment rather than prescription. The fundamental misalignment is that an ML model that has achieved success at a given task, such as image recognition, has not necessarily learned the desired concepts. Ensuring trustworthiness in this context, then, requires rethinking AI assessment. As an alternative to ensuring trustworthiness through technical specifications that may be insufficient to adequately capture the dimensions of trustworthiness, Madry suggested borrowing safety practices from medicine, wherein patients trust doctors in part because there is an infrastructure for reporting errors and malpractice, helping to align physician performance with understood principles for good medical care. By analogy, then, a form of reporting infrastructure could be developed to help oversee AI models and ensure compliance with principles for good AI development and use.
Martin Wattenberg of Google’s People and AI Research (PAIR) initiative discussed tools for better understanding ML systems. These tools promote participatory ML development and assessment and also probe whether models are correctly designed and whether they are performing as expected. PAIR’s Facets tool, for example, makes it possible for system developers to visualize the data sets that drive ML systems. PAIR’s What-If tool probes model performance and allows users to pose counterfactual scenarios to AI models and evaluate the resulting output. Last, the Testing with Concept Activation Vectors (TCAV) technique seeks to provide an understanding of the internal state of neural networks using human-understandable concepts. For instance, the TCAV tool allows users to understand a machine vision algorithm’s outputs not as a function of individual pixels but of attributes and patterns. As an example, the tool could help to clarify how a computer vision model’s output, such as identifying an image of a zebra, changes with the presence of simple, user-defined attributes such as stripes.
In discussion, the session organizers and moderators, Ben Shneiderman and Dj Dvijotham, asked panelists to explore different aspects of system trustworthiness assessment. Heller observed that the easiest way to assess trustworthiness is to ask people, but also noted the difficulty in identifying which people should be asked. Wattenberg noted that his team seeks to calibrate trust for each model to create the correct impressions for users, understanding that a perfectly trusted system with 100 percent accuracy may not be possible. The participants also shared thoughts on how to determine when adequate due diligence for trustworthiness has been conducted.
Audience questions engendered discussion of the trade-offs among various trustworthiness attributes such as explainability, security, and privacy. Madry commented that so-called security through obscurity is actually counterproductive, as opaque systems can be difficult to diagnose whereas open systems can foster creation of a community that is invested in the security of a model. The panelists also discussed
user feedback in AI systems. Madry noted the low level of user engagement with existing tools for soliciting user feedback, and the need for better incentives to ensure responsiveness to user feedback.
ATTRIBUTES OF AI TRUSTWORTHINESS: FAIR, PRIVATE, CONTESTABLE
Jenn Wortman Vaughan of Microsoft Research opened the session with a presentation on her work on responsible AI and Microsoft’s Aether (AI, Ethics, and Effects in Engineering and Research) Committee. First, she discussed a 2018 Microsoft Research investigation4 of industry challenges and needs for developing fair ML systems. Structured interviews with 35 practitioners across 10 technology companies uncovered differences between academic and industry approaches to ML fairness. Academic work on the topic, for instance, has typically assumed data as a given and focuses instead on model design and optimization for specific fairness criteria; it also focuses largely on binary classification problems. In real-world practice, however, models have a much wider range of issues, and quantitative fairness metrics may not be appropriate. Human factors further complicate the picture, as humans can introduce bias at all stages of the development cycle. The investigation found that other high-stakes domains such as medicine and aviation mitigate risk through checklists, and this poses the question of whether this information could be usefully brought into AI development. Wortman Vaughan then described an AI fairness checklist developed in 2019 and currently being piloted, which seeks to scrutinize fairness harms in the development process.5 Finally, Wortman Vaughan offered remarks on the importance of interpretability if stakeholders are to have sufficient understanding of how models operate. Her remarks highlighted the difficulty in measuring interpretability or even defining it, as different stakeholders will have different needs for interpretability. She suggested that developers test which features of a model enable users to better achieve their goals, think about interpretability beyond the model itself, and design and evaluate interpretability in the context of stakeholder needs.
Tad Hirsch of Northeastern University began by underscoring that trust is not an inherent attribute of technical artifacts but rather is enacted and situated in practice. As an example, Hirsch discussed automated essay scoring tools. These systems can be adopted at both a macro (e.g., used throughout school districts) and a micro (e.g., used by individual teachers to lessen workload) level. He noted that when these systems are introduced, trust is often understood in terms of performance, accuracy, precision, and reliability. However, trustworthiness is most salient when the outputs of these tools are under dispute—that is, there is a question about the tool’s accuracy, fairness, and so on. In these cases, the question of fairness extends beyond whether a given algorithm is fair to whether an entire social context (including the tool), from teachers to school boards, is fair. Hirsch discussed the need to design for inevitable
4 K. Holstein, J. Wortman Vaughan, H. Daumé III, M. Dudík, and H. Wallach, 2019, “Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?” Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, May 2, 2019, 1–16, https://doi.org/10.1145/3290605.3300830.
5 M. Madaio, L. Stark, J. Wortman Vaughan, and H. Wallach, 2020, “Co-Designing Checklists to Understand Organizational Challenges and Opportunities Around Fairness in AI,” Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, April 25–30, 2020, 1–20, http://dx.doi.org/10.1145/3313831.3376445.
disputes, emphasizing principles of disclosure to affected populations that an algorithmic system is being used, collaboration rather than confrontation between parties disputing a potentially unfair result, specificity (that explanations of how an output is derived should be specific to the individual input), a right to an established appeals process, human operator training, and oversight.
Ilya Mironov of Facebook followed with a presentation on the challenges of protecting privacy in AI systems. He used the Generative Pre-Trained Transformer 2 (GPT-2) text generation model, which has become well known for its ability to provide reasonably intelligible responses to text prompts, as an example. Mironov described his results in a December 2020 paper,6 in which adversarial attacks on the GPT-2 model succeeded in extracting privacy-sensitive data from the training corpus. Much of this data had appeared only once in the training data, and some included sensitive information such as names and addresses. Mironov then discussed how differential privacy techniques could be applied to AI systems. These techniques, formalized in a 2006 paper,7 have since been integrated into widely used ML frameworks such as PyTorch and used to safeguard public data sets, notably 2020 U.S. Census data products. Last, Mironov warned against treating all AI systems identically, noting that different techniques are appropriate for privacy preservation, depending on how a system is designed. For instance, algorithms that rely on cloud computing could use secure enclaves to process and store sensitive data, while device personalization algorithms can be secured by maintaining the relevant data only on the device itself.
Michael Kearns of the University of Pennsylvania then discussed the difficulty of achieving fairness. As he noted, there are many different ways to measure fairness quantitatively and they cannot be satisfied simultaneously. Kearns observed that this echoes Arrow’s impossibility theorem, wherein multiple criteria for deriving social choice from individual preferences cannot be simultaneously met within a broad set of conditions. Even when multiple types of fairness can be achieved simultaneously, Kearns noted, there may be unavoidable trade-offs between fairness and accuracy, and even between fairness among different subgroups. Kearns also noted that where progress has been made toward AI fairness, it is typically done by applying a group-level definition of fairness—for instance, equalizing rejection rates for loan applicants of all races. These are blunt instruments, and cannot guarantee fairness at an individual level. In recent years, however, work has been done to try to improve group-level fairness at a more granular level—for instance, ensuring reasonably similar error rates for intersectional categories that account for race, gender, age, income, and so on. Kearns also described the growth of so-called minimax fairness measures, which seek to ensure the best possible outcome for the groups with the worst outcomes.
The session organizers and moderators Deirdre Mulligan and Aleksander Madry then led a moderated discussion. The discussion began with panelists being asked, what would be an acceptable baseline for assessing fairness, privacy, and other desiderata of trustworthy AI systems? Wortman Vaughan noted that, although humans are prone to error and unfair behavior, people generally expect machine systems to
6 N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, et al., 2020, “Extracting Training Data from Large Language Models,” ArXiv:2012.07805 [Cs], December 14, http://arxiv.org/abs/2012.07805.
7 C. Dwork, F. McSherry, K. Nissim, and A. Smith, 2016, Calibrating noise to sensitivity in private data analysis, Journal of Privacy and Confidentiality 7(3):17–51, https://doi.org/10.29012/jpc.v7i3.405.
outperform humans if they are to be deemed acceptably fair. Hirsch suggested that the key question is how system designers should approach uncertainty. There are different thresholds for tolerating uncertainty, depending on context, but designers ought to account for when inaccurate or biased predictions arise. The panelists also discussed prospects for improving fairness and privacy in AI. Mironov noted that just as the cryptography community took time to converge on standards and communicate these to nonspecialists to create secure products, it may similarly take time for standards and practices around differential privacy to emerge. Kearns noted the lack of empirical tests to verify differential privacy, unlike fairness, in which measuring a model’s performance is a simple matter of choosing a metric. Hirsch suggested that instead of developing metrics for privacy, which could be difficult, time-consuming, and inconclusive, NIST could emphasize standards and operationalizing principles through practices like checklists and third-party verification.
MEASUREMENT FOR AI SYSTEMS
Sriram Rajamani of Microsoft Research opened the panel with a presentation, informed by a background in formal methods, on trustworthiness as conformance to a set of specifications. In software engineering, he noted, it is possible to develop a reasonable set of specifications for performance. However, for AI systems derived from training data, it is much more difficult to formally describe a set of acceptable outputs for a given set of system inputs. While this description of the domain of acceptable outcomes for all possible inputs is not feasible, AI developers can nevertheless advance trustworthiness through conformance to a set of hygiene conditions such as robustness (i.e., that small changes in input do not cause large changes in output) and fairness (i.e., that output is independent of certain sensitive factors). Furthermore, developers of some safety-critical systems such as autonomous vehicles have begun to explore safety specifications that allow developers to define properties such as collision avoidance using safety margins.
Alexandra Chouldechova of Carnegie Mellon University presented her work with risk assessment instruments, statistical models that output the probability of an undesirable outcome based on a set of features. In many cases, these models suffer from omitted payoff bias, wherein high predictive accuracy does not always translate into optimal decision making, as decisions rely on factors beyond simple error prediction accuracy. To ensure trustworthiness, then, may require a broader perspective than a tool’s adherence to certain specifications. Chouldechova provided a framework of organizational justice, which incorporates interpersonal, procedural, distributive, and information perspectives on justice. This frame of assessment also incorporates ideas of transparency, robustness, and fairness, and addresses people’s need to feel that they are being treated with dignity and respect by those making decisions.
Nico Economou of H5 followed with a presentation focused on lessons about measurement from the field of electronic discovery (e-discovery). As a subset of text retrieval, e-discovery has been the subject of previous NIST evaluation through its Text Retrieval Conference (TREC). These studies have shown that e-discovery algorithms typically trade off a generally low degree of precision—that is, the proportion of identified material that is relevant—in favor of better recall—that is, the proportion of relevant
material that is identified. However, an inability to translate these accuracy measures into the legal context created a false impression of the efficacy of these AI tools. Furthermore, operators of these algorithms typically overestimate the accuracy of these tools’ performance, as compared to the accuracy measured in the NIST evaluation.
Suchi Saria of Johns Hopkins University then spoke on the subject of achieving trustworthiness through a focus on stability and reliability. Saria noted that trustworthiness is a vague concept for engineering purposes, and emphasized decoupling subjective understandings of trust from the quality engineering aspects of trust—namely, ensuring that systems behave as intended and better understanding the circumstances in which systems exhibit unintended behavior. In healthcare, for example, it is especially important that AI systems be trustworthy given the stakes of the decisions being made. The FDA has promoted an approach to trustworthiness that emphasizes evaluating the various potential risks from using an AI tool and identifying appropriate mitigations. One technique for achieving this is to quantify the situations in which a tool performs in reliable ways, and situations in which it does not. In medicine, undesirable behavior can occur when a system’s performance changes and degrades when presented with perturbations of the data. One can then characterize model performance in response to these changes, identify what changes reduce performance, and estimate the probability of encountering these changes in an operational context—thereby helping to identify the scenarios in which a tool can be used.
In the final presentation of the panel, Joaquin Quiñonero Candela of Facebook discussed practical lessons in addressing AI fairness, which, as he noted, is a process rather than a property of an algorithm. While this process does not converge on one solution, progress can be made. Quiñonero emphasized lenses of implementation, policies, and outcomes for assessing fairness. For example, Facebook Portal, a videoconferencing device that uses AI to focus the camera but that performed less effectively with darker skin tones, fairness can be a matter of implementing minimum quality of service. Meanwhile, in the case of AI-powered content moderation to combat misinformation during the Indian elections, fairness involves distributing a scarce resource—namely, human moderators. A fair policy may involve ensuring that risky content is equally likely to be flagged for human review regardless of region or language. Last, addressing hate speech on Facebook may require an outcome-focused notion of fairness that incorporates equity considerations, as hate speech against certain groups may have much more pronounced consequences than against other groups.
In moderated discussion, led by session organizers and moderators Jeannette Wing and Susan Dumais, the participants both posed questions to one another and also offered suggestions to NIST for future work. The panelists discussed the prospects for a research agenda connecting the development of quantitative metrics and assessments to notions of procedural and interpersonal justice. Saria noted the difficulty in incorporating both qualitative and quantitative measures. Researchers tend to prefer metrics that allow undesirable behavior to be quantified, but they must also embrace the qualitative and specific scenarios in which assessments are made. Dumais referred to the work of an earlier panelist, Wortman Vaughan, on checklists and transparency notes, which attempt to accurately depict how a model works and how it was made. The panelists also shared thoughts on the difficulties of self-assessment, as initially presented in Economou’s presentation. As Chouldechova observed, similar issues arise with the use of risk assessment instruments in the criminal justice system,
because judges may not always understand the probabilities produced by the model. Quiñonero suggested that, given the likelihood that AI regulations will rely heavily on self-assessment and self-regulation, any standardization of procedures, benchmarks, and measures would be valuable insofar as they help to create a shared understanding of any assessment of underlying data. The participants also offered other suggestions for NIST, including developing annual benchmarks for defined, desirable objectives on the use of AI, identifying measurable properties, presenting findings from NIST’s studies on these topics in both a scientific format and a format more accessible to ordinary citizens, and the production of real-world case studies.
WORKSHOP SYNTHESIS AND OUTCOMES
In the final session of the workshop, members of the workshop planning committee served as panelists and identified key themes emerging from the discussion and discussed potential future work to advance the development of trustworthy AI systems. Four common themes identified were:
- Expand the concept of “measurement”: As Dumais noted, trustworthiness is a very human-centered notion and does not lend itself easily to simple measurement. Rather, there are many levels at which measurement and assessment could take place—from evaluating the code of a given model to assessing public perceptions. Furthermore, so-called hybrid systems that incorporate both computational and human elements present a measurement problem. Mulligan discussed the importance of broadening the unit of analysis in “measurement” by developing tools and techniques to interrogate sociotechnical systems, rather than individual components. As Wing observed, many relevant attributes of trustworthiness may not lend themselves to formalizable metrics, but the research community ought to nevertheless pursue developing formalizable metrics for as many relevant attributes as possible.
- Independent oversight and stakeholder review: Tools for stakeholder review can include visualization tools such as those developed by Wattenberg at Google that help developers better understand how AI systems actually work. The role of end users and operators was also discussed at several points throughout the workshop. These groups can play an important role in the design and evaluation of AI models. Multiple views were expressed about the importance and feasibility of training as a means of improving trustworthiness. Wing noted the historic precedent of developing a trained cadre of IT specialists, and the gains realized by training a specialized workforce, while Dumais questioned whether adequately training all users of AI algorithms would be feasible.
- Need for ongoing measurement and incorporating feedback: Datta noted that multiple speakers had highlighted the need for testing approaches that allow developers to assess how well their tools generalize to the broader world. Continuously assessing how models respond to new data outside their training domains ensures that developers can quickly identify when AI
performance is degrading as it faces new inputs that differ from examples encountered before.
- Designing for fallibility: Another major theme stemmed from the recognition that AI models cannot guarantee absolute accuracy. As Dumais noted, developers should devote as much attention to designing for these failure modes as to pursuing performance gains.
The workshop planning committee concluded the session with a discussion of potential opportunities for NIST to promote trustworthy AI. These ranged in scope from expanding existing topical work to new activities. For instance, Shneiderman suggested that NIST continue and expand its work on explainable AI (XAI) to promote more prospective explanatory models and promote the development of audit trail mechanisms that allow for the interrogation of model’s operations.
As longer-term and larger-scale work, Wing discussed the possibility of NIST fostering the development of a third-party repository for sensitive data, or using its convening power to develop a series of sector-specific external review boards to complement the internal review processes that currently exist at many companies. Last, Shneiderman proposed a Trustworthiness Conference, or TRUC, analogous to NIST’s TREC work. This conference could propose challenges and ongoing activity streams that would bring together industry and academia to improve AI trustworthiness in a broad range of application domains.
DISCLAIMER: This Proceedings of a Workshop—in Brief was prepared by Brendan Roach, National Academies of Sciences, Engineering, and Medicine, as a factual summary of what occurred at the workshop. The statements made are those of the individual workshop participants and do not necessarily represent the views of all workshop participants; the planning committee; or the National Academies of Sciences, Engineering, and Medicine.
PLANNING COMMITTEE: Anupam Datta, Chair, Carnegie Mellon University; Susan Dumais, Microsoft Research (NAE); Krishnamurthy (Dj) Dvijotham, DeepMind; Aleksander Madry, Massachusetts Institute of Technology; Deirdre Mulligan, University of California, Berkeley; Ben Shneiderman, University of Maryland, College Park (NAE); and Jeannette Wing, Columbia University.
BOARD STAFF: Jon Eisenberg and Brendan Roach, Computer Science and Telecommunications Board, Division on Engineering and Physical Sciences, National Academies of Sciences, Engineering, and Medicine.
REVIEWERS: To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshop—in Brief was reviewed by Deborah Prince, Underwriters Laboratories; Ben Shneiderman, University of Maryland, College Park (NAE); Moshe Y. Vardi, Rice University (NAS/NAE); and Christo Wilson, Northeastern University. Elizabeth G. Panos, National Academies of Sciences, Engineering, and Medicine, served as the review coordinator.
SPONSORS: This workshop was sponsored by the National Institute of Standards and Technology.
For additional information regarding the workshop, visit https://www.nationalacademies.org/event/03-03-2021/assessing-and-improving-ai-trustworthiness-current-contexts-potential-paths.
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2021. Assessing and Improving AI Trustworthiness: Current Contexts and Concerns: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: https://doi.org/10.17226/26208.
Division on Engineering and Physical Sciences
Copyright 2021 by the National Academy of Sciences. All rights reserved.