National Academies Press: OpenBook

Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science (2021)

Chapter: 3 Data Science and the Data Life Cycle: The Short Version

« Previous: 2 Defense Acquisition Process, Data, and Workforce: The Short Version
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×

3

Data Science and the Data Life Cycle: The Short Version

WHAT IS DATA SCIENCE AND WHO DOES IT?

Everyone consumes, processes, and interacts with data every day, and everyone makes decisions based, in part, on data. For example, we use weather forecasts to make plans for a day’s activities. We purchase groceries based on combining our past meal history and future dining plans. We scroll through and choose from a list of recommended news articles selected for us by algorithms that combine our interests with current events. We make decisions that impact our national security by integrating, visualizing, and analyzing different data sources. These examples show the wide range of areas in which data are routinely used—and in which the more systematic use of data science might yield better decisions.

In this section, the committee more clearly defines data science by describing it as a multi-phase process—facilitated by people—that extracts value from data to answer posed questions. Understanding this definition and phases within it are fundamental to incorporating data science into defense acquisition. In Box 3.1, the committee outlines three key features of data science that shape this chapter and summarize the basics for the defense acquisition community.

DATA SCIENCE IS COLLABORATIVE AND CYCLICAL

Many of the early definitions of data science focused on identifying static collections of skills necessary for workforce members to have the title of “data scientist.” Typically visualized via Venn diagrams, these definitions

Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×

highlighted overlapping academic disciplines (including computer science, engineering, mathematics, statistics, and social sciences), which emphasized the cross-disciplinary nature of data science at the central intersection of the diagram. Figure 3.1 below shows an example of these types of data science Venn diagrams (Geringer 2014); although in this case the word “unicorn” is in the center, indicating that finding that one single person with mastery of all related disciplines is akin to searching for a mythical creature. The Venn Diagram framework, while a useful visual from a traditional disciplinary point of view, does not capture data science in practice.

Noting a tremendous demand for data scientists in the United States and some uncertainty for their education, the National Academies of Sciences, Engineering, and Medicine undertook an earlier consensus study titled Data Science for Undergraduates (NASEM 2018). This study included characterizing data science as being centered on “the notion of multidisciplinary and interdisciplinary approaches to extracting knowledge or insights from large quantities of complex data for use in a broad range of applications” (p. 13). Further, “data science is not just the practice of analyzing a certain data set about a particular question. It often results in the creation of processes that continuously take in new data, often from many sources, and generate refined distillations of that data, which in turn become sources for new inquiries, questions, and analyses” (p. 14). In keeping with this characterization, data science is more commonly viewed now as a process or workflow in which real problems are solved with real data through an often-cyclical set of phases requiring different skillsets.

In her 2020 Harvard Data Science Review article, Jeannette Wing wrote, “Data science is the study of extracting value from data. ‘Value’ is subject to the interpretation by the end user and ‘extracting’ represents the work done in all phases of the data life cycle” (Wing 2020). The data life

Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Image
FIGURE 3.1 Geringer data science Venn diagram. SOURCE: Copyright © 2014 by Steven Geringer, Raleigh, NC.

cycle included in the article appears in Figure 3.2; it shows a workflow that includes phases: generation, collection, processing, storage, management, analysis, visualization, and interpretation. Special emphasis is placed on the importance of security/privacy and ethical concerns in all phases. The data life cycle also begins with the integration of disparate sources of information and data, ending with dissemination, consumption, and adoption by stakeholders, which are phases crucial to optimizing the value of data.

This notion of a data life cycle has also been accepted beyond the data science community, including the federal government. Figure 3.3, featured in the Federal Data Strategy: Improving Agency Data Skills Playbook and adopted from the National Institute of Standards and Technology (NIST), depicts a comprehensive and well-managed data life cycle with similar steps to those in Figure 3.2. Note that in the below diagram, as is noted in the 2018 National Academies of Sciences, Engineering, and Medicine consensus study on data science for undergraduates, after assessment, implementation, and feedback from data consumers, the cycle returns to the beginning where disseminated results inform the development of new inquiries for data sources.

Cyclical workflow diagrams are seen within the Department of Defense (DoD) that include most aspects of the data life cycle, albeit labeled primar-

Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Image
FIGURE 3.2 Wing’s data life cycle. SOURCE: J.M. Wing, 2020, “Ten Research Challenge Areas in Data Science,” Harvard Data Science Review, https://doi.org/10.1162/99608f92.c6577b1f. Copyrighted Jeanette M. Wing (2017, 2019).
Image
FIGURE 3.3 NIST data life cycle.

ily as “data analytics”; for example, see Figure 3.4 for a cycle that describes the Air Force “data analytics ecosystem.”

While these diagrams capture the multi-phase workflows associated with extracting value from data to solve problems, they have less emphasis on the initial questions and stakeholders, (possibly inadvertently) downplaying the central role that people play in a successful data science workflow (Marshall and Geier 2019). They also do not capture the iterative process that commonly occurs as people return to previous phases to correct errors or reexamine strategy—for example, moving back and forth between analysis and visualization as statistical modeling is fine-tuned. A data science workflow, while tending toward a direction, rarely travels in a straight line.

In her remarks at the National Academies on January 31, 2020, Professor Sallie Ann Keller shared a data science framework that addresses many shortcomings of these diagrams. This framework, used at the Biocomplexity Institute at the University of Virginia, emphasizes that the multi-phase process of data sciences includes problem identification (via questions and

Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Image
FIGURE 3.4 Air Force data analytics ecosystem.

working hypotheses), communication, dissemination, and data wrangling (Keller et al. 2020). Keller concluded her remarks by saying “the data science framework enables creation of repeatable and measurable processes for the use of and repurposing of all data sources.”

THE DATA LIFE CYCLE AND ITS PHASES

Building upon these diagrams and for use in this report, the committee uses a workflow that (1) incorporates questions, (2) is more cyclical (as opposed to linear) and iterative, (3) communicates the need for small loops, (4) enables multiple entry points, and (5) emphasizes the role of people. With a slight modification to Wing’s definition of data science, the committee will use the definition in Box 3.2 for the remainder of the report.

Figure 3.5 depicts a bi-directional diagram with many people (including stakeholders) as a central focus along with an emphasis on data privacy, ethics, and security. Critically, data science is facilitated by people who make decisions about when to move forward and/or backward and are responsible for minimizing harm. It is important to note that the people at

Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Image
FIGURE 3.5 Data life cycle.

the center of the data life cycle can have different roles in facilitating data science. This feature will be discussed in Chapter 5.

The ideal entry point is the Question(s) phase; the process then moves along the blue arrow, returning to previous phases as needed along the green arrow. Once results are disseminated and interpreted, the assessment phase allows stakeholders an opportunity to refine their questions, beginning the cycle again. To better characterize how each phase in the data life cycle above contributes to data-informed decision making, the committee further describes the phases and corresponding questions and actions as follows:

  • Question—Data science extracts value from data to help answer a posed question or inform a decision. Questions are often developed from engagement with stakeholders; decisions can be path-critical turning points where leadership may need to determine whether or not to move forward with a new product, process, or program.
  • Define—Is the question or decision well-posed? Is it “answerable”? Which data are required to inform the decision or answer the question? Do the data exist already, or must they be generated? What data quality is required? Data quality includes considerations of accuracy, precision, completeness, authoritativeness, consistency, reliability, relevance, and timeliness.
  • Coordinate—Data must be accessible and available for analysis. Can current resources and infrastructure provide the required data? If not, what else is required?
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
  • Generate—There are many mechanisms for generating data: for example, some data are operational (e.g., from business processes or front-line operations), other data come from sensors, and yet other data originate from surveys. In the context of the question or decision, what data exist or should be created?
  • Collect—Not all generated data should be collected and stored for future use. Which data should we collect and store? How should the collected data be organized? What are the limitations of the data? How were the data created and sampled, and what impact does that have on its relevance for the question or decision?
  • Curate and Manage—In order to provide value, data must be organized, refined, and maintained with sufficient quality to support decisions and answer questions. Data curation and management have multiple sub-tasks:
    • Process—Clean, wrangle, format, compress, encrypt/decrypt, and authenticate data.
    • Store—Identify the appropriate data storage system and create appropriate metadata to maximize the ability to access and modify the data for subsequent analysis while ensuring proper data security.
    • Integrate, Fuse, Link—If relevant to underlying processes and questions, build metadata hierarchies that allow for linking data sets through common IDs.
    • Access—Implement appropriate data access methods, including system interoperability and other infrastructure to share data across pieces of the enterprise as needed.
  • Analyze—Data analysis encompasses the use of data and tools to generate insights through statistical modeling, machine learning algorithms, visualization, and human examination. These activities include descriptive statistics and predictive analysis.
  • Visualize—One of the most effective ways to present analysis results in clear, simple, interpretable forms is to use visualization techniques such as histograms, scatterplots, interactive charts, time-dependent graphs, and heat maps. Visualization methods are also commonly used prior to and during the analysis phase to help inform next steps.
  • Disseminate/Interpret—Explain what the analysis and results mean in a way that is interpretable and appropriate to the decision maker and stakeholders.
  • Assess—Continuously monitor and improve all processes in the data life cycle. Use the current analysis and results to refine and develop subsequent questions and decisions.

Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×

DATA ETHICS, PRIVACY, AND SECURITY

Elements of data ethics, privacy, and security are incorporated throughout the data life cycle. The American Statistical Association defines good statistical (and by extension, data science) practice as “fundamentally based on transparent assumptions, reproducible results, and valid interpretations.” This includes using methodology and data that are relevant and appropriate; being transparent about any known or suspected limitations or biases in the data that may affect the reliability of the analysis; protecting the interests of the respondents whose data are considered; and considering the entire range of explanations for observed phenomena (ASA 2018). Addressing threats to privacy—increased by broader data access, machine learning, and artificial intelligence—is an ongoing challenge for policy makers and practitioners. In “Data, Privacy, and the Greater Good,” Eric Horvitz and Deirdre Mulligan note that “machine learning can be used to draw powerful and compromising inferences from self-disclosed, seemingly benign data or readily observed behavior. These inferences can undermine a basic goal of many privacy laws—to allow individuals to control who knows what about them.” They also note that the White House and the Federal Trade Commission (FTC) have, in the past, sought to protect “privacy, regulate harmful uses of information, and increase transparency” (Horvitz and Mulligan 2015). See Box 3.3. In many applications, the appropriate level of security for data through their life cycle is critical. Data can be manipulated, stolen, or destroyed. Resulting risks vary depending on data types and sources; extreme cases include intellectual property theft and threats to national security.

In the opinion of this committee, the DoD acquisition workforce has been leveraging data science, even if not identified as such. Nevertheless, in Chapter 4, the committee identifies some opportunities for improved data use and describes how the data life cycle can be incorporated into common acquisition functions.

Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×

REFERENCES

ASA (American Statistical Association). 2018. Ethical Guidelines for Statistical Practice.https://www.amstat.org.

Geringer, S. 2014. “Data Science Venn Diagram v2.0.” Steve’s Machine Learning Blog (Blog). January 6. http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html.

Horvitz, E., and D. Mulligan. 2015. Data, privacy, and the greater good. Science 349(6245): 253-255.

Keller, S.A., S.S. Shipp, A.D. Schroeder, and G. Korkmaz. 2020. “Doing Data Science: A Framework and Case Study,” HDSR. February 21. https://hdsr.mitpress.mit.edu/pub/hnptx6lq/release/6https://hdsr.mitpress.mit.edu/pub/hnptx6lq/release/6.

Marshall, B., and S. Geier. 2019. “Targeted Curricular Innovations in Data Science.” 2019 IEEE Frontiers in Education Conference (FIE). https://ieeexplore.ieee.org/abstract/document/9028491/https://ieeexplore.ieee.org/abstract/document/9028491/.

NASEM (National Academies of Sciences, Engineering, and Medicine). 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. https://doi.org/10.17226/25104.

Wing, J.M. 2020. “Ten Research Challenge Areas in Data Science.” Harvard Data Science Review.https://doi.org/10.1162/99608f92.c6577b1f.

Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Page 24
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Page 25
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Page 26
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Page 27
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Page 28
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Page 29
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Page 30
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Page 31
Suggested Citation:"3 Data Science and the Data Life Cycle: The Short Version." National Academies of Sciences, Engineering, and Medicine. 2021. Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science. Washington, DC: The National Academies Press. doi: 10.17226/25979.
×
Page 32
Next: 4 Data Science in DoD Acquisition »
Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science Get This Book
×
Buy Paperback | $50.00 Buy Ebook | $40.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The effective use of data science - the science and technology of extracting value from data - improves, enhances, and strengthens acquisition decision-making and outcomes. Using data science to support decision making is not new to the defense acquisition community; its use by the acquisition workforce has enabled acquisition and thus defense successes for decades. Still, more consistent and expanded application of data science will continue improving acquisition outcomes, and doing so requires coordinated efforts across the defense acquisition system and its related communities and stakeholders. Central to that effort is the development, growth, and sustainment of data science capabilities across the acquisition workforce.

At the request of the Under Secretary of Defense for Acquisition and Sustainment, Empowering the Defense Acquisition Workforce to Improve Mission Outcomes Using Data Science assesses how data science can improve acquisition processes and develops a framework for training and educating the defense acquisition workforce to better exploit the application of data science. This report identifies opportunities where data science can improve acquisition processes, the relevant data science skills and capabilities necessary for the acquisition workforce, and relevant models of data science training and education.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!