The Office of the Under Secretary of Defense (Personnel & Readiness), referred to throughout the report as P&R, is responsible for the total management of all Department of Defense (DoD) personnel, including recruitment, readiness, and retention. This mission requires extensive data, a large number and variety of complex analyses, and access to skilled workers to extract meaningful information to guide DoD personnel and readiness policies. With the advent of newer sources of data, such as social media and modern data analytics, P&R has the opportunity to exploit new tools that may produce more powerful analyses and improve the effectiveness and efficiency with which it accomplishes its mission. However, cultural and technological challenges exist and must be addressed, including the following: improving data access and sharing while ensuring proper privacy protection, enhancing analytic methods, and improving workforce education. An important step in addressing these challenges is developing a data and analytics framework, taking into account current and desired capabilities and addressing barriers accordingly. This National Academies of Sciences, Engineering, and Medicine report of the Committee on Strengthening Data Science Methods for Department of Defense Personnel and Readiness Missions offers suggestions on which data analytics capabilities could be targeted and which considerations to keep in mind to advance the framework for these capabilities. The study’s full statement of task is shown in Box S.1.
This report considers data science in its broadest sense: a multidisciplinary field that concerns technologies, processes, and systems to extract knowledge and insight from data and to support reasoning and decision
making under various kinds of uncertainty. There are two primary aspects of interest within the field of data science, namely (1) the management and processing of data and (2) the analytical methods and theories for descriptive and predictive analysis and for prescriptive analysis and optimization. The first aspect involves data systems and data preparation, including databases and warehousing, data cleaning and engineering, and some facets of data monitoring, reporting, and visualization. The second aspect involves data analytics and includes data mining, text analytics, machine and statistical learning, probability theory, mathematical optimization, and visualization of results.
Currently, analyses developed to support P&R are often disjointed, one-time efforts that respond to immediate questions and may lack any plan for future use of their data or methods. A comprehensive data and analytics framework, properly implemented, could add coherence to this work, expanding the types of questions that P&R can quickly examine, reducing the cost of analyses, improving the reliability of findings, and better informing policy decisions. While developing this framework, both the short-term and long-term needs of the Secretary of Defense and the responsibilities of P&R should be considered.
The Force of the Future initiatives1 being pursued by Secretary of Defense Ashton Carter aim to make the DoD workforce more equitable, efficient, and flexible through a number of efforts such as increasing the interchange of personnel with the civil sector, offering more family-friendly benefits, changing how military personnel are promoted, and improving the opportunities for civil service personnel. One aspect of this would be the establishment of an Office of People Analytics to better harness DoD’s big data capabilities in the service of managing personnel talent. This would be done by increasing the understanding of personnel characteristics and analyzing how policy or environmental changes will affect the performance or composition of the workforce. The development of a data and analytics
framework could revolutionize how data and analytics are used by P&R while contributing to the goals of the current Force of the Future initiative.
Finding: Despite the substantial amount of data available on DoD personnel, the data may not be appropriate for DoD’s analytic tasks, or they may necessitate considerable investment in constructing the variables of interest.
Finding: Analyses developed to support the Secretary of Defense are often disjointed, one-off activities undertaken to respond to immediate questions and may lack a plan for future use of data or analytic methods.
Finding: The reuse of operational data for analytic purposes can expose issues in data collection, recording, transmission, cleaning, coding, and loading. Problems are often not detected until the point of analysis, when anomalies crop up in results.
Recommendation 1: The Office of the Under Secretary of Defense (Personnel & Readiness) should develop a data and analytics framework, and a strategy to implement that framework, that addresses both the principal outcomes of its responsibilities and the short-term and long-term needs of the Secretary, based on the findings, recommendations, and discussions outlined in this report and in the Force of the Future proposals.
Developing a data and analytics framework is a complex task, with many components that need to be addressed both individually and systematically. Data need to be easily accessible and shared across groups in a way that reduces the hurdles currently faced when researchers and analysts seek to find or share data while ensuring proper privacy and security protections. Analytic methods available to P&R need to be expanded to enable stronger and more rapid responses to significant P&R research and analysis questions. Prescriptive methods that would allow P&R to better assess alternatives and recommend actions could be used more extensively. The workforce that P&R relies on for its analytics also needs to be improved, which is a challenge facing organizations worldwide. Each of these components is briefly described in the following sections.
The following sections also discuss potential short-, medium-, and long-term goals to help move P&R in the direction of developing a data and analytics framework. Data quality and sharing can be improved immediately, while data science methods can be enhanced in the medium term and data science education strengthened in the long term.
IMPROVE DATA QUALITY AND SHARING
Collections of traditional administrative and transactional data used for P&R missions continue to grow owing to improved technical abilities to track and store data and an increased interest in capturing data that could provide meaningful insights. While the sheer quantity of data is growing in most domains, the importance of P&R’s mission makes it essential that these data are better understood and utilized. Steps have been taken to simplify and unify available data—such as the development of the Defense Manpower Data Center (DMDC), a unified personnel file, and the Civilian Personnel Data System—and these efforts have greatly enhanced the ability of the Office of the Secretary of Defense (OSD) to understand the behavior of its personnel and the effects of its policies. Still, there are benefits to be gained by enabling deeper and richer collection and sharing of data, including improved force readiness, better allocation of funds, and a more agile and adept workforce.
One important step toward improving the usefulness of data would be to add new fields and formats to personnel files that would improve the productivity of P&R’s analyses. This would require P&R to work with other organizations to identify the most useful such fields and formats. One particular technical need is to ensure that future records are capable of including unstructured data and free-form text, and that methods are available to search and combine such information reliably.
Another important need is to enable greater data sharing among the Services2 and between P&R and the Services. This goal is challenging because data definitions can differ, software may not match up, and business practices may vary. These differences of practice reflect the separate histories of the Services. Moreover, there can be substantive reasons for the differences, rooted in variations in policies and practices across the Services. However, data sharing provides clear benefits, such as exposing incomplete or flawed data. Along these lines, there is also a need for standardization in the way data users can report problems with data collections and channel those problems back to the data providers when appropriate. Challenges of data sharing and repurposing are significant; in particular, different definitions and formatting of data complicate data merging and linking, making it difficult to bring to bear multiple databases and the additional insights they represent to inform studies. The reuse of administrative data for analytic purposes can also expose issues in data collection, recording, transmission, cleaning, coding, and loading. Problems with data are often not detected until the point of analysis, when anomalies crop up in results. In addition, there are significant up-front hurdles to identifying, accessing,
2 Throughout the report, the term “Services” is used to refer to the U.S. military Services—namely, the Air Force, Army, Coast Guard, Marine Corps, Navy, National Guard, and Joint Chiefs.
and assembling the data needed to pursue desired data analyses. These hurdles can discourage decision makers from asking for data analyses and researchers from offering them.
DMDC has developed the Person-Event Data Environment (PDE), which is designed to bring data together in a unified and secure system where researchers can conduct analyses easily and in a privacy-preserving fashion. The PDE is a positive step in attempting to make data more easily accessible. However, technical and cultural challenges (such as possible data reidentification and other privacy compromises), a slow and complicated approval process to gain access, lengthy reviews for data import and export, limited computational capabilities, concerns about data quality and comprehensiveness, and concerns about data ownership rules pose a significant deterrent to utilizing the PDE. In addition, it is not clear that the architecture scales up in such a way that it can serve all of P&R’s needs, and forcing analysts to work through the PDE personnel, who then must work through the data owners, may represent a barrier between the analyst and the raw data. The substantial efforts undertaken by PDE personnel to prepare the data for linkage are not transparent and may inadvertently impact the results of analysis.
Finding: The existence of DMDC and a unified personnel file has greatly enhanced OSD’s ability to understand the behavior of its personnel and to refine its policies so as to improve both retention and performance. The creation of the Civilian Personnel Data System was a similar achievement.
Finding: There are benefits to be gained by enabling deeper and richer collection and sharing of data, which support a richer picture of the individual. This could in turn allow for better matching of personnel to the needs at hand (e.g., with regard to desired data skills, language proficiencies, and experiences), improved identification of at-risk servicemembers, enhanced management of the force in terms of retention and training, and many other benefits.
Finding: The challenges of data sharing and repurposing are significant; in particular, different data definitions and formatting complicate data merging and linking. Business practices (e.g., methods, procedures, processes, and rules) vary from Service to Service and from one database to another.
Finding: Enhanced data sharing within DoD, across agencies, and with the research community at large could promote the creation of new statistical methods, tools, and products.
Finding: The existence of alternative data sources, such as social media, especially when they are tied to extensive information about individuals, may deliver deep insights relevant to the mission of P&R. Owing to concerns about privacy and appropriateness and to the difficulty of ensuring statistical validity, further pursuit of this path requires careful consideration and additional research.
Recommendation 2: The Office of the Under Secretary of Defense (Personnel & Readiness) should investigate the feasibility of exploiting alternative data sources to augment traditional methods for measuring collective sentiment, evaluating recruitment practices, and classifying individuals (for creditworthiness, perhaps, or for battle-readiness). Hand in hand with this effort there should be an investigation into privacy technology appropriate for these scenarios for data use.
Recommendation 3: The Office of the Under Secretary of Defense (Personnel & Readiness) should identify incentives to enhance data sharing and collection, such as the following:
- Tracking usage of data by source in repositories such as the Person-Event Data Environment and periodically reporting back to data providers on usage (e.g., number of uses, who the users are, the nature of the study, or analysis the data contributed to);
- Providing incremental funding on contracts that involve data collection and organization to cover the costs of archiving and documenting the data for other users; and
- Giving preference to projects for constructing or redesigning operational data systems that include explicit functionality to support data sharing.
Recommendation 4: The Office of the Under Secretary of Defense (Personnel & Readiness) should leverage opportunities to improve access, including better reuse of prior data, tools, and results, and should investigate incentives to increase interagency and inter-Service data sharing.
Recommendation 5: The Office of the Under Secretary of Defense (Personnel & Readiness) should establish a working group with representation from the Services and other elements of the Department of Defense, as appropriate, to
- Identify productive new fields and formats for personnel files, such as enabling the inclusion of unstructured data and free-form text in future records;
- Identify opportunities for data sharing between Services and the Office of the Under Secretary of Defense (Personnel & Readiness) and within Services and lower barriers to such sharing;
- Work with organizations that provide operational data or collect them for analysis to improve data quality by providing standard ways for data users to report problems with data collections and channel those reports back to data providers when appropriate;
- Clarify self-reporting rules and practices;
- Identify legal and regulatory barriers to the appropriate and responsible sharing of data; and
- Examine new hardware and software architectures that facilitate data access and data management.
Finding: The development of the Person-Event Data Environment is a positive step in making some data more easily accessible. However, certain technical and cultural factors deter the use of this tool.
- Spreads the overall cost of data acquisition, cleaning, ingestion, and linking;
- Reduces time for researchers identifying and downloading data, since they work on it in situ;
- Aims to improve handling of sensitive data;
- Monitors data usage; and
- Creates a group that supports users with data and tool issues.
- Sensitive personally identifiable information is susceptible to reidentification and other privacy compromises such as revelation of sensitive traits or attributes.
- Linkage attacks—innocuous data in one data set used to identify a record in a different data set containing both innocuous and sensitive data—can be carried out via external data sets brought into the PDE by researchers.
- Review processes are lengthy for access to some data.
- Delays in the review process for export of analysis results pose a deterrent to publication and peer review.
- The hurdles to become a PDE user mean that the current user community is much smaller than intended.
- Some users have been limited by the computational power, memory, and tools of the current installation.
- The PDE does not solve completeness and quality issues in the underlying data sources.
- There does not exist a systematic mechanism for reporting data problems.
- Some PDE users say they have been given conflicting statements about the ownership of external data uploaded into the PDE.
Recommendation 6: The Defense Manpower Data Center should assess how well the Person-Event Data Environment is working and whether it is serving its intended community. In doing so, the center should consider taking the following steps to improve the usability of the Person-Event Data Environment and enhance its value:
- Assess if current privacy and security policies are adequate, taking into account modern methods of attack and sources of auxiliary information that can aid in these attacks, such as multiple releases of statistics and data sets (Ganta et al., 2008), linkage attacks that make use of public sources (Sweeney, 1997; Narayanan and Shmatikov, 2008), and chronological correlations with public sources (Calandrino et al., 2011).
- Analyze data usage information, both for privacy and determining value of assets.
- Do a better job of establishing and defining a user community for knowledge sharing. This includes improving relationships with the federally funded research and development centers doing work for the Department of Defense and determining which researchers would benefit from the capabilities of the Person-Event Data Environment.
- Remove unnecessary barriers for researchers to gain access to the system.
- Enhance computational power, memory, and tools.
- Respond to concerns about the quality and comprehensiveness of available data.
- Develop an explicit process for reporting data problems.
- Clarify data ownership rights to external data that are uploaded and merged.
- Assess protocols for accessing personally identifiable information.
- Review approval process for exporting analysis results.
- Consider widening access to the data and/or rebalancing Institutional Review Board requirements by establishing a differentially private interface.3
When conducting analyses relating to personnel data, it is essential that privacy, confidentiality, and fairness be considered as primary factors rather than, as is too often the case, left as an afterthought secondary to an analyst’s findings. The current privacy and confidentiality protections in place with government databases rest heavily on Institutional Review Board (IRB) supervision. However, significant barriers arise in the overreliance on IRB reviews. Researchers often face multiple IRB reviews and re-reviews throughout a single study, which can significantly slow the research and analysis process and add months or years to the time it takes before researchers can access DoD data. This impedes their ability to respond to policy needs in a timely manner while also doing little to stop data reidentification and other compromises of sensitive personal information.
Finding: Reviews by multiple Institutional Review Boards can significantly slow down the research process and add months or years to the time it takes for researchers to have access to DoD data. This creates a serious problem for responding to policy needs in a timely manner.
Recommendation 7: In order to support timely and efficient research, the Office of the Under Secretary of Defense (Personnel & Readiness) should encourage streamlining of Institutional Review Board processes that involve multiple organizations—for example, federally funded research and development centers and the Department of Defense.
Recommendation 8: The Department of Defense should carry out research on the feasibility of differential privacy methods for its personnel analytics. These methods could reduce the need for Institutional Review Board oversight.
Recommendation 9: The Department of Defense should consider adopting or adapting the privacy and governance structure developed by the Office of Management and Budget for civilian statistical agencies. In particular, the department should follow the guidance on use of administrative records and establishing of statistical units under the Confidential Information Protection and Statistical Efficiency Act for both military and civil service personnel. In doing so, the department should examine the applicability of Fair Information Practice Principles in the treatment of Defense Manpower Data Center data.
Recommendation 10: The Defense Manpower Data Center, in its role as steward of the Person-Event Data Environment, should consider ways to adapt and use privacy and governance practices that the Office of Management and Budget has created for civilian use.
ENHANCE ANALYTIC METHODS
While comprehensive and reliable data are essential in informed decision making, they would be of little use without advanced analytic capabilities. New methods of analyzing data are increasingly available, and many cutting-edge approaches are ready to be more thoroughly applied for P&R missions. Data analytics are often categorized as descriptive analytics, predictive analytics, or prescriptive analytics. These categories are defined in Box S.2.
There are a number of opportunities to use other prescriptive analytic techniques beyond those currently being used for P&R missions. In industry, for example, studies addressing the problem of retention typically start with a statistical analysis (predictive analytics) of the entire workforce, looking to identify characteristics of personnel most likely to leave. This analysis estimates losses for the different groups of personnel and also produces models that estimate changes in losses for each group as a function of the amount of additional compensation provided to that group. Then, mathematical optimization under uncertainty methods (prescriptive analytics) are developed, incorporating these predictive models, to determine the compensation to be offered to reduce losses (increase retention) for each group in the workforce to best match demand. This decision-making optimization also takes into account the various trade-offs among other workforce policy levers such as hiring (recruiting) and reskilling (training) to best match demand.
Several hurdles need to be overcome to exploit these new opportunities. The previously discussed improvements to data access and sharing are a first step. The best analytic methods can be accessed by enhancing training in the workforce and by building a data analytics center such as the Office of People Analytics, proposed in the Force of the Future initiatives.
Finding: A wide range of problems are being addressed for P&R using data analytic techniques and the rich data sources discussed in this report. These are often applied in response to specific questions but are not incorporated into a long-term plan.
Finding: Turnkey personnel analytic solutions and currently commercially available software are unlikely to meet P&R’s needs.
Recommendation 11: The Office of the Under Secretary of Defense (Personnel & Readiness) should assess which predictive and prescriptive analyses would benefit its mission over the longer term, taking into account its understanding of which specific decisions could, if evaluated by applying more powerful data and/or methods, better enable
the Department of Defense to prepare for future demands it may face. Some possible steps that might follow include these:
- Emphasizing the use of prescriptive analytics in conjunction with predictive “what if” scenarios;
- Enhancing prescriptive analytics usage and disseminating best practices across the entire department; and
- Adapting the prescriptive analytics methods successfully used in the private sector for workforce and talent management.
Controlled experiments offer an opportunity to test potential policy solutions and provide additional data when needed. They can be used for a variety of areas important to P&R and can be particularly helpful in buttressing conclusions that contradict accepted conclusions.
Finding: The Department of Defense does not routinely employ controlled experiments to understand causes and effects of the Office of the Under Secretary of Defense (Personnel & Readiness) policies—for example, revisions to enlistment standards or choices affecting family welfare—to judge whether they produce the intended effects and provide benefits that justify their costs.
Recommendation 12: To the extent feasible and relevant, the Department of Defense should conduct carefully structured experiments to test the efficacy of policy.
STRENGTHEN DATA SCIENCE EDUCATION
A skilled workforce that can apply state-of-the-art methodology and adapt to the quickly evolving data analytics domain is essential. OSD would benefit if P&R strengthened the data analytics expertise of a portion of its staff, both military and civilians. Such background would allow these specialized staff to answer immediate questions quickly, to be better-informed consumers of external analyses, and to better integrate analyses into policy decisions. Such expertise would also help to transfer best practices and skills across silos within the P&R enterprise.
Finding: Based on its collective experience with seeing data science mature in other organizations, the committee’s judgment is that P&R’s skills, depth, and resources in data analytics are not sufficient to recognize the full range of analytics opportunities and to implement these methods to better support decision making. It is always problematic to leverage scattered pockets of data science expertise, so raising the general level of awareness and skill would be more effective.
Recommendation 13: The Office of the Under Secretary of Defense (Personnel & Readiness) should create greater awareness of data science methods and disseminate them more thoroughly to its personnel
to increase the general understanding of data science and the benefits of its use.
Recommendation 14: The Office of the Under Secretary of Defense (Personnel & Readiness) should enhance education in data science for its personnel, including civil service employees. This education could range from short courses in specific techniques for personnel who already have the requisite foundational knowledge, to overview seminars for managers who need to be acquainted with what their analytical staff can undertake, to formal degree programs, whether at Department of Defense or civilian universities.
Calandrino, J.A., A. Kilzer, A. Narayanan, E.W. Felten, and V. Shmatikov. 2011. “You might also like”: Privacy risks of collaborative filtering. Pp. 231-246 in Proceedings of the 2011 IEEE Symposium on Security and Privacy. May 22-25.
Ganta, S.R., S. Kasiviswanathan, and A. Smith. 2008. Composition attacks and auxiliary information in data privacy. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. http://www.cse.psu.edu/~ads22/privacy598/papers/gks08.pdf.
Narayanan, A., and V. Shmatikov. 2008. Robust de-anonymization of large sparse datasets. Pp. 111-125 in Proceedings of the 2008 IEEE Symposium on Security and Privacy. doi:10.1109/SP.2008.33.
Sweeney, L. 1997. Weaving technology and policy together to maintain confidentiality. Journal of Law, Medicine and Ethics 25(2-3):98-110.