Many of the issues related to data management and integration are not cloud specific, said Alan Evans, James McGill Professor of Neurology and Psychiatry at McGill University. Indeed, he said, getting the major platforms to develop interoperability definitions to enable data sharing transcends the cloud. But without that cooperation, there will continue to be islands and communities that are unable to communicate.
The web of regulations referred to in the section on privacy (see Chapter 3) further complicates efforts to integrate data across geographic boundaries, noted Eline Applemans, scientific program manager in neuroscience at the Foundation for the National Institutes of Health (FNIH). Benjamin Neale, associate professor in the Analytic and Translational Genetics Unit at Massachusetts General Hospital and the Broad Institute of MIT and Harvard, concurred that the GDPR regulations require cloud environments to be set up in each country, allowing investigators to analyze data within national boundaries. He suggested that although research could proceed more rapidly if data were housed in a single place, the community should be open to federated models and storage of different levels of data in different ways. For example, summary-level information might be shared in a highly interoperable environment, while individual-level data may be housed in a more restricted capacity.
Interoperability is facilitated by standards, but developing widely accepted data standards requires cooperation and is itself challenging. Data standards could provide the opportunity for large cloud-based neuroscience resources to work together; however, in a dynamic field like neuroscience with changing data modalities and technologies, it can be difficult to corral standards, said Michael Huerta. Daniel Marcus, professor of imaging neuroscience at the Washington University School of Medicine in St. Louis, suggested that inadequate cooperation arises not from a lack of interest, but a lack of incentives. Governments can play a role in creating such incentives, as well as in coordinating collaborative efforts, said Rebecca Li.
Maryann Martone added that the ideal people to develop standards may not be researchers themselves because they may lack expertise in informatics and coding. However, the standards developed should map onto what the researchers actually do in a way that they can understand how these constructs represent their experimental paradigms, she said. Thus, she said, it is probably helpful to start by asking researchers what they need from the data and what they are willing or unwilling to do to achieve their goals.
Data management can be costly and time consuming, said Huerta. Researchers should think about data integration and data sharing from the beginning as they are developing and designing their projects and should balance the costs versus benefits (value assessment) in deciding what level of data
management is needed, he said. Martone suggested that it may be helpful to ask researchers to fill out templates of metadata schemes. These templates, she said, should be simple and not overly prescriptive. Huerta added that NIH staff need to understand the complexities of data management; one example is that data cleaning is essential and can be expensive, said Huerta.
Huerta recalled that about 20 years ago, to develop the Neuroimaging Informatics Technology Initiative (NIFTI) as an imaging standard, the major neuroimaging labs and software developers came together for workshops to develop standards, which are still widely used. Now, he said, his office is working to accelerate the promotion and adoption of Fast Healthcare Interoperability Resources (FHIR) standards to promote health care information exchange across NIH. He added that NIH is preparing to release for public comment a data management and sharing policy, which will require NIH-funded researchers to include a data management and sharing plan in their grant proposals.1
Neale suggested that genetics is one domain within the field of neuroscience that has already made progress in sharing data. The Psychiatric Genomics Consortium (PGC) was launched in 2007 with the goal of conducting huge genome-wide analyses of psychiatric disorders by bringing researchers together from around the world to work collaboratively (Psychiatric GWAS Consortium Steering Committee, 2009). The more than 800 investigators from 38 countries that have joined this consortium share data on a research compute cluster in the Netherlands that functions in a manner similar to a cloud, enabling many different groups to share and work together with a standardized kind of processing and analysis, said Neale.
The UK Biobank has created a different kind of data model in which data are made available for downloading, said Neale. He suggested that it may be possible to set up cloud-based methods that would enable investigators to point to and analyze those data without downloading it.
Meanwhile, the National Center for Biotechnology Information (NCBI) has developed a database of genotypes and phenotypes (dbGaP)2 to archive and distribute data and results from genotype/phenotype studies conducted in humans, said Neale. He added that the National Human Genome
1 This policy was released since the date of the workshop. For more information, see https://osp.od.nih.gov/scientific-sharing/nih-data-management-and-sharing-activities-related-to-public-access-and-open-science (accessed November 24, 2019).
Research Institute (NHGRI) and NHLBI are trying to move toward a centralized dataset model where researchers can apply for access and then work with the data in a centralized environment. Many pieces are on the table that are not all totally linked and interoperable, he said, suggesting that opportunities remain to improve the data management approach. However, Lyn Jakeman, director of the division of neuroscience at the National Institute of Neurological Disorders and Stroke (NINDS), suggested that there may not be one model that works for all areas within neuroscience.
Interoperability among a multiplicity of data management platforms can also be a problem, said Evans. For example, the Canadian Open Neurosciences Platform uses LORIS (Longitudinal Online Research and Imaging System)3 as its main data management platform, he said, but other institutions across Canada use other systems. Users see only a common application programming interface (API) that sits on top of these platforms, he said.
Transforming data from a raw state into a standardized format capable of being analyzed and/or shared—a process called “data munging” or “data wrangling”—is costly and time consuming, sometimes accounting for as much as 70 percent of a project’s budget, said Michael Nalls, founder and CEO of Data Tecnica International and a consultant for the National Institute on Aging. Rachel Ramoni, chief research and development officer for the Department of Veterans Affairs (VA), suggested that funding agencies might be able to come together to support the development of harmonized approaches for data munging and then incentivize the use of these harmonized approaches by funding projects that use them. Heather Snyder, vice president of medical science relations at the Alzheimer’s Association, agreed that funders could make data cleaning (i.e., correcting or removing inaccurate or irrelevant data) a condition of funding, adding that the Alzheimer’s Association sometimes pays to have datasets cleaned, believing that there is tremendous value in those data being available and shared. Huerta added that here is a trans-NIH effort to train program officers about good data management practices, with tools that will enable even those who do not have computational biology in their portfolios to better understand the costs of data management, including the amount of principal investigator and technician time required for data cleaning and munging.
Derek Merck, director of medical informatics at the University of Florida, advocated moving as much of the burden of data cleaning as pos-
sible from the producers of the data to repositories, while requiring data producers to meet only minimal requirements in order to contribute data. Investigators who adhere to that format would be incentivized by having access to multiple automated processes, he said. Huerta added that in academic environments it can be difficult to recruit and retain people who have the skills necessary for data standardization. Martone said that researchers tend to have little interest in standards, often leaving data management to graduate students and postdoctoral fellows.
Michael Hawrylycz, senior director for informatics at the Allen Institute for Brain Science, said that because large datasets are often generated in a systematic way, standardization is less of an issue. However, standardization becomes more problematic with smaller datasets. Incentivizing researchers to standardize these datasets is especially important, he said. Benchmarking datasets and software that will be used in the cloud is also important when designing experiments, said Nalls.
Harmonizing and federating similar types of data residing in multiple repositories and platforms is only one challenge, said Marcus. When systems hold different types of data, the challenges are magnified. For example, he said, if neuroimaging data identifies a potentially interesting region of interest, one might want to examine gene expression in that region. Making such connections is currently a manual process. Merging these data types would require defining a common coordinate frame, said Evans.
Similarly, analyzing genetics data across datasets without being able to integrate phenotypic data and other information limits what can be learned from those data, added Snyder. Neale said that in the genetics field, a strategic decision early on to go broad in sample, but shallow in phenotype, has slowly shifted, especially with projects such as the UK Biobank, where links with electronic health records and other data sources are providing deeper and richer phenotypic information.
This page intentionally left blank.