Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
4 Managing Data and Promoting Interoperability in the Cloud Highlightsa â¢ Developing interoperability mechanisms that enable data plat- forms to talk to each other is critical, whether or not these platforms exist in the cloud (Evans). â¢ The National Library of Medicine is working to accelerate the promotion and adoption of Fast Healthcare Interoperabil- ity Resources standards to promote data exchange across the National Institutes of Health (Huerta). â¢ Although housing data in a single place could support more rapid research progress, sometimes it may be more practical to use federated models and store different levels of data in differ- ent ways. For example, the Psychiatric Genomics Consortium shares data on a compute cluster in the Netherlands that, while not in the cloud, uses similar data sharing and standardized processing approaches (Neale). â¢ Harmonized approaches, funding, and training are needed to enable transforming data from a raw state to a standardized format, which is costly and time consuming (Huerta, Nalls, Ramoni, Snyder). â¢ A common coordination frame would be needed to merge dif- ferent types of data in repositories and platforms (Marcus). a These points were made by the individual workshop participants identified above. They are not intended to reflect a consensus among workshop participants. 25 PREPUBLICATION COPYâUncorrected Proofs
26 NEUROSCIENCE DATA IN THE CLOUD Many of the issues related to data management and integration are not cloud specific, said Alan Evans, James McGill Professor of Neurology and Psychiatry at McGill University. Indeed, he said, getting the major platforms to develop interoperability definitions to enable data sharing transcends the cloud. But without that cooperation, there will continue to be islands and communities that are unable to communicate. The web of regulations referred to in the section on privacy (see Chap- ter 3) further complicates efforts to integrate data across geographic bound- aries, noted Eline Applemans, scientific program manager in neuroscience at the Foundation of the National Institutes of Health (FNIH). Benjamin Neale, associate professor in the Analytic and Translational Genetics Unit at Massachusetts General Hospital, the Broad Institute of the Massachusetts Â Institute of Technology (MIT) and Harvard, concurred that the GDPR regulations require cloud environments to be set up in each country, allow- ing investigators to analyze data within national boundaries. He suggested that although research could proceed more rapidly if data were housed in a single place, the community should be open to federated models and storage of different levels of data in different ways. For example, summary-level information might be shared in a highly interoperable environment, while individual-level data may be housed in a more restricted capacity. Interoperability is facilitated by standards, but developing widely accepted data standards requires cooperation and is itself challenging. Data standards could provide the opportunity for large cloud-based neuroÂ science resources to work together; however, in a dynamic field like neuroÂ cience with changing data modalities and technologies, it can be dif- s ficult to corral standards, said Michael Huerta. Daniel Marcus, professor Â of imaging neuroÂcience at Washington University in St. Louis, suggested s that inadequate cooperation arises not from a lack of interest, but a lack of incentives. Governments can play a role in creating such incentives, as well as in coordinating collaborative efforts, said Rebecca Li. Maryann Martone added that the ideal people to develop standards may not be researchers themselves because they may lack expertise in informatics and coding. However, the standards developed should map onto what the researchers actually do in a way that they can understand how these constructs represent their experimental paradigms, she said. Thus, she said, it is probably helpful to start by asking researchers what they need from the data and what they are willing or unwilling to do to achieve their goals. Data management can be costly and time consuming, said Huerta. Researchers should think about data integration and data sharing from the beginning as they are developing and designing their projects and should bal- ance the costs versus benefits (value assessment) in deciding what level of data management is needed, he said. Martone suggested that it may be helpful to PREPUBLICATION COPYâUncorrected Proofs
MANAGING DATA AND PROMOTING INTEROPERABILITY 27 ask researchers to fill out templates of metadata schemes. These templates, she said, should be simple and not overly prescriptive. Huerta added that NIH staff need to understand the complexities of data management; one example is that data cleaning is essential and can be expensive, said Huerta. CURRENT PROMISING PRACTICES REGARDING STANDARDS DEVELOPMENT AND INTEROPERABILITY Huerta recalled that about 20 years ago, to develop the NeuroÂmaging i Informatics Technology Initiative (NIFTI) as an imaging standard, the major neuroimaging labs and software developers came together for work- shops to develop standards, which are still widely used. Now, he said, his office is working to accelerate the promotion and adoption of Fast Healthcare Interoperability Resources (FHIR) standards to promote health care information exchange across NIH. He added that NIH is preparing to release for public comment a data management and sharing policy, which will require NIH-funded researchers to include a data management and sharing plan in their grant proposals.1 Neale suggested that genetics is one domain within the field of neuroÂ science that has already made progress in sharing data. The ÂPsychiatric G Â enomics Consortium (PGC) was launched in 2007 with the goal of conducting huge genome-wide analyses of psychiatric disorders by bring- ing researchers together from around the world to work collaboratively Â (Â sychiatric GWAS Consortium Steering Committee, 2009). The more than P 800 investigators from 38 countries that have joined this consortium share data on a research compute cluster in the Netherlands that functions in a manner similar to a cloud, enabling many different groups to share and work together with a standardized kind of processing and analysis, said Neale. The UK Biobank has created a different kind of data model in which data are made available for downloading, said Neale. He suggested that it may be possible to set up cloud-based methods that would enable investiga- tors to point to and analyze those data without downloading it. Meanwhile, the National Center for Biotechnology Information (NCBI) has developed a database of genotypes and phenotypes (dbGaP)2 to archive and distribute data and results from genotype/phenotype studies conducted in humans, said Neale. He added that the National Human Genome Research Institute (NHGRI) and NHLBI are trying to move toward a 1â This policy was released since the date of the workshop. For more information, see https:// osp.od.nih.gov/scientific-sharing/nih-data-management-and-sharing-activities-related-to-public- access-and-open-science (accessed November 24, 2019). 2â For more information, see https://www.ncbi.nlm.nih.gov/gap (accessed November 11, 2019). PREPUBLICATION COPYâUncorrected Proofs
28 NEUROSCIENCE DATA IN THE CLOUD centralized dataset model where researchers can apply for access and then work with the data in a centralized environment. Many pieces are on the table that are not all totally linked and interoperable, he said, suggesting that opportunities remain to improve the data management approach. However, Lyn Jakeman, director of the division of neuroscience at the National Institute of Neurological Disorders and Stroke (NINDS), sug- gested that there may not be one model that works for all areas within neuroscience. Interoperability among a multiplicity of data management platforms can also be a problem, said Evans. For example, the Canadian Open Neurosciences Platform uses LORIS (Longitudinal Online Research and Imaging System)3 as its main data management platform, he said, but other institutions across Canada use other systems. Users see only a common application programming interface (API) that sits on top of these platforms, he said. DATA MANAGEMENT ISSUES TO BE RESOLVED Transforming data from a raw state into a standardized format capable of being analyzed and/or sharedâa process called âdata mungingâ or âdata wranglingââis costly and time consuming, sometimes accounting for as much as 70 percent of a projectâs budget, said Michael Nalls, founder and CEO of Data Tecnica International and a consultant for NIH. Rachel Ramoni, chief research and development officer for the Department of V Â eterans Affairs (VA), suggested that funding agencies might be able to come together to support the development of harmonized approaches for data munging and then incentivize the use of these harmonized approaches by funding projects that use them. Heather Snyder, vice president of medi- cal science relations at the Alzheimerâs Association, agreed that funders could make data cleaning (i.e., correcting or removing inaccurate or irrel- evant data) a condition of funding, adding that the Alzheimerâs Association sometimes pays to have datasets cleaned, believing that there is tremendous value in those data being available and shared. Huerta added that here is a trans-NIH effort to train program officers about good data management practices, with tools that will enable even those who do not have compu- tational biology in their portfolios to better understand the costs of data management, including the amount of Principal Investigator and technician time required for data cleaning and munging. Derek Merck, director of medical informatics at the University of Florida, advocated moving as much of the burden of data cleaning as pos- sible from the producers of the data to repositories, while requiring data 3â For more information, see http://loris.ca (accessed November 11, 2019). PREPUBLICATION COPYâUncorrected Proofs
MANAGING DATA AND PROMOTING INTEROPERABILITY 29 producers to meet only minimal requirements in order to contribute data. Investigators who adhere to that format would be incentivized by having access to multiple automated processes, he said. Huerta added that in aca- demic environments it can be difficult to recruit and retain people who have the skills necessary for data standardization. Martone said that researchers tend to have little interest in standards, often leaving data management to graduate students and postdoctoral fellows. Michael Hawrylycz, senior director for informatics at the Allen Insti- tute, said that because large datasets are often generated in a systematic way, standardization is less of an issue. However, standardization becomes more problematic with smaller datasets. Incentivizing researchers to stan- dardize these datasets is especially important, he said. Benchmarking data- sets and software that will be used in the cloud is also important when designing experiments, said Nalls. Harmonizing and federating similar types of data residing in multiple repositories and platforms is only one challenge, said Marcus. When sys- tems hold different types of data, the challenges are magnified. For example, he said, if neuroimaging data identifies a potentially interesting region of interest, one might want to examine gene expression in that region. Making such connections is currently a manual process. Merging these data types would require defining a common coordinate frame, said Evans. Similarly, analyzing genetics data across datasets without being able to integrate phenotypic data and other information limits what can be learned from those data, added Snyder. Neale said that in the genetics field, a stra- tegic decision early on to go broad in sample, but shallow in phenotype, has slowly shifted, especially with projects such as the UK Biobank, where links with electronic health records and other data sources are providing deeper and richer phenotypic information. PREPUBLICATION COPYâUncorrected Proofs
PREPUBLICATION COPYâUncorrected Proofs