Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
8 Genetic Data Highlightsa â¢ Cloud usage has become common in the genetics landscape in part because of the size of the datasets; however, new methods and tools are need to make full use of cloud infrastructure (Neale). â¢ General Data Protection Regulation is a particular challenge for genomic data because DNA provides an identifiable foot- print or label to the data and there is no way to process away individual-level genetic variation (Agrawal, Neale, Rosati). â¢ Hybrid models that combine cloud computing with high-Â performance computing on local clusters are useful for man- aging genetic data in the cloud because they allow investigators to prototype software before applying it systematically to large datasets (Neale). â¢ Genetic studies in vulnerable populations, such as underserved communities or those with mental health disorders, present challenges related to participant engagement and the risk of stigmatization (Jakeman, Nalls, Neale, Rosati). â¢ Best practices need to be developed for disclosure of genetic results to research participants (Cohen, Hanson, Neale, Rosati). a These points were made by the individual workshop participants identified above. They are not intended to reflect a consensus among workshop participants. 51 PREPUBLICATION COPYâUncorrected Proofs
52 NEUROSCIENCE DATA IN THE CLOUD In the genetic landscape, cloud usage is becoming more common, in large part because of the size of the datasets, said Benjamin Neale. With more than a petabyte of raw data and additional processed data, the cloud is the only computational environment capable of managing these datasets, said Neale. The challenge, he said, is balancing large centralized infrastruc- tural resources with systems that enable scientists to perform analyses and explore data locally. Neale added that making full use of cloud infrastructure to process these ever-growing genetic datasets requires novel methods and software tools. The need for a lot of computing power becomes particularly onerous Â when trying to integrate genomic data with other types of data, such as cognitive data, added Michael Nalls. On top of that, as investigators try to navigate GDPR and other regulatory environments, new methods for working in federated systems and switching from local to cloud computing will be increasingly important, said Nalls. GDPR is a particular challenge for genomic data, which require spe- cial protections because some forms of DNA can provide an identifiable footprint, said Arpana Agrawal, professor of psychiatry at the Washington University School of Medicine. As was discussed in the privacy breakout sessions (see Chapter 3), GDPR treats genetic information as personal data while HIPAA treats it as completely de-identified data, said Kristen Rosati. CURRENT PROMISING PRACTICES FOR MANAGING GENETIC DATA IN THE CLOUD Neale said there are large datasets already available in the cloud and a clear NIH investment in building infrastructure and supporting upload and access with a variety of different approaches. Nalls, for example, is working with hybrid models that combine cloud computing with high-performance systems such as the high-performance computing Biowulf cluster at NIH.1 This sandbox approach, said Nalls, allows researchers to test their software locally or on a small local cluster before going to production scale in the cloud, thus maximizing resources and reducing costs. He said his group externally audits all datasets before they are pushed to the local cluster to ensure identifiable data are not inadvertently uploaded, and also checks the code to ensure privacy is maintained since links in laboratory notebooks could potentially cause inadvertent breaches. Neale added that for genomic data, this approach of prototyping software with small data before applying it systematically to large datasets is beneficial because on a 5-year horizon, there will probably be whole-genome sequences on millions of individuals. 1â For more information, see https://hpc.nih.gov/systems (accessed November 11, 2019). PREPUBLICATION COPYâUncorrected Proofs
GENETIC DATA 53 The Psychiatric Genomics Consortium takes a different approach, with data storage and analysis conducted not on the cloud, but on a dedicated and highly protected server in the Netherlands, said Agrawal. This allows EU researchers to conduct studies without the data leaving the European Union, and U.S. researchers also deposit data there. Nalls added that working in federated learning scenarios in local clusters before switching to the cloud is increasingly important as a means of adhering to GDPR regulations, although he acknowledged that exactly how GDPR will be interpreted and implemented has yet to be determined. ISSUES TO BE RESOLVED REGARDING GENETIC DATA IN THE CLOUD While the GDPR does not regulate âanonymizedâ data, it is unclear whether genetic data can be anonymized, said Kristen Rosati. She also noted that anticipated guidance under the Common Rule regarding the identifiability of genetic data are likely to present challenges for researchers in the United States, because nearly all U.S. institutions have been treating genetic information not accompanied by identifiers as deidentified informa- tion with no regulatory controls on data sharing. An additional problem related to identification and de-identification is that there is no way to process away the individual-level genetic varia- tion that in and of itself may be identifiable, said Neale. Researchers use this and other sensitive information all the time and sometimes in a non-Â anonymized form, he said, but do it with the explicit perspective of not doing nefarious things with those data. He added that engaging in indiÂ vidual re-Âdentification should have serious consequences. The language i used in consent forms related to the risks of re-identification varies consid- erably, said Rosati, and there is also a whole landscape of genetic informa- tion gathered for treatment purposes where there is no consent at all. People assume that if their genetic data are moved to shared repositories that those data are de-identified, noted Jonathan Cohen. He suggested that the com- munity might want to work to ensure that this is clarified in consent forms. Rosati agreed that genetic information and other rich, clinical, formerly de-identified data are going to require some protections. A question that has been debated, she said, is whether consent can provide adequate protection or whether much stronger federal laws are needed to protect against the re- identification of individuals. Neale noted, however, that as more barriers to Â access are introduced, it will become hard to realize the potential of these data to improve lives. Engaging study participants for genetic research studies of vulner- able populations raises significant challenges, said Neale. Partnering with different ancestral groups from study inception is advisable although not PREPUBLICATION COPYâUncorrected Proofs
54 NEUROSCIENCE DATA IN THE CLOUD required, he said. Educating potential participants about the benefits and risks of the study need to be managed with transparency and openness, he said. However, he acknowledged that there are potential risks for group characterization that can emerge from studies of vulnerable populations, which can cause distress. Rosati added that mental health disorders, addic- tion, and some other conditions that could be revealed by genetic infor- mation are associated with a substantial amount of stigma. Indeed, said Lyn Jakeman, neuroethicists and others are beginning to question whether genetic data become a code with which one can compare brain function. For example, if the genetic code is linked to functional imaging or other phenotypic data, it could become code for the individual, she said. Nalls said attempts have already been made to harmonize data from many dif- ferent sources using unsupervised learning methods. These require a lot of computing power and cloud-based technologies, he said. Neale raised another potential challenge: Suppose a new class of muta- tion is identified that enables a new interpretation of genetic data. Should a reanalysis of older data be conducted, and if so, who is responsible for doing such studies? Moreover, this scenario raises questions about whether individuals consented to be contacted again after the initial study, said William Hanson. These questions are being grappled with not only in the research arena, but in clinical practice as well, he said. Disclosure of genetic results to research participants raises other thorny issues, said Rosati. Cohen asserted that withholding genetic information that may raise medical concerns is wrong. How to disclose this informa- tion is not clear, however, said Cohen; for example, the medical risks of non-disclosure must be weighed against the risks (in anxiety and unnec- essary tests) of disclosing what amount to non-consequential or false- positive findings. Neale added that institutions will need to identify the infrastructure needed to deliver these kinds of results. Rosati suggested that best practices need to be identified regarding whether an institution has a duty to inform research participants when something concerning arises in a research study. Hanson added that he and his colleagues are exploring expanded consent for genetic testing in the clinical environment that would clarify the responsibility. An additional complication, Rosati said, is that the Centers for Medicare & Medicaid Services (CMS) have taken the posi- tion that any results reported back to individuals or their care providers for treatment purposes must be generated in a lab certified in accordance with the Clinical Laboratory Improvement Amendments. PREPUBLICATION COPYâUncorrected Proofs