In the genetic landscape, cloud usage is becoming more common, in large part because of the size of the datasets, said Benjamin Neale. With more than a petabyte of raw data and additional processed data, the cloud is the only computational environment capable of managing these datasets, said Neale. The challenge, he said, is balancing large centralized infrastructural resources with systems that enable scientists to perform analyses and explore data locally.
Neale added that making full use of cloud infrastructure to process these ever-growing genetic datasets requires novel methods and software tools. The need for a lot of computing power becomes particularly onerous when trying to integrate genomic data with other types of data, such as cognitive data, added Michael Nalls. On top of that, as investigators try to navigate GDPR and other regulatory environments, new methods for working in federated systems and switching from local to cloud computing will be increasingly important, said Nalls.
GDPR is a particular challenge for genomic data, which require special protections because some forms of DNA can provide an identifiable footprint, said Arpana Agrawal, professor of psychiatry at the Washington University School of Medicine in St. Louis. As was discussed in the privacy breakout sessions (see Chapter 3), GDPR treats genetic information as personal data while HIPAA treats it as completely de-identified data, said Kristen Rosati.
Neale said there are large datasets already available in the cloud and a clear NIH investment in building infrastructure and supporting upload and access with a variety of different approaches. Nalls, for example, is working with hybrid models that combine cloud computing with high-performance systems such as the high-performance computing Biowulf cluster at NIH.1 This sandbox approach, said Nalls, allows researchers to test their software locally or on a small local cluster before going to production scale in the cloud, thus maximizing resources and reducing costs. He said his group externally audits all datasets before they are pushed to the local cluster to ensure identifiable data are not inadvertently uploaded, and also checks the code to ensure privacy is maintained because links in laboratory notebooks could potentially cause inadvertent breaches. Neale added that for genomic data, this approach of prototyping software with small data before applying it systematically to large datasets is beneficial because on a 5-year horizon, there will probably be whole-genome sequences on millions of individuals.
The PGC takes a different approach, with data storage and analysis conducted not on the cloud, but on a dedicated and highly protected server in the Netherlands, said Agrawal. This allows EU researchers to conduct studies without the data leaving the European Union, and U.S. researchers also deposit data there. Nalls added that working in federated learning scenarios in local clusters before switching to the cloud is increasingly important as a means of adhering to GDPR regulations, although he acknowledged that exactly how GDPR will be interpreted and implemented has yet to be determined.
While GDPR does not regulate “anonymized” data, it is unclear whether genetic data can be anonymized, said Kristen Rosati. She also noted that anticipated guidance under the Common Rule regarding the identifiability of genetic data is likely to present challenges for researchers in the United States, because nearly all U.S. institutions have been treating genetic information not accompanied by identifiers as de-identified information with no regulatory controls on data sharing.
An additional problem related to identification and de-identification is that there is no way to process away the individual-level genetic variation that in and of itself may be identifiable, said Neale. Researchers use this and other sensitive information all the time and sometimes in a non-anonymized form, he said, but do it with the explicit perspective of not doing nefarious things with those data. He added that engaging in individual re-identification should have serious consequences. The language used in consent forms related to the risks of re-identification varies considerably, said Rosati, and there is also a whole landscape of genetic information gathered for treatment purposes where there is no consent at all. People assume that if their genetic data are moved to shared repositories that those data are de-identified, noted Jonathan Cohen. He suggested that the community might want to work to ensure that this is clarified in consent forms.
Rosati agreed that genetic information and other rich, clinical, formerly de-identified data are going to require some protections. A question that has been debated, she said, is whether consent can provide adequate protection or whether much stronger federal laws are needed to protect against the re-identification of individuals. Neale noted, however, that as more barriers to access are introduced, it will become hard to realize the potential of these data to improve lives.
Engaging study participants for genetic research studies of vulnerable populations raises significant challenges, said Neale. Partnering with different ancestral groups from study inception is advisable although not
required, he said. Educating potential participants about the benefits and risks of the study need to be managed with transparency and openness, he said. However, he acknowledged that there are potential risks for group characterization that can emerge from studies of vulnerable populations, which can cause distress. Rosati added that mental health disorders, addiction, and some other conditions that could be revealed by genetic information are associated with a substantial amount of stigma. Indeed, said Lyn Jakeman, neuroethicists and others are beginning to question whether genetic data become a code with which one can compare brain function. For example, if the genetic code is linked to functional imaging or other phenotypic data, it could become code for the individual, she said. Nalls said attempts have already been made to harmonize data from many different sources using unsupervised learning methods. These require a lot of computing power and cloud-based technologies, he said.
Neale raised another potential challenge: Suppose a new class of mutation is identified that enables a new interpretation of genetic data. Should a reanalysis of older data be conducted, and if so, who is responsible for doing such studies? Moreover, this scenario raises questions about whether individuals consented to be contacted again after the initial study, said William Hanson. These questions are being grappled with not only in the research arena, but in clinical practice as well, he said.
Disclosure of genetic results to research participants raises other thorny issues, said Rosati. Cohen asserted that withholding genetic information that may raise medical concerns is wrong. How to disclose this information is not clear, however, said Cohen; for example, the medical risks of non-disclosure must be weighed against the risks (in anxiety and unnecessary tests) of disclosing what amount to non-consequential or false-positive findings. Neale added that institutions will need to identify the infrastructure needed to deliver these kinds of results. Rosati suggested that best practices need to be identified regarding whether an institution has a duty to inform research participants when something concerning arises in a research study. Hanson added that he and his colleagues are exploring expanded consent for genetic testing in the clinical environment that would clarify the responsibility. An additional complication, Rosati said, is that the Centers for Medicare & Medicaid Services (CMS) have taken the position that any results reported back to individuals or their care providers for treatment purposes must be generated in a lab certified in accordance with the Clinical Laboratory Improvement Amendments.