Innovations Emerging in the Clinical Data Utility


Elmore and Platt

•  Distributed data queries can provide the foundation of a learning health system.

•  Advantages of distributed data networks include data accuracy, timeliness, flexibility, and sustainability.

•  Distributed queries facilitate asking questions of large datasets in ways that are HIPAA-compliant and maintain local context.


•  Data normalization and harmonization are critical to ensuring effective and accurate secondary use.

•  There are multiple approaches to data normalization, but a hybrid approach of new systems standardizing from inception and legacy systems transforming over time is most feasible.

•  Clinical element models, together with value sets, present opportunity for normalization in a way that maintains the context and provenance of the data.

•  Value-set management is a major component of normalization, and terminology service; a national repository of value sets is one suggested approach to handling this challenge.

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 33
5 Innovations Emerging in the Clinical Data Utility KEY SPEAKER THEMES Elmore and Platt • Distributed data queries can provide the foundation of a learn- ing health system. • Advantages of distributed data networks include data accuracy, timeliness, flexibility, and sustainability. • Distributed queries facilitate asking questions of large datasets in ways that are HIPAA-compliant and maintain local context. Chute • Data normalization and harmonization are critical to ensuring effective and accurate secondary use. • There are multiple approaches to data normalization, but a hybrid approach of new systems standardizing from inception and legacy systems transforming over time is most feasible. • Clinical element models, together with value sets, present op- portunity for normalization in a way that maintains the con- text and provenance of the data. • Value-set management is a major component of normalization, and terminology service; a national repository of value sets is one suggested approach to handling this challenge. 33

OCR for page 33
34 DIGITAL DATA IMPROVEMENT PRIORITIES Kheterpal • Modern health care challenges, such as chronic disease, require comprehensive, longitudinal information to support team care. • Blindfolded record linkage, such as using hashes, offer many advantages to better link data between sources while maintain- ing privacy. INTRODUCTION In order to make optimal use of the digital health data utility, novel and innovative approaches will have to be developed. These innovations include learning from large sets of data while dealing with the risk associated with physical aggregation, coping with incomplete standardization of data, and linking data from diverse sources without the use of universal identifiers. Richard Elmore, Coordinator of Query Health at the Office of the National Coordinator for Health Information Technology, and Richard Platt, Chair of Population Medicine at Harvard Medical School and Harvard Pilgrim Health Care Institute, discussed the specific case of distributed data queries. Christopher Chute, Professor of Medical Informatics at the Mayo Clinic, elaborated on challenges and opportunities associated with data harmo- nization and normalization. Vik Kheterpal, Principal at CareEvolution, focused on data linkage between sources. DISTRIBUTED QUERIES In their discussion of distributed queries, Richard Elmore and Richard Platt covered the broad definition and qualities of such queries, and pro- vided specific examples of these queries in action. Distributed queries allow querying of data from multiple partners without having to physically ag- gregate data in one central repository; a query is sent to all partners, and each participant runs this query internally and returns summary results individually. Some example use cases for distributed population queries include population measures related to disease outbreaks, postmarket sur- veillance, prevention, quality, and performance. The advantages of this model, Elmore emphasized, are myriad. A distributed query approach allows data partners to maintain HIPAA-mandated, contractual control of their protected health information (PHI), and it facilitates data validity by ensuring that results are returned by local content experts, those most familiar with and understanding of the data and their interpretation. The

OCR for page 33
INNOVATIONS EMERGING IN THE CLINICAL DATA UTILITY 35 distributed data environment also supports data accuracy, timeliness, flex- ibility, and sustainability. Despite their many advantages, distributed queries also face a number of data quality challenges. Complications in integrating results from sev- eral data sources due to a lack of standards were cited as an example. But, Elmore said, pathbreaking work is under way to address this problem. Dif- ficulty in striking a balance between clinical intuitiveness and computability when expressing a query is another challenge. Moreover, once a query is formulated, the lack of semantic equivalency and standards to express clini- cal concepts among data systems must be addressed. Additionally, there is no cultivated standard value set, clinicians in the same practice often code differently, and each organization has its own established value sets. Fur- thermore, within those value sets, data are often missing, so completeness also presents a challenge to distributed queries. Despite the obstacles inherent to such queries, several examples, across many domains, are ongoing and have achieved great success. Platt described Mini-Sentinel, an FDA-sponsored pilot initiative that has created a distrib- uted dataset that includes data on 126 million people at 17 data partners to support active safety surveillance of medical products. The FDA now routinely uses the system. Platt cited an example of a query dealing with drugs for smoking ces- sation, addressing concern that a certain drug increased risk of negative cardiac outcomes. Within 3 days of receiving FDA’s intent to query the network, Mini-Sentinel returned its first report on the results, including information on 300 million person years of experience. While the speed and scope of the query result were impressive, Platt noted that it had several as- sociated limitations. These included that it was intended to be a quick look, not a final answer; that the result did not exclude excess risk; and that re- corded exposures may have been missing or included a misclassified indica- tion. Moreover, the cohort may have been unrepresentative, outcomes may have been misclassified, and there was a potential for residual confounding due to disparate smoking intensities or comorbidities. Nonetheless, with the right clarification on the query itself, specifications on the cohort of interest, and selection of diagnosis codes, the network was able to rapidly query hundreds of millions of people’s worth of data without transferring any institution’s PHI. Another query focused on a comparison of individuals who had expe- rienced a stroke or transient ischemic attack (TIA) and previously received one of two different types of platelet antagonists. Treatment with one of the platelet antagonists was counter-indicated for individuals who had pre- viously had a stroke or TIA; Mini-Sentinel determined that half as many individuals received the counter-indicated drug following stroke or TIA compared to those individuals receiving the comparison drug. The limita-

OCR for page 33
36 DIGITAL DATA IMPROVEMENT PRIORITIES tions inherent to this query included that the ICD-9 codes used for TIA and stroke were not validated in Mini-Sentinel, and that the longest look back for stroke or TIA events was 1 year, so that patients who experienced an event earlier than 1 year prior were missed. In both of these examples, it was possible to get very quick informa- tion that provided guidance that FDA found to be useful in determining how much urgency should be attached to a specific question, while also helping to develop next steps. Along these lines, Query Health, an ONC- sponsored initiative, is working with many partners to develop standards for distributed data queries. As Elmore emphasized, the idea is to send questions to voluntary, collaborative networks, whose varied data sources may range from EHRs, to health information exchanges (HIEs), to other clinical records. These queries have the potential to dramatically cut cycle time on population questions, from years to days, and thereby, Elmore said, are critical to ONC’s strategy to bend the curve toward transformed health, and will play a foundational role in the digital infrastructure for a learn- ing health system, focusing on the patient and patient populations, while ensuring privacy and trust. DATA HARMONIZATION AND NORMALIZATION In his comments on data harmonization and normalization, Christopher Chute stressed that data from patient encounters must be comparable and consistent in order to provide knowledge and insights to inform future care decisions. This normalization is also necessary for big-data approaches to queries. However, most clinical data in the United States, even within institutions, are heterogeneous, which presents a major challenge for har- monization efforts. ONC’s initiation of Meaningful Use is mitigating this challenge, but more work is needed. Data normalization, Chute said, comes in two varieties: clinical data normalization of structured information, and processing of unstructured natural language. Moreover, three potential approaches to instituting this normalization exist. The first approach is for all generators of data, includ- ing lab systems, departmental systems, physician entry systems, to normal- ize their data at the source. Given the institutional effort necessary to realize this approach, it is not realistic in the short term. The second approach places all hopes for normalization in transformation and mapping on the back end of data systems; this approach sometimes works, but often is as- sociated with ambiguous meanings and other transformation difficulties. Lastly, the third and most promising method is a hybrid approach, in which new systems begin by normalizing their data at the source, while established systems implement standard normalization protocols like meaningful use and data from legacy systems are transformed.

OCR for page 33
INNOVATIONS EMERGING IN THE CLINICAL DATA UTILITY 37 In discussing these approaches, Chute emphasized, it is important to comprehend fully the definition of normalization, as it has both syntactic and semantic meanings. Syntactic normalization is highly mechanical and involves correction of malformed messages. An example of such work is the Health Open Source Software pipeline created by Regenstrief Institute, which is capable of this type of syntactic normalization. On the other hand, semantic normalization typically involves vocabulary and concept mapping. Both types of normalization assume that there is a normal form to target, yet extant national and international standards do not fully specify that target. Many standards exist, but, Chute said, they do not specify what is needed. The current standards and specifications of HIE and messaging are narrow, and do not look at the full representational problems of clini- cal data, so that efforts to meet the standards fall short on those fronts. Additionally, while there is tension on this point, machine readable, rather than human readable, standard representation is necessary for large-scale inferencing and secondary use. Having elaborated on the definition and current characteristics of normalization, Chute turned to describing current efforts undertaken by ONC’s Strategic Health IT Advanced Research Projects (SHARP) Program, specifically SHARPn, whose major focus is on normalizing and standard- izing data. SHARPn is approaching data normalization through clinical element model (CEM) structures, which are a basis for retaining consis- tent meaning for data when they are exchanged between heterogeneous computer systems or when clinical data are referenced in decision support logic or other modalities of secondary use. CEMs include the context and provenance of data, for example a patient’s position and body location will be recorded alongside his or her blood pressure reading. This promising model has generated an international consortium, the Clinical Information Model Initiative (CIMI), which brings together a variety of efforts focused on CEMs. When comparing the resulting CEMs between different participating partners, it becomes clear that different sec- ondary uses require different metadata, which raises the question of what structured information should be incorporated into these models. By bind- ing value sets to CEMs, Chute suggested, it is possible to effectively institute semantic normalization. Ideally, all collaborating groups would implement the same value sets and they would be drawn from “standard vocabular- ies” like LOINC and SNOMED. However, it is likely that many value sets would have to be bound to these CEMs in order to truly have interoper- ability and a comparable and consistent representation of clinical data. Value-set management, therefore, is a major component of normalization, and terminology services and a national repository of value sets managed by the National Library of Medicine is one suggested approach to handling this challenge. Local codes would have to map to the major value sets, and

OCR for page 33
38 DIGITAL DATA IMPROVEMENT PRIORITIES the process of semantic mapping from local codes to “standard” codes, Chute emphasized, surely would be labor intensive. This underscores the critical importance of tagging data at the local level, so that those who best understand the data’s significance are the individuals determining its codes. DATA LINKAGE Vik Kheterpal began by emphasizing chronic disease as the dominant problem in health care as a way to highlight the challenges associated with data linkage. Chronic diseases are the principal cause of disability and health services utilization, and account for 78 percent of health care expenditures. Care for these conditions necessitates teamwork and coor- dination between multiple caregivers, and this team-based care requires data exchange, interoperability, and management over a patient’s extended care timeline. The data must be longitudinal and its management must be coordinated in order to ensure that clinicians are able to view the patient’s condition across time before making clinical decisions. This level of coordi- nation, Kheterpal suggested, offers the opportunity to reduce costs, improve outcomes, and reduce care fragmentation. In working toward this more interoperable vision of data exchange, it is important that the current focus on EHRs be broadened, Kheterpal suggested. He emphasized the need to focus not on the technology, but what can be done with it. For example, EHRs are necessary to facilitate exchange, but they are not sufficient to accrue transformational systemic value. Rather than simply digitizing the data contained in paper records, emphasis should be placed on improving data visualization, and leverag- ing the power of large datasets for extrapolation. The strategy also must address health care specific challenges, including false positives, lack of uniform identifiers, privacy regulations, dirty data, and the multitude of data sources. Kheterpal highlighted that data linkage is a major challenge to inte- grating data from different sources and to providing longitudinal data on patients in order to assess downstream outcomes and get a complete picture. To confront these challenges, Kheterpal said, blindfolded record linkage holds much promise. This method of linking data allows for secure, one-way hash transformations so that records can be linked without any party having to reveal identifying information about any of the subjects. Its advantages are numerous in that it maintains patient privacy, is already viable and in production, and can process large population sets. Moreover, Kheterpal said, current health data efforts can easily be adapted to include it. Employing this strategy for linking data can decrease duplicity and provide a longitudinal view of the patient’s care history, two of the major challenges to optimizing learning from large datasets.

OCR for page 33
INNOVATIONS EMERGING IN THE CLINICAL DATA UTILITY 39 To close, Kheterpal offered several recommendations to move the field forward. Increased utilization of distributed blindfolded linkage pilots will provide greater evidence on their fitness to address the challenges at hand. Research into the scale of overlap and missed signal problems associated with systems that do not link records stratified across disease states will help to make the case for improved record linkage. Lastly, Kheterpal sug- gested development of a stratification model that matches a proposed re- search question with necessary data types could improve the accuracy and relevance of data linkage efforts.

OCR for page 33