The second session of the workshop, moderated by steering committee member Lance Waller (Emory University), highlighted challenges in using available data for small population health research. Kelly Devers (NORC) discussed the feasibility of using electronic health records and electronic health data. Chris Fowler (Pennsylvania State University) focused on using geospatial methods with demographic data to identify populations, while consultant Ellen Cromley discussed using these methods with other health and environmental data for the same purpose.
In his introduction, Waller noted two recurring hypotheses in the field of data science. The first, “no data, no problem,” alludes to a lack of attention to certain groups because data on them do not exist; the latter, “lots of data, lots of problems,” Waller said, insinuates a heightened focus on certain groups because data on them are so prevalent and is also not entirely accurate. The key is understanding what types of data are available and using them effectively.
THE FEASIBILITY OF USING ELECTRONIC HEALTH RECORDS AND ELECTRONIC HEALTH DATA FOR RESEARCH ON SMALL POPULATIONS
In 2013, Kelly Devers said, she and her colleagues wrote a paper on this topic for the Department of Health and Human Services (Devers et al.,
2013),1 yet much has changed in the intervening years regarding the use of electronic health records (EHRs) and data collection as a whole.
Devers observed people often view survey research and claims data within the federal statistical infrastructure as a separate entity from different or newer sources of data, but the data sources complement each other and provide different approaches to answer research questions. She acknowledged the federal statistical system faces broad and often conflicting challenges: funding constraints versus a need for more granular data, the push for efficiency amid declining survey response rates, and a desire to obtain finer data about various populations while carefully considering privacy issues and questions about trust and data usage. She said these forces should serve as drivers to search for complementary data sources like EHRs and other electronic health data, which may require partnering with the private organizations who maintain these records.
More than 80 percent of office-based physicians and nonfederal acute care hospitals use EHRs.2 The 2009 American Recovery and Reinvestment Act, which included the Health Information Technology for Economic and Clinical Health Act, made federal funds available. More providers are using EHR systems. Many are using certified EHR systems, meaning the systems have certain functionalities and standardizations for data elements, which is useful when compiling the data for research. EHR system adoption in small and rural area hospitals has been on par with other hospital types, she noted, alleviating concerns over a “digital divide.”
To leverage use of electronic health data, the data must be able to be exchanged between different providers, which remains a continuing challenge. According to the 2014 National Electronic Health Records Survey, conducted by the National Center for Health Statistics, more than 60 percent of providers are still not sharing EHR data.
To Devers, an EHR is a rich and powerful resource that contains a wealth of data, from administrative data, such as a patient’s insurance status, to health-specific data, such as pharmaceutical information, laboratory results, and physicians’ notes. Applications are being developed that can feed data into an EHR, and mechanisms such as patient portals and electronic surveys allow patients to enter self-reported data. She said the data entered into an EHR can also be used to track patients over time; they may change their health insurer, but they often see a similar network of providers. Through the use of novel techniques such as Natural Language
2 Washington, V., DeSalvo, K., Mostashari, F., and Blumenthal, D. (2017). The HITECH era and the path forward. New England Journal of Medicine, 377:904-906.
Processing, Devers said EHR information can be applied to a variety of different kinds of research—including studying small populations.
Four Illustrative Examples
EHRs could also be used to identify subpopulations to study further through surveys or other primary data collection efforts. Devers cited several examples in which access to large health insurers or integrated delivery systems in areas with highly diverse populations can provide information on and enable researchers to identify hard-to-reach groups, using data such as language spoken and translation services provided. Devers and her colleagues have identified four illustrative populations that put these issues into context: Asian American subpopulations; lesbian, gay, bisexual, and transgender (LGBT) populations; adolescents with autism spectrum disorders; and rural populations.
Referencing Howard Koh’s earlier presentation on the challenges associated with collecting data and conducting analysis on the Asian American population and the underlying subgroups (see Chapter 2), Devers pointed to the Pan Asian Cohort Study.3 She noted that the Asian American population includes 50 different Asian ethnicities and over 100 languages, distributed across the United States. The Pan Asian Cohort Study was an early adopter of EHRs, beginning in 1999. They followed a virtual cohort of patients for diabetes incidence and predictors. According to the study, the prevalence of diabetes was found to be three times higher among Filipino men than among Japanese men. She noted that while the study shows progress in collecting data for different subgroups, the language barriers and small sample sizes limit federal surveys from collecting and analyzing data in the same fashion.
Devers noted many health issues and research challenges facing the LGBT population relate to stigma. Historically, researchers have hesitated to collect data on this population and their status, thus preventing this population from identifying themselves. Trust concerns on the part of the respondents could also lead to lack of disclosure and underreporting. To illustrate this point, Vanderbilt University Medical Center found that the time between when patients were first seen and when their LGBT status appeared in their medical records averaged about 30 months.
Devers added that the lack of a standard definition and variations in the nature of questions asked (e.g., asking about actual sexual behavior versus attraction versus identity) can yield very different responses—each having important implications for the health of the population. Despite
the fact that she and her colleagues found greater than 10 percentage point differences in LGBT subgroup prevalence based on the type of question asked, she believes the data still point to significant disparities in care for this subgroup. Devers said that by applying natural language processing to unstructured EHR data, researchers can begin to identify and analyze information about sexual orientation, gender identity, and behavior and their impact on various health outcomes. Both Vanderbilt and University of California Davis Health Systems are also now collecting information about patient sexual orientation through EHR patient portals as well on a voluntary basis. Additionally, the more advanced certified EHRs are now required to add gender identity and sexual orientation data, which will further aid research.
Turning to autism, Devers noted that much of the research on this population has focused on the diagnosis of the disorders, with very little known about their health and health care when they become adults. The problem, she said, is a lack of longitudinal data; because most studies are cross-sectional in nature, and the manner in which autism spectrum disorders are measured has changed, it has been difficult to use the data to follow this population.
Kaiser Permanente of Northern California has compiled a list of autism diagnoses based on the ICD-10 codes as well as taking into account who made the diagnosis—which Devers said is important because the diagnostic categories are continually changing. Additionally, the American Academy of Pediatrics has created a large pediatric research virtual data network called ePROS that allows for large-scale studies of children with autism spectrum disorder that could follow them as they age into adulthood.
In her fourth example, Devers said that geographic isolation, lack of economic opportunities, and challenges with access to care have created a set of unique needs for rural populations. In addition to chronic health conditions and conditions associated with aging, there are often environmental health issues to consider for communities that exist in exclusively rural areas, such as farmers and miners. The lack of consistent or well-defined boundaries for rural areas and a myriad of definitions further complicate data collection in these areas. She presented several examples where health providers are using EHR data to better inform their work with rural populations, including studying drug-seeking behavior and the prevalence of chronic disease.
Issues in Electronic Records Research
Devers reminded the group of several conditions required for this type of research, many of which resemble issues that arise when using nonelectronic data. Challenges exclusive to electronic systems include the process-
ing of free text, as well as extraction and formatting of the data. Privacy concerns, security conditions, and restrictions to data are also noteworthy factors in the use of EHRs. Adherence to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and to Common Rule regulations creates unique challenges, but researchers are discovering ways to work through the constraints and gain access while remaining HIPAA compliant. Researchers also have to consider data governance and ownership and how to effectively work with private organizations to gain access to their data.
Devers noted that coordinating EHR data from multiple sources involves the creation of a governance structure and underlying organizational conditions. The field is moving away from centralized data warehouses in favor of virtual data warehouses that can be used for very specific studies. For these to work, she said, a common programming language must be established to bolster interoperability between EHR systems. Users must also test retrieving the data from the EHRs and then monitor the data quality and validity as the data are exchanged with subordinate systems, such as patient registries, surveys, and insurance claim software. A number of research networks, including the Health Systems Research Network and National Patient-Centered Clinical Research Network (PCORnet), are putting this into practice.
Devers and her colleagues (2013) also identified specific studies on small and unique populations and how the quality and validity of the EHR data differ across different systems and for different populations. A variety of basic descriptive studies would help researchers better understand small populations and how particular covariates impact key health outcomes. The use of EHR data can fundamentally improve research on small populations, she said, and many conditions for success are present or close to being realized. Devers said that successful implementation will involve stakeholder engagement of those who own and control the data, including the private sector, integrated delivery systems, EHR vendors, and the public. She also noted the need for progress on the legal framework and other policy issues.
USING GEOSPATIAL METHODS WITH DEMOGRAPHIC DATA TO IDENTIFY POPULATIONS
Chris Fowler focused on the possibilities and problems with using small geographic areas to represent context for studying small populations. He noted that context and contextual variables are already used in health research to predict individuals’ outcomes—for example, neighborhoods, built environments, school districts, and socioeconomic status—but emphasized the importance of applying the context at the appropriate scale. He said that context affects individuals through processes that are relevant at particular scales and not relevant at others; as such, variables
need to be measured at the appropriate scale. What we define as our units of observation will condition the results we achieve, he said. When discussing geographic context, the boundaries used to define the scale and the area matter. He showed a visual example of a school district delineation that did not align with the area’s census tract boundary delineation, but noted that either could be relevant, depending on context.
Fowler conducted much of his research in south Seattle, which he described as an extremely diverse community. He used census tract boundaries as his geographic frame of reference, but pointed out that census tracts sometimes cut through neighborhoods or may include more than one neighborhood, which can pose problems since neighborhoods are often used as a frame of reference. One tract in his study area contained apartments over storefronts occupied primarily by immigrants and refugees, single-family homes occupied by aging black and Italian populations, and waterfront homes worth millions of dollars. He said that while tract data may be efficient for census enumeration, just as zip code delineation makes postal delivery more efficient, neither may be a useful administrative unit for measurement in other contexts. Other delineations, he said, such as school districts, can provide meaningful information but should be carefully reviewed.
Contextual variables may be useful when direct access to a population is not possible, Fowler explained. For example, when individuals’ test scores or blood levels are not available because a population is nonresponsive or hard to find, address data could help provide insight on exposure rates. Useful contextual variables include U.S. Census demographic data, Environmental Protection Agency (EPA) environmental toxicity data, educational context data from the School Attendance Boundary Information System, crime data from the Bureau of Justice Statistics, and economy data from the Bureau of Labor Statistics.
Fowler talked about the Family Life Project,4 a longitudinal study on children and families in rural Pennsylvania and North Carolina. Displaying a map of North Carolina that used an EPA risk-screening environmental indicators model to measure exposure to a specific set of toxicities, he explained how grid cells helped researchers look at degrees of exposure for children over time. In this study, researchers had both exposure data and individual level (or uptake) data, and the two together provided valuable insight.
Fowler noted that, at times, the term “small population” can refer to a high concentration of certain subgroups of individuals, such as ethnic groups who concentrate themselves in a particular part of a city or individuals who live proximate to a chemical plant and may have unique
exposures. In cases such as these, knowledge of small geographic areas can provide very useful information if the area of effect is properly identified.
In his work on segregation, Fowler (2016)5 explained how measures of segregation changed with scale. He noticed in conducting his work that variability increases at smaller geographic scales, and this variance has the potential to help or hinder a study. Conversely, much of that variability is lost in larger contexts. He first started by measuring segregation in a 100-foot radius, and then continually increased the radius to include larger population groups out to a distance of over 2 miles. The Theil’s H measure he used compares the local measure to the population composition of Seattle as a whole, and generated “segregation profiles.”
Fowler described another study where they used individual-level, restricted access, data from the census that identified low-income black individuals in Nashville (and seven other metropolitan areas) who answered the American Community Survey. They conducted radial analysis, looking at the racial composition of each individual’s nearest 5 neighbors, 10 neighbors, and so on, up through 10,000 neighbors. When they looked at individuals’ first 100 neighbors, they found that approximately 60 percent were black; however, once they disaggregated, they saw some individuals’ first 100 neighbors were all black. At the 4,500-neighbor mark, roughly the size of the average census tract, they found that the percentage of black neighbors decreased to 45 percent and was equal to the percentage of whites. He noted that the larger the area becomes, the less variation is apparent, which could remove the context in which individuals find themselves on a day-to-day basis.
The variation observed in these individual contexts calls into question the common narrative about how racial composition shapes outcomes for individuals. When segregation can be seen at large scales, Fowler noted it is correctly understood as the result of long-standing historical racism, discrimination, and other negative structural factors. But at very small scales, those same observations of people of similar race living together may be associated with social solidarity, political efficacy, social networks, and social capital that bring significant benefits. The same marker—percentage black or percentage Asian—has different values and means different things at different scales. The degree to which boundaries effectively capture context varies not only tract to tract, but also can elicit geographical bias on larger scales.
Fowler said one solution to enhance the robustness of results is to run an analysis using multiple scales. He referenced a study by Elizabeth
5 Fowler, C.S. (2016). Segregation as a multiscalar phenomenon and its implications for neighborhood-scale research: The case of South Seattle 1990-2010. Urban Geography, 37(1):1-25.
Root (2012)6 in which she measured spatial scale effects by conducting her analysis at the block, block group, and tract levels. Another way is to use the individual as the frame of reference and set the boundary around the individual. Fowler referred to this concept as the standard deviation of individual context—the degree to which individual experience varies within a geographic unit. He identified a working paper that shares more about this method.7
In conclusion, Fowler stressed that scale matters. The better one’s boundaries are at the onset, and the more that is known about the optimal scale for the process to be captured, the better the analysis will be.
USING GEOSPATIAL METHODS WITH OTHER HEALTH AND ENVIRONMENTAL DATA
Ellen Cromley illustrated the use of geospatial methods to locate populations from data sources other than EHRs and the U.S. Census. She described how two seemingly identical datasets can be viewed in different ways. One may see a singular outcome for the two datasets when looking at the frequency distribution, median result, range, or relationship between variables, but may encounter another completely different result when looking at the spatial statistical view of the data. She noted the locations of the observations, or spatial data, are part of the data record and are required to support effective spatial data analysis.
Cromley noted that when observations are made for a health study from a population that is itself geographically distributed, the sample is considered as having been taken in a geographic space. In that sense, any random sample from such a population is implicitly a spatial sample. She added that it is possible to advance health research by making the spatial basis of evidence explicit using geospatial methods; in other words, create the map first, understand the geographical distribution of the sample, and consider the geographical distribution of the population of interest. For example, Troped et al. (2014)8 took a sample from the Nurses’ Health
6 Root, E.D. (2012). Moving neighborhoods and health research forward: Using geographic methods to examine the role of spatial scale in neighborhood effects on health. Annals of the Association of American Geographers, 102(5):986-995.
7 Fowler, C.S., Spielman, S., Folch, D.C., and Nagle, N. (2018). Who Are the People in My Neighborhood? The “Contextual Fallacy” of Measuring Individual Context with Census Geographies. Working Paper 18-11. Washington, DC: Center for Economic Studies, U.S. Census Bureau.
8 Troped, P.J., Starnes, H.A., Puett, R.C., Tamura, K., Cromley, E.K., James, P., Ben-Joseph, E., Melly, S.J., and Laden, F. (2014). Relationships between the built environment and walking and weight status among older women in three U.S. states. Journal of Aging and Physical Activity, 22(1):114-125. doi: 10.1123/japa.2012-0137.
Study in Massachusetts to look at the environmental context for physical activity statewide; once they mapped the locations of the participants, they found the sample pertained only to one specific area in the state, and ultimately had to draw another sample.
Cromley said it is possible to locate small populations by conducting surveys based on residential location, and gave examples where respondent data were geocoded according to census tract and used along with other data to measure the distribution of certain health conditions in older adults in New Jersey.9 She and her colleagues found that the respondent distribution aligned with the general population distribution for the state. They were able to identify combinations of chronic conditions that showed no particular pattern of spatial association.10
Cromley pointed out that close to 8 million people lived in group quarters in the United States—about 2.5 percent of the population—per the 2010 U.S. Census. This proportion varies widely, from 0 percent of the population in some counties to 55 percent in others. Populations in group quarters often differ significantly from the general population in age, gender, and mobility. She and her colleagues used data from MassGIS, a statewide, one-stop site for interactive maps and geospatial data for Massachusetts, to look at long-term care facilities. To look more closely at capacity, they measured the distance from specific points to every nursing home using Gaussian spatial weighting to obtain a geographically weighted mean. They could then use these data to design an appropriate sampling scheme.
In addition to using residential data, Cromley said that administrative data, including the registration of life events (e.g., births and deaths), can be used to locate people. She referred to an article in the Guardian (titled “Bussed Out”)11 that used public records laws to obtain data from 16 cities and counties that give homeless people free bus tickets to go somewhere else. They documented 21,400 such journeys across the United States from 2011 to 2017. Because administrative records are collected with specific decision-making purposes in mind and often contain identifying information, she noted that privacy and confidentiality issues are very important.
Cromley discussed a study on collision risk factors that drew on a map of motor vehicle collisions occurring on federal and state roads in
9 Cromley, E.K., Wilson-Genderson, M.P., Heid, A.R., and Pruchno, R.A. (2016). Spatial associations of multiple chronic conditions among older adults. Journal of Applied Gerontology, 1-25. doi: 10.1177/0733464816672044.
10 Cromley, E.K., Wilson-Genderson, M., Christman, Z., and Pruchno, R.A. (2015). Colocation of older adults with successful aging based on objective and subjective measures. Journal of Applied Geography, 56(1):13-20. doi: 10.1016/j.apgeog.2014.10.003.
Connecticut in 1995 and 1996.12 The data were linked with the state hospital trauma registry and mortality databases and with geocoded collision data from the Connecticut Department of Transportation. Cromley found that after the site data were analyzed, the majority of the accidents occurred on dry roads, by male drivers, and during daylight conditions—which is not what one might have guessed. Furthermore, she realized that sampling just 1 of these 10 sites would not have provided the same picture of the data. She then used the data to do local odds ratios to determine whether there are differences in the characteristics between fixed object crashes and other types of collisions at these sites.
Cromley identified challenges to using administrative records. She added that errors can also occur in the spatial data, such as when zip codes overlap county boundaries. Researchers can respond to these challenges and improve data quality by talking to the people who collect and code the data.
Another approach Cromley has used involved locating populations based on their social networks. She said two seemingly similar networks can possess different properties when considered in real geographic space. Venue-based studies are another common approach, which involve understanding patterns of use of venues by the population of interest. Cromley mentioned the usefulness of using data from geo-enabled devices to compare individuals’ interactions and movement in time and space. Residences, venues, and medical care facilities would appear on daily time-space paths as places where people colocate. The data could be processed by creating indexes of association.
Cromley discussed a study she and colleagues conducted in Mumbai, India, where they studied drinking and sexual risk in three low-income communities.13 They found that 111 of the 751 men surveyed reported drinking with friends in places outside the study community, which prompted them to adjust their index of relevance from what was found in their formative work.
Cromley called for an increase in common repositories for spatial data, where people can access data on their own and develop interventions specific to their communities. She cited the Malaria Atlas Project14 as a great model for how to build spatial data commons. The Census Bureau’s wealth of population and spatial data is another example, as is Michael Emch’s
12 Cromley, E.K. (2007). Risk factors contributing to motor vehicle collisions in an environment of uncertainty. Stochastic Environmental Research and Risk Assessment (SERRA), 21(5):473-486. doi: 10.1007/s004770070130-5.
13 Cromley, E.K., Schensul, J.J., Singh, S.K., Berg, M.J., and Coman, E. (2010). Spatial dimensions of research on alcohol and sexual risk: A case example from a Mumbai study. AIDS & Behavior, 14(S1):S104-S112.
work in Matlab, Bangladesh,15 which brings together many methods to conduct evaluations of cholera vaccine trials.
Thomas Louis (Johns Hopkins Bloomberg School of Public Health) discussed altitude as the “third dimension,” noting malaria studies in Mutasa, Zimbabwe, where no malaria was seen in higher altitudes. In some instances, straight-line distance may not be the appropriate metric, using the study of pollution in the Chesapeake Bay as an example. He suggested the research community use the term “domain” instead of “space,” since connections are made not just between spatial attributes but also sociologic and demographic attributes.
Eugene Lengerich (Pennsylvania State University Cancer Institute) referred to 2017 recommendations by the National Cancer Institute (NCI), American Cancer Society, and other organizations around cancer health disparities research in five areas. The first area specifically relates to data and standards. At Penn State, they now ensure that all their data are able to be used for health disparities research. He asked how the community can use existing data systems, processes, and standards to implement advancements. He suggested creating standards to assist with collecting data on rural populations. He also agreed with Devers’s discussion on natural language processing and would like to see this utilized more regularly.
Devers responded when clinicians collect data for EHRs, the data are specific to the clinical encounter. The clinician may not be thinking about additional data for research. She suggested working with clinicians to emphasize the importance of the quality and reliability of data they document. It is also important to work with data vendors and researchers who can explore different ways of efficiently getting information into a database.
Richard Moser (NCI) asked about current methods to link EHR claims data to survey data at an individual level when the identity of the individual is unknown. Devers clarified she was referring less to linking federal survey datasets with existing EHRs and more to the ways in which a large health system could tap into its own database to identify small populations, proactively reach out to them, and leverage them for studies. She said some studies are combining EHR data with data from national primary surveys for cohorts of people and using various methods to try and leverage both.
Bob Sun (National Institute of Environmental Health Sciences) asked about technological solutions being built into EHR systems that can facili-
15 See http://spatialhealth.web.unc.edu/projects/present-projects/incorporating-geographic-context-into-randomized-controlled-trials-case-studies-on-the-rtss-malaria-and-the-oral-choleravaccines/ [May 2018].
tate research and asked about features lacking in current systems. Devers noted EHR system vendors have groups of users to help inform them on the kind of data needed for day-to-day clinical care, which could also potentially be used for research. Most large health systems and EHR vendors are aware of the multidisciplinary nature of this work, she added, and several have interdisciplinary teams that work closely with computer and data science staff to try to better understand the needs for the systems.
Mandi Pratt-Chapman (George Washington Cancer Center) said she has seen how zip code boundaries can cross over neighborhood boundaries and asked at what level research should start if the primary question is not spatially oriented. She also asked how to mitigate the risk of disclosing personally identifiable information in secondary analyses. Fowler noted at the point of an interview, the risk is already established. He suggested using small units of data such as block centroids. He added that researchers should take care to remove personally identifiable information and referred to techniques that can combine the data with other statistical methods and ensure a quantifiable amount of protection. Cromley suggested collecting information at the most disaggregated level possible and promoting informed consent among participants so they understand how their data are being used. In addition, data can be collected at one level and reported at another level that would not enable identification of the individual.
Robert Croyle (NCI) said that a common dilemma for investigators is that available data are county-level data, but county sizes vary tremendously. He asked the panelists how they address that challenge. Fowler suggested involving a spatial statistician who can perform cross-section analyses that can divide the population in a way that will help assign context. For example, if it is known that the entire Asian population of a certain county is in one area, one could constrain the spatial likelihood to, or do probabilistic models on, that particular area. He said bootstrapping and other methods allow researchers to test the significance of their assumptions of a particular spatial configuration.