Read "Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition" at NAP.edu

Page 293 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology

Chair: Patricia Nechodom, University of Utah

Authors:

Selma C.Kunitz, Clara Lee, and Rene C.Kozlojf, Kunitz and Associates, and Harvey Schwartz, Agency for Health Care Policy and Research

Christian Houle, Jean-Marie Berthelot, Pierre David, and Michael Wolfson, Statistics Canada Cam Mustard and Leslie Roos, University of Manitoba

Steve Kendrick, National Health Service, Scotland

Page 294 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

This page in the original is blank.

Page 295 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Record Linkage Methods Applied to Health Related Administrative DataSets Containing Racial and Ethnic Descriptors

Selma C.Kunitz, Clara Lee, and Rene C.Kozloff, Kunitz and Associates, Inc. Harvey Schwartz, Agency for Health Care Policy and Research

Abstract

In response to the lack of easily retrievable clinical data to address health services and medical effectiveness questions, especially as they relate to racial/ethnic minorities, the Center for Information Technology (CIT), Agency for Health Care Policy and Research (AHCPR) recently sponsored a project on record linkage methodology applied to automated medical administrative datasets containing racial and ethnic identifiers (Contract 282–94–2005). The primary objectives of the project were to:

link patient-level related datasets that contain racial and ethnics descriptors; and assess the value of the linked data to address medical effectiveness research questions that focus on the quality, effectiveness, and outcomes from care for minority populations.

KAI, AHCPR's contractor, received approval from the State of New York's Department of Health to utilize the Statewide Planning and Research Cooperative System (SPARCS) files Discharge Data Abstract (DDA) and Uniform Billing files (UBF), which contain all acute hospital discharge and claims data, the SPARCS Ambulatory Surgical files, and the Cardiac Surgery Reporting System (CSRS) files, a research dataset. KAI received files for the 1991, 1992, and 1993 time periods. The files were successfully linked by patient “visits” across the time periods. While the linked data appear to be of high quality, the process of obtaining and linking the data is lengthy. Additionally, these administrative health care data sets contain millions of records that document all hospital stays, and thus, identifying appropriate subpopulations for a particular research question is a time- and resource-consuming effort.

While the administrative health care datasets may be useful in answering questions about charges, length of stay, and other health service issues, their current utility may be less useful in answering clinical questions for minority populations. These datasets can be used to explore potential associations among diagnoses, treatment, and outcome variables. However, understanding the mediating factors and the decision-making variables that result in patient care may not be possible. For example, the results of diagnostic tests such as angiograms are not generally recorded in these datasets, thus limiting the ability to carefully subgroup patients by disease severity. With consideration for the potential utility of these datasets, however, there are several recommendations that emanate from the study.

This talk will briefly describe the research questions posed, linkage process, findings, and recommendations for additional action and policy considerations.

Page 296 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Introduction

The ability to link automated health data records is of critical importance in our rapidly changing health care system. In a managed care and cost containment environment, researchers require reliable and valid data collected over time and across providers that describe patient characteristics and the location, process, cost, quality, and outcome of care to analyze which procedures are effective and produce satisfactory patient outcomes. Approaches and methods to linking records across time and providers are needed to provide information to policy makers, health plans, practitioners, consumers, and patients to make decisions about accessing, using, and paying for care, as well as the effectiveness of that care.

Background

In response to the lack of easily retrievable clinical data to answer medical effectiveness questions, especially as they relate to racial/ethnic minorities, the Agency for Health Care Policy and Research (AHCPR) sponsored a project on “Record Linkage Methodology Applied to Linking Automated Data Bases Containing Racial and Ethnic Identifiers to Medical Administrative Data Bases” under AHCPR contract number 282 –94–2005 (Kunitz and Associates, Inc., 1996). This linkage demonstration project contributed to AHCPR's research goals by reviewing and adding to record linkage methodology; illustrating the value of this methodology; assessing the need for further development; and providing guiding principles to developers. The primary objectives of this record linkage methodology project were to: link two patient- level related datasets that contain racial and ethnic descriptor; and assess the value of the linked data to address medical effectiveness research questions that focus on the quality, effectiveness, and outcomes from care for minority populations.

Data Sets

AHCPR's contractor, KAI, a health research firm, identified data sets to use for assessing the value of linking administrative health related data bases to support medical effectiveness research in minority populations. KAI received approval from New York State's Department of Health (NYSDOH) to utilize the Statewide Planning and Research Cooperative System (SPARCS) files Discharge Data Abstract (DDA) and Uniform Billing files (UBF), the SPARCS Ambulatory Surgery files, and the Cardiac Surgery Reporting System (CSRS) files. KAI received files for 1991, 1992, and 1993. The selected systems and data files are briefly described as follows:

SPARCS (State Wide Planning and Research Cooperative System) is a system maintained by the NYSDOH. The Discharge Data Abstract files (DDA) contain all acute hospital discharge data and the Uniform Billing Files (UBF) contain all acute hospital billing records. Data about surgeries performed at hospital-based ambulatory care centers and certified diagnostic and treatment freestanding centers are maintained in the Ambulatory Surgery files. The data are used for planning and research. Three of the files extracted from SPARCS for this project were the DDA, UBF, and the Ambulatory Surgery file.

NYSDOH staff combined the acute hospital DDA and UBF data files by individual hospital stay for this project. Thus, we received both matched and unmatched records from the DDA and UBF for 1991–1993. Because the files were selected based on DDA variables, unmatched records are those that are in the DDA file but do not have a corresponding match in the UBF. A completeness level of 95% is typically achieved in SPARCS files, a figure that is supported by our re-

Page 297 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

search, as seen in Figure 1. Yet, those records which are unmatched may reflect not only missing UBF records, but also incorrect information which may have hindered the original matching process performed by the NYSDOH.

Figure 1. —Matching Rates Between DDA and UBF Records

Year	Total	Matched	Unmatched
			N	%
1991	Total	Matched	626,222	594,302	31,290	5%
1992	699,246	663,323	35,923	5%
1993	714,583	677,778	36,805	5%

KAI obtained 31,290 unmatched records out of a total of 626,222 for 1991, 35,923 unmatched out of a total of 699,246 for 1992, and 36,805 unmatched records out of 714,583 for 1993. The selection process did not enable KAI to receive data which was in the UBF file but absent in DDA. In addition, we received the Ambulatory Surgery files for these three years.

Cardiac Surgery Reporting System (CSRS) is a voluntary reporting system of all in-hospital cardiac surgeries. It contains risk factors, clinical descriptors and procedure data and is used as a research data set. We received these files for 1991 –1993.

Figure 2 summarizes the size of the original data files. The DDA/UBF files contain between 2.5 and 3 million records each year. The ambulatory surgery files were not segregated by year and contain slightly more than two million records. The CSRS data files are also summarized by year and contain a considerably smaller number of records because of the more narrow focus of the records on cardiac surgery.

Figure 2. —Summary of Sizes of Complete Data Files

Data Set	Year(s)	Number of Records
SPARCS	1991	1,687,521
		1992	1,677,948
		1993	1,660,109
Ambulatory Surgery	1991–1993	2,121,542
Cardiac Surgery Reporting System (CSRS)	1991	19,783
	1992	21,592
	1993	22,491

Page 298 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Research Question

One of the primary goals of this project was to determine whether a medical effectiveness research question could be successfully addressed by the linked data. The selected research question for this project relates risk factors, treatment, and outcome of cardiovascular disease to minority status:

Are the racial/ethnic differences in mortality and morbidity from coronary heart disease related to racial/ethnic differences in treatment?

The working hypothesis stated that minorities are less likely to receive surgical treatment for coronary artery disease and, therefore as a group, experience higher incidence of cardiovascular morbidity and mortality than the majority U.S. population. The cohort was to be extracted from the linked SPARCS and CSRS datasets. The linked datasets were to contain records for 3 years, 1991–1993. Males and females aged 45–75 who were assigned a diagnosis of ischemic heart disease (ICD-9 codes 410–414) were to be included.

Confidentiality Approach

One of the primary issues in acquiring the New York State files was data confidentiality. Technically, the problems of confidentiality of data are often addressed by suppressing, encrypting, or compressing information. In these data sets primary identifiers such as name, address, and telephone number were removed or suppressed from the files and secondary identifiers such as Medical Record Number (MRN), Admission Number, and Physician License Numbers (PLNs) were encrypted consistently across files and years to aid the matching process. Typically, confidentiality restrictions hinder the matching of large data sets. Identifiers such as name, address, and medical record number are important in order to be confident that the correct linkages are being made. If only demographic data and broad geographic identifiers are available such as gender, race, age, and zip code, then a large group of people may have the same characteristics with the result that their records inaccurately matched.

Cardiac Subset—Identification and Issues

The original research plan specified the use of ICD-9 codes 410–414 to address the research question. The low yield on initial matches, however, indicated that we needed to expand these codes to obtain a more complete record match between the SPARCS and CSRS files. Therefore, for the linkage process, the codes were expanded to include: 390.xx —459.xx—disease of the circulatory system; 212.7x—benign neoplasm of the heart; 745.xx—bulbus cordis anomalies and other cardiac anomalies; 861.0x—injury to the heart without open wound to thorax; 861.1x—injury to the heart with open wound to thorax; 901.xx— injury to thoracic aorta; and 996.0x—mechanical complication of cardiac device. Figure 3 summarizes the number of potential patients on the DDA/UBF files using the ischemic heart disease codes (ICD-9 410–414) and an expanded set of codes.

Figure 3. —Universe of Patient Records in DDA/UBF

DDA/UBF File Year	Initial Universe of ICD-9 Codes—410–414	Expanded Universe of ICD-9 Codes—390–459
1991	170,779	626,222
1992	189,198	699,246
1993	190,497	714,583

Page 299 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Record Linkage

The linkage software used for this project was MatchWare Technology Incorporated's (MTI) Automatch, developed by MTI's founder, Matthew Jaro (Jaro, 1997). MTI was KAI's subcontractor and its linkage experts collaborated with KAI's clinical researchers in conducting this project.

Several steps were involved in the data preparation process prior to performing the record matching or linking process. Fields that are common to the files had to be identified and recoded, where necessary, for potential use in the linkage process. Common person and event fields included for all three data sets were MRN, sex, date of birth, patient county, hospital identification number, diagnosis, procedure code and date, and Physician License Number (PLN). Fields common to two of the three files included age, patient zip code and state, admit date, discharge date, and payor.

As an example of recoding needs, race codes on the CSRS files were converted to correspond to SPARCS codes as shown in Figure 4.

Figure 4. —Race Code Conversions

Description	SPARCS Race	CSRS Race
Asian or Pacific Islander	1	8
Black	2	2
Hispanic	3	8
Native American	4	8
Other	5	8
White	6	1

Linkage Objective

The linkage objective was to build a longitudinal, comprehensive patient history that captured clinical encounters over time and across care settings. Thus, records for the same patient were linked in two ways: matches were performed within each of the three data sets; and matches were performed between the DDA/UBF files and CSRS and between the DDA/UBF and Ambulatory Surgery files.

Steps in Record Linkage

Steps in the linkage process included identifying duplicate records; running preliminary matches as an iterative process to determine which fields yielded the most appropriate matches; identifying appropriate cutoff weights; and running the final linkage.

Duplicate records were identified on each of the files with no file having duplicates that exceeded 1% of the records. Automatch's method for determining most effective variables and probability weights to match across files was evaluated in preliminary iterative match runs. The process was iterative and consisted of selecting key variables for each match strategy, producing preliminary matched pairs, examining matched pairs with marginal match weights, and revising the parameters to better discriminate between apparent true and false matches. For the final matches specific probabilities of agreement were

Page 300 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

determined based on the preliminary matches. The match cutoff weight was chosen so that the estimated absolute odds of a true match for record pairs with that match weight were 95:5.

Linkage Data Quality Analysis

The linkage results were reviewed for data reliability and validity. First, the same variables on linked and unlinked records were compared to assess internal consistency and reliability. Agreement was 99% or greater for all variables except for date of principal procedure (67%) and admission number (83%); MRN, zip code, county, and other procedure each exhibited an agreement rate of 93%. Principal procedure as well as other procedure differences may reflect differences in reimbursement categories that were changed on the UBF for payment advantages. Admission number and MRN are scrambled by computer and any clerical error such as a transposition of numbers in the original MRN yields an inconsistent scrambled MRN. Likewise, transposition of numbers in zip code and county can yield mismatches.

The DDA and UBF responses for linked and unlinked records were then compared for the same patient. The responses are fairly consistent across DDA and UBF subfiles and between linked and unlinked records with slight differences in reimburser and diagnoses, which could be a function of the research question reflected in the linked files.

The DDA variables were selected for matching and were compared for linked and unlinked patient records, because of their tendency to be more reliable in the clinical area. In the linked records, patients are older (age = 65–71% versus 59% for unlinked records), most likely reflecting the research question which focuses on cardiac diagnoses. Racial characteristics are similar as are ethnicity and gender.

Linked and unlinked records for Ambulatory Surgery patients were also compared. Analysis showed a greater percentage of the linked records to have a higher proportion of angina as the primary diagnosis while in the unlinked files there was a higher proportion of arterial disease, perhaps reflecting procedures performed in ambulatory surgery, i.e., angiograms. There were more Medicare reimbursers in the linked records, which is consistent with differences in age groups. Other fields show no differences. Linked records compared with unlinked records for the CSRS patients showed a greater proportion of persons over 65, most likely reflecting the diagnostic groups of research interest. There were no gender, race, or ethnicity differences in the linked and unlinked records, reflecting similar patient populations.

The general consistency between the DDA and UBF subgroups and the consistency between linked and unlinked records within each of the data sets demonstrate the reliability of the matching and indicates that the linked records generally reflect the file population.

Racial Subsets

Responses across racial subgroups for DDA variables were reviewed. As expected, more Blacks, Asians, and other minorities are treated in the New York City area (over 70%) than other parts of the state. Payment also differs, with a higher proportion of Whites on Medicare (69% versus 46% for Blacks and Others and 40% for Asian Americans). A higher proportion of Blacks and other minorities have Medicaid as the primary reimburser (Blacks—26%, Whites—5%, Asian Americans—25%, Other— 27%). Blacks have a higher proportion of diabetes (4% versus 1% for Whites, 2% for Asian Americans and Other) and hypertension diagnosis (5% versus 1% for Whites, and 2% for Asian Americans and Other), and a slightly lower proportion of myocardial infarctions (Whites—11%, Blacks—7%, Asian Americans—10%, Other—11%) as principal diagnosis. Responses for other variables for linked and

Page 301 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

unlinked records by racial and ethnic categories are consistent, indicating that the linked file is a representative subset of the larger file.

Research Subsets

The research subsets, defined as the original diagnoses categories, 410.xx-414.xx, were examined next. Comparing the DDA and UBF records on the SPARCS data set for linked and unlinked records indicates that age is higher on the linked records (age = 65 = 75%) than on the unlinked records (= 65 = 59%), reflecting the cardiac procedure research question. Also reflecting the research question is the larger number of patients on Medicare in the linked data set (74% versus 59% in the unlinked data set). Comparisons between linked and unlinked records in the ambulatory surgery research files indicate no significant differences between the two subsets.

A review of responses for racial and ethnic subgroups for the linked and unlinked subsets in the DDA research file indicates that in both, Whites are significantly older (78% Whites in the linked subset and 66% Whites in the unlinked subset are 65 or older). In the other racial categories, however, there is a larger proportion under 65 (Blacks—42%; Asian Americans—33%; and Other—56%) in both linked and unlinked subgroups. The age differences between White and minority racial subgroups are also reflected in the proportion of patients on Medicare. There do not appear to be other major differences between White and minority subgroups. These trends are also reflected in the differences between Hispanic and non-Hispanic subgroups.

Linked Data Sets and the Research Question

Preparing the data to answer the research question was a complex process despite the fact that record linkage had taken place. The primary reason for the complexity of the process is that the research question focuses on outcome while the linkage focused on diagnosis. The linkage focus on diagnosis appears logical because it is how patients are generally categorized for health services and clinical research. However, medical effectiveness questions often focus on outcomes and thus, within diagnoses, outcome is an important patient characteristic. The research question, while resulting in a complex subject identification procedure, was typical of many medical effectiveness questions. The amount of time, then, needed for progressing from a linked data set to analyses for outcomes research, is several months and should be built into the research planning process.

Data and Linkage Issues

Several issues related to health care data sets and application of linkage methodology were identified:

Purpose. —The purpose of the primary data collection endeavor impacts on the quality of specific variables and on their utility for linkage and their relevance for addressing a medical effectiveness question. For example, primary diagnosis frequently differed between the DDA and UBF subfiles. The diagnoses in the DDA is driven by clinical practice while in the UBF it is driven by reimbursement. Variables such as age and date of birth, gender, county of residence, hospital identification number, MRN, admission date, and procedure date may not be consistent across billing and discharge administrative records as well as the research records for several reasons: accuracy is not important for billing, discharge, and some research; an individual 's high anxiety state; and family members reporting information under stress. Further, discharge abstracts generally reflect clinical diagnoses more accurately, while billing data typically reflect

Page 302 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

charge justification.

Encryption. —Encrypting the Medical Record Number (MRN), admission numbers, and physician license numbers degrades the efficiency of the matching software. The matching software used in this study can take into account slight differences among identifiers such as transposition of characters and adjust the match for them. However, since the encryption process scrambles identifiers or assigns a sequential number to records, the software is not dealing with actual numeric identifiers, which may have typographical errors. Thus this feature of the software is not useful for electronically encrypted or created numbers. The degradation was demonstrated in the first matching pass between the ambulatory surgery file and the DDA/UBF file. The MRN in the DDA/UBF file is defined as ten characters and was encrypted as such. The MRN in the Ambulatory Surgery file is defined as seventeen characters in which the first ten characters actually contain the MRN and the last seven characters are spaces. When the initial match between the DDA/UBF file and the Ambulatory Surgery file took place there were no matches. The resolution involved the recreation of the Ambulatory Surgery file using only the first ten characters of the MRN in the encryption process. If however, the MRNs had been provided without being encrypted, the software could have adjusted for the spaces at the end of the original MRN in the Ambulatory Surgery file.
Race and Ethnicity Codes. —The race and ethnicity codes are not always accurate as demonstrated by all observations for a particular New York State hospital which contained a race code of 5 and ethnicity code of 2 for all patient records. Additionally, the state SPARCS programmer indicated that there were software problems for RACE and ETHNICITY for certain hospitals that affected accuracy.
Dependent Relationships Among Variables. —Certain pairs of patient and provider variables are strongly dependent on each other. For example, MRN is frequently hospital-specific and physicians are generally associated with only a few hospitals, thus PLN (Physician License Number) and Hospital Identification Number are also strongly dependent as shown statistically by chi square and uncertainty coefficient tests. The Automatch software requires that only one member of each dependent pair is used as a match variable because of relative odds of a true match calculation. For example, if both date of birth and age were used in a matching process, the calculated match weight would overstate the relative odds of a true match by exactly the contribution of the second occurrence. While date of birth and age represent the same concept, hospital and physicians may be logically independent entities although statistically associated. The nature of association in health related records should be considered in the matching process and perhaps, a different statistical approach used for these data.
Matching Variables. —A related issue is determining what variables provide the greatest yield during the blocking and matching procedures. Linking is generally dependent upon person identifiers such as name and address, and date of birth, as well as on procedure and diagnosis codes from health related records. Since name and address were omitted from the files used to preserve personal privacy, other variables assumed greater importance. The clinical research staff, experienced with clinical data, recommended the use of age and date of birth, gender, county of residence, hospital identification number, MRN, admission date, and procedure date. The researchers pointed out that procedure and diagnoses codes can vary between administrative and clinical data sets because of reimbursement interests and are more likely to be accurate in clinical files. Identification of the variables most appropriate for linking health related files is still an open research issue.

Page 303 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Type and Number of Variables Utilized for Linking. —Personal identifiers such as name and address are frequently used in census and vital statistics linkage efforts. Since these variables are not present on the health files, other variables that appear in several files and have a high probability of accuracy must be identified. Some examples are hospital identification number, admission date, and zip code. Additionally, linkage software experts often argue for numerous variables upon which to link. We found that the health-related data sets were more frequently linked with fewer discrepancies in the matching records when fewer variables are used. Thus the percentage of “true” matches was higher with fewer variables or, conversely, the number of false positives was lower.
Experience from Other Applications. —Experience and assumptions gathered from other applications of linkage methodology such as census data cannot necessarily be applied to health-related data. Thus, for health-related data, multidisciplinary teams of linkage software programmers and health researchers need to develop appropriate linkage algorithms and to identify variables pertinent for linking these files.

Findings

Despite time delays and other issues, the files were successfully linked and the data were used to address the above hypothesis that pertains to care among minority populations. General findings are as follows:

Data quality in the administrative and research files generally appears high and the data are potentially useful for health services research.
Both the linkage process and the analytic phase for large data sets are lengthy and resource consuming. The practicality of linking large health-related data sets needs to be balanced against the number of years the data will be useful. If data can be used to support research for three to five years, then the linkage overhead expense may be justifiable. Costs of linking large data sets, then, need to be balanced against the potential benefits.
Linking is only the first step when the data are to be used to address research questions. The linkage process identifies a set of unique indexes for each of the patient records in each of the linked files. Depending upon the focus of the research question, it is necessary to carefully review the data files and the index files, which consumes both time and computer processing. Since the data files for large data sets must reside on mainframe computers, it also is a costly process.

In this project, in which those subjects with the same diagnoses who received cardiac surgery are compared to those who did not, patients with relevant diagnoses had to be identified to form a subgroup from the SPARCS DDA/UBF files. The subgroup had to then be identified on the index files, determined whether linked or not linked to the CSRS file, and then found on the CSRS files. These steps precede any analytic procedures and represent the complexity of data management procedures that are associated with the analysis of the linked files.
Utility of administrative data sets in answering medical effectiveness questions is variable. Clearly, identifying diagnoses, treatment, and outcome at a general level is possible and meaningful. The data set can be used to explore potential associations among diagnoses, treatment,

Page 304 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

and outcome variables. However, understanding the mediating factors and decision making variables that result in a patient proceeding to surgery or not may not be possible. For example, the results of an angiogram for a patient with ischemic heart disease are not recorded in SPARCS DDA/UBF or in CSRS. Thus, understanding why some patients who have angiograms proceed to surgery and others do not is not possible.

Recommendations

This project yielded the following recommendations:

Utilize Linking Techniques for Projects with a Three- to Five-Year Life. —Because of the time, labor, and financial costs of linking large data sets, it would appear practical to utilize linking techniques for data that can be analyzed over a period of three to five years.
Continue Methods Research. —Issues in data dependence and optimal variables for use in linking health related data sets should be addressed in additional research projects.
Multidisciplinary Teams. —The need for utilizing multidisciplinary teams composed of health researchers, programmers, and linkage experts was demonstrated in the linkage process.
Linking Prior to Research Use. —Future efforts may enlarge record linkage before data are released from the agency that holds authority for the data to avoid degradation of data from scrambling or encryption. Linking prior to release across agencies raises issues of data sharing, protection of privacy, and other operational issues that must be addressed.
Recognize Time Needed for Research. —Research efforts using linked data sets must allocate sufficient time and manpower resources to identify and extract the suitable subpopulation for a specific research question.

Selected References

Kunitz and Associates, Inc. ( 1996). Record Linkage Methodology Applied to Linking Automated Data Bases Containing Racial and Ethnic Identifiers to Medical Administrative Data Bases. Unpublished Final Report.

Jaro, Matt ( 1997). MatchWare Product Overview, Record Linkage Techniques—1997, Washington, DC: National Academy Press.

Schwartz, H.; Kunitz, S.; Jaro, M.; Therlault, G.; and Kozloff, R. ( 1996). Studying Treatment Variation among Minority Populations via Linked Administrative and Clinical Data Sets, Proceedings of the Section on Social Statistics, American Statistical Association.

The views expressed in this paper are those of the authors and do not necessarily represent the views of the Agency for Health Care Policy and Research.

Page 305 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Matching Census Database and Manitoba Health Care Files

Christian Houle, Jean-Marie Berthelot, Pierre David, and Michael C.Wolfson, Statistics Canada; Cam Mustard and Leslie Roos, University of Manitoba

Abstract

Introduction: In the current economic context, all partners in health care delivery systems, be they public or private, are obliged to identify the factors that influence the utilization of health care services. To improve our understanding of the mechanisms that underlie these relationships, Statistics Canada and the Manitoba Centre for Health Policy and Evaluation have set up a new database. For a representative sample of the population of the province of Manitoba, cross-sectional microdata on individuals ' health and socio-economic characteristics were linked with detailed longitudinal data on utilization of health care services.

Data and methods: The 1986–87 Health and Activity Limitation Survey, the 1986 Census and the files of Manitoba Health were matched (without using names or addresses) utilizing a CANLINK software. In the pilot project, 20,000 units were selected from the Census according to modern sampling techniques. Before the files were matched, consultations were held and an agreement signed by all parties to establish a framework for protecting privacy and preserving the confidentiality of the data.

Results: A match rate of 74% was obtained for private households. A quality evaluation based on the comparisons of names and addresses over a small subsample established that the overall concordance rate among matched pairs was 95.5%. The match rates and concordance rates varied according by age and household composition. Estimates produced from the sample accurately reflected the socio-demographic profile, mortality, hospitalization rate, health care costs, and consumption of health care by Manitoba residents.

Discussion: The match rate of 74% was satisfactory in comparison with response rates reported by the majority of population surveys. Because of the excellent concordance rate and the accuracy of the estimates obtained from the sample, this database will provide an adequate basis for studying the association between sociodemographic characteristics, health and health care utilization in province of Manitoba.

Introduction

A number of studies have clearly shown that there is a link between an individual's socio-economic status and the probability of his or her death during a given period of time (Wolfson et al., 1993; Marmot, 1986; Wilkins et al., 1991). Other studies have shown that the prevalence of certain diseases varies greatly depending on the socio-economic characteristics of the area in which an individual resides (Anderson, 1993, Dougherty, 1990). In addition, several Canadian surveys have already provided cross-sectional data on individuals ' health status and socio-economic status, along with self-reported information on the use of health services, e.g., General Social Survey of 1991 (Statistics Canada, 1994a), Ontario Health Survey of 1990, Enquête Santé Quebec of 1987 and 1992–93, Health and Activity Limitation Survey of 1986 and 1991 (Statistics Canada, 1988), Canadian Health and Disability Survey of 1983–84 (Statistics Canada, 1986a), Canada Health Survey of 1978–79 (Health and Welfare Canada, 1981). However, to our knowledge, there is no Canadian longitudinal database that combines information on health, use of health services, and socio-economic characteristics. In an effort to meet this

Page 306 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

information need, Statistics Canada and the Manitoba Centre for Health Policy and Evaluation (MCHPE) set up a joint pilot project to evaluate the possibility of creating such a database using existing data sources.

The primary objective of the pilot project was to evaluate the feasibility of combining the following three data sources: the 1986 Census of Population, the 1986–1987 Health and Activity Limitation Survey (HALS) and the Manitoba Health (MH) longitudinal file on health care service utilization. The database resulting from this combination will enable researchers to explore new directions with respect to health determinants. In this article, we describe the matching of files, the selection of the sample for analysis purposes and the results, which show the representativeness of the database created and validate the techniques employed.

Confidentiality and Right to Privacy

When creating a database from both administrative and survey data, it is essential to ensure the confidentiality of the data and prevent any unwarranted intrusion into individuals' privacy. In accordance with the policies of the collaborating agencies, certain procedures were undertaken prior to matching these data sets. They include consultations with the Privacy Commissioner of Canada, the Faculty Committee on the Use of Human Subjects in Research at the University of Manitoba, and Statistics Canada's Confidentiality and Legislation Committee. In addition, Manitoba Health's Committee on Access and Confidentiality was informed of the project.

Following these consultations, and in accordance with the formal policies of Statistics Canada, the Minister responsible for Statistics Canada authorized the matching as outlined below:

A pilot project for evaluating the feasibility and utility of data matching.
It was explicitly stated that individuals' names and addresses would not be used for matching purposes, nor would they appear in the database.
The matching would be done entirely on the premises of Statistics Canada by persons sworn in under the Statistics Act.
Only a sample of 20,000 matched units would be used for purposes of research and analysis; and
Access to the final data would be strictly controlled in accordance with the provisions of the Statistics Act. In addition, all activities with the linked data set are covered by a memorandum of understanding including Statistics Canada, the University of Manitoba and the Manitoba Ministry of Health.

Data

The detailed questionnaire (questionnaire 2B) of the 1986 Census of Population contains extensive socioeconomic information including variables such as family composition, dwelling characteristics, tenure, ethnic origin and mother tongue, as well as a number of variables relating to income and educational attainment (Statistics Canada, 1986b). This questionnaire was filled by persons residing in Manitoba on June 3, 1986 in a proportion of approximately one household in five. The other households completed a short form designed solely for enumerating the population. Thus, the file used for matching purposes consisted of 261,861 records. The individuals represented by these records lived in two types of dwellings: private or collective. While this article focuses primarily on the private household component, there were in 1986 more than 26,161 persons in Manitoba living in a collective dwelling according to the Census. Examples of collective dwellings are hospitals, hospices,

Page 307 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

nursing homes, institutions for the physically handicapped, orphanages, psychiatric institutions, hotels/motels, work camps, jails, Hutterite colonies, military residences, religious institutions, student residences and YMCAs.

The 1986 HALS was a postcensal survey that sought to identify individuals who, because of their health, were limited by the type or amount of daily activities that they could perform. A postcensal survey refers to a question from the Census (in this case, Question 20 on disabilities) which serves to enrich the survey sample by identifying a high proportion of the target population. An appropriate questionnaire was then completed for each person sampled. For HALS, the Manitoba population living in private households and having disabilities was studied on the basis of a sample of 5,480 persons representing a population of 150,857 persons having at least one disability. The data set created, contained information on individuals' health and functional limitations as well as on type of employment; educational level, transportation, housing and recreation. Since the survey was of the self-reporting type, the data represent the situation of respondents from their viewpoint rather than from an administrative or clinical viewpoint.

The MH longitudinal file, for its part, contains information on visits to physicians, stays in hospital, diagnoses, surgical procedures, admission to personal care (nursing) home, health care received at home, the date and cause of death, and other data on health care utilization. A number of innovative studies in health care research have used this file (Roos et al., 1987, Shapiro and Roos, 1984). For this pilot project, a register of persons covered by Manitoba health insurance was identified from June 1986, using the date of commencement of health insurance coverage and the date of cancellation of coverage. The register contained 1,047,443 records.

Methods

The matching project was divided into three main stages. The first stage consisted of pairing individuals belonging to three distinct data sources. The second stage consisted of assessing what proportions of the pairs formed represented the same individual. The third stage consisted of selecting a sample of 20,000 matched units used to create the database for analysis purposes. In this section, we shall deal with the methodology used in each of these stages in turn.

Matching

The CANLINK system (Smith, 1981; Fellegi and Sunter, 1969) developed at Statistics Canada, was used for the pairing stage. CANLINK is a probabilistic matching software that pairs records from two sets of data by using the discriminatory power of the common variables available. The software weights the pairs of records according to the degree of concordance of the values observed and also takes account of the probability of random concordances. The files paired were that of the 2B sample from the 1986 Census covering the province of Manitoba and the file of persons registered with MH in June 1986, containing a subset of the variables available. Only these two files were involved in the probabilistic matching, since the 1986–1987 HALS sample was drawn from the Census 2B sample (Dolson et al., 1987), and all HALS records were already paired to those in the Census by a unique identifier.

The individual records which were paired came from two files, one containing the records of 261,861 individuals living in Manitoba (derived from the 2B file of the 1986 Census), and the other, containing the records of 1,047,443 persons (a derivative of the Manitoba Health file). The strategy adopted for identifying pairs representing the same individual (good pairs) consisted of dividing up the two data sets into blocks and forming only pairs of individuals belonging to the same block.

Page 308 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

The pairs of records were compared only if all the blocking variables concurred. It was therefore necessary to choose carefully so as not to eliminate at an early stage a great number of “good” pairs. It will be recalled that the most discriminant variables, namely surnames, given names and addresses, were not used in this study. Because of this constraint, we were forced to choose other combinations of variables that were limited in discriminatory power and then apply innovative techniques.

Two matching phases were carried out. First, after examining various possible definitions, we defined a block as a set of four individual characteristics, namely a person's sex, year of birth, month of birth and postal code. In the second matching phase, the definition was relaxed in order to form more pairs of individuals. The exact year and month of birth were replaced by the person's age, which made it possible to compare an individual with a greater number of candidates. In addition, the area covered by the geographic variable in urban settings was expanded by a factor of approximately three, with the postal code being replaced by the census enumeration area.

Through these matchings the census file was divided into three subsets: records which had clearly matched (definites), those which had matched but for which the discriminatory power of the available variables raised a doubt (based on CANLINK criteria (possibles)), and those which had never matched.

While the information on family structure was used in the matching process, the CANLINK system compared only two individuals at a time, without taking account of matches obtained for other family members. We had to define a series of rules in order to ensure the consistency of matchings within a given family and between two matching phases (David et al., 1993).

Evaluation of the Concordance of the Pairs Formed

The evaluation pursued two objectives. First, it was important to determine the degree of accuracy with which we had associated the Manitoba Health data with the Census data (definite matches only). Then it was necessary to assess whether the rules that had been developed for rejecting certain “possible” matches were adequate.

A sample of 1,000 families was drawn, representing 2,102 matched individuals. As stratification variables, we used urban/rural area as determined by Census, family composition (person living alone, couple with child, couple without child, multiple family) and matching status (definite or possible). MH extracted the names and addresses of all these individuals and their family members. It should be understood that this identifying information was not used to determine the validity of specific matches, but only to estimate actual matching rates at aggregated levels. Names and addresses were compared manually with those on the mircrofilmed 2B questionnaires kept at Statistics Canada.

Sample Selection

As the project involved three databases, the sampling frame was derived from the 2B file from the 1986 Census. The sample size was already set at a maximum of 20,000 units and the database created had to combine information from the census files and the MH file, as well as information from the HALS file. The HALS sample used the individual as its sampling unit, whereas the analysis of the overall population of Manitoba used the household as defined by the Census. Several options were considered in order to try to construct a single database, however the complexity of the analysis would have negated any potential gains in accuracy. To ensure a balance between simplicity of analysis and an effective design, the selection process consisted of constructing two independent databases: the first, to study the link between disability, socio-economic status and health, and

Page 309 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

the second, to analyse the general population of Manitoba. To maximize the use of the 20,000 units, it was also necessary to take account of the overlap between these two databases.

Owing to the complex sample design of HALS, the relatively small number of individuals sampled and the importance of this database from an analytical standpoint, matched individuals with disabilities were all selected. These accounted for 4,434 basic units. This sample formed the first database, used for the analysis of persons with at least one functional disability.

There were therefore 15,566 units left to form the general population database plus the expected number of units overlapping the two databases. Still pursuing the objective of optimizing the sample design, an evaluation indicated that stratification was appropriate. Stratification has several advantages. First, it serves to reduce the overall variance of the estimates. Second, it ensures a standard of quality for estimates relating to subgroups of interest in the population. Third, stratification can result in improved accuracy for cases in which non-sampling error can be taken into account. Finally, stratification is especially effective when the stratification variables are correlated with the target variable.

Since a number of studies have established links between socio-economic status and health, it was natural to use socio-economic variables to construct the strata. In addition, there was no disadvantage to using the household as the sampling unit, since socio-economic status is generally the same for all members of a given household. Since it was the 1986 Census file that entirely determines the composition of this population, all the stratification variables were either taken directly from that file or derived from it. The final number of strata for private households was 611. The total number of units drawn was 16,387. These represented 46,670 persons.

Finally, it is common practice to adjust sampling weights so that the totals estimated by the sample will reflect as accurately as possible the counts of the population studied. With post-stratification, the counts can be adjusted for categories for which the number of units was insufficient to create a real stratum but which were of sufficient analytical importance to justify the use of special techniques. These techniques changed the initial weight subject to the constraints of minimum change (Kovacevic, 1995). For private households, the counts by age group, rural or urban geographic area, marital status and sex were used to adjust the weights to the individual level, while rural or urban geographic area, household size and tenure played the same role for adjusting at the household level.

Results

Results for Matching

Despite the conservative approach applied to the initial matches, overall, 74% (174,476 out of 235,700) of individuals from the census living in a private household were matched with an individual in the Manitoba file. This rate varied according to geographic mobility, age, marital status and family size.

The factors that had the greatest influence on the match rate were all related to individuals' geographic mobility. Hence, the following groups of individuals were more difficult to match: young adults (between 20 and 25 years of age: Figure 1), persons who had changed their place of residence between the 1981 and 1986 censuses (Table 1), and divorced or separated persons (Table 2). Among these groups, frequent changes of address and family structure made concordance between the two data sources more difficult than among less mobile groups. The reason for this is that since the Census figures date from June 3, 1986, and some MH data are dated December 31, 1986, there was more likely to be an information lag with respect to mobile individuals.

Page 310 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Figure 1. —Match Rate by Age

Private Households Only

Table 1. —Match Rate According to Mobility: Private Households Only

Mobility	Match Rate %
Same household	81.7
Same CD	65.8
Other CD	62.5
CD: Census Division, a geographic unit used by the Census. Manitoba is made up of twenty-three Census Divisions.

Page 311 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Table 2. —Match Rate According to Marital Status: Private Households Only

Marital Status	Match Rate %
Married	78.5
Widowed	74.5
Single	71.2
Divorced	61.4
Separated	43.4

The effect of age on the match rate was not surprising. Children under fifteen years of age and adults between thirty and sixty years of age had better rates, owing to their more stable situation. Among individuals over 85 years of age, there was greater variability in the data due to the rate of institutionalization and the small number of cases.

Among individuals who did not move between the 1981 and 1986 censuses (same household), one might have expected an even better match rate. The rate of 81.7% is perhaps an indication that using the methodology described thus far, there is an upper limit of around 80% on matches, given that the files are not totally free of errors.

Intuitively, family size is correlated in two opposite ways with the match rate. While a large family has an intrinsic constraint on the mobility of the family nucleus, some members of the family will periodically attach or re-attach themselves to this nucleus. Table 3 indicates that match rate dropped off significantly as family size increased.

Table 3. —Match Rate According to Family Size: Private Households Only

Family Size	1	2	3	4	5	6	7	8	9	10+
Match rate %	66.4	78.5	74.7	79.8	77.1	70.0	55.9	51.4	39.2	46.7

Results for Evaluation of Concordance of Pairs Formed

Table 4 shows that overall, more than 95% of the definite matches retained represented the same individual. As the sample of 20,000 units was drawn from definite matches only, this meant that the matching was of exceptional quality. Also, due to the fact that the rate of concordance of names among possible pairs was only 40% indicated that the enforcement of a conservative methodology was justified and prudent, as they prevented a large proportion of bad matches.

Household size was closely related to the concordance rate. Persons living alone and those living in households of eight or more persons exhibited a lower concordance rate, namely 86.8% and 90.8% respectively. These results would seem to be due to the small number of discriminant variables available for persons living alone and the fact that in the case of large households, there was often more than one family within the household.

Page 312 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Table 4. —Rate of Concordance of Names According to Various Groupings: Private Households Only

Match Status	Concordance on Names %	Standard Error* %
“Possible” match”	40.1	2.4
“Definite” match”	95.5	0.5
Indian reserves	95.3	1.5
Household of 1 person	86.8	2.5
Household of 2 to 7 persons	96.5	0.5
Household of 8 or more persons	90.8	2.0
* The design effect is ignored in calculating the standard error.

A final point to be observed is that matched inhabitants of Indian reserves[1] had a concordance rate equivalent to that of persons living off reserve.

Results for Sample Selection

Overall, we were pursuing two specific objectives in designing the stratification. First, it was necessary to come as close as possible to having a self-weighted design so as to allow for the use of existing computer applications. The costs of custom applications and the time required to develop them would have been a major handicap for any subsequent analysis. The first objective was attained by avoiding oversampling of strata to the extent possible and by maintaining a certain uniformity of weights within each stratum formed. The second objective was to use socio-economic variables in the process of forming strata. Therefore, when stratum sizes permitted, we used variables derived from income, education level, family structure, age and geography.

There were major conceptual differences with respect to the definition of the populations represented by the Census and by the MH file. Only persons having a usual place of residence in Manitoba on June 3, 1986 were enumerated in that province. The MH file was made up of all persons who were covered by the health insurance plan. Some of these persons no longer lived in Manitoba or may not have indicated a change in their status, resulting in some overcoverage. The MH file contained no information on residents of several categories of collective dwellings for which medical services were provided by the federal government, such as military camps and some Indian reserves, whereas the Census considered these persons to be residents of Manitoba.

For purposes of comparison, we excluded persons living in nursing homes (an institutional collective dwelling) from the MH counts in the tables that follow. According to the census definition, persons who had stayed for 180 days or more in a health care institution were considered institutionalized, and therefore excluded. Despite efforts to make the two universes uniform, the fact remains that we managed only to approximate the counts of persons living in institutions. Consequently, the populations compared represent approximately those persons living in a private household or a non-institutional collective household.

Page 313 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

As Table 5 shows, despite major conceptual differences, the sizes of the two populations by age group were quite comparable. Overall, the estimated total sizes of the two populations differed by only 0.1%, although males were underestimated by 1.2% and females overestimated by 1.4%. It was also observed that the greatest differences were amongst younger individuals.

Table 5. —Accuracy of the Sample by Age Group Versus MH: Private And Non-Institutional Collective Households

Age	Males MH	Difference from Sample %	Females MH	Difference from Sample %	Total MH	Difference from Sample %
0 to 4 years	32743	2.57	31105	1.98	63848	2.28
5 to 14 years	78076	1.47	73912	2.87	151988	2.15
15 to 24 years	86722	-1.24	82971	1.61	169693	0.15
25 to 44 years	165783	-2.84	159458	1.92	325241	-0.51
45 to 64 years	96989	-1.71	98997	0.92	195986	-0.38
65 years and +	57904	-1.34	74129	-1.10	132033	-1.20
Total	518217	-1.20	520572	1.39	1038789	0.10

Tables 6, 7 and 8 compare the mortality rate, medical care utilization and hospital care utilization by whether they were estimated from our sample or from the MH file. It should be noted that the death rates reported in the literature (Statistics Canada, 1994b) were slightly higher than those presented in Table 6, with the difference increasing with age. This may be explained by the fact that our files exclude individuals living an institutional collective dwellings, who exhibit a higher mortality rate than persons living in private households.

Table 6. —Annualized Mortality Rates Based on the Period from June 1986 to May 1989: Private and Non-Institutional Collective Households

Age	Annual Mortality Rate* (x 1,000) MH	Annual Mortality Rate* (x 1,000) Sample	95% Confidence Interval for the Sample
0–4	0.51	0.02	(0,0.18)
5–44	2.49	2.04	(1.54, 2.54)
45–49	3.23	2.78	(0.57, 4.99)
50–54	5.12	3.68	(1.03, 6.33)
55–59	8.42	8.91	(4.80, 13.02)
60–64	12.43	10.48	(6.01, 14.95)
65–69	18.96	19.03	(12.69, 25.37)
70–74	28.63	24.59	(16.76, 32.42)
75–79	42.77	43.09	(30.85, 55.33)
80 and over	76.98	73.67	(57.41, 89.93)
Total	6.71	6.11	(5.39, 6.83)
* Number of estimated deaths (over the three years period) divided by estimated population total times 3.

Page 314 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Overall, the mortality rate estimated by the matched sample (6.11) was lower than the one derived from the MH file (6.71), but this difference was not statistically significant at a 95% confidence level. It should be noted that the confidence intervals derived from our sample contained the value calculated by MH for all age groups except children aged 0 to 4. While the number of deaths in this category were relatively small, the difficulty in matching children under one year of age may be related to this underestimate. Additionally, this also indicates that any analysis specific to children from 0 to 4 years of age be conducted with caution, especially where the prevalence of a disease or condition was low.

Table 7. —Number and Costs of Medical Services, 1986–87 Fiscal Year: Private and Non-Institutional Collective Households

Selected Type of Practice	Number of Services			Costs of Services ($)
	MH	Sample	Relative Difference (%)	MH	Sample	Relative Difference (%)
Internal medicine	699,542	702,735	0.46	19,060,658	18,904,922	-0.82
Paediatrics	416,157	449,122	7.92	6,932,217	7,359,290	6.16
Psychiatry	154,279	146,704	-4.91	8,970,489	8,468,584	-5.60
Surgery	406,907	409,097	0.54	21,772,057	21,230,743	-2.49
Ophthalmology	339,334	357,273	5.29	10,017,371	10,506,461	4.88
Radiology	623,712	653,850	4.83	9,564,330	9,970,191	4.24
Pathology	2,941,244	3,126,365	6.29	21,369,502	22,489,399	5.24
Obstetrics and Gynaecology	294,288	328,728	11.70	8,774,785	9,151,231	4.29
General practice	4,762,316	4,858,641	2.02	75,806,649	76,545,594	0.97
Totals*	10,938,103	11,351,540	3.78	193,386,798	195,908,988	1.30
* Totals includes some Type of Practice not showned on this table.

For most categories of medical practice, the estimates drawn from the sample were fairly close to those presented by MH, both for the number of services and for the costs generated in providing these services. The accuracy achieved was all the more remarkable as no post-stratification was carried out at any level to adjust the assumption of health care services to the MH figure.

Table 8. —Number and Duration of Hospital Stays, 1986–87 Fiscal Year: Private and Non-Institutional Collective Households

Age Group	Number of Stays			Duration of Stays
	MH	Sample	Relative Difference (%)	MH	Sample	Relative Difference (%)
0–64	100,127	96,303	-3.82	538,616	499,665	-7.23
65 and over	43,226	41,318	-4.41	452,172	414,555	-8.32
Total	143,353	137,621	-4.00	990,788	914,220	-7.73

Table 8 shows the results of the comparison of the number and duration of hospital stays. Taking the conceptual differences between the two data sources into account, it is deemed to be satisfactory to achieve an accuracy of the estimates within 10%. A larger underestimate for the duration of stays, than for the number of

Page 315 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

stays, indicates that longer stays were prone to underestimation. This situation may be explained by the difficulty in identifying residents of institutional collective dwellings on the MH files.

Discussion

With the methodology presented in this article, approximately 74% of the census file corresponding to private households could be matched with the MH file, using mainly age, sex, postal code, family size and family structure. This rate of 74% is satisfactory when compared to the response rate reported in a number of surveys. For example, response rates for the Nova Scotia Nutrition Survey were 79.7% among located respondents and 60.0% for the total sample (Maclean, 1993). The Manitoba Heart Health Survey registered response rates of 77.1% among located respondents and 60.8% for the total sample (Young et al., 1991).

Obviously, considering the various types of errors possible with matching on a large scale, it is not realistic to expect a matching rate of 100%. It is inevitable that the success rate of any probabilistic matching exercise be affected by erroneous data, lags in the collection or updating of the information, as well as conceptual differences between the data sets. Furthermore, while non-matched individuals exhibited different characteristics from matched individuals, rich socio-demographic information concerning the non-matched population was available from the Census. This information was used to select a sample of matches representative of the entire population.

In 95.5% of cases, the pairs formed did associate with the data on an individual's health care utilization and with the socio-economic data on the same individual in the 1986 Census. This rate of accuracy is exceptional, considering that surnames, given names and birth dates were not used in the matching process.

The accuracy obtained in estimating various indicators associated with the consumption of health care (such as mortality, number and costs of medical services, number and duration of hospital stays) justifies the care with which the matching and sampling methods were developed.

In light of the match rates obtained, the rates of concordance of names and the accuracy of the estimates, it can be said that not only is the new database unique in Canada but also that the quality of the data coded in it greatly exceeds that of many surveys based on interviews.

At a time when health expenditures exceed 10% of the GDP in Canada and 13% in the United States, substantial efforts are being made to identify the relationships between health care utilization and health itself. While it is suspected that the level of health perceived by the patient explains a sizable portion of consumption, many studies have focused on consumption by a specific client group, such as the elderly, or on consumption of health care in the years prior to death (Barer et al., 1987; Shapiro and Roos, 1984).

The newly constructed microdatabase opens the door to various studies that were previously not possible in Canada. For example, one of the projects proposed by the MCHPE consists of analysing morbidity with respect to an individual's occupation and by examining the extent to which the health care utilization for a particular class of illnesses is related to the basic occupational group. The census data can be used to classify individuals according to the reported occupation, or according to whether or not they are employed and whether or not they are in the labour force. Using the 9th revision of the International Classification of Diseases (ICD-9), the potential medical conditions to be studied will be musculo-skeletal disorders, cardio-vascular diseases, mental disorders, gastrointestinal illnesses and injuries.

From a general view, there are plans to study differences in the level of health care utilization by socioeconomic status at different stages in life. On the one hand, it is well-documented that the greatest consumption of health services occurs toward the end of one's life (Barer et al., 1987; Roos et al., 1987). On the other hand, a

Page 316 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

major decline in infant mortality between 1960 and 1990 has also been observed (Pappas et al., 1993; Marmot, 1986). These two phenomena alone are justification for undertaking more thorough comparisons at all age levels. Using data on visits to doctors, health care at home and hospital admissions, it will be possible to compare health care utilization by different age groups by socio-economic status for the classes of illnesses mentioned above. In addition, several studies (Ugnat and Mark, 1987; Williams, 1990; Wilkins et al, 1991) suggest that differences in health conditions according to socio-economic status are greater among persons between 35 and 64 years of age than for other age groups. Analyses by age group using this new database confirmed these hypotheses or shed new light on these matters (Mustard, 1995).

A study to examine the impact of parental socioeconomic status on the use of hospital and ambulatory medical care services during the first year of life has just been completed. It showed that after controlling for low birth weight, maternal age and the joint effects of education and income; for hospital care, education was significantly negatively associated and exhibited a threshold effect between the lowest quartile and all other quartiles; for ambulatory treatment care, income was significantly associated and exhibited a linear effect; for preventive care, both income and education were associated and exhibited a threshold effect between the lowest quartiles and all other quartiles. In the first year of life excluding the birth event, per person public health expenditures were more than twice as high in the lowest education or income quartile compared to the highest quartile (Knighton et al., 1997).

Although the links between socio-economic status and health are the object of intensive research, one of the most frequently encountered problems is that it is impossible to have precise information on socio-economic status at the individual level. Some researchers have no other choice but to use an indicator obtained through the aggregation of taxation or census data for an area of a given size, such as the census enumeration area or the postal code area (Wilkins, 1993). Little research has been done to verify the impact and the validity of this methodology. This tends to reduce the capacity of such models to detect more subtle but theoretically important determinants.

The HALS file, combined with administrative data from health care utilization records, opens the door to comparisons which, until now, have been difficult if not impossible to make. HALS offers us a clear and detailed image of individuals suffering from disability. Whether by age group or sex, by type of disability (mobility, sight, hearing, dexterity, cognition, etc.) or severity, the Manitoba population suffering from a disability can be compared to the general population by means of the census file. Specific analyses of these data focussed on mortality and on health care utilization (Tomiak et al., 1997).

Finally, as the population ages, a greater demand for long-term care services and in particular, nursing homes is expected. A study was initiated to assess the relative importance of predisposing, enabling and need characteristics in predicting nursing home entry.

The list presented here is not exhaustive and is provided only to demonstrate how the new database can be used to analyze health care utilization at different stages in life and for different topics. It is believed that the analytical benefits produced by the record linkage of the administrative sources are important in term of public interest.

Acknowledgments

The authors wish to thank the following persons for their significant and generous contribution to this study: Shelley Derksen, J.Patrick Nicol, and Leonard McWilliam, Manitoba Centre for Health Policy and Evaluation.

Page 317 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Footnote

[1] It should, however, be kept in mind that the match rate on Indian reserves was only 44.5%, considerably lower than the average rate of 74%. This could lead to a bias, since the matched individuals may have had very different characteristics from the reserve population as a whole.

Bibliography

Anderson, G.;Grumbach, K.; Lutt, H.; Roos, L.L.; and Mustard, C. ( 1993). Use of Coronary Artery Bypass Surgery in the United States and Canada: Influence of Age and Income, Journal of the American Medical Association, 269, 1661–1666.

Barer, M.L. et al. ( 1987). New Evidence on Old Fallacies, Social Science Medicine 24(10), 851–862.

David, P. et al. ( 1993). Linking Survey and Administrative Data to Study the Determinants of Health, Proceedings of the American Statistical Association, San Francisco.

Dougherty, G.; Pless, I.B.; and Wilkins, R. ( 1990). Social Class and the Occurrence of Traffic Injuries and Death in Urban Children, Canadian Journal of Public Health, 81, 204–209.

Dolson, D.; Maclean, K.; Morin, J.P.; and Théberge, A. ( 1987). Sample Design for the Health and Activity Limitation Survey, Survey Methodology, 13(1), 93–108.

Health and Welfare Canada ( 1981). The Health of Canadians: Report of the Canada Health Survey, Cat. No. 82–538E.

Fellegi, I.P. and Sunter, A.B. ( 1969). A Theory for Record Linkage, Journal of the American Statistical Association, 64, 1183–1210.

Knighton, T.; Houle C.; Berthelot J.M.; and Mustard C. ( 1997). The Impact of Socio-Economic Inequity on the Health Care Utilization Practices of Infants During the First Year of Life, Symposium on Intergenerational Equity in Canada, Statistics Canada, Ottawa. (For reprint contact Miles Corak at (613) 951–9047.)

Kovacevic, M. ( 1995). The Weight Adjustment for the Sample from the “Whole Population Database” (Private Household Component), Technical Note, Statistics Canada, March 24 (not published).

Maclean, D.R. ( 1993). Report of the Nova Scotia Nutrition Survey, Nova Scotia Heart Health Program, Department of Health, Government of Nova Scotia.

Marmot, M.G. ( 1986). Social Inequalities in Mortality: The Social Environment. In: Class and Health, Research and Longitudinal Data, (Ed. R.G.Wilkinson), London: Tavistock Publications.

Marmot, M.G. and Mcdowall, M.E. ( 1986). Mortality Decline and Widening Social Inequalities. Lancet, 1, 274– 276.

Mustard, C. ( 1995). Socioeconomic Gradients in Mortality and the Use of Health Care Services and Different Stages in the Life Course, Research Paper, Manitoba Centre for Health Policy and Evaluation.

Pappas, G. et al. ( 1993). The Increasing Disparity in Mortality Between Socioeconomic Groups in the United States, 1960 and 1986. New England Journal of Medicine, 329, 103–109.

Roos, L.L.; Nicol, J.P.; and Cageorge, S.M. ( 1987). Using Administrative Data for Longitudinal Research: Comparisons with Primary Data Collection, Journal of Chronic Diseases, 40(1), 41–49.

Shapiro, E. and Roos, L.L. ( 1984). Using Health Care: Rural/Urban Differences Among the Manitoba Elderly The Gerontologist, 24(3), 270–274.

Smith, Martha E. ( 1981). Generalized Iterative Record Linkage System, Health Division, Statistics Canada.

Statistics Canada ( 1994a). Health Status of Canadians: Report of the 1991 General Social Survey , Cat. No. 11–612E, No. 8.

Page 318 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Statistics Canada ( 1994b). Causes of Death 1992, Cat. No. 84–208

Statistics Canada ( 1988), The Health and Activity Limitation Survey, Selected Data for Canada, Provinces and Territories, Cat. No. 41034.

Statistics Canada ( 1986b). Report of the Canadian Health and Disability Survey 1983–1984, Cat. No. 82– 555E.

Statistics Canada ( 1986b). Census Handbook, Cat. No. 99–104.

Tomiak, M.; Berthelot, J.M; and Mustard, C. ( 1997). A Profile of Health Care Utilization of the Disabled Population in Manitoba, (submitted for publication).

Ugnat A M and Mark, E. ( 1987). Life Expectancy by Sex, Age and Income Level. Chronic Disease in Canada.

Young, T.K.; Gelskey, D.E.; Macdonald, S.M.; Hook, E.; and Hamilton, S. ( 1991). The Manitoba Heart Health Survey: Technical Report.

Williams, D.R. ( 1990). Socio-Economic Differentials in Health: A Review and Redirection, Social Psychology Quarterly 53(2):81–99.

Wilkins, R.; Adams, O.; and Brancker, A. ( 1991). Changes in Mortality by Income in Urban Canada from 1971 to 1986, Health Reports, 1(2), 137–174.

Wilkins, R. ( 1993). Use of Postal Codes and Addresses in the Analysis of Health Data, Health Reports, 5(2): 157–177, Cat. No. 82–003.

Wolfson, M.C.; Rowe, G.; Gentleman, J.F.; and Tomiak, M. ( 1993). Career Earnings and Death: A Longitudinal Analysis of Older Canadian Men, Journal of Gerontology: Social Sciences.

Page 319 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

The Development of Record Linkage in Scotland: The Responsive Application of Probability Matching

Steve Kendrick, National Health Service, Scotland

Abstract

Since 1968, patient identifiable records of hospital discharges, cancer registrations and death records have been held centrally in Scotland in machine readable form. Patient details are held in order to enable record linkage using probability matching. In the 1970s and early 1980s over forty ad hoc linkages were carried out. Since the late 1980s, the records have been brought together into permanently linked data sets the largest of which now contains over 12 million records span

These linked data sets have enabled a wide range of analyses to be carried out in response to demands from the health service and the medical research community. They have ranged from relatively simple aggregations of data at the patient level to complex studies of long term patient outcomes. Outcome indicators such as 30 day survival after acute myocardial infarction are now published at hospital level.

In addition to the main “internal” linkages over seventy linkages have been carried out between external data sets such as surveys (e.g., the West of Scotland Coronary Prevention Study), employee records and clinical audit records and the centrally held linked data sets.

The linkage techniques used have evolved to meet the challenges posed by a wide range of customer requirements and data sets. In particular there has been a shift from traditional sort-and-match methods to one pass techniques involving indexing in memory. This has been necessary to enable the linking of relatively small data sets to the main data set without multiple sorting of the data. The technique is currently being adapted to the main linkages to enable much more rapid incorporation of new data. Appropriate use of “best-link” principles has made possible either very high linkage accuracy (e.g., the linkage of Scotland's two main population registers) or reasonable accuracy in linking very poor quality data sets (e.g., linkage of records of victims of cardiac arrest to death records).

The paper will use the Scottish experience to illustrate how the application of probability matching needs to be closely attuned to the precise characteristics of and, in particular, the relationship between the data sets to be linked.

Introduction

Record linkage using probability matching, like many fields of human endeavour, has progressed as a highly fruitful interplay between theory and experiment, axioms and pragmatism. One viewpoint would see record linkage as primarily a highly practical enterprise based on common-sense and close attention to the empirical characteristics of the data sets involved in any linkage. Another would emphasize the

Page 320 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

rigorous grounding of record linkage practice in statistical theory and the theory of probability (Fellegi and Sunter, 1969; Newcombe et al., 1992; Arellano, 1992).

Howard Newcombe, pioneer and founder of probability matching techniques, recognises the value of both perspectives. His work has illustrated the continuing dialectic between the theory and the practical craft of linkage. From the point of view of the development of record linkage in Scotland however his most valuable contribution, beyond his initial formulation of the principles of probability matching, has been his emphasis on being guided by the characteristics and structure of the data sets in question and close empirical attention to the emergent qualities of each linkage (Newcombe et al., 1959; Newcombe, 1988). Particularly inspiring has been his insistence that probability matching is at heart a simple and intuitive process and should not be turned into a highly specialised procedure isolated from the day to day concerns of the organization in which it is carried out (Newcombe et al., 1986).

In this paper we wish to show how the development of the methods of record linkage used in the Scottish Health Service have been driven forward by concrete circumstances and in particular by the practical demands of our customers and the needs of the health service as a whole. Although almost no specifically “research and development” time has been devoted to the development of the Scottish system, our openness to the demands of customers and the sheer variety of linkages which this has engendered has in fact produced a rapid pace of development and change which shows no sign of abating.

Although we have pursued a highly pragmatic rather than a theoretical approach, the variety of linkages which have been undertaken has served to give shape to an overview of some of the main factors which need to be taken into account in designing linkages most effectively. The paper is thus in part the story of record linkage in Scotland, in part a concrete account of how our methods have evolved but also contains an overview of some the factors to do with data structure which may be relevant to linkage strategy. More than anything however the paper is an illustration of how the sensitive and flexible application of the very simple and basic principles outlined by Howard Newcombe can produce very powerful results.

The Context

The current system of medical record linkage in Scotland was made possible by an extremely far sighted decision made as long ago as 1967 by the predecessor organisation to the Information and Statistics Division of the Scottish Health Service and by the Registrar General for Scotland. The decision was taken that from 1968 all hospital discharge records, cancer registrations and death records would be held centrally in machine readable form and would contain patient identifying information (names, dates of birth, area of residence etc.).

The decision to hold patient identifying information was taken with probability matching in mind and reflected familiarity with the early work of Howard Newcombe in Canada and close contact between Scotland and the early stages of the Oxford record linkage initiative. (Heasman, 1968; Heasman and Clarke, 1979).

In what can now be regarded as the first phase of medical record linkage in Scotland, over 40 often sizeable linkages were carried out between the late 1960s and the mid-1980s (for example, Hole et al., 1981; Kendell et al., 1987). The linkages were primarily for epidemiological purposes and each involved the rather laborious specification and development of a bespoke computer program, the whole process often taking over a year to complete. Although the system represented a considerable achievement, by the mid-

Page 321 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

1980s it was acknowledged that it would be increasingly inadequate for the perceived future needs of the Scottish Health Service especially in terms of management information.

In the late 1980s the decision was taken to reconstitute the linkage system. Increased computing power and data storage capacity enhanced the feasibility of linking once and for all the set of records pertaining to a given patient. New enquiries, whether epidemiological or relating to service management, would increasingly involve analysis of already linked data rather than requiring fresh linkages.

The years since 1989 have seen the creation of such permanently linked data sets of Scottish health related data. (Kendrick and Clarke, 1993). The largest currently contains all hospital discharge data, cancer registrations and Registrar General's death records from 1981 to 1995 (over 14 million records relating to just over 4 million individuals). A maternity/neonatal data set contains all maternity admissions, neonatal records, Registrar General's birth records and stillbirth and infant death records from 1980 to 1995. Finally, the data set with the longest time span contains linked psychiatric inpatient records and Registrar General's death records from 1970 onwards.

It was envisioned that the creation of the national linked data sets would be carried out purely by automated algorithms with no clerical checking or intervention involved. After linkage of five years of data in the main linked data set it was found that the false positive rate in the larger groups of records was beginning to creep up beyond the 1% level felt to be acceptable for the statistical and management purposes for which the data sets are used. Limited clerical checking has been subsequently used to break up falsely linked groups. This has served to keep both the false positive and false negative rates at below one per cent. More extensive clerical checking is used for specialised purposes such as the linking of death records to the records of the Scottish cancer registry to enable accurate survival analysis for example.

The existence of the linked data sets has generated a high level of demand for analysis. Approaching a thousand analyses have been carried out ranging from simple patient based counts to complex epidemiological analyses. Among the major projects based on the linked data sets have been clinical outcome indicators (published at hospital level on a national basis), analyses of patterns of psychiatric inpatient readmissions and post-discharge mortality and analyses of trends and fluctuations in emergency admissions and the contribution of multiply admitted patients.

However, far from reducing the requirement for specialised data linkage, the existence of permanently linked national data and facilities for linkage has served to fuel the demand for new linkages. Over a hundred and fifty separate probability matching exercises have been carried out over the last five years. These have consisted primarily of linking external data sets of various forms—survey data, clinical audit data sets—to the central holdings. A particularly important linkage in the context of a major trial of cholesterol lowering drugs enabled comparison of the accuracy of follow-up using probability matching with reporting based on direct contact with patients. Automated linkage was found to be just as accurate for tracking hospital admissions (West of Scotland Coronary Prevention Study Group, 1995). Other specialised linkages have involved extending the linkage of subsets of the ISD data holdings back to 1968 for epidemiological purposes, (for example, Gillespie et al., 1996). These exercises have varied enormously in scale and complexity, from following up the patients of a particular consultant to linking what are virtually two different registers of the population of Scotland. Linkage proposals are subjected to close scrutiny in terms of the ethics of privacy and confidentiality by a Privacy Advisory Committee which oversees these issues for ISD Scotland and the Registrar General for Scotland.

The Scottish linkage project has been funded primarily as part of the normal operating budget of ISD Scotland. Relatively little time or resources have been available for general research into linkage method-

Page 322 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

ology. Instead the development and refinement of linkage methods has taken place as a response to a wide variety of immediate operational demands. We have become to all intents and purposes a general purpose linkage facility at the heart of the Scottish Health Service operating to very tight deadlines often set in terms of weeks and in extreme cases, days. This has placed a high premium on developing quick, effective and accurate methods of linkage with an emphasis on fitness for purpose rather than straining for precision for its own sake.

Despite the lack of time and resources available for background research and development in linkage methodology, these conditions have in fact fostered, especially in recent years, a rapidly changing and developing approach to linkage. Before describing the most significant developments involved, a brief overview of the main components will serve to set them in context.

The Elements of Linkage

For the purposes of this discussion, record linkage using probability matching can be regarded as having three phases or elements each involving a key question.

Bringing pairs of records together for comparison. —How do we bring the most effective subset of pairs of records together for comparison? It is usually impossible to carry out probability matching on all pairs of records involved in a linkage. Usually only a subset are compared, those which share a minimum level of identifying information. This has been traditionally achieved by sorting the files into “blocks” or “pockets” within which paired comparisons are carried out (Gill and Baldwin, 1987).
Calculating probability weights. —How do we assess the relative likelihood that pairs of records belong to the same person? This lies at the heart of probability matching and has probably been the main focus of much of the record linkage literature. (Newcombe, 1988).
Making the linkage decision. —How do we convert the probability weights representing relative odds into absolute odds which will support the linkage decision? The wide variety of linkages undertaken has been particularly important in moving forward understanding in this area.

It would probably be fair to say that of the three areas, it is the second, the calculation of probability weights which has received the most attention and is the best understood. Developments in Scotland over the last few years have occurred in the other two areas as the two subsequent sections will demonstrate.

Before moving on to these developments, our approach to the calculation of probability weights has been relatively conventional and can be quickly summarised. A concern has been to avoid overelaboration and over complexity in the algorithms which calculate the weights. Beyond a certain level increasing refinement of the weight calculation routines tends to involve diminishing returns. This relatively basic approach has been facilitated by the relative richness of the identifying information available on most health related records in Scotland. To take an example, for the internal linking of hospital discharge (SMR1) records across Scotland we have available the patient's surname (plus sometimes maiden name), first initial, sex and date of birth. We also have postcode of residence. For records within the same hospital (or sometimes the same Health Board) the hospital assigned case reference number can be used. In addition positive weights can be assigned for correspondence of the date of discharge on one record with the date of admission on another. Surnames are compressed using the Soundex/NYSIIS name compression algorithms

Page 323 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

(Newcombe, 1988) with additional scoring assigned for more detailed levels of agreement and disagreement. Wherever possible specific weights relating to degrees of agreement and disagreement are used.

Bringing the Pairs of Records Together: One Pass Linkage

The Limitations of Sort-and-Match

By the time the largest linked data set covered several years of data and consisted of several millions of records, a particular challenge emerged. The linkage team began to be asked to link data sets consisting of relatively small numbers of “external” or “newcomer” records to the central catalog of identifiable records. The external or newcomer records might consist of respondents to a survey, a specialised disease register or a particular group of employees.

In all cases the aim was to link the newcomer data set to the central catalog of records so that the experience of the individuals involved could be traced forward from the date of survey, the date last known to the disease register or the date of employment.

As we have seen, in record linkage it is impossible to bring together and compare all the pairs of records involved in the linkage. The number of pairs which are brought together for comparison is normally reduced to manageable proportions by some form of blocking by which only those pairs of records which share common sets of attributes are compared. For example a common strategy is to compare only those pairs of records which share either the same first initial and NYSIIS/Soundex code or the same date of birth. The normal method of achieving such blocking is to sort the two files concerned on the basis of the blocking criteria. Thus, for a first pass of linkage, the files would be sorted by first initial and NYSIIS/Soundex code to bring together into the same “pocket” or “block” all records sharing the same NYSIIS/Soundex code and first initial. Records would only be compared within this block. Because a number of truly linked pairs of records would not be brought together on this basis (for example, because of a misrecording of first initial), a second pass could be carried out which blocks by date of birth. This second pass involves resorting the files on the basis of date of birth to create a second set of pockets or blocks within which comparison takes place. The results of the first and second passes need to be reconciled and this involves sorting the file yet again.

The key point is that standard methods of blocking involve sorting all the records involved in the linkage at least twice and usually more often. When faced with the kind of linkage mentioned above, involving linking a small number of newcomer records to a central catalog holding several millions of records such a procedure is at best immensely wasteful and at worst impossible. No matter how few newcomer records are involved, it is still necessary to sort all the central catalog records for the years of interest. If only a few years are involved, and especially if linkage is restricted to a subset of the central records e.g., cancer registrations, the exercise is feasible but immensely inefficient. If it is desired to link newcomer records to the entire data set, the exercise becomes, in reality, impossible.

One Pass Linkage: Blocking Without Sorting

The question thus became: how can we link a relatively small number of newcomer records to the catalog without having to repeatedly sort the catalog? The solution adopted has been to store the newcomer records in memory and carry out blocking using indexes based on numerical elements of the blocking criteria. The catalog records can then be read in sequentially and compared with all the newcomer records which fit the chosen blocking criteria (Kendrick and Mcllroy, 1996).

Page 324 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

The linkage is thus carried out in the course of “one pass” through the catalog data set.

Before they are brought into contact with the catalog, all newcomer records are read into memory and stored in an array indexed by a unique numeric record identifier. Necessary pre-processing such as generation of NYSIIS/Soundex codes is also carried out.

The next step is the creation of blocking index arrays. In this description we assume that two sets of blocking criteria are being used: first initial and NYSIIS/Soundex code on the one hand, and date of birth on the other.

The blocking index arrays are indexed by numeric elements of the blocking criteria. Thus the first blocking array uses the numeric element of the NYSIIS/Soundex code as its index. All NYSIIS/Soundex codes consist of a letter followed by three figures e.g., A536 or B625. The first blocking index array has a row for each number from 001 to 999 which covers all possible numeric elements of NYSIIS/Soundex codes. In each row are stored the numeric identifiers of the newcomer records whose NYSIIS/Soundex code has the relevant numeric element. For example, the identifiers of newcomer records with surname FRAME (NYSIIS/Soundex code F650) and BROWN (B650) would be stored in the same row.

The second blocking index array has three indices: year, month and day of birth. Along the fourth dimension of the array are stored the numeric identifiers of the newcomer records sharing that date of birth.

Catalog records are then read in one by one. Suppose the first catalog record is for someone named BROWN (NYSIIS/Soundex code B650) with date of birth 1st March, 1922.

Row 650 of the first blocking index array is inspected to see whether any newcomer records share the numeric element of the Soundex code. If any are found, then the first newcomer record is accessed via its numeric identifier in the newcomer record array. An immediate comparison is made of the first letter of the Soundex code and the first initial. If both match then we proceed to full probability matching between the catalog and newcomer records. If neither or only one match then no further action is taken. We then look at the next newcomer record (if any) indexed on the relevant row of the blocking index array.
Blocking by date of birth is even easier to simulate in that the blocking criteria are entirely numeric. The catalog record can be directed to all newcomer records which share the same day (in this case 1); month (in this case 3); and year (in this case 22) by directly accessing the relevant array.

How the results of the ensuing pair comparisons are stored and implemented depends upon the structure and purpose of the linkage. Whenever links above a certain weight occur they can be output and stored for implementation in a provisional linkage file. This provisional linkage file can itself be flexibly interrogated to implement a given structure of linkage e.g., we may be only interested in the best link (the link with the highest weight) achieved by each newcomer record (see below).

One Pass Linkage: Practical Considerations

The above strategy, whereby the newcomer records can be indexed only in terms of the numeric elements of any blocking criteria, is necessary when we are using a programming environment which only allows numeric indexing of arrays. If either the newcomer data or the catalog data is stored in a database

Page 325 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

which allows direct access by any type of key then the logic of the exercise would be simplified. The file could be flexibly indexed by whatever blocking keys are felt appropriate.

Our impression at present would be that using memory still has advantages in terms of speed of access. This of course is a practical issue and may well change quickly as relational databases and “search engines ” improve in speed and efficiency.

The number of newcomer records which can be linked in one pass through the data is of course limited by the available memory. Memory is needed both for storing the elements of the newcomer records which are necessary for linkage and for storing the blocking index arrays. For most ad hoc linkages involving anything up to 15,000 newcomer records this has not tended to be a problem. Larger newcomer data sets can often be linked in sections without affecting the logic of the linkage.

Of the three elements of linkage it is this which is most dependent upon the capabilities of available hardware and software. The implication is that as these capabilities develop, there will be immense potential for moving beyond current limitations.

The Linkage Decision: Relative and Absolute Odds, Structuring the Linkage and the Best Link Principle

At the heart of the record linkage enterprise is the decision as to whether two records are truly linked. Most often the question is one of whether the records involved relate to the same person. The calculation of probability weights aims to provide a mathematical grounding for this decision.

However, it is a fundamental characteristic of the odds represented by probability weights that they are relative odds rather than absolute odds. They only serve to rank the pairs of records involved in a given linkage in order of the probability that they are truly linked. The relative odds do not represent fixed absolute or betting odds such that a probability weight of 25, for example, would always representing absolute or betting odds of 50/50. The conversion factor will vary from linkage to linkage.

It is absolute odds which are needed to inform the linkage decision. In practical terms the issue of the determinants of the conversion from relative odds to absolute odds can often be bypassed in that the required threshold in terms of absolute odds can usually be identified empirically from inspection of a sample of pairs. However a broad understanding of the relationship between relative and absolute odds is useful in that it can help optimise the way a given linkage should be structured.

Not all linkages require the same absolute odds. The absolute odds required depend upon the purpose of the linkage. They depend upon the costs associated with missing a true link compared with making a false link. For statistical purposes, the required absolute odds may be 50/50. If the linkage is to be used for administrative or patient contact purposes where a false linkage may have extremely damaging consequences, very high absolute odds may be required.

Relative to Absolute Odds: A Priori Factors

Two of the factors involving in converting relative to absolute odds take the form of relatively straightforward numerical principles. Newcombe has stated them in the context of a search file and a file being searched (Newcombe, 1988; Newcombe, 1995). The first principle is that the higher the proportion of records in the search file for which there exists a linked record in the file being searched, the more favorable

Page 326 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

will be the conversion factor between relative and absolute odds. The second proposition is that the larger the file being searched, the less favourable will be the conversion factor between relative and absolute odds.

These factors are given for any specific linkage.

Relative to Absolute Odds: Structural Factors

However the conversion factor between relative and absolute odds can be influenced by how the linkage is implemented. It is important to design the linkage in a way which takes maximum advantage of the structures of the files involved and the relationships between the records in the files. For example, are the relationships between the records in two files one-to-one, one-to-many or many-to-many? How much confidence do we have in previous linkages which may have been carried out on the files involved? How confident are we that a file to be linked already contains only one record per person?

For example, if we want to link to each other a set of hospital discharge records, we have no a priori knowledge of how many records belong to each person. Our best bet is to do a conventional internal linkage and inspect all resulting pairs in setting a threshold. In this case we have relatively little leverage to improve the terms of conversion between relative odds and absolute odds.

If however we are linking a file of hospital discharge records to a file of death records we can obtain some “structural leverage.” Death only occurs once and assuming that this is reflected in there being only one death record per person in the file of death records, the linkage becomes many-to-one. Each hospital discharge record should link to only one death record. The terms of conversion from relative to absolute odds can be improved by only retaining, for each hospital discharge record, the best (highest weight) link which is achieved to a death record (see also Winkler, 1994).

Similarly, at the other end of the life cycle, if we are linking baby records to mothers records, assuming that the mothers records themselves have been correctly linked we should allow each baby to link to only one mother. This in fact was the first context in which the importance of structuring the linkage emerged in Scotland.

Best Link and Structural Leverage: An Example

The CHI/NHSCR Linkage

These rather abstract considerations can be best understood in the context of a particular example. In common with the rest of the United Kingdom, Scotland is committed to the development of a unique patient identifier to help streamline the management of all patient contacts with the health service. Historically, Scotland has possessed two health-related registers of the Scottish population. It was felt that the combined strengths of the two registers would provide a firm basis for a new patient identifier.

For the last twenty years the Community Health Index (CHI) has operated on a regional basis as a primary care patient register for such purposes as screening for breast and cervical cancer and childhood immunisation. It contains a wealth of operational information with high population coverage. However, the regional indexes were initially compiled on an opportunistic basis and there was a general perception that there were gaps in its coverage and that there was a high proportion of duplicate records for people who had moved from one area of Scotland to another.

Page 327 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Scotland also possesses a National Health Service Central Register (NHSCR) which has been carefully maintained to contain one record for each resident of Scotland. The NHSCR however contains relatively little operational information.

The initial plan was to carry out an internal linkage of the aggregated regional CHI indexes in order to remove duplicate records and then to link the resulting aggregated CHI data set to the NHSCR to form the basis of a national index.

However, based on our early experience of structuring linkages to maximise the power of the linkage, it was felt that linking the CHI databases to each other via a linkage to the NHSCR would provide more “leverage” in the linkage. (Kendrick et al., 1997)

The data which was available on both data sets to enable linkage was reasonable but not excessively rich. We had forenames, surnames, sex and date of birth but the only residential data was Health Board of residence (average size 300,000). A major bonus was that National Health Service number was available in a well formatted form on all NHSCR records. Because of its irregular format, the British NHS number has been notoriously difficult to use and was available on only a proportion of CHI records and with wide variations in accuracy and formatting.

Although the linkage was primarily concerned with “current” CHI records, those reflecting the current residence and GP registration of the Scottish population, “redundant” CHI records for people who had died or moved to a new Health Board were included in the linkage as a possible basis for constructing historical traces. In order to find a correct NHSCR “home” for as many CHI records as possible the NHSCR file also contained deaths from 1981 as well as known emigrants.

Since we were confident that the NHSCR did contain one record for every Scottish resident but there were suspicions that the CHI data set contained duplicate records (as well as legitimately multiple historical records), it was decided to structure the link as many-to-one. Each CHI record was allowed to link to only one NHSCR record—the one with which it achieved the highest probability weight. Each NHSCR record on the other hand was allowed to link to as many CHI records as necessary.

Relative to Absolute Odds: Conversion Factors

The linkage can be described in terms of the factors which were outlined in the previous section as determining the conversion factor between relative and absolute odds.

Purpose of the linkage. —Any links accepted from the linkage would form the basis of patient contact. A very high level of confidence in the validity of any links was required. Missed links were regarded as less of a problem in that they would normally be picked up in the course of the running of the new index. Thus very high absolute odds for linkage were required.
A priori probabilities. —Given that both sets of data represented a high level of coverage of the Scottish population, there was a very high probability that a person represented on the CHI file would also be represented on the NHSCR file. In terms of Newcombe's first rule, circumstances could not have been more favourable.
File sizes. —Reflecting as they did the entire Scottish population as well as deaths and transfers these were large files: approximately 6.3 million NHSCR records against 7.8 million CHI records.

Page 328 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

This gives a high coincidence factor and, according to Newcombe's second rule, would normally serve to push up the relative odds required for given absolute odds.

Structuring the file. —Given the knowledge that all Scottish residents were likely to be represented by one NHSCR record and one or more CHI records, it made sense to structure the linkage as a best link many-to-one linkage i.e. allowing each CHI record to link only to the NHSCR record with which it achieved best link would be the most effective route and would maximise the conversion factor between relative and absolute odds.

In broad terms then the linkage faced two difficult circumstances: the requirement for very high absolute odds and the large file sizes. These were more than outweighed however by the massive leverage contributed by the use of the best link principle in the context of a very high a priori probability that people were represented in both files.

Linkage Results

Of approximately 5,360,000 current registered CHI records, 4,600,000 or 86% linked deterministically to an NHSCR record. There was a match between the Soundex/NYSIIS code of surname, first initial, date of birth, sex and NHS number. For the remaining 750,000 CHI records, probability matching was carried out.

Resources for clerical checking were limited and such checking was limited to a sample of best link pairs to determine a probability weight which would represent absolute odds for the correctness of a linkage which were sufficiently high for administrative purposes. Staff of Health Board Primary Care Teams and the National Health Service Central Register checked 2,500 pairs using existing search and confirmation systems. No incorrect links were found at a probability weight greater than 30 and this was chosen as the administratively acceptable threshold.

To put this outcome into a broad comparative perspective we can compare the CHI/NHSCR linkage with previous linkages in Scotland which did not use the best link principle but which linked similar types of record using virtually the same agreement and disagreement weights for the main identifying items such as name and date of birth.

In the linkage of the Scottish hospital discharge and death record data sets using probability matching, the fifty/fifty threshold (i.e., the weight at which it is equally likely that the two records belong or do not belong to the same person) has remained relatively constant at a probability weight of 25. The fifty/fifty threshold for the best links of CHI to NHSCR records is around 15. Similarly, the threshold below which links between Scottish Cancer Registrations and death records are clerically checked and above which they are accepted automatically is a weight of 40. In the CHI/NHSCR linkage as we have seen, this threshold is 30. In both cases the difference is ten units in the currency of binit weights or logs to the base 2. In terms of odds this is an improvement in the conversion factor from relative to absolute odds of 2¹⁰ or around a thousandfold.

Why the use of only best links in this context should contribute so much extra leverage compared with a pure threshold method is perhaps intuitively obvious but is much more difficult to explain in principle. The logic is perhaps best illustrated by a hypothetical example.

Let us suppose that a CHI record on which is recorded the name Angus MacAllan with date of birth 25/01/1952 has achieved its best link with an NHSCR record on which is recorded the name Angus

Page 329 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

McAllan born 24/01/1951. There is no NHS number on the CHI record and no other elements agree so that the link achieves what would be, in the context of an unstructured purely threshold linkage, only a moderate probability weight implying a less than fifty/fifty chance that the records belong to the same person. We can best assess the likelihood that these two records would belong to the same person in the CHI/NHSCR linkage context by an indirect route. Let us imagine what would have to be true for the two records not to belong to the same person. Either:

there is no NHSCR record relating to the individual represented on the CHI record and in addition there exists on the NHSCR file a record relating to another Angus Mc/MacAllan with a highly similar date of birth; or
there is an NHSCR record corresponding to the individual represented on the CHI record but there are sufficient discrepancies in the recording of the identifying information for this “true link” Angus MacAllan that an NHSCR record for another Angus MacAllan in fact achieves a higher probability weight with the CHI record.

Neither of these scenarios are impossible but they are highly improbable and it is much more likely that the two records really do belong to the same person.

The method used had two additional advantages. The file which was output from the linkage took the form of a copy of each CHI record to which was appended an extract from the NHSCR record to which it had achieved the best link and the weight at which the link was achieved. This file was used as a basis for generating pairs for inspection and links could be extracted at whatever weights were necessary. In essence this means that the threshold for linkage was set and could be varied retrospectively without having to rerun the linkage.

The problem of twins has always bedevilled record linkage. The CHI/NHSCR linkage was able to take advantage of the fact that the NHS numbers for most pairs of twins are consecutive and a high negative weight was given for pairs of records with consecutive NHS numbers. Linkages using best link are in normal circumstances better than linkages using only a numeric threshold. In the presence of consecutive NHS numbers for twins the linkage was very successful in correctly allocating the records for twins.

One Pass Linkage and the Structuring of Linkages

Although one pass linkage and the structuring of linkages in terms of the best link have developed as separate responses to different challenges, they are not entirely independent.

Given that one of the main aims of one pass linkage is to avoid having to repeatedly sort or restructure the larger or target file, it is natural to implement one pass linkage as a best link procedure i.e., each newcomer record is allowed to link only to the catalog or target record with which it achieves the highest probability weight. Thus, it is not possible for the linkage to bring together records in the target file by “bridging” between them—this would involve restructuring or resorting. As we saw earlier, as patient record sets in the main linked database grew larger, the false positive rate crept upwards, often because of illegitimate bridging by new records. As the main production linkages are adapted to one pass linkage, this problem will be minimised.

Although the affinity between one pass linkage and the best link principle is one of practical convenience, as we have seen, depending upon the circumstances of the linkage, the best link principle often has highly beneficial effects. Practicality and best practice often go hand in hand.

Page 330 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Linkage in Scotland: A Possible Future

Another way of looking at the CHI/NHSCR linkage is to see the NHSCR file as a target file at which the regional CHI files were aimed for linkage. As we have seen finding the best link record in the target file for each CHI record proved to have a dramatic effect on the accuracy of the linkage.

The much richer “national CHI” file which has resulted from the linkage and the introduction of national search and enquiry facilities provides an even better target for the linkage of other data sets.

For example, in November 1996 Scotland experienced a severe outbreak of infection by the E-coli 0157 bacterium. Several different sets of records were generated in the course of the outbreak: a case register, community clinic contacts, laboratory records, known exposed cohorts and hospital patients. The quality of identifying information on many of these records was rather poor reflecting the circumstances in which they were collected. ISD Scotland was asked to link these records so that the records for each individual involved in the outbreak could be gathered together. Rather than attempt to link the different sets of records directly to each other, the records were “aimed” at the local Community Health Index and linked to it. Again this method paid off in terms of much more accurate linkage.

It is likely that more and more linkages in Scotland will take the form of aiming data sets at the target of the national CHI. Ultimately the objective is to use such linkages, whereby for example laboratory data sets or hospital Master Patient Indexes are linked to the national CHI, to populate an increasing proportion of Scotland's health records with a unique patient identifier. It is intended that this will eventually reduce the need to record patient identification details such as names and dates of birth on operational records and communications. Instead identification will be via the national CHI number. Such a system is already in place in Tayside Health Board where the CHI number is implemented on a wide range of primary and acute health care records.

In this context the role of probability matching in the Scottish Health Service and the methods used to carry it out are likely to change even more rapidly over the next few years than they have over the last ten years.

As we have emphasised it has been the openness of record linkage in the Scottish Health Service to the demands of a wide range of customers which has driven the rapid development in our methods and this is likely to continue.

In this context the common sense and pragmatic approach to record linkage championed by Howard Newcombe has been especially useful and appropriate as guidance. Working as we are in his footsteps we can summarise some of the most salient emphases.

Record linkage is about being guided by the data and staying as close to the data as possible at all stages. The people who know the data best must be involved. Linkage is an evolutionary and recursive process at all levels. Linkage is a continual learning process and linkage is about what works, not what ought to work.

Finally, record linkage is not about the mechanical application of complex and abstract rules. As circumstances change and data sets vary there is unlikely ever to be one definitive best method of carrying out record linkage using probability matching. Progress will come rather from the flexible and responsive application of what are, at heart, very simple principles.

Page 331 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Acknowledgments

The following people have contributed over the years to the collective enterprise which is described. In ISD Scotland, James Boyd, Dorothy Gardner, Lena Henderson, Kevin McInneny, Margaret MacLeod, Fiona O'Brien, Chris Povey, Jack Vize, David Walsh, and Bruce Whyte have contributed their insight and programming skills. John Clarke and May Sleigh provided invaluable continuity and guidance. The expertise of Angela Bailey, Eileen Carmichael, Janey Read, and Maggi Reid in checking output has helped keep the system on course. In the Data Centre of the Scottish Health Service's Common Services Agency Gary Donaldson, Alison Jones, Ruth McIlroy, and Debbie McKenzie-Betts have laboured to put the linkage systems on a production basis.

References

Arellano, M.G. ( 1992). Comment on Newcombe et al. 1(992). Journal of the American Statistical Association, 87, 1204–1206.

Fellegi, I.P. and Sunter, A.B., ( 1969). A Theory of Record Linkage, Journal of the American Statistical Association, 40, 1183–1210.

Gill, L.E. and Baldwin, J.A. ( 1987). Methods and Technology of Record Linkage: Some Practical Considerations in Textbook of Medical Record Linkage, Baldwin J.A. et al. (eds), Oxford: Oxford University Press.

Gillespie, W.J.; Henry, D.A.; O'Connell, D.L.; Kendrick, S.W.; Juszczak, E.; McInneny, K.; and Derby, L. ( 1996). Development of Hematopoietic Cancers after Implantation of Total Joint Replacement, Clinical Orthopaedics and Related Research, 329S, S290–296.

Heasman, M.A. ( 1968). The Use of Record Linkage in Long-term Prospective Studies, in Record Linkage in Medicine: Proceedings of the International Symposium, Oxford, July 1967, Oxford: Oxford University Press.

Heasman, M.A. and Clarke, J.A. ( 1979). Medical Record Linkage in Scotland, Health Bulletin (Edinburgh), 37:97–103.

Hole, D.J.; Clarke, J.A.; Hawthorne, V.M.; and Murdoch, R.M. ( 1981). Cohort Follow-Up Using Computer Linkage with Routinely Collected Data, Journal of Chronic Disease, 34, 291–297.

Kendell, R.E.; Rennie, D.; Clarke, J.A.; and Dean, C. ( 1987). The Social and Obstetric Correlates of Psychiatric Admission in the Puerperium., in Textbook of Medical Record Linkage, Baldwin J.A. et al. (eds), Oxford: Oxford University Press.

Kendrick, S.W. and Clarke, J.A. ( 1993). The Scottish Medical Record Linkage System., Health Bulletin (Edinburgh), 51, 72–79.

Kendrick, S.W. and McIlroy, R. ( 1996). One Pass Linkage: The Rapid Creation of Patient-Based Data, in Proceedings of Healthcare Computing 1996: Current Perspectives in Healthcare Computing 1996, Weybridge, Surrey: British Journal of Healthcare Computing Books.

Page 332 Cite

Suggested Citation:"Chapter 10 Invited Session on More Record Linkage Applications in Epidemiology." National Research Council. 1999. Record Linkage Techniques -- 1997: Proceedings of an International Workshop and Exposition. Washington, DC: The National Academies Press. doi: 10.17226/6491.

×

Kendrick, S.W.; Douglas, M.M.; Gardner, D.; and Hucker, D. ( 1997). The Best-Link Principle in the Probability Matching of Population Data Sets: The Scottish Experience in Linking the Community Health Index to the National Health Service Central Register, Methods of Information in Medicine (in press).

Newcombe, H.B. ( 1988). Handbook of Record Linkage, Oxford: Oxford University Press.

Newcombe, H.B. ( 1995). Age-Related Bias in Probabilistic Death Searches Due to Neglect of the Prior Likelihoods, Computers and Biomedical Research, 28, 87–99.

Newcombe, H.B.; Kennedy, J.M.; Axford, S.J.; and James, A.P. ( 1959). Automatic Linkage of Vital Records, Science, 130, 954–959.

Newcombe, H.B.; Smith, M.E.; and Lalonde, P. ( 1986). Computerised Record Linkage in Health Research: An Overview, in Proceedings of the Workshop on Computerised Linkage in Health Research (Ottawa, Ontario, May 21–23, 1986), Howe, G.R. and Spasoff, R.A. (eds), Toronto: University of Toronto Express.

Newcombe, H.B.; Fair, M.E.; and Lalonde, P. ( 1992). The Use of Names for Linking Personal Records, Journal of the American Statistical Association, 87, 1193–1204.

West of Scotland Coronary Prevention Study Group ( 1995). Computerised Record Linkage Compared with Traditional Patient Follow-up Methods in Clinical Trials and Illustrated in a Prospective Epidemiological Study, Journal of Clinical Epidemiology, 48, 1441–1452.

Winkler, W.E. ( 1994). Advanced Methods for Record Linkage, Statistical Research Division, Statistical Research Report Series No. RR94/05, Washington D.C.: U.S. Bureau of the Census.