Chapter 7

Contributed Session on More Applications of Probabilistic Record Linkage

Chair: Jennifer Madans, National Center for Health Statistics

Authors:

Tony LaBillois, Marek Wysocki, and Frank J.Grabowiecki, Statistics Canada

Kenneth Robertson, Larry Huff, Gordon Mikkelson, Timothy Pivetz, and Alice Winkler, Bureau of Labor Statistics

Sandra Johnson, National Highway Traffic Safety Administration

Eva Miller, New Jersey Department of Education



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Chapter 7 Contributed Session on More Applications of Probabilistic Record Linkage Chair: Jennifer Madans, National Center for Health Statistics Authors: Tony LaBillois, Marek Wysocki, and Frank J.Grabowiecki, Statistics Canada Kenneth Robertson, Larry Huff, Gordon Mikkelson, Timothy Pivetz, and Alice Winkler, Bureau of Labor Statistics Sandra Johnson, National Highway Traffic Safety Administration Eva Miller, New Jersey Department of Education

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition A Comparison of Direct Match and Probabilistic Linkage in the Death Clearance of the Canadian Cancer Registry Tony LaBillois, Marek Wysocki, and Frank J.Grabowiecki, Statistics Canada Abstract The Canadian Cancer Registry (CCR) is a longitudinal person-oriented database containing all the information on cancer patients and their tumours registered in Canada since 1992. The information at the national level is provided by the Provincial and Territorial Cancer Registries (PTCRs). An important aspect of the CCR is the Death Clearance Module (DCM). It is a system that is designed to use the death records from the Canadian Mortality Data Base to confirm the deaths of the CCR patients that occurred during a pre-specified period. After extensive preprocessing, the DCM uses a direct match approach to death confirm the CCR patients that had a death registration number on their record and it performs a probabilistic record linkage between the remaining CCR patients and death records. For one province, death registration numbers are not provided with the cancer patient records. All these records go directly to the probabilistic linkage. For the rest of the country, a good proportion of the cancer patients reported as dead by the PTCRs have such a number that can be used to match directly the two databases. After an overview of the CCR and its DCM, this presentation will compare the situation where the direct match is used in conjunction with the probabilistic linkage to death confirm cancer patients versus the case where the probabilistic record linkage is used alone. Introduction In combining two sources of data, it is sometimes possible to match directly the records that represent the same units if these two sources have one common unique identifier. Nevertheless, it is often not possible to find all the common units using only this approach, either because the two sources do not have a common unique identifier, or because, even when used, it is not complete for all the records on the files. The case of the Death Clearance (DC) of the Canadian Cancer Registry (CCR) is an example of the latter. The purpose of this task is to associate cancer patient records with death certificate records to identify the individuals that are present on both files. The CCR already contains the death registration identifier for some patients, but not for all that may indeed be deceased. Consequently, the most reasonable process involves matching directly all the CCR patient records that have this information, and then using probabilistic record linkage in an attempt to couple the remaining records that could not directly match. It is our belief that this maximises the rate of association between the two files while reducing the processing cost and time. In this situation, one could also use probabilistic linkage, alone, to perform the same task. The intention of this study is to compare these two approaches. Firstly, this paper provides an overview of the CCR with emphasis on the Death Clearance module. Secondly, the characteristics of the populations used in the study are described. Next, the paper explains

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition the comparisons between the two approaches (process, results and interpretations); and finally, it presents the conclusions of this study. Overview of the Canadian Cancer Registry The Canadian Cancer Registry at Statistics Canada is a dynamic database of all Canadian residents diagnosed with cancer[1] from 1992 onwards. It replaced the National Cancer Incidence Reporting System (NCIRS) as Statistics Canada's vehicle for collecting information about cancer across the country. Data are fed into the CCR by the 11 Provincial and Territorial Cancer Registries (PTCRs) that are principally responsible for the degree of coverage and the quality of the data. Unlike the NCIRS that targeted and described the number of cancers diagnosed annually, the CCR is a patient-based system that records the kind and number of primary cancers diagnosed for each person over a number of years until death. Consequently, in addition to cancer incidence, information is now available about the characteristics of patients with multiple tumours, as well as about the nature and frequency of these tumours. Very importantly, since patients' records remain active on the CCR until confirmation of their death, survival rates for various forms of cancer can now be calculated. The CCR comprises three modules: core, internal linkage and death clearance. The core module builds and maintains the registry. It accepts and validates PTCR data submissions, and subsequently posts, updates or deletes information on the CCR data base. The internal linkage module assures that the CCR is truly a person-based file, with only one patient record for each patient diagnosed with cancer from 1992 onwards. As a consequence, it also guarantees that there is only one tumour record for each, unique, primary tumour. The internal linkage identifies and eliminates any duplicate patient records that may have been loaded onto the database as a result of name changes, subsequent diagnoses, or relocations to other communities or provinces/territories. Finally, death clearance essentially completes the information on cancer patients by furnishing the official date and cause of their death. It involves direct matching and probabilistic linking cancer patient records to death registrations at the national level. The Death Clearance Module Death clearance is conducted on the CCR in order to meet a certain number of objectives (Grabowiecki, 1997). Among them, it will: permit the calculation of survival rates for patients diagnosed with cancer; facilitate epidemiological studies using cause-of-death; and help file management of the CCR and PTCRs. The death clearance module confirms the death of patients registered on the CCR by matching/linking[2] their patient records to death registrations on the Canadian Mortality Data Base (CMDB), or to official sources of mortality information other than the CMDB. These other sources include foreign death certificates and other legal documents attesting to, or declaring death (they are added to the CMDB file before processing). The first major input to this module is the CCR database that is built of patient and tumour records. For every person described on the CCR, there is only one patient record, but as many tumour records as there are distinct, primary cancers diagnosed for that person. Patient records contain nominal, demographic and mortality information about the person, while tumour records principally describe the characteristics of the cancer and its diagnosis. CCR death clearance uses data from the patient record augmented with some

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition fields from the tumour record (the tumour record describing the patient 's most recently diagnosed tumour when there is more than one). More details on the variables involved are available in Grabowiecki (1997) and Statistics Canada (1994). The second main input is the Canadian Mortality Data Base. This file is created by Statistics Canada's Health Division from the annual National Vital Statistics File of Death Registrations, also produced by Statistics Canada. Rather than going directly to the Vital Statistics File, death clearance uses the CMDB as the principal information source about all deaths in Canada, because of improvements that make it a better tool for record linkage. A separate record exists on the CMDB for every unique reported surname on each Vital Statistics record—viz.: the deceased's surname, birth/maiden name, and each component of a hyphenated surname (e.g., Gérin-Lajoie, Gérin, and Lajoie). All of the above surnames and the Surname of the Father of the Deceased have been transformed into NYSIIS[3] codes. For details on the CMDB data fields needed for death clearing the CCR, consult Grabowiecki (1997) and Statistics Canada (1997). Death clearance can be performed at any time on the CCR. However, the most efficient and effective moment for performing death clearance is just after the completion of the Internal Record Linkage module, that identifies and removes any duplicate patient records on the CCR data base. The death clearance process has been divided into five steps. Pre-Processing In this phase the input data files for death clearance are verified and prepared for the subsequent processing steps. The specific years of CMDB data available to this death clearance cycle are entered into the system. Based upon these years, the cancer patient population from the CCR, and mortality records from the CMDB are selected. Direct Match (DM) The unique key to all the death registrations on the CMDB is a combination of three data fields: Year of Death Province (/Territory/Country) of Death Death Registration Number. These three fields are also found on the CCR patient record. PTCRs can obtain this information by doing their own death clearance, using local provincial/territorial files of death registrations. Patient records having responses for all three key fields first pass through a direct match with the CMDB in an attempt to find mortality records with identical common identifiers. If none is found, they next pass through the probabilistic record linkage phase, along with those patient records missing one or more of the key match fields. For the records that do match, five data items common to both the patient and CMDB records are compared (Sex, Day of Death, Month of Death, Year of Birth, Month of Birth). On both the CCR patient records and matched CMDB records, the responses must be non-missing and identical. If they are not, both the patient and mortality records are free to participate in the record linkage, where they may link together. Matched pairs that pass the comparison successfully are considered to represent the same person; they then will move on to the post-processing phase.

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Probabilistic Linkage (PL) In order to maximise the possibility of successfully linking to the CMDB file, the file of unmatched CCR patient records is exploded by creating, for every person, a separate patient record for each unique Surname, each part of a hyphenated Surname, and the Birth/Maiden Name—a process similar to the one used to create the CMDB, described in above. NYSIIS codes are generated for all names. The two files are then passed through the Generalised Record Linkage System (GRLS), and over 20 important fields are compared using a set of 22 rules. Based on the degree of similarity found in the comparisons, weights are assigned, and the CCR-CMDB record pairs with weights above the pre-established threshold are considered to be linked. When patient records link to more than one mortality record, the pair with the highest weight is taken and the other(s) rejected. Similarly, if two or more patients link to the same CMDB record, the pair with the highest weight is selected. The threshold weight has been set at such a level that the probability of the linked pairs describing the same person is reasonably high; consequently, manual review is not necessary in the linkage phase. At the same time, the threshold has not been positioned too high, in order to avoid discarding too many valid links, and thus reducing the effectiveness of the record linkage process. The death information of linked CMDB records is posted onto the CCR patient records, overlaying any previously reported data in these fields. The linked pairs and unlinked CCR patient records join the matched pairs in proceeding to the post processing phase of death clearance. Post-Processing Essentially, this phase updates the CCR data base with the results from the match and linkage phases. Also, the results are communicated to the PTCRs for their review, and for input into their own data bases. Before being updated, copies are made of the patient records from the database. This makes it possible to restore them to their pre-death confirmed state should the matches/linkages be judged to be incorrect later by the PTCRs. Refusal Processing Refusals are PTCR decisions, taken after their review of the feedback reports and files generated in the post processing phase, that specific matches and linkages are incorrect—i.e., that the persons described on the CCR patient records are not the same persons to whose death registrations they matched or linked. In this step, the affected patient records have their confirmation of death reversed, and are restored to their pre-death clearance state. A description of the entire DC Module is available in Grabowiecki (1997) and the detailed specifications of the Direct Match and Probabilistic Linkage can be found in Wysocki and LaBillois (1997). Characteristics of the Target Populations for this Study To perform our comparisons, a subset of the CCR population was selected that could best illustrate the effect of direct match versus probabilistic linkage. Three provinces were chosen: British Columbia, Ontario and Quebec. They were picked because they contain, within Canada, the largest populations of

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition cancer patients, and the size of their respective populations is in the same order of magnitude. Quebec was specifically taken because its provincial cancer registry does not do death clearance. Consequently, the patient files sent to the CCR by this registry never contain complete death information. Therefore, no cancer patient record from Quebec can obtain a confirmation of death by means of the Direct Match process; all Quebec records participate in the Probabilistic Linkage. All other provinces do their own DC, and a significant number of their records on the CCR stand a good chance of being confirmed as dead as a result of the Direct Match. Due to the availability of data from the CCR and the CMDB at this time, we used reference years of diagnosis 1992 and 1993. The distribution by age and sex of the cancer patients in the three provinces is shown in Figure 1, below. It appears that there are only minor differences in the populations of cancer patients between these three provinces. Consequently, such differences are not expected to cause differences in the results of the death clearances. Figure 1. —Demographic Characteristics of the Populations of Cancer Patients It is also important to note that the data coming from different provinces are gathered by different PTCRs. Even though there is little difference between them, in terms of coding practices, definitions and timeliness, certain variations still exist. In particular, the data sources used by the PTCRs to build their registries vary considerably among them (Gaudette et al., 1997). These considerations are taken into account in the interpretation of the results. Direct Match and Probabilistic Linkage Vs. Only Probabilistic Linkage (Within the Same Province) Process This comparison is done by running the complete DC Module on the CCR data from British Columbia and Ontario. Both the DM and the PL are used to identify pairs for death confirmation. In the second run, any death information contained on the CCR records from these provinces is ignored. The system thus channels all the records directly to the PL. Quebec data are not usable for this comparison because of the

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition absence of complete death information on their CCR records. By comparing the two sets of pairs obtained in each approach for death confirmation, it is possible to measure different phenomena: overall percentage of accepted pairs (death confirmations) for each approach; percentage of pairs that are common to both approaches; percentage of pairs that were present in the regular DC process (DM & PL) but not in the PL only; percentage of pairs that were not present in the regular DC process (DM & PL) but were found in the PL only; and computer time and cost for each approach. These measures help to evaluate the usefulness of the Direct Match in the DC process and contrarily, the impact of not having the CCR death information previously supplied by PTCRs. Results and Observations The results of this process are summarised in Figure 2, below. When both a DM and PL were performed, the majority of the pairs formed (approximately 95%) came from the DM. This was the case for both of the provinces involved in this part of the study. This result emphasises the importance of high quality death information in effectively matching records on these two files. There can be no direct match unless all of the death fields are identical on the two files, and these account for all but 5% of the total of pairs created in the DM and PL process. Figure 2. —Comparison of Ontario and British Columbia Using Both Methods DC Population DM and PL PL   Matched Linked Total % Total % Ont. 84,926 22,648 1,183 23,831 28.1 23,670 27.9 B.C. 33,103 8,058 360 8,418 25.4 8,367 25.3 Total 118,029 30,706 1,543 32,249 27.3 32,037 27.1 It is evident that in terms of the number of pairs obtained in the end, one can expect little difference between the two methods of death clearance. Additionally, the particular pairs obtained (which specific patients are confirmed) will also be very similar. In this regard, there was less than a 1% difference in the two methods. Those differences that did exist tended to reflect favourably on the DM-PL method. Both methods found the same 32,035 pairs. On a net basis, the DM-PL method found 214 more pairs than did the PL only method. In percentage terms, this represented a negligible amount (again, less than 1%). Of those 214 pairs, roughly 94% were found in the direct match portion of the run; the others were found in the linkage. There were two pairs identified by the linkage-only method and not by its counterpart. In regard to the actual cost of running the programs under the two different methods, the total for the DM-PL approach was 54% of the total cost incurred in running the PL alone. There is a certain small

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition amount of instability in these numbers since the cost was dependent in part on the level of activity on the mainframe computer at the time that the programs were run. However, the percentage difference in the two costs is substantial even when this is considered. The relatively high cost of the linkage-only approach is due to the fact that the usual preprocessing steps must still be done but, at the same time, the number of records that are compared in the probabilistic linkage is considerably higher than the number used in the DM-PL approach (since many patient records, and their associated death records, will have been accounted for in the DM). A Province With Only Probabilistic Linkage Vs. Provinces With Direct Match and Probabilistic Linkage Process For this part, the complete death clearance system is used to process the data of the three selected provinces. It will automatically produce death confirmation pairs by using the Direct Match and the Probabilistic Linkage for British Columbia and Ontario. Simultaneously, it will only apply the Probabilistic Linkage for Quebec, because the Quebec cancer registry does not report the necessary identifiers for the Direct Match to the CCR. In comparing the death confirmation results obtained for each of the three provinces, it is possible to observe different phenomena. The first is the overall percentage of accepted pairs (death confirmations) for each province, and the possible contrast between Quebec and the two others. Another aspect to consider is the comparison of the percentage of death confirmation in Quebec versus those obtained with PL only for British Columbia and Ontario in the previous Section. It is also interesting to evaluate the impact of not having the CCR death information previously supplied by PTCRs. Results and Observations The results obtained from the above process are summarised in Figure 3. Figure 3. —Ontario and British Columbia vs. Quebec, Where Only PL Was Possible DC Population DM and PL PL   Matched Linked Total % Total % Qué. 57,252 – – – – 18618 32.5 Ont. 84,926 22,648 1,183 23,831 28.1 – – B.C. 33,103 8,058 360 8,418 25.4 – – The percentage of pairs found from among the Quebec data is rather higher than the corresponding percentages for the other provinces. In addition, all the Quebec patient records which contained some death information were successfully linked to a mortality record during probabilistic linkage. This was not the case for all of the Ontario and BC records which contained death information; that is, there were some patients reported as deceased by Ontario and BC which neither matched or linked to a CMDB record. Overall, 32.5% of the Quebec records that were in scope were successfully linked to the death file, while 28.1% of the Ontario records and 25.4% of the BC records were matched or linked. As previously noted, the data from Quebec does not contain complete death information; it does, however, contain some records where

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition the patient was reported as deceased by this province. It is probable that these were hospital deaths and so it is in turn very unlikely that the corresponding patients are being mistakenly reported as deceased. In essence, these patients can be anticipated to be good candidates to be successfully linked to a death record. More generally, some cancer patients in Quebec receive treatment entirely outside of hospitals and such patients may not then be reported to the CCR. The data from Quebec might, therefore, contain a greater proportion of more serious cancers than do the data from the other provinces used in the study. This offers a possible explanation for the higher percentages of cancer patients confirmed in Quebec compared to Ontario and B.C. Finally, we have seen that the differences between the outcomes observed for the Ontario-BC data, using the match and linkage, and the linkage only, in terms of the total number of pairs found, were relatively minor. Again, a greater percentage of pairs were found in Quebec than in the other provinces, and possibly because of the reasons outlined above. Conclusions Death Clearance of the CCR using PL only can be conducted with equal effectiveness as the DM-PL approach because of the reporting of high-quality personal and cancer data by the PTCRs. The advantages of the DM-PL method include lower operating costs to perform death clearance (increased efficiency), and greater certainty with the results (minimum manual review of cancer-mortality record pairs by PTCRs). Footnotes [1] The cancers that are reported to the CCR include all primary, non-benign tumours (with the exception of squamous and basal cell skin cancers, having morphology codes 805 to 808 or 809 to 811, respectively), as well as primary, benign tumours of the brain and central nervous system. In the International Classification of Diseases System—9th Revision (ICD-9), the following codes are included: for benign tumours, 225.0 to 225.9; for in situ/intraepithelial/noninfiltrating/noninvasive carcinomas, 230.0 to 234.9; for uncertain and borderline malignancies, 235.0 to 239.9; and finally, for primary site malignancies, 140.0 to 195.8, 199.0, 199.1, and 200.0 to 208.9. Similarly, according to the International Classification of Diseases for Oncology—2nd Edition (ICD-O-2), the target population of cancers includes: all in situ, uncertain/borderline, and primary site malignancies (behaviour codes 1, 2, or 3), as well as benign tumours (behaviour code 0) with topography codes in the range C70.0 to C72.9 (brain and central nervous system). [2] Matching entails finding a unique, assigned, identification number on two or more records, thus identifying them as belonging to the same person; whereas linkage concludes that two or more records probably refer to the same person because of the number of similar, personal characteristics found on them. [3] NYSIIS (New York State Identification and Intelligence System) assigns the same codes to names that are phonetically similar. It is used to group like-sounding names and thus take into account, during record linkage, variations (and errors) in spelling—e.g., Burke and Bourque, Jensen and Jonson, Smith and Smythe.

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition References Gaudette, L.; LaBillois, T.; Gao, R.-N.; and Whittaker, H. ( 1997). Quality Assurance of the Canadian Cancer Registry, Symposium 96, Nonsampling Errors, Proceedings, Ottawa, Statistics Canada. Grabowiecki, F. ( 1997). Canadian Cancer Registry, Death Clearance Module Overview, Statistics Canada (internal document). Statistics Canada ( 1994). Canadian Cancer Registry Data Dictionary, Health Statistics Division. Statistics Canada ( 1997). Canadian Mortality Data Base Data Dictionary, Health Statistics Division, (preliminary version). Wysocki, M. and LaBillois, T. ( 1997). Death Clearance Record Linkage Specifications, Household Survey Methods Division (internal document). Note: For further information, contact: Tony LaBillois, Senior Methodologist, Household Survey Methods Division, Statistics Canada, 16-L, R.H. Coats Building, Ottawa, Ontario K1A 0T6, e-mail: labiton@statcan.ca; Marek Wysocki, Methodologist, Household Survey Methods Division, Statistics Canada, 16-L, R.H. Coats Building, Ottawa, Ontario K1A 0T6, e-mail: wysomar@statcan.ca; Frank Grabowiecki, Project Manager, Health Division, Statistics Canada, 18-H, R.H. Coats Building, Ottawa, Ontario K1A 0T6, e-mail: grabfra@statcan.ca.

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition severe cases which by definition were likely to require treatment and thus to generate a medical record. About 76–87% of the drivers with incapacitating injuries linked to at least one injury or claims record (except for Wisconsin, which had limited access to outpatient data and Pennsylvania which used 6 levels to designate severity). Linkage rates for persons with possible injuries varied widely among the seven states. Because of extensive insurance data resources, about two-thirds of the possible injuries linked in Hawaii and New York compared to a third or less in the other states. Many more records indicating “no injuries” matched in New York and Utah, again because of access to extensive computerized outpatient data for the minor injuries. Included in this group of not injured were people who appeared uninjured at the scene but who hours or days after the crash sought treatment for delayed symptoms, such as whiplash. Overall, the CODES states without access to the insurance data linked between 7–13% of the person-specific crash reports for crashes involving a car/light truck/van to at least one injury record compared to 35–55% for Hawaii and New York, the states with extensive outpatient data. Wisconsin linked 2% of its drivers to the hospital inpatient state data and this rate matched that for the seven states as a group. Linkage of the records for the motorcycle riders was much higher than the car/light truck/van group, a reflection of the high injury rate for cyclists involved in police reported crashes. As expected the linkage rates were lower for the lower severities. Except for Pennsylvania and Wisconsin, more than 45 per cent of the person-specific motorcycle crash records linked to at least one injury record. Validation of the Linkages Causes of false negatives and false positives vary with each linkage because each injury data file is unique. Since it is unknown which records should link, validation of the linkage results is difficult. The absence of a record in the crash file prevents linkage to an injury record; the absence of a cause of injury code in the injury record risks a denominator inflated with non-motor vehicle crashes. The states assigned a high priority to preventing cases which should not match from matching and conservatively set the weight defining a match to a higher positive score. At the same time, they were careful not to set the weight defining a nonmatch too low so that fewer pairs would require manual review. The false positive rate ranged from 3.0–8.8 percent for the seven states and was viewed as not significant since the linked data included thousands of records estimated to represent at least half of all persons involved in motor vehicle crashes in the seven CODES states. False positives were measured by identifying a random sample of crash and/or injury records and reviewing those that linked to verify that a motor vehicle crash was the cause of injury. Maine, Pennsylvania, and Wisconsin read the actual paper crash, EMS, and hospital records to validate the linkage. Missouri compared agreement on key linkage variables such as injury county, last initial, date of event, trafficway/trauma indicators, date of birth, or sex. Wisconsin determined that the false positive rate for the Medicaid linkage varied from that for hospitalizations generally since Medicaid cases were more likely to be found in urban areas. False negatives were considered less serious than a false positive so the states adjusted the cut-off weight defining a nonmatch to give priority to minimizing the total matched pairs requiring manual review. A false negative represents an injury record with a motor vehicle crash designated as the cause which did not link to a crash report or a crash record with a designated severe injury (i.e., fatal, incapacitating) for which no match was found. The rates for false negatives varied from 4–30 percent depending on the linkage pass and the files being linked. The higher rates occurred when the power of the linkage variables to discriminate among the crashes and the persons involved was problematical. False negatives were measured by first identifying the records which should match. These included crash reports indicating ambulance transport, EMS records indicating motor vehicle crash as the cause of injury or hospital records listing an E

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition code indicating a motor vehicle crash. These records were then compared to the linked records to identify those that did not link. False negatives were also identified by randomly selecting a group of crash reports and manually reviewing the paper records to identify those which did not link. Crash and injury records failed to match when one or the other was never submitted, the linking criteria were too restrictive, key data linkage variables were in error or missing, the case selection criteria, such as the E-code, were in error or missing, the crash-related hospitalization occurred after several hours or days had passed, the crash or the treatment occurred out-of-state, etc. Lack of date of birth on the crash report for passengers was a major obstacle to linkage for all of the states except Wisconsin which included this information for all injured passengers. (As the result of the linkage process, Maine targeted the importance of including this data element on the crash report.) Among the total false negatives identified by Wisconsin, 12 percent occurred because the admission was not the initial admission for the crash and 10 percent occurred because key linkage variables were missing. Another 7.5 percent occurred because the linking criteria were too strict. About 7 percent were missing a crash report because the crash occurred out of state or the patient had been transferred from another institution. Twelve percent of the false negatives were admitted as inpatients initially for other reasons than the crash. It was not possible to determine the false negative rates when the key data linkage variables or E-code were in error, when out of state injuries were treated in Wisconsin Hospitals and when the crash record was not received at DOT. In spite of the failure of some records to match, the estimates of matching among those that could be identified as “should match” was encouraging. Missouri estimated linkage rates of 65 percent of the hospital discharge, 75 percent of the EMS records, and 88 percent of the head and spinal cord injury registry records when motor vehicle crash as the cause of injury was designated on the record. Comparison of Missouri's linked and unlinked records suggested that actual linkage rates were even higher, as unlinked records contained records not likely to be motor vehicle related injuries (such as gunshot, laceration, punctures, and stabs). The linked records showed higher rates of fractures and soft tissue injuries, which are typical of motor vehicle crashes. Seventy-nine percent of the fractures were linked, as were 78 percent of soft tissue injuries. The comparison of linked and unlinked records does not suggest that significant numbers of important types of records are not being linked, though perhaps some less severely injured patients may be missed. Because ambulance linkage was used as an important intermediate link for the hospital discharge file, some individuals not injured severely enough to require an ambulance may have been missed, but they would also be less likely to require hospitalization. Any effect of this would be to erroneously raise slightly the estimate of average charges for hospitalized patients. Significance of the False Positive and False Negative Rates Although the rates for the false negatives and false positives were not significant for the belt and helmet analyses, they may be significant for other analyses using different outcome measures and smaller population units. For example, analyses of rural/urban patterns may be sensitive to missing data from specific geographic areas. Analyses of EMS effectiveness may be sensitive to missing data from specific EMS ambulance services or age groups. Another concern focuses on the definition of an injury link. Defining an injury to include linkage to any claim record that indicated medical treatment or payment increases the probability of including uninjured persons who go to the doctor for physical exams to rule out an injury. But this group also includes persons who are saved from a more serious injury by using a safety device, so although they inflate the number of total injuries, they are important to highway safety. When minor injuries are defined as injuries only if their existence is verified by linkage, then by definition the unlinked cases become non-injuries relative to the data sources used in the linkage. States using data

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition sources covering the physician's office through to tertiary care will have more linkages and thus more “injuries.” Estimates of the percentage injured, transported, admitted as inpatients, and the total charges will vary accordingly. The Linkage Methodology is Robust and the Linked Data Are Useful Seven states with different routinely collected data that varied in quality and completeness were able to generate from the linkage process comparable results that could be combined to calculate effectiveness rates. The states also demonstrated the usefulness of the linked data. They developed state-specific applications to identify populations at risk and factors that increased the risk of high severity and health care costs. They used the linked data to identify issues related to roadway safety and EMS, to support safety legislation, to evaluate the quality of their state data and for other state specific purposes.

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Record Linkage of Progress Towards Meeting the New Jersey High School Proficiency Testing Requirements Eva Miller, Department of Education, New Jersey Abstract The New Jersey Department of Education has undertaken a records linkage procedure to follow the progress of New Jersey's Public school students in meeting the state standardized graduation test-the High School Proficiency Test (HSPT). The HSPT is a test of higher order thinking skills mandated by state legislation in 1988 as a graduation requirement which measures “those basic skills all students must possess to function politically, economically, and socially in a democratic society. ” The HSPT is first administered in the fall of the student's eleventh grade. If the student is not successful in any of the three test sections—reading, mathematics, writing—he/she has additional opportunities, each semester, to retake those test sections for which the requirement is still unmet. In terms of public accountability of educational achievement, it is very important to define a population clearly and then to assess the quality of public education in two ways—the ability of the educational program to meet the challenge of the graduation test at the first opportunity (predominantly an evaluation of the curriculum); and the ability of the school system, essentially through the effectiveness of its interventions or remediations, to help the population meet the graduation requirement over the time remaining within a routine progression to graduation. New Jersey uses a unique student identifier (not social security number) and has designed a complete mechanism for following the students through the use of test answer folders, computerized internal consistency checks, and queries to the school districts. The system has been carefully designed to protect confidentiality while tracking student progress in the many situations of moving from school to school or even in and out of the public school system, changes in grade levels and changes in educational programs (such as mainstreaming, special education, and limited English proficient programs). Preserving confidentiality, linking completely to maintain the accuracy and completeness of the official records, definitions and analysis will be discussed. Introduction The New Jersey Department of Education has undertaken a record linkage procedure involving use of computers in the deterministic matching of student records to follow the progress of New Jersey's public school students in meeting the state standardized graduation test —the High School Proficiency Test (HSPT). The HSPT is a test of higher order thinking skills mandated by state legislation in 1988 as a graduation requirement which measures “those basic skills all students must possess to function politically, economically, and socially in a democratic society.” The HSPT is first administered in the fall of the

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition students' eleventh grade. If the student is not successful in any of the three test sections—reading, mathematics, writing—he/she has additional opportunities, each semester, to retake test section(s) not yet passed. On first glance it would seem that New Jersey Department of Education 's records linkage task is an easy and straightforward one. Since in October 1995, 62,336 eleventh grade students were enrolled in regular educational programs in New Jersey's public schools and 51,601 (or 82.8%) of these students met the HSPT testing requirement on their first testing opportunity (also includes eleventh grade students who may have met the requirement in one or more test sections while categorized by their local educators as “retained tenth grade” students), only 10,730 students need to be followed forward for three more semesters until graduation! Since some of these students (probably half again) will meet the requirement upon each testing opportunity, the number diminishes and the task should be trivial…right? We have high speed computers and the public wanting this information thinks we just have to push a few buttons! The problem is complicated, however, especially by flows of migration (students entering or leaving New Jersey's public schools) and mobility (students transferring from one public school to another), and gets increasingly more subject to error as time from the original eleventh grade enrollment passes. From the perspective of the policy maker in the Department of Education whose intent it is to produce a report of test performance rates which are comparable over schools, districts, and socio-demographic aggregations, the problem is further complicated by the fact that grade designation is a decision determined by local educators and rules may vary from school district to school district. Changes in a student's educational status with respect to Limited English Proficiency programs and/or Special Education programs also complicate tracking. In terms of public accountability of educational achievement, it is very important to define a population clearly and then to assess the quality of public education in two ways: the ability of the educational program to meet the challenge of the graduation test at the first opportunity (predominantly an evaluation of the curriculum); and the ability of the school system, essentially through the effectiveness of its interventions or remediations, to help the population meet the graduation requirement over the time remaining within a routine progression to graduation. Before the New Jersey Department of Education developed the cohort tracking system, information on HSPT test performance was reported specific to each test administration. This cross-sectional method of analysis was dependent on which students attended school during the test administration, and even more dependent on local determination of students' grade level attainments than in a longitudinal study. Using the cross-sectional reports, it was very difficult, if not impossible, to meaningfully interpret reports which were for predominantly retested student populations (i.e., what did the fall grade 12 test results report really mean?). Methodology The cohort tracking project is a joint effort involving the New Jersey Department of Education, National Computer Systems (NCS), and New Jersey educators in public high schools. The department is responsible for articulation of the purpose of the project and establishing procedures to be used—including such activities as statistical design and decision-making rules, maintaining confidentiality of individual performance information, and assuring appropriate use and interpretation of reported information. NCS is

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition responsible for development and support of a customized computer system, its specifications and documentation. The system is written in COBOL and provides features necessary for generation of the identifier; sorting and matching; data query regarding mismatches, nonmatches the uniqueness of the identifier, and assurances of the one-to-one correspondence of identifier to student. The department and NCS share responsibility in maximizing the efficiency and effectiveness of the system and in trying to reduce the burden of paper work involved in record keeping, minimizing queries back to local educators, utilizing the computer effectively in checking information for internal consistency, developing and maintaining quality control procedures of interim reports to the local educators and public reports, and maximizing yield of accurate information. The local educator maintains primary responsibility related to the validity of the information by: assuring the accuracy of identifier information about individual students, reviewing reports sent to them to assure the accuracy and completeness of information about their enrolled and tested student population; and the responsibility to ascertain that every enrolled student is listed on the school's roster once and only once! At its inception in October 1995, the cohort tracking project was intended to follow a defined population of eleventh grade students forward to their anticipated graduation (the static cohort). Local educators objected to this methodology because they could only educate students who were currently enrolled. To address this very important concern, the dynamic cohort was defined (see Figure 1). In effect, the dynamic cohort represents statistical adjustment of the original static cohort at each test administration to allow students who have left the reference group (school, district, or statewide) without meeting the graduation testing requirement to be removed from observation and adds those students who entered the reference group after fall of the eleventh grade and have not already met the testing requirement. Statistics produced for either the static cohort (prospective perspective) or the dynamic cohort (retrospective perspective) were not true rates, but rather were indices since after the first test administration (on the last day of testing in the fall of the eleventh grade) these populations are no longer groups of students served within a school, district, or state at a specific moment in time. The mobility index—simply the sum of the number of students entering and the number of students leaving the reference group since the last day of testing in fall of the eleventh grade, divided by the reference group at the initial time point—was designed to help the user interested in evaluating educational progress as assessed by the HSPT (educator, parent, student, citizen and/or policy maker) decide which set of statistics, static or dynamic, would be more appropriate with respect to a particular reference group (school, district, or state). The higher the mobility level, the greater the difference between the set of statistics, and the more likely reliance should be made of the dynamic statistics. In developing the system, the department had a need for a cost-effective, accurate, and timely system. The department needed exact matches and, therefore, could not rely on probability matching or phonetic schemes such as NYSIIS. A system with a number of opportunities for the local educator to review and correct the information was developed. A mismatch (or Type II error) was considered to have far more serious consequences in this tracking application than a nonmatch (or Type I error) because an educator might be notified that a student met requirements in one or more testing sections when that has not yet occurred (and the student might have been denied an opportunity to participate in a test administration based on a mismatch). The nonmatch is especially of concern to the local educator, because the most likely scenario here is that the student is listed in the file more than once, and none of these (usually incomplete) student records are likely to show all of the student's successes, therefore, the student was in the denominator population multiple times and had little or no chance of entering the numerator of successful students. In working with various lists and HSPT ID discrepancy reports, local educators have had heightened awarenesses of the “Quality in…Quality out” rule mentioned by Martha Fair (Fair and Whitridge, 1997).

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Figure 1. —Definitions of Static and Dynamic Cohort

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Considerations Regarding the Identifier This records linkage application is a relational database dependent on a unique identifier—the HSPT identification number (HSPT ID) —and supported by the following secondary fields: name (last and first but not middle initial), date of birth, and gender. In determining a unique identifier, the department first considered the use of social security number because it is a number which has meaning to the individual (is known), and is nearly universal and readily available to the individual. However, the department abandoned the plan to use SSN before the tracking project was implemented because citizens complained—both verbally and in writing. Concerns included not wanting to draw attention to illegal aliens and considerations of the reasonableness of the number in terms of an individual's willingness to disclose it to school officials for this purpose and how use of the SSN may make it possible to access other unrelated files. In reviewing Fair's criteria for a personal identifier (permanence, universality, reasonableness with respect to lack of objection to its disclosure, economy, simplicity, availability, having knowledge or meaning to the individual, accuracy, and uniqueness), the HSPT ID received low marks related to permanence and having the property of being easily known or meaningful to the student. The HSPT ID rated high marks on universality and reasonableness, with respect to lack of objection to its disclosure, precisely because it lacked meaning and could not be easily related to other records. The HSPT ID is also economical, simple, accurate, and is secured and safeguarded—procedures have been implemented which assure that only appropriate school officials can access specific HSPT IDs for their enrolled populations and next access confidential information associated with these students' records of test results, in accordance with concerns regarding data confidentiality (U.S. Department of Education, 1994). Work involving assurance that there is only one number per student includes an HSPT ID update report, an HSPT ID discrepancy report, and multiple opportunities for record changes to correct information on student identifiers based on local educators ' reviews of rosters (lists) of their students' test results. The HSPT ID has been generated within the tracking system on the answer folder for each first-time test taker. Repeat test takers were to use stickers with student identifiers contained in a computer bar code label provided by NCS. District test coordinators can also contact staff at NCS, and after reasonable security checks are completed, obtain the HSPT ID and test results (from previous test administrations) for entering students who have already been tested. A cohort year designation is to be assigned to a student once and only once. Safeguards are currently being developed to assure that despite grade changes over time, each student is followed based upon the initial (and only) cohort year designation. Validity Assurances The department and NCS are currently developing additional computerized procedures to assure the one-to-one correspondence of the HSPT ID to the student. A critical element in the assurance of the validity of the correct identification of each enrolled student as well as pass/fail indicators (for each test section and the total test requirement) is the review of the static roster immediately following the fall eleventh grade test administration. Recently also, safeguards have been added to the computer system to assure: that for a given HSPT ID once a passing score in a particular test section has been obtained by a student, no further information on testing in that test section can be accepted by the cohort tracking system because the first passing score is the official passing score; and

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition that for a student who has met the testing requirement by passing all test sections the HSPT ID number is locked and the system accepts no new information to be associated with that HSPT ID. Quality Control Procedures for Cohort Reports Quality control of cohorts reports is a joint effort on the part of the department and NCS. Quality control procedures include visual review of student rosters and statistical report, utilization of a system of SAS programs generated under the same project definitions and decision-making rules by a different programmer in order to check the logic used in the COBOL programs. To date, this quality control has been conducted three times. There is a written quality control protocol which has made it possible to move from the implicit understanding of records linkage methodology and computer systems capabilities to explicit criteria for this particular application. These explicit criteria are identified, clearly articulated, and observable. This protocol has been very useful in that it: helped clarify expectations for NCS, allowed more department staff to participate fully in the quality control process while minimizing need for specific project orientation or training time, and more complete documentation of the quality control effort for each cohort after each test administration. Refinement of these quality control procedures is on-going. Confidentiality In addition to the procedures for release of HSPT ID described above, confidentiality is preserved on cohort dynamic out rosters in that students who have left a school are listed without pass/fail indicators for test sections and the total testing requirement. With respect to public reporting, the department has been very conservative in using a rule of “10” instead of the rule of “three”; in this way individual student test performance information is not discemable from information reported publicly. Results Actual test performance results based on this longitudinal study have been reported only once to date (Klagholz, L. et al., 1996). These results were for the first cohort, juniors in October 1995, and followed students through one academic year (two test administrations). The audience of public users seemed to receive the information well and are currently anticipating the December 1997 release of information which is to include the academic progress of the 1995 cohort toward graduation and, in comparison to last year's public release, the academic progress of the second cohort, juniors in October 1996, through their junior year. While there are a variety of ways to correct and update the cohort tracking master data base, a key (and predominant) method is the first record change process after receiving initial test results. The record change process is an opportunity for correction of erroneous data related to permanent student identifiers (name, date of birth, and gender), personal status identifiers (school enrollment, grade, participation in special programs (such as Special Education, Limited English Proficiency programs, and Title I), and test

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition related information (attendance at time of testing each content area, void classifications, and first time or retest taking statuses). There were a total of 2,526 record changes processed at this first opportunity for data review. This is not an unduplicated students count, since one student's records might have needed several variables to be corrected. It is not readily obvious what denominator to suggest in determining rates—93,627 for total students enrolled (including students in special programs) would be most appropriate in determining a proportion of the total population to be tested for a cohort year. Another approach, however, would be that record changes as they relate to the cohort tracking project, should be segmented and the number of record changes for the students who had one or more test sections yet to pass after the October test administration would be useful; however, that statistic is not readily available. Numbers of record changes by reason were as follows: Student identifiers: name: 447   date of birth: 218   gender: 41   school: 53 Status: grade: 367   Special Education: 631   Limited English Proficiency: 196   Title I: 153 Test specific information: attendance on test days: 43   void classifications: 57   first time or retest status 381. A systematic error regarding 731 students who were tested for the first time in April 1996 occurred. These students were not appropriately reflected in the dynamic cohort statistics. The computer system has been corrected to handle these cases correctly. This also necessitated tightening the quality control protocol and procedures. Corrected test performance rates for that same time point in the longitudinal study will be released in December 1997. While no one ever wants to release erroneous information, it was interesting to note the order of magnitude: for 51 schools there was no change, for 44 schools the correction increased pass rates by up to 0.8%, and for 136 schools passed rates decreased (by within 1.0% for 104 schools and between 1.1% and 4.2% for 32 schools). The mobility index was designed to measure the stress on student populations caused by students who change the educational climate by either entering or leaving a particular high school after October of their junior year. This index was considered to be needed to guide the decision as to whether the set of static or dynamic statistics would be more appropriate measures of progress for a given reference group (school, district, state). The mobility index was observed to have a highly negative correlation with test performance. This finding was especially important to educators in communities with high mobility in that it helped these educators quantify the seriousness of the socio-economic problem, and communicate it in understandable terms regarding the consequences of these moves upon the continuity of students' educational experiences and educational progress in meeting performance standards.

OCR for page 201
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Plans for the Future The department has an ambitious plan to increase the graduation test (and assessments at fourth and eighth grades) to include eight test sections (content areas). The cohort tracking project is to be expanded to include all test sections on the graduation test. Cohort tracking is also to be extended vertically to the other grades for which there is a statewide assessment program. The possibility of developing a population register for students enrolled in New Jersey's public schools is under discussion. Then the cohort tracking system would be incorporated into a larger department information system for reviewing educational programs, attendance, and school funding as well as outcome measures such as test results. In discussions about a population registry, social security number has been proposed for the linkage variable. References Fair, M. and Whitridge, P. ( 1997). Record Linkage Tutorial, Records Linkage Techniques—1997, Washington, D.C.: Office of Management and Budget. Klagholz, L.; Reece, G.T.; and DeMauro, G.E. ( 1996). 1995 Cohort State Summary: Includes October 1995 and April 1996 Administrations Grade 11 High School Proficiency Test, New Jersey Department of Education. U.S. Department of Education, ( 1994). Education Data Confidentiality: Two Studies, Issues in Education Data Confidentiality and Access and Compilation of Statutes, Laws, and Regulations Related to the Confidentiality of Educational Data , Washington, D.C.: National Center for Education Statistics, National Forum on Education Statistics.