Chapter 2

Invited Session on Record Linkage Applications for Epidemiological Research

Chair: John Armstrong, Elections Canada

Authors:

Leicester E.Gill, University of Oxford

John R.H.Charlton, Office of National Statistics, UK and Judith D.Charlton, JDC Applications

John Van Voorhis, David Koepke, and David Yu University of Chicago



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Chapter 2 Invited Session on Record Linkage Applications for Epidemiological Research Chair: John Armstrong, Elections Canada Authors: Leicester E.Gill, University of Oxford John R.H.Charlton, Office of National Statistics, UK and Judith D.Charlton, JDC Applications John Van Voorhis, David Koepke, and David Yu University of Chicago

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition OX-LINK: The Oxford Medical Record Linkage System Leicester E.Gill, University of Oxford Abstract This paper describes the major features of the Oxford record linkage system (OX-LINK), with its use of the Oxford name compression algorithm (ONCA), the calculation of the names weights, the use of orthogonal matrices to determine the threshold acceptance weights, and the use of combinational and heuristic algebraic algorithms to select the potential links between pairs of records. The system was developed using the collection of linkable abstracts that comprise the Oxford Record Linkage Study (ORLS), which covers 10 million records for 5 million people and spans 1963 to date. The linked dataset is used for the preparation of health services statistics, and for epidemiological and health services research. The policy of the Oxford unit is to comprehensively link all the records rather than prepare links on an ad-hoc basis. The OX-LINK system has been further developed and refined for internally cross matching the whole of the National Health Service Central Register (NHSCR) against itself (57.9 million records), and to detect and remove duplicate pairs; as a first step towards the issue of a new NHS number to everyone in England and Wales. A recent development is the matching of general practice (primary care) records with hospital and vital records to prepare a file for analyzing referral, prescribing and outcome measures. Other uses of the system include ad hoc linkages for specific cohorts, academic support for the development of test programs and data for efficiently and accurately tracing people within the NHSCR, and developing methodologies for preparing registers containing a high proportion of ethnic names. Medical Record Linkage The term record linkage, first used by H.L.Dunn (1946; Gill and Baldwin, 1987), expresses the concept of collating health-care records into a cumulative personal file, starting with birth and ending with death. Dunn also emphasised the use of linked files to establish the accuracy or otherwise of the recorded data. Newcombe (Newcombe et al., 1959; and Newcombe, 1967, 1987, and 1988) undertook the pioneering work on medical record linkage in Canada in the 1950's and thereafter, Acheson (1967, 1968) established the first record linkage system in England in 1962. When the requirement is to link records at different times and in different places, in principle it would be possible to link such records using a unique personal identification number. In practice, a unique number has not generally been available on records in the UK of interest in medicine and therefore other methods such as the use of surnames, forenames and dates of birth, have been necessary to identify different records relating to the same individual. In this paper, I will confine my discussion to the linkage of records for different events which relate to the same person.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Matching and Linking The fundamental requirement for correct matching is that there should be a means of uniquely identifying the person on every document to be linked. Matching may be all-or-none, or it may be probabilistic, i.e., based on a computed calculation of the probability that the records relate to the same person, as described below. In probability matching, a threshold of likelihood is set (which can be varied in different circumstances) above which a pair of records is accepted as a match, relating to the same person, and below which the match is rejected. The main requirement for all-or-none matching is a unique identifier for the person which is fixed, easily recorded, verifiable, and available on every relevant record. Few, if any, identifiers meet all these specifications. However, systems of numbers or other ciphers can be generated which meet these criteria within an individual health care setting (e.g., within a hospital or district) or, in principle, more widely (e.g., the National Health Service number). In the past, the National Health Service number in England and Wales had serious limitations as a matching variable, and it was not widely used on health-care records. With the allocation of the new ten digit number throughout the NHS all this is being changed (Secretaries of State, 1989; National Health Service and Department of Health, 1990), and it will be incorporated in all health-care records from 1997. Numbering systems, though simple in concept, are prone to errors of recording, transcription and keying. It is therefore essential to consider methods for reducing errors in their use. One such method is to incorporate a checking device such as the use of check-digits (Wild, 1968; Hamming, 1986; Gallian, 1989; Baldwin and Gill, 1982; and Holmes, 1975). In circumstances where unique numbers or ciphers are not universally used, obvious candidates for use as matching variables are the person's names, date of birth, sex and perhaps other supplementary variables such as the address or postcode and place of birth. These, considered individually, are partial identifiers and matching depends on their use in combination. Unique Personal Identifiers Personal identification, administrative and clinical data are gradually accumulated during a patient's spell in a hospital and finalized into a single record. This type of linkage is conducted as normal practice in hospital information systems, especially in those hospitals having Patient Administration Systems (PAS) and District Information Systems (DIS) which use a centrally allocated check-digited District Number as the unique identifier (Goldacre, 1986). Identifying numbers are often made up, in part, from stable features of a person's identification set, for example, sex, date of birth and place of birth, and so can be reconstructed in full or part, even if the number is lost or forgotten. In the United Kingdom (UK), the new 10-digit NHS number is an arbitrarily allocated integer, almost impossible to commit to memory, and cannot be reconstructed from the person's personal identifiers. Difficulties arise, however, where the health event record does not include a unique identifier. In such cases, matching and linking depends on achieving the closest approach to unique identification by using several identifying variables each of which is only a partial identifier but which, in combination, provide a match which is sufficiently accurate for the intended uses of the linked data. Personal Identifying Variables The personal identifying variables that are normally used for person matching can be considered in five quite separate groups.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Group 1. –Represents the persons proper names and with the exception of present surname when women adopt their husbands surname on marriage, rarely changes during a person's lifetime: birth surname; present surname; first forename or first initial; second forename or second initial; and, other forenames. Group 2. –Consists of the non-name personal characteristics that are fixed at birth and very rarely changes during the person's lifetime: gender (Sex at birth); date of birth; place of birth (address where parents living when person was born); NHS number (allocated at birth registration, both old and new formats); date of death; and ethnicity. Group 3. –Consists of socio-demographic variables that can change many times during the course of the person's lifetime: street address; post code; general practitioner; marital status; social class; number(s) allocated by a health district or special health-care register; number(s) allocated by a hospital or trust; number(s) allocated by a general practitioner's computing system; and, any other special hospital allocated numbers. Group 4. –Consists of other variables that could be used for the compilation of special registers: clinical specialty; diagnosis; cancer site; drug idiosyncrasy or therapy; occupation; date of death; and other dates (for example, LMP, etc.). Group 5. –Consists of variables that could be used for family record linkage: other surnames; mother's birth surname; father's surname; marital status; number of births; birth order; birth weight; date of marriage; and number of marriages. File Ordering and Blocking Matching and linkage in established datasets usually involves comparing each new record with a master file containing existing records. Files are ordered or blocked in particular ways to increase the efficiency of searching. In similar fashion to looking up a name in a telephone directory the matching algorithm must be able to generate the “see also” equivalent to this surname for variations in spelling (e.g., Stuart and Stewart, Mc, Mk, and Mac). Searching can be continued, if necessary, under the alternative surname. Algorithmics that emulate the “see also” method are used for computer matching in record linkage. In this way, for example, Stuarts and Stewarts are collated into the same block. A match is determined by the amount of agreement and disagreement between the identifiers on the “incoming” record and those on the master file. The computer calculates the statistical probability that the person on the master file is the same as the person on the record with which it is compared. File Blocking The reliability and efficiency of matching is very dependent on the way in which the initial grouping or the “file-blocking” step is carried out. It is important to generate blocks of the right size. The balance between the number and size of blocks is particularly important when large files are being matched. The selection of variables to be used for file blocking is, therefore, critical and will be discussed before considering the comparison and decision-making stages of probability matching. Any variable that is present on each and every record on the dataset to be matched could be used to divide or block the file, so enhancing the search and reducing the number of unproductive comparisons. Nevertheless, if there is a risk that the items chosen are wrongly recorded—which would result in the records being assigned to the wrong file block, then potential matches will be missed. Items that are likely to change their value from one record to an-

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition other for the same person, such as home address, are not suitable for file blocking. The items used for file blocking must be universally available, reliably recorded and permanent In practice, it is almost always necessary to use surnames, combined with one or two other ubiquitous items, such as sex and year of birth, to subdivide the file into blocks that are manageable in size and stable. Considerable attention has been given to the ways in which surnames are captured and algorithmic methods to reduce, or eliminate, the effects of variations in spelling and reporting, and which “compress” names into fixed-length codes. Phonemic Name Compression In record linkage, name compression codes are used for grouping together variants of surnames for the purposes of blocking and searching, so that effective match comparisons can be made using both the full name and other identifying data, despite misspelled or misreported names. The first major advance in name compression was achieved by applying the principles of phonetics to group together classes of similar-sounding groups of letters, and thus similar-sounding names. The best known of these codes was devised in the 1920's by Odell and Russell (Knuth, 1973) and is known as the Soundex code. Other name compression algorithms are described by Dolby (1970) and elsewhere. Soundex Code and the Oxford Name Compression Algorithm (ONCA) The Soundex code has been widely used in medical record systems despite its disadvantages. Although the algorithm copes well with Anglo-Saxon and European names, it fails to bring together some common variants of names, such as Thomson/Thompson, Horton/Hawton, Goff/Gough, etc., and it does not perform well where the names are short, as is the case for the very common names, have a high percentage of vowels, or are of Oriental origin. It is used principally, for the transformation of groups of consonants within names, to specific combinations of both vowels and consonants (Dolby, 1970). Among several algorithms of this type, that devised by the New York State Information and Intelligence System (NYSIIS) has been particularly successful, and has been used in a modified form by Statistics Canada and in the USA for an extensive series of record linkage studies (Lynch and Arends, 1977). A recent development in the Unit of Health-Care Epidemiology (UHCE) (Gill and Baldwin, 1987; Gill et al, 1993), referred to as the Oxford Name Compression Algorithm (ONCA), uses an anglicised version of the NYSIIS method of compression as the initial or pre-processing stage, and the transformed and partially compressed name is then Soundexed in the usual way. This two-stage technique has been used successfully for blocking the files of the ORLS, and overcomes most of the unsatisfactory features of pure Soundexing while retaining a convenient four-character fixed-length format. The blocks produced using ONCA alone vary in size, from quite small and manageable for the less common surnames, to very large and uneconomic for the more common surnames. Further subdivision of the ONCA blocks on the file can be effected using sex, forename initial and date of birth either singly or in combination. ORLS File Blocking Keys and Matching Variables The file blocking keys used for the ORLS are generated in the following fashion: The primary key is generated using the ONCA of the present surname. The secondary key is generated from the initial letter of the first forename. Where this forename is a nickname or a known contraction of the “formal” forename, then the initial of the “formal” forename is used. For example, if the recorded forename was BILL, the “formal” forename would be William, and the initial used

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition would be W. A further record is set up on the master file where a second forename or initial is present; the key is derived from this second initial. Where the birth surname is not the same as the present surname, as in the case of married women, a further record is set up on the master file under the ONCA code of birth surname and again subdivided by the initial. (This process is termed exploding the file.) Further keys based on the date of birth and other blocking variables are also generated. In addition to the sorting header, four other variables are added to each record before sorting and matching is undertaken: Accession Number. –A unique number allocated from a pool of such numbers, and is absolutely unique to this record. The number is never changed and is used for identification of this record for correction and amendment. The number is check digited to modulus 97. Person or System Number. –A unique number allocated from a pool of such numbers. The number can be changed or replaced if this record matches with another record. The number is check digited to modulus 97. Coding Editions. –Indicators that record the various editions of the coding frames used in this record, for example the version of the ICD (International Classification of Diseases) or the surgical procedure codes. These indicators ensure that the correct coding edition is always recorded on each and every record and reliance is not placed on a vague range of dates. Input and Output Stream Number. –This variable is used for identifying a particular dataset during a matching run, and enables a number of matches to be undertaken independently at the same pass down the master file. Generating Extra Records Where a Number of Name Variants Are Present To ensure that the data record can match with the blocks containing all possible variants of the names information, multiple records are generated on the master file containing combinations of present and birth surnames, and forenames. To illustrate the generation of extra records where the identifying set for a person contains many variants of the names, consider the following example: birth surname: SMITH present surname (married surname): HALL first forename: LIZ (contraction of Elizabeth) second forename: PEGGY (attraction of Margaret) year of birth: 1952 (old enough to be married). Eight records would be generated on the master file and each record indexed under the various combinations of ONCA and initial, as follows: Indexed under the present surname HALL: i.e., ONCA H400: H400L for Liz H400E for Elizabeth (formal version of Liz) H400P for Peggy H400M for Margaret (formal version of Peggy);

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Indexed under the birth surname SMITH: i.e., ONCA S530: S530L for Liz S530E for Elizabeth (formal version of Liz) S530P for Peggy S530M for Margaret (formal version of Peggy). Mrs. Hall would have her master file record included under each of the above eight ONCA/initial blocks. A data record containing any combination of the above names would generate an ONCA/initial code similar to any one of the eight above, and would have a high expectation of matching to any of the variants during the matching phase. To reduce the number of unproductive comparisons, a data record will only be matched with an other record in the same block provided that the year of birth on both records are within 16 years of each other. This constraint has been applied, firstly, to reduce the number of unproductive matches, and secondly to confine matching to persons born within the same generation, and in this way eliminate father/son, mother/daughter matches. Further constraints could be built into the matching software for example, matching only within the same sex, logically checking that the dates on the two records are in a particular sequence or range, or that the diagnoses on the two records are in a specified range, as required in the preparation of a cancer registry file. Matching Methods There are two methods of matching data records with a master file. The two file method is used to match a data record from a data file with a block on the master file, and in this way compare the data record with every record in the master file block. The one file/single pass method is used to combine the data file block and the master file block into one block, and to match each record with every other in the block in a triangular fashion, i.e., first with the rest, followed by second with the rest etc. In this way every record can be matched with every other record. Use of a stream number on each record enables selective matching to be undertaken, for example data records can be matched with the master file and with each other, but the master file records are not matched with themselves. Match Weights Considerable work has been undertaken to develop methods of calculating the probability that pairs of records, containing arrays of partial identifiers which may be subject to error or variation in recording do, or do not, relate to the same person. Decisions can then be made about the level of probability to accept. The issues are those of reducing false negatives (Type I errors) and false positives (Type II errors) in matching (Winkler, 1995; Scheuren and Winkler, 1996; and Belin and Rubin, 1995). A false negative error, or “missed match, ” occurs when records which relate to the same person are not drawn together (perhaps because of minor variations in spelling or a minor error in recorded dates of birth). Matches may also be missed if the two records fall into different blocks. This may happen if, for example, a surname is misspelled and the phonemic compression algorithm puts them into two different blocks. Methods for probability matching depend on making comparisons between each of several items of identifying information. Computer-based calculations are then made which are based on the discriminating power of each item. For example, a comparison between two different records containing the same surname has greater discriminating power if the surnames are rare than if they are common. Higher scores are given for agreement between identifiers (such as particular surnames) which are uncommon than for those which are common. The extent to which an iden-

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition tifier is uncommon or common can be determined empirically from its distribution in the population studied. Numerical values can then be calculated routinely in the process of matching for the amount of agreement or disagreement between the various identifying items on the records. In this way a composite score or match weight can be calculated for each pair of records, indicating the probability that they relate to the same person. In essence, these weights simulate the subjective judgement of a clerk. A detailed discussion of match weights and probability matching can be found in publications by Newcombe (Newcombe et al., 1959; and Newcombe, 1967, 1987, and 1988), and by Gill and Baldwin (1987) (See also Acheson, 1968.) Calculating the Weights for the Names Items Name identifiers are weighted in a different fashion to the non-name identifiers, because there are many more variations for correctly spelled names. Analysis of the NHS central register for England and Wales shows that there are: 57,963,992 records 1,071,603 surnames 15,143,043 surname/forename pairs. The low frequency names were mainly non Anglo-Saxon names, hyphenated names and misspelled names. In general the misspellings were due to embedded vowel changes or to miss keying. A more detailed examination of the register showed that 954 different surnames covered about 50% of the population, with the following frequency distribution: 10% population 24 different surnames 20% population 84 different surnames 30% population 213 different surnames 40% population 460 different surnames 50% population 954 different surnames 60% population 1,908 different surnames 70% population 3,912 different surnames 80% population 10,214 different surnames 90% population 100,000 different surnames 100% population 1,071,603 different surnames. Many spelling variations were detected for the common forenames. Using data from the NHSCR register, various forename directories and other sources of forenames, a formal forename lexicon was prepared that contained the well known contractions and nicknames. The problem in preparing the lexicon was whether to include forenames that had minor spelling errors, for example JOHN and JON. This lexicon is being used in the matching algorithm, to convert nicknames and contractions, for example LIZ, to the formal forename ELIZABETH, and both names are used as part of the search strategy. Calculation of Weights for Surnames The binit weight calculated from the frequency of the first letter in the surname (26 different values) was found to be too crude for matching files containing over 1 million records. The weights for Smith, Snaith, Sneath, Smoothey, Samuda, and Szabo would all have been set to some low value calculated from the frequency of Smith in the population, and ignoring the frequency of the much rarer Szabo. Using the frequencies of all of the 1 million or more different surnames on the master match file is too cumbersome, time consuming to keep up-to-date, and operationally difficult to store during the match run. The list would also have contained all of the one-off surnames generated by bad transcription and bad spelling. A compromise solution was devised by calculating the weights based on

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition the frequency of the ONCA block (8,000 values), with a cut-off value of 1 in a 1,000 in order to prevent the very rare and one-off names from carrying very high weights. Although this approach does not get round the problem of the very different names that can be found in the same ONCA block (Block S530: contains Smith, Smithies, Smoothey, Snaith, Sneath, Samuda, Szabo, etc.) it does provide a higher level of discrimination and, in part, accommodate the erroneous names. The theoretical weight based on the frequency of the surname in the studied population is modified according to the algorithm devised by Knuth-Morris-Pratt (Stephen, 1994; Gonnet and Baeza-Yates, 1991; and Baeza-Yates, 1989), and takes into account the length of the shortest of the two names being compared, the difference in length of the two names, the number of letters agreeing and the number of letters disagreeing. Where the two names are absolutely identical, the weight is set to +2N, but falls down to a lower bound of -2N where the amount of disagreement is quite large. If the birth surname and present surname are swapped with each other, exploding the file as described previously enables the system to find and access the block containing the records for the appropriate surnames. The weights for the present and birth surname pairs are calculated, then the present surname/birth surname and birth surname/present surname pairs are also calculated. The highest of the two values is used in the subsequent calculations for the derivation of the match weight. In cases where the marital status of the person is single, i.e., never married, or the sex is male, or the age is less that 16 years, it is normal practice in the UK for the present surname to be the same as the birth surname, and for this reason only the weight for the present surname is calculated and used for the determination of a match. Forenames The weights derived for the forenames are based on the frequency of the initial letter of the forename in the population. Since the distribution of male and female forenames are different, there are two sets of different weights, one for males and a second for females. Since the forenames can be recorded in any order, the weights for the two forenames are calculated and the highest value used for the match. Where there are wide variations in the spelling of the forenames, the Daitch-Motokoff version of Soundex (“Ask Glenda”) is being evaluated for weighting the forenames in a fashion similar to that used for the surnames. Calculating the Weights for the Non-Names Items The weights for date of birth, sex, place of birth and NHS number are calculated using the frequency of the item on the ORLS and on the NHSCR file. The weight for the year of birth comparison has been extended to allow for known errors, for example, only a small deduction is made where the two years of birth differ by 1 or 10 years, but the weight is substantially reduced where the year of births differ by say, 7 years. The weight for the street address is based on the first 8 characters of the full street address, where these characters signify a house number (31, High Street), or house name (High Trees), or indeed a public house name (THE RED LION). Terms like “Flat” or “Apartment” are ignored and other parts of the address are then used for the comparison. The postcode is treated and weighted as a single field although the inward and outward parts of the code can be weighted and used separately. The range of binit weights used for the ORLS is shown in Table 1. When the matching item is present on both the records, a weight is calculated expressing the amount of agreement or disagreement between the item on the data record and the corresponding item on the master file record.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Table 1. —The Range of Binit Weights Used for Matching Identifying Item Score in Binits1   Exact Match Partial Match No Match Surnames: Birth +2S +2S to -2S -2S   Present2 +2S +2S to -2S -2S   Mother's birth +2S +2S to -2S -2S   (where: common surname S = 6, rare surname S = 17) Forenames3   +2F +2F to -2F -2F   (where: common forename F = 3, rare forename F = 12) NHS number +7 NP4 0 Place of birth (code) +4 +2 -4 Street address5 +7 NP 0 Post Code +4 NP 0 GP (code) +4 +2 0 Sex6 +1 NP -10 Date of birth +14 +13 -> -22 -23 Hospital and Hospital unit number +7 NP -9 1 Where an item has been recorded as not known, the field has been left blank, or filled with an error flag, the match weight will be set to 0, except for special values described in the following notes. 2 Where the surname is not known or has been entered as blank, the record can not be matched in the usual way, but is added to the file to enable true counts of all the events to be made. 3 Forename entries, such as boy, girl, baby, infant, twin, or not known, are weighted as -10. 4 Where the weight is shown as NP (not permissible), this partially known value cannot be weighted in the normal fashion and is treated as a NO MATCH. 5 No fixed abode is scored 0. 6 Where sex is not known, blank, or in error, it is scored -10. (All records input to the match are checked against forename/sex indices and the sex is corrected where it is missing or in error.) It is possible for the calculated weight to become negative where there is extreme disagreement between the item on the data record and the corresponding item on the master file. In matching street address, postcode and general practitioner the score cannot go negative, although it can assume zero, because the individual may have changed their home address or their family doctor since they were last entered into the system, this is really a change in family circumstances and not errors in the data and so a negative weight is not justified. Threshold Weighting The procedure for deciding whether two records belong to the same person, was first developed by Newcombe, Kennedy, Axford, and James (1959), and rigorously examined by Copas and Hilton (1990), Belin and Rubin (1995), and Winkler (1995). The decision is based on the total binit weight, derived by summing algebraically the individual binit weights calculated from the comparisons of each identifying item on the master file and data file. The

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Figure 2 (simplified). MSGP4 was the fourth survey of morbidity in general practice. In previous MSGP surveys output consisted only as a series of tables produced by COBOL programs, and MSGP4 was the fist survey for which relational databases were used to provide flexible outputs. Figure 2. —Example Database (Simplified) —Patient Consultations of General Medical Practitioners Some Definitions Read Code. —A code used in England and Wales by general practice staff to identify uniquely a medical term. This coding was used in the MSGP project because it is familiar to general practice staff, but it is not internationally recognised and the codes have a structure that does not facilitate verification. ICD Code. —International classification of disease. Groups of Read codes can be mapped onto ICD9 codes. For example Read code F3810= “acute serious otitis media,” maps to ICD A381.0 = “acute nonsuppurative otitis media”). Such mappings form part of the consultation metadata (see below). Consultation. —A “consultation” refers to a particular diagnosis by a particular member of staff on a particular date at a particular location, resulting from a face to face meeting between a patient and doctor/nurse. A “diagnosis” is identified by a single Read code. “Patients Consulting”. —Some registered patients did not consult a doctor or other staff member during the study. “Patients consulting” is therefore a subset of the practice list of all registered patients. Consultations must be carefully distinguished from “patients consulting.” A combination of patient number, date and place of consultation and diagnosis uniquely define each record in the consultation file. Patient numbers are not unique because a patient may consult more than once, nor are combinations of patient number and diagnosis unique. On the other hand, a “patient consulting” file will contain at most one record for each patient consulting for a particular

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition diagnosis (or group of diagnoses), no matter how many times that patient has consulted a member of the practice staff. “Consultations ” are more relevant when work-load is being studied, but if prevalence is the issue then “patients consulting,” i.e., how many patients consulted for the illness, is more useful. Patient Years at Risk. —The population involved in the MSGP project did not remain constant throughout the study. Patients entered and left practices as a result of moving house or for other reasons, and births and deaths also contributed to a changing population. The “patient years at risk” derived variable was created to take account of this. The patient file contains a “days in” variable, which gives the number of days the patient was registered with the practice (range 1–366 days for the study). “Patient years at risk” is “days in” divided by 366, since 1992 was a leap year. Database Structure To facilitate future analyses some non-changing data were combined at the outset. For example some consultation metadata were added to the consultation dataset, such as International Classification of Disease (ICD) codes and indicators of disease seriousness. The resultant simplified data structure is thus: Practice: Practice number; information about practice (confidential) Primary Key: Practice number A practice is a group of doctors, nurses, and other staff working together. Although patients register with a particular doctor, their records are kept by the practice and the patient may be regarded as belonging to a practice. Data on practice and practice staff are particularly confidential, and not considered in this paper. Individual practice staff consulted are identified in the consultation file by a code. Patients: Patient number; age; sex; post code; socio-economic data Primary key: Patient number Foreign key: Postcode references geographic data These data were stored as four separate files relating to: all patients; adult patients; children; married cohabiting women, because different information was collected for each subgroup. Consultation: Patient number; Practice number; ID of who consulted; date of contact; diagnosis; place of consultation; whether referred to hospital; other consultation information Primary key: Patient number, doctor ID, date of contact, diagnosis Foreign keys: Practice number references practice; Patient number references patients; Staff ID references staff (e.g., doctor/nurse). Episode: For each consultation the doctor/nurse identified whether this was the “first ever,” a “new,” or “ongoing” consultation for that problem. An “episode” consists of a series of consultations for the same problem (e.g., Read code). Geographically-referenced data: Post codes, ED, latitude/longitude, census ward, local authority, small area census data, locality classifications such as rural/urban, prosperous/inner city, etc. These data were not collected by the survey, but come from other sources, linked by postcode or higher level geography.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Patient metadata: These describe the codes used in the socio-economic survey (e.g., ethnic group, occupation groups, social class, housing tenure, whether a smoker, etc.) Consultation metadata: The ReadICD file links Read codes with the corresponding ICD codes. In addition a lookup table links 150 common diseases, immunisations and accidents to their ICD codes. Each diagnosis is classified as serious, intermediate or minor. Derived files: The MSGP database contains information on individual patients and consultations. To make comparisons between groups of patients, and to standardise the data (e.g., for age differences), it is necessary to generate files of derived data, using database queries and linkages as described below. In some derived files duplicate records need to be eliminated. For example, we may wish to count patients consulting for a particular reason rather than consultations, and hence wish to produce at most one record per patient in a “patients consulting ” derived file—see “Some Definitions above). Types of Linkage (with Examples) In this section we classify a variety of linkage types that are possible into three main types, illustrating the linkages with examples based on the MSGP4 study. Simple Linkage Straightforward data extracts (lists) combining several sources. — Example: Making a list of patients with asthma including age, sex and social class for each patient. Observed frequencies.— Example: Linking the “all patients” file, and the “consultations” file to count the number of consultations by the age, sex and social class of the patient, or cross-classifying home-visits and hospital referrals with socio-economic characteristics. Conditional data, where the availability of data items depends on the value of another variable.— Example: In MSGP4 some data are available only for adults, or children, or married/cohabiting women. Smoking status was only obtained from adult patients, so tabulating “home visits” by “smoking status” by “age,” and “sex” involves linking the “all patients” (to find age and sex), “adult patients” (to find smoking status) and “consultations” (to find home visits) files. Linking the “adult” file to the “all patients” file excludes records for children. Linking files with “foreign” files. — Useful information can often be obtained by linking data in two or more different datasets, where the data files share common codes. For example data referenced by postcode, census ED or ward, or local authority are available from many different sources as described above. Example: The MSGP4 study included the postcode of residence for each patient, facilitating studies of neighbourhood effects. The crow-fly distance from the patient's home to the practice was calculated by linking patient and practice postcodes to a grid co-ordinates file and using Pythagoras's theorem. The distance was stored permanently on the patient file for future use.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Linking to lookup tables (user-defined and pre-defined). — Examples: The information in the MSGP database is mostly held in coded form, with the keys to the codes held in a number of lookup tables linked to the main database. Most of these are quite small and simple (e.g., ethnic group, housing tenure, etc.) but some variables are linked to large tables of standard codes (e.g., occupational codes, country of birth). In some cases the coded information is quite detailed and it is desirable to group the data into broader categories, e.g., group diagnostic codes into broad diagnostic groups such as ischaemic heart disease ICD 410–414. For some diseases a group of not necessarily contiguous codes are needed to define a medical condition. A lookup file of these codes can be created to extract the codes of interest from the main data, using a lookup table that could be user-defined. Missing value codes could also be grouped, ages grouped into broad age groups, social classes combined, etc. Auto-Linkage Within a File (Associations Within a File) Different records for the same “individual.” —Records for the same individual can be linked together to analyse patterns or sums of events, or associations between events of different kinds. In general a file is linked to a subset of itself to find records relating to individuals of interest. Example: Diabetes is a chronic disease with major complications. It is of interest to examine, for those patients who consulted for diabetes, what other diseases they consulted for. Consultations for diabetes can be found from their ICD code (250). Extracting just the patient identification numbers from this dataset, and eliminating duplicates, results in a list of patients who consulted for diabetes at least once during the year. This subset of the consultation file can be linked with the original consultation file to produce a derived file containing the consultation history of all diabetic patients in the study, which can be used for further analysis. Note that in this example only the consultation file (and derived subsets) has been used. Different records for same households/other groups. — Example: Information on households was not collected as part of MSGP4. However “synthetic” households can be constructed, using postcode and socio-economic data, where the members of the same “household” must, by definition, share the same socio-economic characteristics and it would be rare for two distinct households to have exactly the same characteristics. These “households” can be used to discover how the behaviour of one “household” member may affect another. For example, we can examine the relationship between smoking by adults, and asthma in children. Clearly in this example some sort of check needs to be made on how accurately “households” can be assembled from the information available and the algorithm used. Temporal relationships. —Files containing “event” data can be analysed by studying temporal patterns relating to the same individual. Example: The relationship between exposure to pollution or infection and asthma can be studied in terms of both immediate and delayed effects. Consultations for an individual can be linked together and sorted by date, showing temporal relationships. The duration of clinical events can sometimes be determined by the sequence of consultations. In MSGP4 each consultation for a particular medical condition was labelled

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition “first ever,” “new,” or “ongoing” and the date of each consultation recorded. Survival analysis techniques cater for these types of data. Complex Linkages Linkages that are combinations of the two types of linkage previously described could be termed “complex linkages.” These can always be broken down into a sequence of simpler linkages. A number of examples of complex linkages are given, in order of complexity. Finding subsets through linkage. — Example: In the MSGP4 data this is particularly useful in the study of chronic conditions such as diabetes and heart disease. Linking the file of patients consulting for diabetes discussed in section 3.2 with the patient dataset results in a subset of the patient file, containing only socio-economic details of diabetic patients. Linking a derived file to a lookup table and other files. — Example: Diabetes is particularly associated with diseases of the eye (retinopathy), kidney, nervous system and cardiovascular systems. It is of interest to analyse the relationship between diabetes and such diseases, which are likely to be related to diabetes. In this slightly more complex situation it is necessary to create a lookup table containing the diseases of interest and their ICD codes and link this to the “consultations by diabetic patients” file to create a further subset of the consultation file containing consultations for diabetes and its complications. It is likely that this file as well as the simpler one described above would be linked to the patient file to include age and sex and other patient characteristics before analysis using conventional statistical packages. Linking a derived file with another derived file. — Example: Rates for groups of individuals. — Rates are found by linking a derived file of numerators with a derived file of denominators. The numerators are usually found by linking the patient and consultation files, for example, age, sex, social class or ethnic group linked to diagnosis, referral or home visits. Denominators can be derived from the patient file (patient years at risk) or the consultation file (consultations or patients consulting) for the various categories age, sex, etc. Standardised ratios. — This is the ratio of the number of events (e.g., consultations or deaths) observed in a sub-group to the number that would be expected if the sub-group had the same age-sex-specific rates as a standard population (e.g., the whole sample), multiplied by 100. Examples of sub-groups are different ethnic groups or geographical areas. The calculation of standard population rates involves linking the whole population observed frequencies to whole population patient years at risk. Each of these is a derived file, and the result is a new derived file. Calculating expected numbers involves linking standard population rates to the subgroups' “years at risk” file. This produces two new derived files, “Observed” and “Expected.” Age-standardised patient consulting ratios are obtained by linking these two derived files together, using outer joins to ensure no loss of “expected” records where there are no observed in some age-sex categories.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Establishing population rates for a series of nested definitions. — Example: Individuals at particular risk from influenza are offered vaccination. In order to estimate how changes in the recommendations might affect the numbers eligible for vaccination, population rates for those living in their own homes were estimated for each of several options. People aged 65 and over living in communal establishments are automatically eligible for vaccination, and hence were selected out and treated separately. The options tested were to include patients with: any chronic respiratory disease, chronic heart disease, endocrine disease, or immunesuppression; as A but also including hereditary degenerative diseases; as B but also including thyroid disease; as C but also including essential hypertension. The MSGP dataset was used to estimate the proportion of the population in need of vaccination against influenza according to each option. The problem was to find all those patients who had consulted for any of the diseases on the list, taking care not to count any patient more than once. This involved creating a lookup table defining the disease groups mentioned in options A-D, linking this to the consultation dataset, eliminating duplicates and linking this to the patient dataset (to obtain age-group and sex), and then doing a series of queries to obtain appropriate numerator data files. A denominator data file was separately obtained from the patient dataset to obtain patient years at risk, by age-group and sex. The numerator and denominator files were then joined to obtain rates. These rates were then applied to census tables to obtain the estimated numbers of patients eligible for vaccination under assumptions A-D. Record matching for case-control studies. — These are special studies of association-extracting “cases” and “controls” from the same database. Example: what socio-economic factors are associated with increased risk of Crohn's disease? All patients who consulted for ICD555 (regional non-infective enteritis) during the MSGP4 study were selected and referred back to their GP to confirm that they were genuine cases of Crohn's disease. Patients who were not confirmed as having Crohn 's disease were then excluded. This resulted in 294 cases. Controls were selected from patients who did have the disease—those who matched cases for practice, sex and month and year of birth. In each of two practices there were two cases who were of the same sex and the same month and year of birth. In each of these practices the controls were divided randomly between these cases as equally as possible. There were 23 cases for whom no controls could be found using these criteria. In 20 of these cases it was possible to find controls who matched on practice and sex and whose date of birth was within two months of the case's date of birth. The remaining three cases were excluded from the analysis. This procedure resulted in 291 cases and 1682 controls. User-Friendly Linkage Software The MSGP4 practice software was originally written so that participating practices could gain access to the data collected from their own practice. The software was designed to be used easily by people with no knowledge of database technology and because the software runs directly under DOS or Windows, no specialised database software is needed. The structure of the MSGP database is transparent to the user who can refer to entities (e.g., diseases or occupation) by name rather than codes.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Later, a modified version of the software was developed to enable researchers to use the complete dataset (60 practices). Although it may be possible for some of these linkages to be performed as a single query it is generally best to do a series of simple linkages for two reasons. Firstly, database software creates large temporary files of cross products, which is time consuming and may lead to memory problems. Secondly, queries involving complex linkages are often difficult to formulate and may easily turn out to be incorrect. The order in which the linkages are performed is also important for efficiency. In general, only the smallest possible files should be linked together. For example, rather than linking the patient and consultations files together, then finding the diseases and patient characteristics of interest, it is better to find the relevant subsets of the two files first, then link them together. The software performs the required linkages and then analyses the data in two stages. The first part of the program performs the sequence of linkages and queries needed to find subsets required for the second stage, and the second part performs the analyses and displays the output. The data flow through the program is shown in Figure 3. It can be seen from the diagram that any of the three input files may be linked to themselves or to either of the others in any combination to form subsets of the data, or the entire dataset can be used. Finding Subsets The program enables the user to find any combination of characteristics required, simply by choosing the characteristic from menus. The program finds subsets of individual files, as well as linking files in the dataset to each other and to lookup tables, and finding subsets of one file according to data in another. For example the program can produce a list of young women with asthma who live in local authority accommodation, or of patients with a particular combination of diagnoses. It is also possible to examine the data for a particular group of people (for example, one ethnic group), or for a particular geographical area. Dealing with missing values. —When the data for MSGP4 was collected it was not possible to collect socio-economic data for all patients. The user is given the option to exclude missing values, or to restrict the data to missing values only should they want to find out more about those patients for whom certain information is missing. For example, an analysis of the frequency of cigarette smoking in each age/sex group in the practice might include only those patients for whom smoking information is available. The Output The output from the program is of three types, any of which may be exported by the program in a variety of formats (e.g., WK1, DBF, TXT, DB) for further statistical analyses. Lists output consists of one record for each patients, consultation or episode of interest, with files linked together as appropriate. Each record contains a patient number together with any other information that the user has requested. These flat files can be used for further analysis using spreadsheet or statistical software. Frequency output consists of counts of the numbers of patients, consultations or episodes in each of the categories defined by the fields selected by the user.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Rate output enables a variety of rate with different types of numerators and denominators to be calculated. Any of the following rates may be chosen: Diagnostic rates for a specified diagnostic group (patients consulting; consultations; episodes); referral rates; and home visit rates. Rates are generally calculated for standard age and sex groups but other appropriate patient and consultations characteristics may be included in the analysis. Denominators can be consultations, patients consulting or patient years at risk. Figure 3. —Data-flow Diagram for MSGPX Data Extractor Program

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Discussion and Conclusions We have demonstrated through the use of one example database the potential that relational databases offer for storing statistical data. These are also the natural way to capture the data, since they reflect real data relationships, and are economical in storage requirements. They also facilitate linking in new data from other sources. However most statistical analyses require simple rectangular files, and complex database queries may be required to obtain these. We have shown that such complex linkages can be decomposed into a sequence of simple linkages, and user-friendly software can be developed to make such complex data readily available to users who may not understand the data structure or relational databases fully. The major advantage of such software is that the naïve user can be more confident in the results than if they were to extract the data themselves. They can also describe their problem in terms closer to natural language. Although such programs enable the user with no knowledge of database technology to perform all the linkages shown above, they do have their limitations. Choosing options from several dialogue boxes is simple but certainly much slower than performing queries directly using SQL, Paradox or other database technology. Since the most efficient way to perform a complex query depends on the exact nature of the query, the program will not always perform queries in the most efficient order. The user is also restricted to the queries and tables defined by the program, and as more options are added the program must of necessity become more unwieldy and possibly less efficient. User friendly software remains, however, the useful for the casual user who may not be familiar with the structures of a database, and essential for the user who does not have access to or knowledge of database technology. References Fellegi, I.P. and Sunter, A.B. ( 1969). A Theory for Record Linkage, Journal of the American Statistical Association, 64:1183–1210. McCormick, A.; Fleming, D.; and Charlton, J. ( 1995). Morbidity Statistics from General Practice, Fourth National Study 1991–92, Series MB5, no 3, London: HMSO. Newcombe, H.B. ( 1988). Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business, Oxford: Oxford University Press. Newcombe, H.B.; Kennedy, J.M.; Axford, A.P.; and James, A.P. ( 1959). Automatic Linkage of Vital Records, Science, 130:954–959. Winkler, W.E. ( 1994). Advanced Methods of Record Linkage, American Statistical Association, Proceedings of the Section on Survey Research Methods, 467–472.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Tips and Techniques for Linking Multiple Data Systems: The Illinois Department of Human Services Consolidation Project John Van Voorhis, David Koepke, and David Yu University of Chicago Abstract This project involves the linkage of individuals across more than 20 state-run programs including TANF (AFDC), Medicaid, JOBS, Child Protection, Child Welfare Services, Alcohol and Substance Abuse programs, WIC, and mental health services. The count before linking is over 7.5 million records of individuals. Unduplicating the datasets leaves 5.9 million records. And the final linked dataset contains records for 4.1 million individuals. This study will provide the basic population counts for the State of Illinois's planning for the consolidation of these programs into a new Department of Human Services. In the context of linking multiple systems, we have done a number of different things to make using AutoMatch easier. Some features of the process relate to standardized file and directory layouts, automatically generating match scripts, “data improvement” algorithms, and false match detection. The first two issues, files and directories and scripts, are primarily technical, while the second two issues have more general substantive content in addition to the technical matter. Properly laying out the tools for a matching project is a critical part of its success. Having a standard form for variable standardization, unduplication and matching provides a firm and stable foundation for linking many files together. Creating additional automation tools for working within such standards is also well worth the time it takes to make them. With multiple sources of data it is possible to improve the data fields for individuals who are linked across multiple datasets. We will discuss both how we extract the information needed for such improvements and how we use it to improve the master list of individuals. One particular example of these improvements involves resolving the false linking of family members.

OCR for page 13
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.