Skip to main content

Currently Skimming:

Chapter 11 Selected Related Papers, 1986-1997
Pages 333-454

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 333...
... Fair and Pierre Lalonde, Statistics Canada Matthew A Jaro, System Automation Corporation Nancy P
From page 335...
... Fair is Chiefand Pierre I^londe is Project Manager, Occupational and Environmental Health Research Section, Canadian Centre for Health Infonnation, Statistics Canada, Ottawa, Ontario KIA OT6, Canada The authors thank John Armstrong, Michael Eagen, and William ~ Winlder for helpful critical comments on an early version of this article, and also the associate editors and referees who substantially influenced its final form. The test is special to names as identifiers; is suitable for fine-tuning this component of a record linkage system; and is uninfluenced by the adequacy of the rest of the identifiers.
From page 336...
... This study found that people often compare identi- A decade later, the linkage rationale was restated and exfiers in unexpected ways whenever there is a need for added panded into a formal mathematical theory by Fells and Sunter at Statistics Canada (Fellegi 1985; Fellegi and Sunter 1969; Sunter 1968~. These authors confirmed what had not been rigorously proved earlier and chose for purposes of illus~ation such simple outcome definitions as disagreement (nonspecific for value)
From page 337...
... When Statistics Canada first actually used probabilistic linkage in the early 1980s, based on the Felleg~-Sunter theory, it was to search the newly established Canadian mortality data base, which extended back to 1950 Their linkage sync tem, letdown as CABANA or GORES (for Generalized Iterative Record Linkage System) included innovations described by Howe and Lindsay (1981)
From page 338...
... The problem posed by the valu~specific partial agreements of names may be handled in various ways, but only one of N(AX BY ~ LI" )
From page 339...
... and are independent of each over, and P(~INR) is the prior likelihood of a correct match on a singly random pairing.
From page 340...
... It can be NO=I~RS over LINKS is not kept in mind. Thus the din seen to be misleading when scanning visually for record pairs tnbutions and their crossover points serve little purpose if NONUN" l l l 1~0FU - S-> ~ UNKS n i' LJ 1 AWL ' IS .~ I A- NONUNKS ~ 1% OF LINKS 230 2~10 1 20 210 THEORETICAL ODDS 230 figure 1.
From page 341...
... All of the projects involved searches of the Canadian Mortality Data Base (File B containing 3,397,860 male given names)
From page 342...
... Such choices are inescapable, but only rarely have their effects on the calculated ODDS been quantified Indeed, where data to support the more redned alternative were lacking, the comparison often was not possible. But now the extensive data from large files olI,INKS accumulated at Statistics Canada make it attractive to assess the effects on discriminating power when people's names are compared in alternative ways.
From page 343...
... Error factors are greater for the partial agreements than for the full agreements and disagreements, independent of the actual values of the names; for this reason, only the partial agreements are considered here. Awn, the effects of Me shortcut are modest.
From page 344...
... 344. (with their error factors)
From page 345...
... Mcthodology,- Heakh Reports (Statistics Canada)
From page 346...
... E Fair,Ottawa: Statistics Canada, pp.
From page 347...
... For instance, the reporting of the correct direction of these "difficult matches" without of demographic information by psychiatric padents may be reference to the subjects whose records are being linked? much less reliable than information gathered for epidemi In the development of their decision criteria, Fellegi and ologic research purposes, the point being that "memories of Sunter stated very clearly that the effect oftheir weighs cam- past encounters with similar problems" may well lead you petition is '`to array the record pairs, relative to one another, astray.
From page 348...
... (~969) , ``A Theory for Record linkage'- Journal is that subjective judgment is based on perceptions of prom of the American Statistical Association.
From page 349...
... was observed for workers in the uranium mines of Saskatchewan and the Northwest Terntones; so there was no question of extrapolating from one subset of the cohort to the other. Here, final verifictibon of the linkage status of the record pairs was not in doubt.
From page 350...
... This is why details of our use and denvation of an estimated "prior likelihood" are included here together with the related idea that blocking be treated as not altering, either the total number of possible record pairings (actual plus potential) , or the use of likelihood ratios dived from the blocking identifiers.
From page 351...
... The user specifies the fields to be matched, the record formats, parameters. blocking variables, etcetera, arid the generation program creates a matcher that can be run with the desired files.
From page 352...
... The two files can be partitioned into mutually exclusive and exhaustive bloclcs designed to increase the proportion of matched pairs observed while decreasing the number of record pairs to compare. Comparisons are resIncted to record pairs within each block.
From page 353...
... These counts are obtained as follows: each file Is partitioned into bloclcs by means of the blocking variables. For each block, all record pairs in the block are examined.
From page 354...
... Since blocking greatly rogues Be number of no~atched pairs observed and because blocking selects record pairs that are likely to match, Me a probability estimates on tained Dog blocked data will be biased. Consequendy, the u probabilities must be computed directly on unbloclced data, as exploded in Section 2.4, and the EM algondun is only used to compute the m probabilities, where blocking enriches the number of matched pairs oW served while avoiding comparisons on relatively large numbers of unmatched pairs.
From page 355...
... .. scheme tl';'t ma.xin~izes the sum of the composite weights of the assig''e'1 record pairs.
From page 356...
... 5.2 Pass 11 Match In an attempt to match records that failed to match in Pass I, an independent blocking scheme was chosen for Pass It. The blocking variables Acre C13NA, census block number, SOUNDEX of street name, house number, and apartment number.
From page 357...
... Census Bureau where he developed the mathematical methodology and software to perform statistically valid matching procedures in support of estimating census coverage. Although application specific, the census estimation methodology Matt developed was precedent-setting in the field of probabilistic record linkage.
From page 358...
... For example, if a surname in two records matches, the computer calculates the odds of the two names matching by chance and how often the surname field agrees in the truly linked records contained in the comparison file. From these two statistics the program determines a score which represents He odds above chance He two surnames matching are for He same person.
From page 359...
... It is possible to enhance the recall of a system without greatly reducing the precision by using some form of authority control to bring together the equivalent names of people which are spelled differently and locality names which are different but refer to the same locality. Blocking schemes are tuned to the specific file or part of the file being searched.
From page 360...
... The Family History Department and Record Linkage The theory underlying this technology is Me best approach lmown to the scientific community. For this reason, the Family History Department of The Church of Jesus Christ of Latterly Saints has chosen to implement its usage in Weir genealogical systems and databases.
From page 361...
... She is currently a Systems User Specialist in He LDS Family History Department. Personal Ancestral File is a registered trademark of The Church of Jesus Christ of Latter-day Saints.
From page 362...
... Introduction The Church of Jesus Christ of Latter Day Saints maintains massive genealogical files, which consist of millions of names. The two largest files are the International Genealogical Index ~GI)
From page 363...
... Record Linkage Techniques-1997 where _ means that the two sides of the equation are "close," although they may not be exactly equal. We read P(SIM)
From page 364...
... is to actuaby create a set of randomly matched pairs and calculate the proportion of matches obtained. This way is computationally less intensive and may be a practical alternative for people with more meager computational resources.
From page 365...
... = PA if one or both elements are missing in the record pair. This makes sense, since E tells us nothing new about the match when the information is missing.
From page 366...
... involves the time of experienced researchers who must consider a large sample of record pairs and find Nose which will be identified as duplicates. If ad possible pairs are to be considered for the cases of interest, we will have an impossible task before US7 with literally billions of pairs to evaluate.
From page 367...
... IF we now consider several blocking schemes, the percentage of known duplicate pairs, which are picked up with any one of We schemes, may well be less Man 100%. Hopefully, we we find one of them, which picks up a higher percentage of duplicate pairs Man do We others.
From page 368...
... Several blocking schemes were used for identifying duplicates. For each blocking scheme and for each block in the scheme, all possible pairs were obtained, and Me worker identified each pair as either a match (i.e., a duplicate)
From page 369...
... Table 1. -- Frequency Distributions for Matched and Unmatched Pairs Class Limits for Unmatched Pairs Weight Totals -34.35 to -27.56 0 -27.55 to -20.76 6 -20.75 to -13.96 257 -13.95 to -7.16 602 -7.15 to -0.36 381 -0.35 to 6.44 s8 6.45 to 13.24 2 13.25 to 20.04 0 20.0s to 26.84 0 26.85 to 33.64 0 33.65 to 40.44 0 40.45 to 47.24 0 Matched Pairs o o o 33 255 540 344 15 19 13 o Minimizing False Duplicates It wait be noted Hat He scores for He unmatched pairs are consistently lower Han for He matched pairs' but Hat occasionally, He scores for He unmatched pairs wait be higher Man some of He scores for the matched pairs.
From page 370...
... - Threshold which Minimizes Missed Duplicates Unlinked Pairs ~ Li,~ Pairs to to to lo to to 370 FIFED FOIST Duplicates I, 1. ~ ._ -20 0 mrnatched.pairs Unlinked Pairs ~ Lining Pairs TO Missed Duplicates r ~ \ · ..
From page 371...
... Note that increasing the threshold value decreases Else duplicates but increases He percentage of missed duplicates Table 2. -- Duplicates and Missed Duplicates for Alternative Thresholds Threshold Value for Sum of Lo~Odds -8.81 -7.03 -5.25 -3.48 -1.70 0.06 1.84 3.61 5.39 7.16 °/0 False Duplicates in Nonmatched Sample 26.85 26.52 26.38 7.66 5.11 4.66 4.66 1.54 0.21 0.21 °/0 Missed Duplicates in Matched Sample 0.00 0.08 0.08 2.21 2.62 2.78 2.78 16.47 23.60 23.93 Now consider Figure 3.
From page 372...
... The position of the threshold or thresholds depends on Me desired type and size of He error rates (see Figures I-31. Merge records which have been linked, allowing storage space for possible conflicts.
From page 373...
... There are many refinements, which need to be investigated, including the use of value-specific techn~ques, partial agreements, lack of independence between field entries, and the use of other statistical procedures to enhance current techniques. Note Dav~d White is professor emeritus at Utah State University, Department of Mathematics, Logan, Utah 84322 and has been statistical consultant to Me Record Linkage Team, Family History Department, Church of Jesus Christ of Latter-Day Saints.
From page 374...
... There are potential advantages to using administrative data in analyses. Administrative data sources may contain greater amounts of data and that data may be more accurate due to improvements over time.
From page 375...
... string comparator metrics' search strategies, and name and address parsinglstandardizat~on from computer science; (2) discaminato~y decision rules, error rate estimation, and iterative fitting procedures from statistics; and (3)
From page 376...
... In such a situation there is insufficient infonnation to separate new units from existing units that have different mailing addresses associated with them. The matching weight or score is a number assigned to a pair that simplifies assignment of link and nonlinic status via decision rules.
From page 377...
... As an acliunct to computer operations, clerical review is still needled to deal with pairs having significant amounts of missing information, typographical errors, or contradictory inforEnation. Even then, using the computer to bring pairs together anti having computer-assistect methods of review at terminals is more efficient than manual review of printouts.
From page 378...
... About 14,000 person-hours (as many as 75 clerks for 3 months) were used in this clerical review' and an aclditional 450,000 clupl~cates (7.5 percent)
From page 379...
... Because the name information in Tables 20.3 and 20.4 may be insufficient for accurately determining match status, address infonnalion or other identifying characteristics may have to be obtained via clerical review. If the additional address infonnation is indeterminate, then at least one establishment in each pair may have to be contacted.
From page 380...
... Winkler 380 of E cot v' cat a: u' O G V2 O Cal _ V]
From page 381...
... 20.4 MATCHING DECISION RULES For many projects' automated matching decision rules are developed using ad hoc, intuitive approaches. For instance, the decision rule might be as follows: · If the pair agmes on a specific three charactenstics or agrees on four or more within a set of five charactenstics, designate the pair as a link.
From page 382...
... error rates. In my view the best way to build record linkage strategies is to start with formal mathematical techniques based on the Fellegi-Sunter mode]
From page 383...
... that rule (20.2) is optimal In that for any pair of fixed upper bounds on the rates of false matches and false nonmatches, the clerical review region is minimized over all decision rules on the same comparison space r.
From page 384...
... The intuitive idea is that if surnames such as ''Vijayan'' occur less often than surnames such as "Smith,'' then ''Vijayan" has more distinguishing power. A variant of Ne~vcombe~s ideas was later mathematically fonnalized by Fellegi and Sunter (1969; see also WinMer 1988' 1989c for extensions)
From page 385...
... 20.4.S Jaro String Comparator Metrics for Typographical Error Jam (1989) introduced methods for dealing with typographical error such as ''Smith'' versus ''Smoth.'' laro's procedure consists of two steps.
From page 386...
... Belin demonstrated that the original Jaro string compa~ator and the Winkler extensions were the two best ways of improving matching efficacy in files for which identifying fields had significant percentages of minor typographical errors. 20.4.6 General Parameter Estimation Two difficulties arise in applying the EM procedures of Section 20.4.3.
From page 387...
... When both name and address information is used for matching, the two-cIass EM tends to divide a set of palm into those agreeing on Stress information and those disagreeing. If address information associates} with many pairs is incleterrninate o =7 _ _ ~,, ~ , ,, , _ _ ~ ~ ~ ~ ~ 1 _ {e.g., Rural Route 1 or ti~gnway t)
From page 388...
... because it yields reasonably goof! matching decision rules (e.g., Newcombe 19X8; Winkler 199Ob)
From page 389...
... All three-way interactions are used in the last moclel. The basic reason that iterative refinement and three-cIass independent EM perform poorly is that indepencience does not hoist.
From page 390...
... truth, cumulative distribution of nonmatches three-cIass. selected interaction EM.
From page 391...
... tn~th, cumulative distribution of nonmatche~three-class' three-way interaction EM convex. matching status is useci to determine the interactions that must be inclucled.
From page 392...
... Improving parameter estimates can retince clerical review regions by as much as 90 percent. 20.S.' Quality of Blocking Criteria While use of blocking criteria facilitates the matching process by reclucing the number of pairs to be considered, it can increase the number of false nonmatches because some pairs disagree on the blocking cntena.
From page 393...
... For instance, with name information alone, it may only be feasible to create subsets of pairs that are held for clerical review. With name and address information, a substantial number of the matches can be correctly distinguished.
From page 394...
... Because the overwhelming majority of fanning operations have names of the forth given in Table 20.8, the resultant parsed names will likely all have ''Smith" as a surname that will yield good distinguishing power when combined with address information The exception can occur when two names containing ''Smith" have the same address. A similar situation occurs with the 1992 match of the Standard Statistical Establishment List (SSEL)
From page 395...
... The most straightforward means of parameter reestimation Is the iterative refinement procedure of Statistics Canada (e.g., Newcombe 198S, pp. 65-66; Statistics Canada 1983; Jaro 1992~.
From page 396...
... developed a method of estimating matching error rates when the cumes (ratio R versus frequency) for matches and nonmatches are somewhat separated and the failure of the independence assumption is riot too severe.
From page 397...
... software that they did not write or to pon new algorithms in other languages to PLY A secondary concern is lack of appropriate, gene~al-pu~pose software. In many situations for which name, address' and other comparable information are available, existing matching software will work well if names and addresses can be parsed correctly.
From page 398...
... While name parsing software is written and used by commercial firms, the associated source code is generally considered propr~eta~. 20.7.2 Need for General Ad~ess-Parsing Software and What Available Statistics Canada has the ASK(iEN package (again written in PL4/!
From page 399...
... While the ASKGEN and NSKGEN packages from Statistics Canada have been given out to individuals for use on IBM mainframes, associated documentation does not cover insolation or details of the aIgonthms. To ~ lesser extent, the lack of detailed documentation is also We for the USDA system.
From page 400...
... , '`On the Problem of Matching Lists by Samples,'' Journal of the American Statistical Association, 54, pp.
From page 401...
... (1993) , 64Geneml~zed Record Linkage at Statistics Canada''' Proceedings of Ike International Conference on Establishment Surveys, Alexandna, VA: American Statistical Association, pp.
From page 402...
... 1 1 X- 125. Statistics Canada (19831, ''Generalized Iterative Record Linlcage System,'' unpublished report, Ottawa: Systems Development Division.
From page 403...
... of Record Linkage,,, Proceedings of the Survey Research Methods Section, Amen can Statistical Association, pp.
From page 404...
... Young, LLP 1. Purpose The purpose of this paper is to provide an introduction or "starter set" for reflecting on human rights issues Mat arise when bringing together or lining We health records of individuals.
From page 405...
... 2.1 Types of Record Linkages It seems fairly safe to speculate that once human beings began to keep records there were efforts to link Rem together. Until well mto this century, Rough, such work was done maritally and often only with great difficulty and expense; however, Were now exist four broad types of automated record linkage (see Figure 2)
From page 406...
... They also dealt with the implications of restricting the comparison pairs to be looked at, that is of "blocking" Me files, something that generally has had to be done when linking files that are at all large. Many ofthe major public health research ado ances made In recent decades have benefitted at least In part from probabilistic linkage techniques.
From page 407...
... If an efficient Dow cost, essentially error free) health care linkage system is a goal, then consideration needs to be given to the establishment of a health identification "number." In ideal circumstances, personal identifying information on a medical record should satisfy the following requirements [12]
From page 408...
... Its use could lead to matching errors and might greatly increase the potential for unregulated linkages between health and nonhealth data sets. 2.3 Some Proposed Health Record Linkage Systems The proposed Health Security Act [16]
From page 409...
... -- Record Linkage Architecture Primary Care Unit _ =, , Linkage Processor Automated Pabent Record , L - t | F~ _ 1. Skates Malt ~er~r~ elm unique code 2.
From page 410...
... All afford the individual or data subject some, sometimes sole, rights over what matters they want to keep private and what matters they are wing -- or want -- to reveal Record linkage settings pose a particular challenge to an indiv~dual's ability to exercise his or her privacy rights. The sheer complexity of the setting makes it hard to clarify for the subject what the potential benefit or harm may be to permitting access.
From page 411...
... Arguments In favor of doing record linkages for efficiency reasons have not filthy weighed these costs. in Brannigan and Beier [261 still Pier sound system architecture issues and recommendations are made Pat would be needed to Implement essential confidentiality and security procedures, especially if large scale record linkages are to be employed.
From page 412...
... Hoilinan in a recent paper makes Me observation Mat "too many people may already have ~nsufficiently monitored access to hospital patient records. He seconds Mark Siegler's thesis Hat "medical confidential~ty, as it has been traditionally understood by patients and doctors, no longer exists." Siegler, after a patient Figure 7.-Administrative Data Linkages Conducted for the Health of Patients Broad Possible Response Areas | Und, r whet conditions | For future study (Column 1)
From page 413...
... which states: Personally-identifiable health records must be in the control of the individual. Personal informanon should only be disclosed with the knowing, meanings consent of the individual.
From page 414...
... 4. Research Data Linkages Within the Health System It can be argued that some research uses of data linkages within the Health System are administrative and so are already covered by the discussion in Section 3, especially subsection 3.2.
From page 415...
... Figure 8. ~ Basic Research Data Linkages within the Health System Broad Areas Overall Recommendations Technical Aspects Possible Response Under what conditions (Column 1)
From page 416...
... The control of any linkages between head and nonhealth records, say with Census Bureau data, needs careful study too [50]
From page 417...
... -- Research Data Linkages between Health and Other Record Systems I Broad ~ Possible Response Areas Under what conditions For future study (Column 1) (Column 2)
From page 418...
... Legal Ban any use of a new health identifier in Study conforming legislative needs. Questions nonhealth record systems.
From page 419...
... Nonresearch Administrative Data Linkages between For nonhealth reasons, only with a court order. For Health and Nonhealth Record Systems health reasons, only to directly aid patients.
From page 420...
... This point may have been lost in the detailed discussion of privacy and consent concerns. For example, the epidemiological literature is full of health studies that use record linkage techniques to advance knowledge [56]
From page 421...
... , 19 (1) 39-58, Statistics Canada; Part II was delivered at the XII Methodology Symposium, Ottawa Canada, November 1, 1995, under the title Linking Data to Create Information and will be included in a forthcoming issue of Survey Methoa7Olo'gy.)
From page 422...
... (1995~. Building Data Research Resources From Existing Data Sets: A Model for Integrating Patient Data to Form a Core Data Set, presented at the American Statistical Association Annual Meetings In Orlando, FE, August 1995.
From page 423...
... (1990~. Implementing NCES's New Confidentiality Protections, American Statistical Association, 199O Proceedings on the Section on Survey Research Methods, Alexandria, Va.: American Statistical Association.
From page 424...
... (1997~. Improving Health Data among American Indians and Alaska Natives: An Approach from He Pacific Northwest, Health Care arid Information Ethics: Protecting Fundamental Human Rights, Kansas City, MO: Sheed and Ward, 88-} 13.
From page 425...
... (1985~. Methodological Issues in Linkage of Multiple Data Bases, Record Linkage Techniques ·985, Washington, DC: Department of He Tresury, Internal Revenue Service, pp.
From page 426...
... , (19971. Health Care and Information Ethics: Protecting Fundamental Human Rights, Kansas City, MO: Sheed and Ward.
From page 427...
... .~ a As.. ~r~rs'u''$i,~-22g.'.~.9,M,,',~,.'''2' '' '''I '' ' "' I'd ' ~ - ' ' '' ' ''''' ' This paper was rep rented with pennission from Me Proceedings of the Census Bureau 's Conference anal Technology Interchange, March 17-21, 1996.
From page 428...
... . In probabilistic record linkages the comparison or matching algorithm yields for each record pairs a probability or CCweight'' which indicates the likelihood that record pairs relate to the same entity (Fair, 1995~.
From page 429...
... There is a corresponding rising consumer expectation, particularly with respect to timeliness and quality of statistical data products. This has implications in terms of standard data concepts, definitions, coding, methodology used for record linkage.
From page 430...
... (1994) , Population Pro1echonsfor Canada, Provinces and Territories 1993-2016, Statistics Canada Catalogue No.
From page 431...
... The Information Age The Tofflers have described our times as berg that of Me Gird wave (Toffler and Toffler, 19951. The First wave was agricultural and it lasted thousands of years until the IS~ chary.
From page 432...
... The XIl~ International Symposium on Methodological Issues, held at Statistics Canada on November I-3, 1995, was entitled "From Data to Information." At this symposium topics included the role of statistics In making social policy, data integration, analytical methods, access and control of data, quaker of statistical information, technical aspects of confidentiality, making data accessible to He general public, data warehousing, and electronic information dissemination. An earlier symposium dealt with re-engineering for statistical agencies (Statistics Canada, 1994~.
From page 433...
... The original version of generalized record linkage software (GRLS.VI) that was developed at Statistics Canada was for a mainfiame environment.
From page 434...
... files against He same master file. Record Linkage in the Toolbox of Software -- Some Examples of its Use Statistics Canada uses a common set of software products In re-eng~neering its administrative and statistical programs.
From page 435...
... (A roundtable luncheon of the Social Statistics Section at the 1995 American Statistical Association, chaired by G Hole, discussed some of the above and future uses of administrative records to complements supplement data from household surveys.)
From page 436...
... ) , national databases of existing administrative records (e.g., Canadian Bird Data Base, He Canadian Cancer Data Base and the Canadian Mortality Data Base)
From page 437...
... Seven major files were linked to create He data required for the analysis file In this study, namely: He 1971 Census of Agriculture; He 1971 Census of Population; He 1971 Central Farm Register; He 1981 Central farm Register, He Canadian Mortality Data Base; He 1966-71-76-~-86 Census of Agriculture Longitudinal file; and He Canadian Cancer Data Base. Analyses of these data have examined prostate (Momson et al., 1993)
From page 438...
... As a result, Me Manitoba Centre for Health Policy and Evaluation has collaborated with Statistics Canada to determine the feasibility of linking provincial administrative health care utilization with census data for a sample of Manitobans (Mustard et al., 19951. Mortality and health services utilization have been described in relation to Me socio-econom~c status measure mortally and Me use of health care sentences at seven different stages in Me life course , , .
From page 439...
... Proceedlings of Statistics Canada Symposium 94 -- Re-Engineering for Statistical Agencies, November 439 ~
From page 440...
... (1981~. Generalized Iterative Record Linkage System, Ottawa, Canada: Statistics Canada.
From page 441...
... Statistics Canada (1996~. Generalized Record Linkage System Concepts, Draft version dated 1996 February 14, Research and General Systems Development Division, Ottawa, KlA OT6.
From page 442...
... To the public, patient confidentiality implies Hat only people directly involved in Heir care will have access to Heir medical records and that these people will be bound by strict ethical and legal standards that prohibit further disclosure (Woodward, 1996~. The public is not likely to accept that Heir records are kept "confidential" if large numbers of people have access to Heir contents.
From page 443...
... The goal of this work is to provide tools for extracting needed information from medical records while maintaining a commitment to patient confidentiality. These same techniques are equally applicable to financial, demographic and educationainicrodata releases, as well.
From page 444...
... We can identify 29% by just birth date and gender, 69%with only a bird date and a 5 digit ZIP code, and 97% (53,033 voters) when the full postal code and birth date are used.
From page 445...
... In medical databases, the minimum bin size should be much larger than Me SSA guidelines suggest. Considerthese~ree reasons: most medical databases are geographically located and so one can presume, for example, the ZIP codes of a hospital's patients, the fields in a medical database provide a tremendous amount of detai!
From page 446...
... In extremely large databases like that of SSA, the database itself can be used to compute frequencies of characteristics found in the general population since it contains almost all the general population; small, specialized databases, however, must estimate these values. In the next section, we will present a computer program that generalizes data based on bin sizes and estimates.
From page 447...
... Table 5 shows the relationship between bin sizes and selected anonymity levels using the Cambridge voters database. As A increased, the minimum bin size increased, and in order to achieve the minimal bin size requirement, values within the birth date field, for example, were re-coded as shown.
From page 448...
... The average bin size based only on bird date and gender for that population is 3, but had the researcher received only1he year of bird in He birds date field, He average bin size based on bird year and gender would have Increased to 1125 people. It is estimated Hat most of this data could be re-identified since collected fields also included residential ZIP codes and city, occupational department or agency, and provider information.
From page 449...
... The program ~-Argus, like We Datafly System, makes decisions based on bin sizes, generalizes values within fields as needed, and removes extreme outlier information from We released data. The user provides an overall bin size and specifies which fields are sensitive by assigning a value between O and 3 to each field.
From page 450...
... chest pain chest pain hypertension obesity shortness of breath chest pain chest pain 02138 * The minimum bin size is 2.
From page 451...
... Of course, guaranteeing anonymity in data requires a cr~tenon against which to check result~ng data and to locate sensitive values. If this is based only on the database itself, Me minimum bin sizes and sampling fractions may be far from optimal and may not reflect We general population.
From page 452...
... But recall Hat profiling requires guesswork in identifying fields on which He recipient could link. Suppose a profile is incorrect; that is, the producer misjudges which fields are sensitive for linking.
From page 453...
... (19971. Panel Told Releases of Medical Records Hurt Privacy, Boston Herclid, Boston, (35~.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.