Chapter 6

Invited Session on Business and Miscellaneous Record Linkage Applications

Chair: Richard Allen, National Agricultural Statistics Service

Authors:

Jenny B.Wahl, St. Olaf College

Philip M.Steel and Carl A.Konschnik, Bureau of the Census

Edward H.Porter and William E.Winkler, Bureau of the Census



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Chapter 6 Invited Session on Business and Miscellaneous Record Linkage Applications Chair: Richard Allen, National Agricultural Statistics Service Authors: Jenny B.Wahl, St. Olaf College Philip M.Steel and Carl A.Konschnik, Bureau of the Census Edward H.Porter and William E.Winkler, Bureau of the Census

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Linking Federal Estate Tax Records Jenny B.Wahl, St. Olaf College Abstract This chapter focuses on the construction of a dataset that links together tax records and contemplates possible uses of these data. I first provide an overview of scholarly work regarding inherited wealth and establish the need for intergenerationally linked data. I then discuss techniques that I used to work with Federal estate tax returns filed in Wisconsin up to 1981 (which included 93,539 decedents and their 299,688 beneficiaries). By combining a standardizing/matching software package with a series of SAS programs. I linked these records to form a database containing 27,535 observations. Each observation has information on an individual who was reported on at least two estate tax returns: once as a decedent and at least once as a beneficiary. Of the 27,535 observations, 6,453 are matched pairs and the remaining 21,082 are likely pairs. I conclude by revealing certain problems associated with linking together tax records and by suggesting future research. Introduction The only sure things in life are death and taxes—and, unfortunately for some, death taxes. Fortunately for the rest of us, Federal estate tax data offer a rare opportunity to observe the total wealth, portfolios, and bequest behavior of certain individuals. Not only that, these data can be linked across generations, providing testing grounds for hypotheses about motives for intergenerational transfers, tradeoffs of family size and bequest amount, and the like. I have used all the estate tax records filed in Wisconsin from 1916 to 1981 to assemble just such a data set. These data consist of 27,535 observations; each observation has information on a single individual who was reported on two estate tax records: once as a decedent and once as a beneficiary[1]. Of this number, 6,453 are matched pairs and the remaining 21,082 are likely pairs. Figure 1 illustrates the configuration of each observation. In addition to the linked data, a residual set of 272,153 beneficiaries did not match to any decedent. Figure 1. —Configuration of an Observation in the Matched Data Set

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition What follows is, first, a brief overview of some of the questions and theories that scholars have put forth regarding wealth and intergenerational transfers. I then turn to a fuller description of the data, a discussion of the linking methodology, and a short mention of the empirical work that lies ahead. The Importance of Inherited Wealth For a variety of reasons, scholars have studied the transfer of wealth across generations. Some have focused on macroeconomic issues such as the influence of wealth transfers on the distribution of wealth (Menchik, 1979; Kotlikoff and Summers, 1981; Modigliani, 1988; and Tachibanaki, 1994), the degree to which intergenerational wealth transfers affect savings rates across countries (Darby, 1979; Hayashi, 1986; and Kotlikoff, 1988), and the interaction of cross-generational transfers and fiscal policy (Barro, 1974; and Aaron and Munnell, 1992). Others have concentrated on microeconomic questions such as the propensity of parents to compensate their less able children or, alternatively, to leave more money to their relatively capable offspring (Becker and Tomes, 1979; and Tomes, 1981). In the process, researchers have speculated as to the appropriate model of behavior. Do individuals leave bequests because they care about their descendants or other heirs? Or do people design bequests strategically to induce potential heirs to offer attention and companionship? Or might the leaving of an estate simply be a mistake born of miscalculating one's own mortality? (Kotlikoff and Spivak, 1981; Bemheim, Schleifer, and Summers, 1985; Abel, 1985; Hurd, 1987; Modigliani, 1988; Lord and Rangazas, 1991; Altonji, Hayashi, and Kotlikoff; 1992; Gale and Scholz, 1994; Abel and Kotlikoff, 1994; Hurd, 1994; Arrondell, Perelman, and Pestieau, 1994; Yagi and Maki, 1994; and Tachibanaki and Takata, 1994.) Professors Martin David and Paul Menchik (1982) took yet a different tack. They used wealth data to estimate propensities to bequeath out of earnings. Although they did not propose any new theories, David and Menchik cast doubt on an old one: their results indicated that the life-cycle hypothesis cannot explain the bequest behavior of a set of Wisconsin decedents[2]. Others have posed additional interesting questions. Do people behave differently—choose alternative occupations or retire early, for example —if they receive or anticipate a bequest? What relationship do estate size and life insurance bear to a decedent's earnings? What connections exist among fertility, estate size, and earnings? Can one find evidence, for instance, of a tradeoff between the number of children and the wealth left to each one (Becker and Tomes, 1976; Behrman, Pollak, and Taubman, 1982; Wahl, 1986; and Wahl, 1991)? Do people tend to allocate estates equally among their children? Are people increasingly “spending the kids' inheritance,” as the bumper stickers proclaim? What patterns in charitable giving have appeared over the years? Is age at death related to lifetime earnings? Many of these questions remain open. Answering them requires a sufficiently large, intergenerationally linked data set that contains comprehensive demographic and socioeconomic information. The Original Estate Tax Data: Saved in the Nick of Time Estate tax records contain a wealth of data on a nation's citizens. One can find not only detailed information on accumulated capital and portfolio holdings but also clues about family composition, residence and migration patterns, fertility, and mortality. By dint of much effort (and good computer software) one can even link records together to reconstitute families and their financial and demographic histories. I have drawn upon Federal estate tax records to do just this. Let me offer a short history of the initial data collection effort. In 1916, the modern Federal estate tax came into being—well before modern computers, but recently enough that paper documents still existed in

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition archives seven decades later. In efforts to clean house during the Reagan years, zealous politicians nearly caused an untimely end for the boxed estate tax returns that were scattered in warehouses around the country. Fortunately, the Statistics of Income (SOI) Division at the Internal Revenue Service marshaled its forces to preserve these important historical artifacts in computerized form. The result was two enormous files: one consisting of economic and demographic information on decedents, the other of information on beneficiaries (linked via record number to the original estate tax record). Any attempt to match these two files required reducing their size. Because other researchers have used Wisconsin data to investigate wealth and estate issues (for example, David and Menchik, 1982), SOI extracted all the Wisconsin estate tax returns to use for a pilot project. The result was a decedent file with 93,711 observations and a beneficiary file with 300,269 observations. In the decedent file, 93,539 are unique individuals. For consistency's sake, omitting records from the decedent file meant purging the same records from the beneficiary file. The outcome was a file of 299,688 beneficiaries. Of this number, 188 seem to be duplicates on the same estate tax record—that is, beneficiaries with the same name and same relationship code to the decedent, but appearing twice on a given tax return. Such apparent duplicates may, however, represent different persons with the same name—cousins, for example. Alternatively, these may constitute separate bequests to a single individual—one direct and one in trust. Rather than investigate these observations before the match procedure, I simply marked them so that, if any appeared after the match, I could inspect them more carefully at that time. Linking the Data: Overlapping Estate Tax Returns Linking data from one set of records to another requires much information and, frequently, creative computer programming (Fellegi and Sunter, 1969). The AUTOMATCH software written by Matt Jaro provides a solid foundation (Jaro, 1997); variations on his programs coupled with SAS programming produced the linked estate tax records. The critical linkage was this: Joseph McCarthy, say, appears as a beneficiary on his father's estate tax return. In turn, the estate of Joseph McCarthy also files a tax return. The two are linked into a single observation, given consistency in social security numbers, sex, years of birth and death, and the like. Each observation then contains detailed information about the Joseph the decedent: his portfolio, age, marital status, and number of children, for instance. Information about Joseph the beneficiary appears as well: his relationship to his benefactor, receipt of a trust, and sometimes the size of his bequest. The AUTOMATCH software contains several attractive features that help create good links between records. It standardizes individual names and creates NYSIIS and Soundex codes. (Because I had maiden names for many women, I ran the standardization/coding step twice.) These codes work well as blocking variables in the match process. The software also allows specification of values for missing variables; this helps distinguish between true mismatches and apparent mismatches caused by missing data. The match procedure itself allows multiple rounds so that I could block and match over different sets of variables. Table 1 shows the salient variables for each match round. The matching process itself also has nice characteristics. I could request multiple matches—important, because Joseph McCarthy may have inherited from more than one person. Each matching variable has a designation to control for miskeying in the original data. For example, I could allow for mismatched numbers in the social security number string and mismatched letters in the name character strings. These designations also allow matching around intervals, which proved essential for my year-of-birth variables because I had to construct them from rounded-year ages. Each matching variable also carries a set of probabilities to allow for type I and type II errors[3]. All together, these probabilities translate into a single weight associated with each match in each match round. I could choose two cutoff weights per round:

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition one the lower bound for declared matches, the other the lower bound for potential matches. After each match round, I could perform an interactive clerical review on the potential matches and change their status to declared matches or residuals. Following the clerical review, the software outputs all residuals to the next match round. Table 1. —Matching Rounds Match Pass Blocking Variables Matching Variables Original Matches Original Clerical Final Matches Final Clerical 1 SSN surname first name maiden name suffix initial sex year of birth 4,805 119 4,884 0 2 SSN surname/ maiden name first name suffix initial sex year of birth 4,906 43 4,928 0 3 surname NYSIIS first name Soundex sex SSN surname first name maiden name initial suffix year of birth 5,514 30,651 5,514 30,651 4 surname/ maiden name NYSIIS first name Soundex sex SSN surname/ maiden name first name initial suffix year of birth 5,515 30,652 5,515 30,652 The clerical review process is extremely time-consuming. Although I used it for the first two match rounds, thereafter I used SAS programs to decide whether to change the status of potentially matched pairs [4]. Simply put, I distilled a set of decision rules into SAS programs rather than using the same rules on an interactive, case-by-case basis. For example, suppose the initial matching process paired Joseph McCarthy from the decedent file to Joseph McCarthy from the beneficiary file. The beneficiary file in-

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition cludes a date of death for the Joe's benefactor. If this date of death was after the date of death of Joe the decedent, I called it a nonmatch. Particular Features of Estate Tax Data Any two data sets have quirks that make matching difficult. Let me point out a few issues associated with matching data on people observed at two different points in time, often several years apart. Some problems pertained primarily to females. During the time period covered by my data, a woman often took her husband's social security number at marriage. Sorting and matching by SSN for women was therefore problematic if a woman got married after receiving a bequest. Women also sometimes changed their middle initials upon marriage to reflect their maiden names. I had to take care, then, with the probabilities placed on type I and type II errors when initials appeared as matching variables. Yet women provided information—namely, maiden names—that helped me refine the likelihood of matches as well. Suppose a decedent carried the maiden name Scheuren. Say that the decedent potentially matches to a beneficiary, whose benefactor carried the last name Scheuren. Provided that birth and death years were logical, I could declare this a match. By the same token, if a (potentially matched) decedent had the last name Winkler and the benefactor named on the beneficiary file had the maiden name Winkler, again this might be considered a match. Males created certain problems as well, albeit less directly than females. I had hoped to use cities as matching variables. Yet this hope was dashed: Wisconsin men seemed to like passing their names on to their sons, people did not seem to move around much, and missing ages for beneficiaries frequently meant that I could not screen matches by birth year. As a result, I could not use locational variables to improve the matching process. A last discovery: one should always assign unique record identification numbers to observations on each file. Initially, the beneficiary file contained identifiers that pointed back to the estate tax record, but it did not have unique identifiers. Because my original files were so large, I excluded some variables while performing the match. When I attempted to reattach data after the match, I could not be sure that the right data went to the right individual. I therefore had to retrace my steps, this time with unique identifiers for each original file. What Lies Ahead In the coming months, I will use these linked data to fulfill two objectives. One is to compare matched and unmatched beneficiaries and report any significant differences. The other is to generate a proxy for bequest amount. To proceed, I must convert dollar figures to constant-dollar amounts, control for changes in filing thresholds, and implement a logical cutoff process so as to separate nonmatches from impossibilities. That is, I do not want to call unmatched data a “nonmatch” if the individual could not possibly have entered the matched data set because he or she was born before 1916 or was still living after 1981. Eventually, I hope to extend matches forward and back to reconstitute multiple generations of families.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Acknowledgments Many thanks to Fritz Scheuren, who made sure this work saw the light of day; Dan Skelly, who encouraged me to dirty my hands with these data; and Barry Johnson, who answered my questions and did most of the hard stuff. Footnotes [1] Individuals can appear as beneficiaries on more than one estate tax return. The pairs do not therefore represent unique persons. [2] The life-cycle hypothesis, associated originally with Franco Modigliani, suggests that people tend to decumulate wealth after a certain age, as they begin to anticipate death. For a review, see Ando and Modigliani (1963) and Modigliani (1988). [3] Type I errors occur when true matches are declared nonmatches; Type II errors occur when nonmatches are declared matches. [4] Here is a time comparison: using the clerical review process on 3,827 potential pairs took me seven hours. Writing and running SAS programs with embedded decision rules took about one-half hour for the same data. References Aaron, H. and Munnell, A. ( 1992). Reassessing the Role for Wealth Transfer Taxes, National Tax, Journal 45:119–44. Abel, A. ( 1985). Precautionary Saving and Accidental Bequests, American Economic Review, 75:777–91. Abel, A. and Kotlikoff, L. ( 1994). Intergenerational Altruism and the Effectiveness of Fiscal Policy —New Tests Based on Cohort Data, Savings and Bequests, ed. T.Tachibanaki, Ann Arbor: University of Michigan Press, 167–96. Altonji, J; Hayashi, F.; and Kotlikoff, L. ( 1992). Is the Extended Family Altruistically Linked? New Tests Based on Micro Data, American Economic Review, 82:1177–98. Ando, A. and Modigliani, F. ( 1963). Lifecycle Hypothesis of Savings: Aggregate Implications and Tests American Economic Review, 53. Arrondell, L.; Perelman, S.; and Pestieau, P. ( 1994). The Effect of Bequest Motives on the Composition and Distribution of Assets in France, Savings and Bequests, ed. T.Tachibanaki, Ann Arbor: University of Michigan Press, 229–44. Barro, R. ( 1974). Are Government Bonds Net Wealth? Journal of Political Economy, 82:1095–1118. Becker, G. and Tomes, N. ( 1976). Child Endowments and the Quantity and Quality of Children, Journal of Political Economy, 84:143–62.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Becker, G. and Tomes, N. ( 1979). An Equilibrium Theory of the Distribution of Income and Intergenerational Mobility, Journal of Political Economy, 87:1153–89. Behrman, J.; Pollak, R.; and Taubman, P. ( 1982). Parental Preferences and Provision for Progeny, Journal of Political Economy, 90:52–73. Bernheim, B.D.; Schleifer, A.; and Summers, L. ( 1985). The Strategic Bequest Motive, Journal of Political Economy, 93:1045–76. Darby, M. ( 1979). The Effects of Social Security on Income and the Capital Stock, Washington, DC: American Enterprise Institute. David, M. and Menchik, P. ( 1982). Modeling Household Bequests, University of Wisconsin, working paper. Fellegi, I. and Sunter, A. ( 1969). A Theory for Record Linkage, Journal of the American Statistical Association, 64:1183–1210. Gale, W. and Scholz, J.K. ( 1994). Intergenerational Transfers and the Accumulation of Wealth, Journal of Economic Perspectives, 8:145–60. Hayashi, F. ( 1986). Why is Japan's Saving Rate So Apparently High, NBER Macro Annual, ed. S. Fisher, Cambridge: MIT Press. Hurd, M. ( 1987). Savings of the Elderly and Desired Bequests, American Economic Review, 77:298– 312. Hurd, M. ( 1994). Measuring the Bequest Motive: The Effect of Children on Saving by the Elderly in the U.S., Savings and Bequests, ed. T.Tachibanaki, Ann Arbor: University of Michigan Press, 111–36. Jaro, M. ( 1997). Matchware Product Overview, Record Linkage Techniques -1997, eds. W.Alvey and B.Jamerson, Washington, D.C.: Office of Management and Budget. Kotlikoff, L. ( 1988). Intergenerational Transfers and Savings, Journal of Economic Perspectives, 2:48– 51. Kotlikoff, L. and Spivak, A. ( 1981). The Family as an Incomplete Annuities Market, Journal of Political Economy, 89: 372–91. Kotlikoff, L. and Summers, L. ( 1981). The Role of Intergenerational Transfers in Aggregate Capital Accumulation Journal of Political Economy, 89:706–32. Lord, W. and Rangazas, P. ( 1991). Savings and Wealth in Models with Altruistic Bequests, American Economic Review, 81:289–96. Menchik, P. ( 1979). Intergenerational Transmission of Inequality: An Empirical Study of Wealth Mobility, Economica, 46:749–62.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Modigliani, F. ( 1988). The Role of Intergenerational Transfers and Life Cycle Saving in the Accumulation of Wealth, Journal of Economic Perspectives, 2:15–40. Tachibanaki, T, ed. ( 1994). Savings and Bequests., Ann Arbor: University of Michigan Press. Tachibanaki, T. and Takata, S. ( 1994). Bequest and Asset Distribution: Human Capital Investment and Intergenerational Wealth Transfers, Savings and Bequests, ed. T.Tachibanaki, Ann Arbor: University of Michigan Press, 197–228. Tomes, N. ( 1981). The Family, Inheritance, and Intergenerational Transmission of Inequality Journal ofPolitical Economy, 89:928–58. Wahl, J. ( 1991). American Fertility Decline in the Nineteenth Century: Tradeoff of Quantity and Quality? Essays in Honor of Robert William Fogel, eds. C.Goldin and H.Rockoff, Chicago: University of Chicago Press. Wahl, J. ( 1986). New Results on the Decline in Household Fertility in the United States from 1750 to 1900, Studies in Income and Wealth, eds. R.Gallman and S.Engerman, Chicago: University of Chicago Press, 391–438. Yagi, T. and Maki, H. ( 1994). Cost of Care and Bequests, Savings and Bequests, ed. T.Tachibanaki, Ann Arbor: University of Michigan Press, 39–62.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Post-Matching Administrative Record Linkage Between Sole Proprietorship Tax Returns and the Standard Statistical Establishment List Philip M.Steel and Carl A.Konschnik, Bureau of the Census Abstract In 1992 a match was performed between the IRS Form 1040, Schedule C file and the Standard Statistical Establishment List (SSEL). The match supplemented existing linkages already established between the two files. Though no matching operation has been performed on subsequent 1040 files, the links established on the 1992 data continue to be based in the processing of these files. We are now in a position to analyze the long term effectiveness of the procedure and how frequently it should be applied. As a by product of the matching operation we obtained a measure of the fit between two records. We explore the possibility of utilizing this measure to link records or select records to be subjected to a matching procedure. The measure of fit derived from the 1992 processing can also be applied to test the validity of existing linkages derived from other procedures. Introduction This paper describes a matching process which improves the linkage between sole proprietorship income tax return records from the Internal Revenue Service (IRS) and their associated payroll records on the Census Bureau's Standard Statistical Establishment List (SSEL). The matching process supplements the linkages made previously based on a common primary identifying number on the two types of records. This number is the Employer Identification Number (EIN), issued by IRS to businesses with employees, and used by them as a principal taxpayer identification number. Unfortunately this common identifier is omitted on roughly 30 percent of the annual income tax returns on which it should appear. In matching, our aim was to make the linkages more complete by using other information besides the EIN—chiefly, name, city, state, ZIP code, payroll and kind-of-business activity code. Context and Motivation for the Matching Linking receipts and payroll records depends largely on associating the correct EIN with each annual income tax return. A sole proprietorship business, when filing the required annual Form 1040, Schedule C, (or, briefly, 1040-C) tax return with the IRS, uses the owner's social security number (SSN) as its taxpayer identification number. If the business has employees, it is required to have an EIN and use it for filing IRS Form 941, Employer's Quarterly Federal Tax Return. When it files its annual 1040-C tax return, the sole proprietorship business is asked to provide its EIN if it has one. This reported EIN is the principal link between the annual business income and quarterly payroll tax returns for sole proprietorship employers.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Approximate String Comparison and its Effect on an Advanced Record Linkage System Edward H.Porter and William E.Winkler, Bureau of the Census Abstract Record linkage, sometimes referred to as information retrieval (Frakes and Baeza-Yates, 1992) is needed for the creation, unduplication, and-maintenance of name and address lists. This paper describes string comparators and their effect in a production matching system. Because many lists have typographical errors in more than 20 percent of first names and also in last names, effective methods for dealing with typographical error can greatly improve matching efficacy. The enhanced methods of approximate string comparison deals with typographical variations and scanning errors. The values returned by the string comparator are used in a statistical model for adjusting parameters that are automatically estimated by an expectation-maximization algorithm for latent class, log linear models of the type arising in the Fellegi-Sunter model of record linkage (1969). Overall matching efficacy is further improved by linear assignment algorithm that forces 1–1 matching. Introduction Modern record linkage represents a collection of methods from three different disciplines: computer science, statistics, and operations research. While the foundations are from statistics, beginning with the seminal work of Newcombe (Newcombe et al., 1959, also Newcombe, 1988) and Fellegi and Sunter (1969) the means of implementing the methods have primarily involved computer science. Record linkage begins with highly evolved software for parsing and standardizing names and addresses that are used in the matching. Name standardization identifies components such as first names, last names (surnames), titles, and middle initials. Address standardization locates components such as house numbers, street names, PO Boxes, apartment numbers, and rural routes. With good standardization, effective comparison of corresponding components of information and the advanced methods described in this paper become possible. Because pairs of strings often exhibit typographical variation (e.g., Smith versus Smoth), the record linkage needs effective string comparator functions that deal with typographical variations. While approximate string comparison has been a subject of research in computer science for many years (see survey article by Hall and Dowling, 1980), some of the most effective ideas in the record linkage context were introduced by Jaro (1989) (see also Winkler, 1985, 1990). Budzinsky (1991), in an extensive review of twenty string comparison methods, concluded that the original Jaro method, the extended method due to Winkler (1990) and a widely used computer science method called bigrams worked well. This paper describes two new enhancements to the string comparators used at the Census Bureau. The first, due to McLaughlin (1993), adds logic for dealing with scanning errors (e.g., “I” versus “1”) and certain common keypunch errors (e.g., “V” versus “B”). The second due to Lynch and Winkler (1994) makes adjustments

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition for pairs of long strings having a high proportion of characters in common. We also describe the method of computing bigrams and present results comparing them with the other string comparators of this paper. Our record linkage system uses the Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin, 1977) to estimate optimal matching parameters. We use a linear sum assignment procedure (Isap) to force 1–1 matching. Jaro (1989) introduced the Isap as a highly effective means of eliminating many pairs that ordinarily might be clerically reviewed. With a household data source containing multiple individuals in a household, it effectively keeps the four pairs associated with father-father, mother-mother, son-son, and daughter-daughter pairs while eliminating the remaining twelve pairs associated with the household. The next section describes the string comparator. In the third section, we provide a summary of the parameters that are obtained via the EM algorithm. The results of section four provide empirical examples of how matching efficacy is improved for three, small pairs of high quality lists. The final section consists of a summary and conclusion. Approximate String Comparison Dealing with typographical error can be vitally important in a record linkage context. If comparisons of pairs of strings are only done in an exact character-by-character manner, then many matches may be lost. An extreme example is the Post Enumeration Survey (PES) (Winkler and Thibaudeau, 1991; also Jaro, 1989) in which, among true matches, almost 20 percent of last names and 25 percent of first names disagreed character-by-character. If matching had been performed on a character-by-character basis, then more than 30 percent of matches would have been missed by computer algorithms that were intended to delineate matches automatically. In such a situation, required manual review and (possibly) matching error would have greatly increased. Jaro (1989) introduced a string comparator that accounts for insertions, deletions, and transpositions. In a small study, Winkler (1985) showed that the Jaro comparator worked better than some others from computer science. In a large study, Budzinsky (1991) concluded that the comparators due to Jaro and Winkler (1990) were the best among twenty in the computer science literature. The basic Jaro algorithm is: compute the string lengths, find the number of common characters in the two strings, and find the number of transpositions. The definition of common is that the agreeing character must be within 1/2 the length of the shorter string. The definition of transposition is that the character from one string is out of order with the corresponding common character from the other string. The string comparator value (rescaled for consistency with the practice in computer science) is: jaro(s1,s2) = 1/3(#common/str_len1 + #common/str_len2 + 0.5 #transpositions/#common), (1) where s1 and s2 are the strings with lengths str_len1 and str_len2, respectively. The new string comparator algorithm begins with the basic Jaro algorithm and then proceeds to three additional loops corresponding to the enhancements. Each enhancement makes use of information that is obtained from the loops prior to it. The first enhancement due to McLaughlin (1993) assigns value 0.3 to each disagreeing but similar character. Each exact agreement gets value 1.0 and all exact agreements are located prior to searching for similar characters. Similar characters might occur because of scanning errors (“1” versus “1”) or keypunch (“V” versus “B”). The number of common characters (#common) in equation (1) gets increased by 0.3 for

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition each similar character, is denoted by #similar, and #similar is substituted for #common in the first two components of equation (1). The second enhancement due to Winkler (1990) gives increased value to agreement on the beginning characters of a string. It was based on ideas from a very large empirical study by Pollock and Zamora (1984) for the Chemical Abstracts Service. The study showed that the fewest errors typically occur at the beginning of a string and the error rates by character position increase monotonically as the position moves to the right. The enhancement basically consisted of adjusting the string comparator value upward by a fixed amount if the first four characters agreed; by lesser amounts if the first three, two, or one characters agreed. The string comparator examined by Budzinsky (1991) consisted of the Jaro comparator with only the Winkler enhancement. The final enhancement due to Lynch and Winkler (1994) adjusts the string comparator value if the strings are longer than six characters and more than half the characters beyond the first four agree. The final enhancement was based on detailed comparisons between versions of the comparator. The comparisons involved tens of thousands of pairs of last names, first names, and street names that did not agree on a character-by-character basis but were associated with truly matching records. A common string comparison methodology is comparing the bigrams that two strings have in common. A bigram is two consecutive letters with a string. Hence the word “bigram” contains the bigrams “bi,” “ig,” “gr,” “ra,” and “am.” The bigram function also returns a value between 0 and 1. The return value is the total number of bigrams that are in common divided by the average number of bigrams in the two strings. Bigrams are known to be a very effective, simply programmed means of dealing with minor typographical errors. They are widely used by computer scientists working in information retrieval (Frakes and Baeza-Yates, 1992). Table 1 illustrates the effect of the new enhanced comparators on last names, first names, and street names, respectively. To make the value returned by bigram weighting function more comparable to the other string comparators, we make a functional adjustment. If x is the value returned by the bigram weighting function, we use f(x) = x0.2435 if x is greater than 0.8 and 0.0 otherwise. If each string in a pair is less than four characters, then the Jaro and Winkler comparators return the value zero. The Jaro and Winkler comparator values are produced by the loop from the main production software (e.g., Winkler and Thibaudeau 1991) which is only entered if the two strings do not agree character-by-character. The return value of zero is justified because if each of the strings has three or less characters, then they necessarily disagree on at least one. In record linkage situations, the string comparator value is used in adjusting the matching weight associated with the comparison downward from the agreement weight toward the disagreement weight. Using crude statistical modeling techniques, Winkler (1990) developed downweighting functions for last names, first names, street names, and some numerical comparisons that generalized the original down-weighting function introduced by Jaro.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Table 1. —Comparison of String Comparators Using Last Names, First Names, and Street Names Two Strings Jaro String Comparator Values   Wink McLa Lynch Bigram SHACKLEFORD SHACKELFORD 0.970 0.982 0.982 0.989 0.925 DUNNINGHAM CUNNIGHAM 0.896 0.896 0.896 0.931 0.917 NICHLESON NICHULSON 0.926 0.956 0.969 0.977 0.906 JONES JOHNSON 0.790 0.832 0.860 0.874 0.000 MASSEY MASSIE 0.889 0.933 0.953 0.953 0.845 ABROMS ABRAMS 0.889 0.922 0.946 0.952 0.906 HARDIN MARTINEZ 0.000 0.000 0.000 0.000 0.000 ITMAN SMITH 0.000 0.000 0.000 0.000 0.000 JERALDINE GERALDINE 0.926 0.926 0.948 0.966 0.972 MARHTA MARTHA 0.944 0.961 0.961 0.971 0.845 MICHELLE MICHAEL 0.869 0.921 0.938 0.944 0.845 JULIES JULIUS 0.889 0.933 0.953 0.953 0.906 TANYA TONYA 0.867 0.880 0.916 0.933 0.883 DWAYNE DUANE 0.822 0.840 0.873 0.896 0.000 SEAN SUSAN 0.783 0.805 0.845 0.845 0.800 JON JOHN 0.917 0.933 0.933 0.933 0.847 JON JAN 0.000 0.000 0.860 0.860 0.000 BROOKHAVEN BRROKHAVEN 0.933 0.947 0.947 0.964 0.975 BROOK HALLOW BROOK HLLW 0.944 0.967 0.967 0.977 0.906 DECATUR DECATIR 0.905 0.943 0.960 0.965 0.921 FITZRUREITER FITZENREITER 0.856 0.913 0.923 0.945 0.932 HIGBEE HIGHEE 0.889 0.922 0.922 0.932 0.906 HIGBEE HIGVEE 0.889 0.922 0.946 0.952 0.906 LACURA LOCURA 0.889 0.900 0.930 0.947 0.845 IOWA IONA 0.833 0.867 0.867 0.867 0.906 1ST IST 0.000 0.000 0.844 0.844 0.947 Data and Matching Weights—Parameters In this section, we describe the fields and the associated matching weights that are used in the record linkage decision rule. We do not give details of the EM algorithm or the assignment algorithm because they have been given elsewhere (Winkler, 1994). The fields used in the creation of mailing list during the 1995 test census are first name, last name (surname), sex, month of birth, day of birth, year of birth, race, and Hispanic origin. The census file is linked with an update file. These update files have been either IRS, driver's license, or school records. Only fields whose housing unit identifier agreed are compared in the first pass. The housing unit identifiers were calculated by the Census Bureau's geography division's address standardization software. It consists of a State Code, County Code, TIGER Line ID (e.g., a city block), Side ID (right or left), house number,

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition and apartment number. In the 1995 test census of Oakland, California 95.0 percent of the records file were geocoded with housing unit identifier. Also, 94.7 percemt of the IRS file records for the corresponding area were geocoded with housing unit identifier. The names were standardized at a 95.2 percent rate in the test census file and 99.0 percent rate in the IRS file. Each parameter was assigned an agreement and disagreement weight. (See Table 2.) Certain parameters such as first name are assigned a higher agreement weight. Since matching was done within a household, surname carried had less distinguishing power than first name. After initial trial runs and research of the output, the expectation-maximization software (EM) was run to produce the parameters for the test. Table 2. —Parameters Used in Matching for the 1995 Test Census of Oakland, California Parameter Agreement Weight Disagreement Weight first 4.3385 -2.7119 last(surname) 2.4189 -2.5915 sex 0.7365 -3.1163 month 2.6252 -3.8535 day 3.5206 -2.9652 year 1.7715 -4.1745 Hispanic 0.2291 -0.3029 race 0.5499 -0.5996 String comparators were only used with first names and surnames. For example, if the first names were Martha and Marhta. The matching weight would be computed as follows:   Jaro Wink McLa Lynch Comparator Value 0.944 0.961 0.961 0.971 Matching Weight 3.943 4.063 4.063 4.134 The piecewise linear function that uses the value returned by the different string comparators to adjust the matching agreement weight downward is detailed in Winkler (1990). Results Results are presented in two parts. In each part, the different string comparators are substituted in the string comparison subroutine of an overall matching system. The matching weights returned by the EM algorithm are held constant. Two different versions of a linear sum assignment procedure are used. For the description of the Isap, see Winkler (1994). The main empirical data consists of three pairs of files having known matching status. In the first part, we show how much the string comparators can improve the matching results. The second part provides an overall comparison of matching methods that utilize various combinations of the new and old string comparators and the new and old assignment algorithms.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Exact Matching Versus String Comparator Enhanced Matching In Table 3, we illustrate how much string comparators improve matching in comparison with exact matching. After ordering pairs by decreasing matching weight in the first and third of the empirical data files, we plot the proportion of false matches against the total number of pairs. We see that, if matching is adjusted for bigrams and the string comparators, then error rates error rates are much lower than those obtained when exact matching is used. Since exact matching is not competitive, remaining results are only presented when string comparators are used. Table 3. —Matching Results at Different Error Rates: First Pair of Files with 4,539 and 4,859 Records 38,795 Pairs Agreeing on Block and First Character of Last Name Link Error Rate Link Match/Nonm Clerical Match/Nonm 0.002   base 3172/6 242/64 s_ c 3176/6 236/64 as 3176/6 234/64 os_l 3174/6 242/64 bigram 3224/ 7 174/63 0.005   base 3363/17 51/53 s_c 3357/17 55/53 as 3357/17 53/53 os_l 3364/17 52/53 bigram 3327/17 71/53 0.010   base 3401/34 13/36 s_c 3396/34 16/36 as 3396/34 14/36 os_l 3402/34 14/36 bigram 3376/34 22/36 0.020   base 3414/70 0/0 s_c 3411/70 0/0 as 3410/70 0/0 os_l 3416/70 0/0 bigram 3398/70 0/0

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Overall Comparison of Matching Methods The baseline matching is done under 3-class, latent class models under the conditional independence assumption. The 3-class models are essentially the same ones used in Winkler (1994). In Table 4, results are reported for error rates of 0.002, 0.005, 0.01, and 0.02, respectively. Link, Nonlink, and Clerical (or Possible Link) are the computer designations, respectively. Match and Nonmatch are the true statuses, respectively. The baseline results (designated by base) are produced using the existing Isap algorithm and the previous string comparator (see e.g., Winkler, 1990) but use the newer, 3-class EM procedures for parameter estimation (Winkler, 1994). The results with the new string comparator (designated s_c) are produced with the existing string comparator replaced by the new one. The results with the new assignment algorithm (designated as) use both the new string comparator and the new assignment algorithm. For comparison, results produced using the previous string comparator but with the new assignment algorithm (designated by os_l) are also given. Finally, results using the bigram adjustments are denoted by bigram. Table 4. —Matching Results at Different Error Rates: Second Pair of Files with 5,022 and 5,212 Records 37,327 Pairs Agreeing on Block and First Character of Last Name Link Error Rate Link Match/Nonm Clerical Match/Nonm 0.002   base 3475/7 63/65 s _c 3414/7 127/65 as 3414/7 127/65 os_l 3477/7 63/65 bigram 3090/7 461/66 0.005   base 3503/18 35/54 s_c 3493/18 48/54 as 3493/18 48/54 os_l 3505/18 36/54 bigram 3509/18 42/55 0.010   base 3525/36 13/36 s _c 3526/36 15/36 as 3526/36 15/36 os_l 3527/36 14/36 bigram 3543/36 8/73 0.020   base 3538/72 0/0 s_c 3541/72 0/0 as 3541/72 0/0 os_l 3541/72 0/0 bigram 3551/73 0/0

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition As Table 5 shows, matching efficacy improves if more pairs can be designated as links and nonlinks at fixed error rate levels. In Tables 3–5, computer-designated links and clerical pairs are subdivided into (true) matches and nonmatches. Only the subset of pairs produced via 1–1 assignments are considered. In producing the tables, pairs are sorted by decreasing weights. The weights vary according to the different model assumptions and string comparators used. The number of pairs above different thresholds at different link error rates (0.002, 0.005, 0.01, and 0.02) are presented. False match error rates above 2 percent are not considered because the sets of pairs above the cutoff threshold contain virtually all of the true matches from the entire set of pairs when error rates rise to slightly less than 2 percent. In each line, the proportion of nonmatches (among the sum of all pairs in the Link and Clerical columns) is 2 percent. Table 5. —Matching Results at Different Error Rates: Third Pair of Files with 15,048 and 12,072 Records 116,305 Pairs Agreeing on Block and First Character of Last Name Link Error Rate Link Match/Nonm Clerical Match/Nonm 0.002   base 9696/19 155/182 s_c 9434/19 407/182 as 9436/19 406/182 os_l 9692/19 157/182 bigram 9515/19 335/182 0.005   base 9792/49 59/152 s_c 9781/49 60/152 as 9783/49 57/152 os_l 9791/49 58/152 bigram 9784/49 66/152 0.010   base 9833/99 18/102 s_c 9822/99 19/102 as 9823/99 17/102 os_l 9831/99 18/102 bigram 9823/99 27/102 0.020   base 9851/201 0/0 s_c 9841/201 0/0 as 9842/201 0/0 os_l 9849/201 0/0 bigram 9850/201 0/0 The results generally show that the different string comparators improve matching efficacy. In all of the best situations, error levels are very low. The new string comparator produces worse results than the previous one (see e.g., Winkler, 1990) and the new assignment algorithm (when combined with the new string comparator) performs slightly worse (between 0.1 and 0.01 percent) than the existing string com-

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition parator and Isap algorithm. In all situations (new or old string comparator), the new assignment algorithm slightly improves matching efficacy. To test the effect of the Winkler variant of the Jaro string comparator and bigrams on more recent files, we use 1995 test census files from Oakland, California. (See Table 6.) The match rates were as follows. In the first matching pass, we only used pairs of records that agreed on housing unit ID. Those that were not matched were processed in a second pass. Blocking during the second pass was on house number and first character of the first name. The results generally show that either string comparator produces good results. The variant of the Jaro string comparator yields a slightly smaller clerical review region. Table 6. —First Pass—Housing Unit Identifier Match: Matching Results of a Pair of Files with 226,713 and 153,644 Records, Respectively   Jaro String Comparator Bigram   Links Clerical Links Clerical   78814 5091 78652 5888 Estimated false match rate 0.1% 30% 0.1% 35% Second Pass—House Number and First Character of First Name: Matching Results of a Pair of Files with 132,100 and 64,121 Records, Respectively   Links Clerical     16893 7207   Estimated false match rate 0.3% 40%   Summary and Conclusion Application of new string comparator functions can improve matching efficacy in the files having large amounts of typographical error. Since many of the files typically have high typographical error rates, the string comparators can yield increased accuracy and reduced costs in matching of administrative lists and census. References Budzinsky, C. D, ( 1991). Automated Spelling Correction, Statistics Canada. Dempster, A.P.; Laird, N.M.; and Rubin, D.B. ( 1977). Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, series B, 39, 1–38. Fellegi, I.P. and Sunter, A.B. ( 1969). A Theory for Record Linkage, Journal of the American Statistical Association, 64, 1183–1210.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Frakes, W.B. and Baeza-Yates, R. (eds.) ( 1992). Information Retrieval: Data Structures and Algorithms, Upper Saddle River, NJ: Prentice-Hall PTR. Hall, P.A.V. and Dowling, G.R. ( 1980). Approximate String Comparison, Computing Surveys, 12, 381– 402. Jaro, M.A. ( 1989). Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association, 89, 414–420. Lynch, M.P. and Winkler, W.E. ( 1994). Improved String Comparator, Technical Report, Statistical Research Division, Washington, DC: U.S. Bureau of the Census. McLaughlin, G. ( 1993). Private Communication of C-String-Comparison Routine. Newcombe, H.B. ( 1988). Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business, Oxford: Oxford University Press. Newcombe, H.B.; Kennedy, J.M.; Axford, S.J.; and James, A.P. ( 1959). Automatic Linkage of Vital Records, Science, 130, 954–959. Pollock, J. and Zamora, A. ( 1984). Automatic Spelling Correction in Scientific and Scholarly Text, Communications of the ACM, 27, 358–368. Winkler, W.E. ( 1985). Preprocessing of Lists and String Comparison, Record Linkage Techniques— 1985, W.Alvey and B.Kilss, (eds.), U.S. Internal Revenue Service, Publication 1299, 181–187. Winkler, W.E. ( 1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage, Proceedings of the Section on Survey Research Methods, American Statis tical Association, 354–359. Winkler, W.E. ( 1994). Advanced Methods for Record Linkage, Technical Report, Statistical Research Division, Washington, DC: U.S. Bureau of the Census. Winkler, W.E. and Thibaudeau, Y. ( 1991). An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census, Statistical Research Division Report 91/09, Washington, DC: U.S. Bureau of the Census. *This paper reports general results of research by Census Bureau staff. The views expressed are attributable to the authors and do not necessarily reflect those of the U.S. Bureau the Census.

OCR for page 169
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.