Read "Development of a Comprehensive Approach for Serious Traffic Crash Injury Measurement and Reporting Systems" at NAP.edu

« Previous: 6 Roadmap to Comprehensive Measurement of Serious Injuries Through Linkage

Page 37

Suggested Citation:"7 Roadmap to Linkage." National Academies of Sciences, Engineering, and Medicine. 2021. Development of a Comprehensive Approach for Serious Traffic Crash Injury Measurement and Reporting Systems. Washington, DC: The National Academies Press. doi: 10.17226/26305.

Page 38

Page 39

Page 40

Page 41

Page 42

Page 43

Page 44

Page 45

Page 46

Page 47

Page 48

Page 49

Page 50

Page 51

Page 52

Page 53

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

37 7 Roadmap to Linkage The previous section described the necessary components of a comprehensive statewide data system with linkages to crash and including medical outcome (among other datasets). However, the process of putting these elements in place is complex and challenging. In this section, we present a process, or roadmap, by which states can reach the goal of comprehensively measuring serious injury in crashes at the state level through linked datasets. The contents of this roadmap are based on interviews and discussions with staff of state agencies and researchers working with state data. Most of the ideas in the roadmap have been tried and/or are in use in at least one state. In addition, we present reasonable alternatives that allow states to choose what works best given their circumstances. 7.1 Roadmap Overview Table 11 provides an overview of the steps in the roadmap. Each step is detailed in a subsection below. Table 11 Summary of Steps to Linkage Step Key Goal. 1: Arrange Collaboration Among Relevant Agencies Facilitate critical communication and decision-making pathways; create a data linkage project group; identify the motivation for and benefits of participation for each group. 2: Catalog Available Databases Know coverage, contents, schema, inclusion criteria, and potential problems with each database before trying linkage. Step 3: Determine Databases to Be Linked Make a plan for the order in which databases will be linked. 4: Identify the Identifiers Know what is available in existing databases to aid linkage. 5: Determine Linkage Mechanism For each pair of databases to be linked, choose a mechanism for that linkage. Consider facilitating linkages among more than two databases. 6: Step 6: Determine Database Storage Mechanism Data must be stored and managed, and access must be protected. Step 7: Harmonize Common Data Elements Common data elements in linked files must conform to a common schema. 8: Set Up a Pilot Project Testing on a small scale helps find problems to fix. 9: Set Up a Sampling Program (Optional but strongly recommended) Provides pre-linkage ability to measure serious injuries and a way of testing linkage approaches as they are developed. 10: Set Up Statewide Linkage Pilot project must eventually be launched statewide.

38 7.2 Step 1: Arrange Collaboration Among Relevant Agencies In most of the states we interviewed, the TRCC was the group that motivated, initiated, facilitated, and often provided funds to launch data linkage activities. Representation of all relevant agencies, especially the combination of public health and crash/roadway agencies, is necessary for a successful program. In some states, Memoranda of Understanding were signed by agency officials to commit to supporting linkage-related activities specifically. In other states, the agreements to support linkage were less formal, but many of the people we interviewed talked about how important multi-agency participation and advanced buy-in was to the process. To engage the different agencies that need to participate for a successful program, it is important to start by identifying the motivations and potential benefits of linkage. Some of these may be specific to one agency (e.g., tying diagnosis-based injury outcome to roadway features), while others may serve the great good by enabling research and improving assessment of many different countermeasures. The motivation provided by MAP-21 is not sufficient to sustain an effective multi-agency linkage program. Moreover, knowing what each group hopes to get out of a linked data system will help in prioritizing development of the various components of the system. At a practical level, the multi-agency discussion around the linkage process should include a number of specific issues that will have to be resolved for a successful program. First, agencies will need to identify what each will put in (e.g., money, staff resources, data), and what each wants to get out (e.g., data access, specific reports, new data elements). Dataset ownership and rules for access must be worked out. Data access rules must comply with HIPAA, state law, and agency policy. As a result, there should be a legal review early in the process to determine whether any state laws or agency policies need to be changed. This has been necessary in some states and since the process may be slow, it should be addressed early. 7.3 Step 2: Catalog Available Databases Since the first requirement of a linked data system is statewide datasets in good condition, the second step in the data linkage roadmap is to catalog the available datasets. Although crash linked to medical outcome is necessary to meet the goal of measuring serious injuries in crashes, the cataloguing process should include a much wider variety of datasets. A useful goal might be to measure the comprehensive cost of crashes in the state. This would promote inclusion of many datasets that address costs (e.g., Medicare, roadway asset management and repair) in addition to datasets that focus on what happened (e.g., medical outcome and crash data). The items to catalog for each dataset are listed in below. 7.3.1 Data Dictionary As described earlier, a data dictionary, or schema, is essentially the contents, or codebook, of the dataset. It includes the relational structure of any tables, all variable names, values each variable can take on, and the meaning of numeric codes. 7.3.2 Inclusion Criteria Inclusion criteria describe how cases are selected to be in the database (or not). Inclusion criteria are critical in linkage because they influence what cases can be in linked datasets and they influence results of any analysis. For example, if the medical outcome dataset is a trauma dataset, then only those occupants with injuries requiring hospital admission and meeting trauma inclusion criteria will be available for matching. This should make it possible to measure MAIS

39 3+ injuries fairly well, but will not facilitate measurement of all injury costs or of less serious injuries not necessitating hospital admission. Inclusion criteria may also restrict the dataset to certain reporting units, such as Level 1 and 2 trauma centers. These centers may be more prevalent in urban areas and therefore may bias the state dataset towards events that occur within their catchment areas. In general, among hospital datasets, trauma registry have the most restrictive inclusion criteria, followed by hospital discharge and then ED. In the long run, linkage to ED data, in addition to hospital and/or trauma data, will be key to broad determination of the injury outcomes of people involved in crashes. The order in which a state chooses to incorporate linkages will depend on the condition of each of the databases. EMS datasets, by definition, include only cases that are transported by ambulance or other emergency services. Thus, linkage via EMS will not capture those who are transported by private vehicle. As with a trauma registry, EMS linkage will likely capture almost all MAIS 3+ cases, but a future desire to analyze less serious injuries may require direct ED-to-crash linkage in addition to EMS-based linkage. Crash datasets also have inclusion criteria, and in some states, the criteria include a requirement of injury to one or more persons involved in the crash. A stricter police-reporting requirement will change the nature of the included crashes in both the original and linked crash datasets. This is not likely to be an issue for measuring serious injuries, but might become relevant if injury measurement is eventually broadened to include all injuries. Finally, other datasets, such as licensing and driver history, will include only those licensed within the state. Out-of-state drivers will not be linkable within a stateâs databases without some additional agreements with neighboring states. 7.3.3 Coverage Coverage, described above, reflects the proportion of possible cases in a state that are available in the statewide dataset. For example, some entities may not successfully report to the state, so they will be left out of the state database. Often, this is a small percentage of cases, but they may be systematically biased towards certain types of entities (e.g., small or rural). As a result, the dataset will be biased to some degree, which needs to be assessed and accounted for in analysis. 7.3.4 Quality Control QC was discussed in the previous section, but in this step, quality issues need to be catalogued. In particular, problems should be evaluated in terms of their specific potential impact on the analysis of serious injuries in crashes. For example, if crash location is used to enable linkage (e.g., to EMS data), then substantial missingness in that data element will hamper the linkage process. In contrast, if specific driver distraction is missing, then only driver-distraction- related analyses will be affected. In other cases, some jurisdictions may apply codes differently than others, biasing results of analysis and indicating the need for better training. Cataloguing quality issues is critical because these problems compound in the linkage process. Only cases that have all necessary identifiers and other key data elements in all datasets can be linked and used in analyses. One key data element that should be mentioned specifically is the externalExternal-cause code required by hospital datasets. Separating MVC cases from the large number of other cases in trauma, hospital, and ED datasets is critical for linkage. externalExternal-cause code coverage

40 and consistency should be carefully evaluated at the cataloguing step, and issues in its use should be addressed early in the process. 7.4 Step 3: Determine Databases to Be Linked The goal of comprehensiveness argues for a data collection and linkage process that maximizes linkages between different datasets as well as successful linkages among cases that are expected to link. This ambitious goal is probably best achieved by a step-by-step process of adding databases and linkages. Thus, this step should involve setting up a long-term plan for adding linkages over time. Those planning the linkage process should take a number of factors into consideration. First, the condition and coverage of databases need to be good enough to support linkage at the state level. Thus, some databases may not be initial candidates until they are improved. Second, the utility of a linkage should be considered. What questions can be answered through this linkage? Are they high-priority and high-impact? Third, the amenability of datasets to linkage should be considered in planning (though difficult linkages should not necessarily be developed last). Considerations should include the compatibility of data elements, presence/absence of common identifiers, and the amenability of the existing data structure to linkage. A good starting point for tying injury outcome to crash is linkage from crash-to-EMS to trauma registry. National standardization of EMS databases through NEMSIS, as well as the physical presence of the ambulance at the crash scene and at the hospital location, facilitates linkages between EMS and both crash and trauma registries. Trauma registries commonly include AIS coding and tend to capture nearly all MAIS 3+ cases, making them easiest to work with. As described earlier, one challenge for EMS linkage is that most EMS databases do not include a patient-specific identifier that tracks transfers. Thus, a patient who is taken by ambulance to a local hospital and is then transferred by ambulance to a trauma center for more specialized treatment will typically have two unlinked entries in the EMS dataset. Crash data will link to the first EMS transport and the first hospital, and trauma registry will link to the second transport. Linking the two is possible, but requires an extra process. Unfortunately, this issue will tend to affect some cases more than othersâmore seriously injured patients and more rural locations are more likely to be transferred and therefore may fail to link via EMS when the within-EMS patient identifier is not handled. There are at least two solutions to the patient transfer problem. One is to identify transfer cases in the EMS dataset so that a patient is tracked across runs. This can be accomplished by completing an âinternal linkageâ. In other words, using common elements within the EMS dataset it is possible to identify (i.e., link) all of the EMS resources reporting data for the same patient. Another approach is to implement two linkages: one with EMS as intermediate step and another direct linkage between crash and trauma registry. This allows checking of the linked cases from two directions and could aid in solving the transfer problem. To expand the medical outcome linkages, crash-EMS-trauma could be followed by successively expanding the inclusiveness of medical databases (e.g., adding hospital discharge and ED datasets). Each more inclusive hospital dataset will introduce new challenges, including the need to handle ICD codes, but will also expand the breadth of crash-related injuries that can be analyzed. Roadway linkage to crash is generally more straightforward and has been implemented in most states. Once crash is linked to injury outcome, this information can be brought through to

41 the roadway database to facilitate analysis of roadway safety as it relates to injury as well as fatality. Similarly, driver license and history databases are generally already linked and can be tied to injury once injury is available in the crash dataset. It should be noted here that linkage to a state license file will only include in-state drivers. In some states, there is a reasonably large population of out-of-state drivers in crashes, and the inability to link to these drivers might be noted in this step. Solutions include: 1) setting up agreements with neighboring states, and 2) a national driver license and history dataset. In general, multi-directional linkage among relevant databases is the ideal goal. Linkage from crash-to-EMS, EMS to hospital, and hospital to crash allows for assessment of many aspects of the cost of crashes. Hospital data with good-quality externalExternal-cause codes will represent the universe of crash-related hospital admissions. EMS data often include injury cases that do not have an associated police report. These can help assess how police are using inclusion criteria for police reports and the extent to which police are called in for different types of crashes. For example, pedestrian and bicycle crashes might be underreported when considering only police crash reports. 7.5 Step 4: Identify the Identifiers State databases that need to be linked were, in most cases, developed independently and prior to thoughts of linkage. As a result, common identifiers may or may not be present in databases to be linked. Once datasets have been catalogued, a high-level linkage schema should be developed to understand how databases are related to each other. This schema will call out the linkages that require a linkage mechanism, and then each linkage can be addressed. An example of such a linkage schema for Michigan is shown in Figure 6. In Figure 6, each line between data tables must have some means of identifying common cases. The linkages among tables within the crash database are defined in the crash database schema. However, the linkage between the âPersonâ table in the crash database and the âMedical Recordâ table in the medical database must be addressed. As a starting point, identifiers in the âPersonâ table and identifiers in the âMedical Recordâ must be selected. Figure 6. Example linkage schema from Michigan

42 7.6 Step 5: Determine Linkage Mechanisms If common, unique identifiers are present in datasets to be linked, then the process is straightforward. However, for linkages across different databases, such as the âPersonâ to âMedical Recordâ linkage in Figure 6, common unique identifiers will usually not be present. This section presents a variety of linkage options from which states can choose, along with their pros and cons. 7.6.1 Adding Identifiers After the Fact The first set of options for identifiers are those that are added to databases after the data are collected. These require datasets to have enough data elements in common that the linkage can be done after the fact. Typically, these approaches are less comprehensive and precise than implementing a process for assigning and passing identifiers in the original records (i.e., at the scene or at/near the time of the event), but they can be easier logistically. 7.6.1.1 Probabilistic Linkage From 1992-2012, the NHTSA funded the CODES program, which provided both funding and technical support to a set of CODES states to work on linkage of their own data systems. Since 2012, a number of CODES states have continued to self-fund their programs. One of the key features of the NHTSA-funded CODES program was the CODES Technical Assistance Center, which developed statistical methods and software to support the probabilistic linkage process. In addition to the large volume of content work, the CODES program produced a number of papers focused on improvements to and understanding of probabilistic linkage (e.g., Cook, Olson & Dean, 2001). In addition, the statistical work was implemented in software that is now commercially available as LinkSolv. The papers, analytical reports, and software provide a large storehouse of knowledge and applications of data linkage. However, the loss of centralized technical assistance has left states to implement programs on their own (seeking support individually). This issue will be discussed further later in the report. Probabilistic linkage is the process of using common (but non-unique) variables in a pair of datasets to compute the likelihood that a case in one dataset refers to the same person as a case in another dataset. A patient in a trauma dataset can, for example, have non-zero probability of matching more than one case in a crash dataset. The technical approach to probabilistic linkage is described elsewhere (e.g., Felligi & Sunter, 1969; Jaro, 1995). Although it is not necessary to understand all of the technical details of the method, there are certain key concepts that have an effect on how states might go about using this method to link datasets. The basic idea of probabilistic linkage is the assignment of a match weight to each possible pair of cases in each of two datasets. Suppose, for example, that a state has a crash dataset with 100,000 cases and an EMS dataset with 200,000 cases. There are 20,000,000,000 possible matches. The match weight for a given pair of cases is the sum of the individual match weights for each of the variables that are used in the linkage. If two cases match on the value of a particular variable, then the match weight is shown in Equation 1. ð¤ = ð ð ð¢ (1) where mi is the probability that the values of variable i match given that the cases refer to the same person, ui is the probability that the values of variable i match given that the cases do not refer to the same person, and f is a monotonic function, typically log2.

43 If the two cases do not match on the value of a particular variable, then the contribution to the total weight is given in Equation 2. ð¤ = ð 1 âð 1 â ð¢ (2) In Equations 1 and 2, the value of m essentially represents the quality of the variable. In theory, a true match should have the same value of each variable with probability 1. However, data entry errors and missing data generally cause mismatches with some non-zero probability, so m can be somewhat smaller than 1. The denominator component, u, reflects the tendency of a variable to match at random, or the discrimination ability of the variable. For example, sex has only two values and will match at random 50% of the time. This creates a large denominator in Equation 1 and results in small match probabilities. By comparison, birthday (without year) has 366 possible values and thus will match approximately 1 out of 366 times at random (uâ1/366). In practice, the match weight is specific to the value that does or does not match. This allows for a more nuanced assessment of the information value of each match (or non-match) in helping to determine whether two cases match. For example, if a state has 10 counties, and 80% of crashes occur in one county, then a match on that county is not as informative as a match in another county. The value of u for the common county is much greater than for other counties and results in smaller contributions to the total match weight. Similarly, matching on a common last name, such as Smith, is less informative than matching on an uncommon last name. For computational efficiency, many possible pairs are eliminated from consideration based on a blocking variable such as time of crash (within a certain time window). Nonetheless, once total match weight has been computed for all pairs under consideration, a histogram of total match weight should ideally result in two peaks. One peak, with very low match weights, will contain the clear non-matches, and another with high match weights will contain clear matches. The cases in between represent âpossibleâ matches. It is important to keep in mind that match weight is assigned to each considered pair, so a specific case in one dataset may have a âpossibleâ match to more than one case in the other dataset. To understand the consequences of the matching process, Figure 7 and Figure 8 illustrate two different matching scenarios. Figure 7 shows a hypothetical example of distributions of match weights for actual matches (blue) and actual non-matches (pink) in the case where match quality is generally high. Figure 8 shows a hypothetical example where match quality is generally low. In Figure 7, the matches and non-matches are easily distinguished and there is little uncertainty in the areas between the two distributions. In Figure 8, there is a large range of uncertainty in which a given match weights could occur from either a true match or a true non-match.

44 Figure 7 Hypothetical high-quality linkage Figure 8. Hypothetical low-quality linkage When match quality is low enough, some additional process must be used to account for the uncertainty in the âpossible linkageâ category. Three approaches include: 1) Inspecting each possible match by hand; 2) Selecting all matches of specific quality; or 3) Multiple imputation. The first approach, hand-inspection of âpossible matchesâ is recommended by Jaro (1995). If there are few matches in this category, the method is potentially feasible. However, in most cases, the number of inspections needed may be too high to be manageable. The second approach is appealing in its simplicity and because it results in a single linked dataset. To incorporate matches into a data warehouse or other integrated data system, it is necessary to have no more than one match per case. Some states achieve this by requiring a match on each of a complete set of variables. Others keep only matches above a certain cutoff match weight (which can adjust for missing data). Either way, the cutoff approach simplifies the resulting dataset and analysis, but unless the dataset has very high-quality matching variables Non-Matches Matches Matches Non-Matches Possible Matches Increasing match weight -> Increasing match weight ->

45 with low missingness, the resulting dataset will tend to be biased towards rare events (i.e., unique combinations of match variables). The tendency for high-weight matches to be biased is a critical issue if states are to use probabilistic linkage as the primary mechanism for measuring serious injury in crashes. Match weights inherently measure the informativeness of a particular case-pairâs set of variable values (matches and non-matches). The numerator of a match weight component is a maximum of 1 and is only affected by data quality. However, the denominator represents the probability of a match by chance. Rare values are less likely to match by chance, resulting in smaller denominators and larger contributions to match weight. Thus, these values have a greater likelihood of ending up at the right end of Figure 8 and being included in a dataset where only high-probability matches are kept. The consequence to states of retaining a biased linked dataset is that rare events show up in the resulting metrics with greater probability. For example, in the example where one county has 80% of crashes, that county is more common and less likely to show up in the censored dataset (high-weight matches only). When counting serious injuries by county, injuries in the populous county will be undercounted relative to those in less populous counties. The third, more statistically rigorous, alternative is to use multiple imputation (McGlincy, 2004). Multiple imputation (MI) produces a small set of parallel datasets (3-5) in which a matching row in the second dataset is selected at random for each row in the first dataset (no match may also be selected). Analysis is done in parallel on the datasets and the results are combined. The key benefit of MI is that it accounts for the uncertainty introduced by the matching process and reduces or eliminates bias in that way. In particular, it decreases false negativesâcases that should have matched but did notâwhich is a key problem in the use of a fixed cutoff. The software developed for the CODES program handles the MI process. However, a critical disadvantage of the MI approach to handling lower-quality linkage is that it does not lend itself to producing a single linked dataset with one match per case. It is possible to produce a single imputation for use as a linked dataset, but results of analysis will be influenced by unusual random selections of links. In addition, the means of choosing imputed links (McGlincy, 2004) does not guarantee one link per case, since imputation is at the level of a pair of cases and each case can be evaluated as part of several pairs. If a state chooses to use probabilistic linkage, it will be vitally important to use metrics to assess the overall quality and potential bias in the resulting dataset. Using a single imputed dataset and taking matches above a cutoff have different pros and cons and either might be used to produce one dataset. However, the ideal solution is to have sufficient separation of the matching and non-matching groups. This quality depends on the particulars of the dataset and the variables being used, so a standard set of metrics for linkage quality would be helpful to states trying to determine whether additional identifiers are needed. In particular, some of the identifiers discussed in the next section go hand-in-hand with probabilistic linkage and should improve match quality. Once probabilistic linkage has produced a matched dataset (of sufficient quality), the linked cases can be assigned a numeric identifier that is inserted in the separate databases. This allows the matches to be recreated on the fly when data are accessed. Thus, the linkage process can be carried out on new data once (updated at regular intervals) and the results can be used by anyone with access to the linked datasets.

46 In general, probabilistic linkage allows states that have not previously incorporated common identifiers to link datasets going back a number of years. However, the complexity of the process and the potential for producing a biased linked dataset argue for two things: First, there is a strong need for a technical assistance center at the national level that can help states assess the quality of linkage mechanisms, including but not exclusively probabilistic linkage, and assess the quality of the linked dataset. Second, probabilistic linkage should ideally be viewed as an intermediate step on a path to incorporating an on-scene identifier (see Section 7.6.2). 7.6.1.2 Hand Linkage Hand linkage is a form of after-the-fact linkage done by a human, but made logistically feasible by software. This approach, employed in Kansas to link EMS to trauma registry data, uses software to select a small set of EMS runs that are potential matches to a single trauma case. The potential matches are selected based on the timing of the EMS run and the destination hospital, and the trauma registrar selects the correct match by looking at name, address, birthdate, gender, and other identifying information in both records. Once the match is selected, a common identifier can be pushed into both the trauma and EMS databases to enable future linking of de-identified information. This approach has certain advantages. First, matching can make use of the ability of the human to know what names are likely to be matches even when nicknames or different spellings are used. (Name-based matching is a challenge for probabilistic linkage software.) Second, the workload of hand-matching, which would normally make the process infeasible, is spread among trauma registrars, whose job is data entry and data management. The list of potential matches is made as short as possible through software intelligence, and then the final decision is made by the human. Finally, the trauma registrar is, by definition, allowed to see PII on patients, but once cases are matched, PII can be removed. There are certain disadvantages to this approach as well. First, it is difficult to assess the matching algorithm or success rate for a human match process. Different registrars may have different criteria for accepting an apparent match, and the overall error rate is unknown without a separate study (which might be warranted). Second, Kansas was able to successfully motivate their trauma registrars to do this because some EMS data are needed in the trauma dataset as well, and these data elements are pushed automatically once the link is made. However, this motivating benefit may not exist for all datasets the state might want to link (e.g., crash), so there must be some consideration for the additional burdens on the trauma registrars (or any other personnel used to hand-link). 7.6.2 Assigning Identifiers At The Time of the Event The alternative to identifying matches after the fact is to assign some type of identifier at the time of the event, ideally on-scene, and pass that identifier among responding agencies. This group of approaches is more logistically challenging to implement, but once implemented, should allow linkage without introducing bias and without the technical challenges of the after- the-fact approaches. We present several classes of on-scene identifiers below. 7.6.2.1 Event-Specific An event-specific identifier is one that specifies the crash event, but does not separately identify the people involved in that event. This further identification would have to be done using one of the after-the-fact approaches, but by limiting possible matches to only those people involved in a single crash event, the matching process should be more successful.

47 The advantage of using an event-specific identifier is logistical simplicity. One approach is to pass either the EMS run number or the crash report number (or both) between agencies, ideally at the scene. In the near future, it may be possible to use vehicle-to-vehicle (V2V), based on dedicated short-range communication (DSRC), to pass the event number automatically to other responding units. Bettisworth et al. (2015) describe some of the data issues, including meeting HIPAA requirements, that need to be addressed to make DSRC data useful for emergency response applications. Solutions to these issues will facilitate the use of V2V to enable data linkage among crash, EMS, and hospital datasets. GPS location and time can also be used to identify an event. For linkage to roadway datasets, location is already used in states. For linkage between crash and EMS, GPS location and time will not be identical for police and EMS. However, a fairly simple algorithm could choose time-location combinations that match (between police and EMS). By not having to specify the person, this approach is less time-consuming at the scene and can more easily be automated. The disadvantage is that extra work still has to be done to identify specific occupants for matching to hospital records. The event-specific identifier becomes a good way to limit the field and improve probabilistic (or hand) linkage, but it does not solve the whole problem of finding the same individual in multiple datasets. 7.6.2.2 Person-Specific/Event-Specific A person-specific/event-specific identifier is one that is assigned to each person involved in a crash, but only for that crash. Trauma bands, which are physical ID wristbands given to anyone seen by rescue personnel, fall into this category. An alternative is for the police to assign numbers to each occupant in a crash and pass along the number to EMS for any occupants who are transported. EMS would then pass the number to the hospital on arrival. The advantage over event-specific identifiers is obvious. Any linkage method aimed at capturing injury outcome will have to link peopleâpatients in the hospital dataset to occupants/non-motorists in a crashâand a person-specific approach of any kind accomplishes this without additional analysis. The disadvantage is generally logistical burden, particularly at the scene, when other activities (e.g., treating victims) are higher priority. Any on-scene solution must take little time and be very simple, and the person-specific approaches may be difficult to implement in this way. 7.6.2.3 Person-Specific/Global The gold standard of identifiers is one that is person-specific and permanent, or global. This ID, like social security number (SSN), follows the person throughout all datasets over time and allows for assessment of long-term follow-up, delayed treatment, and repeat visits. It also allows patients to be easily followed when they are transferred between hospitals. An example of a person-specific/global identifier is the driverâs license number. For linkage from crash to driver license/history files, this identifier is ideal. However, young occupants do not have licenses, and crash reports generally do not include the license numbers of non-drivers. Medical outcome datasets also do not tend to include license number. Thus, license number is feasible only for information pertaining specifically to the driver in a crash, but not for linkage to injury outcome. Alaska and Massachusetts are in the planning stages of implementing two versions of person-specific global identifiers. In Alaska, everyone who interacts with public safety personnel (for crash, crime or other reason) is assigned an Alaska Public Safety Identification Number (APSIN). When police respond to a crash, they look up each occupant in the APSIN system. If a

48 number has been assigned, it is automatically entered on the crash report. If a number has not been assigned, a new one is generated, and an entry is made in the APSIN database for future use. Alaska is now embarking on a program to allow hospital and trauma registrars to access the APSIN system to put that number in the hospital record. Massachusetts uses encrypted SSN in a variety of datasets including ten-year death data, trauma registry, hospital discharge and all-care claims data. They use a single encryption algorithm, which translates each SSN into a different number, but one that remains the same across databases. In this way, the patient can give their SSN rather than be assigned a new number, but the SSN does not reside in any of the databases, providing an extra level of security. Massachusetts is exploring incorporating the same identifier into EMS data, and possibly eventually crash data. One key advantage of these two approaches is that numbers do not need to be passed on- scene between agencies. Common person-specific ID numbers can be accessed separately by each agency that the person encounters. In Alaska, the number can be looked up separately by the rescue or hospital personnel, and in Massachusetts, the patient (or family) can provide the number (to be encrypted). The other obvious advantage is the ability to track the same person beyond the initial EMS run and hospital admission. The disadvantage of the Alaska approach for other states is the need to implement a statewide ID system first. The APSIN system has been in place in Alaska, thereby facilitating its use in this case. However, setting up a statewide ID system will delay implementation of any data linkage that depends on it. The disadvantage of the Massachusetts approach is that SSN must be provided by the patient or family. Parents do not always have childrenâs SSNs available and some very young children may not have a number at all. Unconscious victims cannot provide SSN, leading to a potential injury-based bias in missing data in that field. 7.6.3 Summary of Linkage Mechanisms Used by States Table 12 summarizes the linkage mechanisms used by the states we visited and interviewed. Other states, particularly those with CODES programs, reported using probabilistic linkage in the survey. Although some states are trying on-scene identifier methods, most are using probabilistic linkage. This approach can work if the technical challenges are handled and appropriate linkage-quality metrics are used. Although most states have struggled with the linkage process initially, North Carolina has demonstrated that once the system is in place, it can work in a timely way (NC can produce a linked crash-EMS annual state dataset by the end of Februaryâtwo months after the year ends).

49 Table 12 Summary of Linkage Mechanisms Used by States Interviewed Linkage Approach States Probabilistic (Linkage Software) CA, MA (crash-EMS), WA, MT (planned), NC (crash- EMS); (many other CODES states continue to use this method) Probabilistic (Perfect Matches Only) UT Hand Linkage KS (for EMS to trauma, crash under discussion), NC (for EMS to trauma) Event-Specific Identifier FL (pilot) Person-Specific Identifier (Event-Specific) AL (pilot) Person-Specific Identifier (Global) AK (for crash, under development), MA (for trauma to case-mix (ED, admitted, observation)) As discussed earlier, probabilistic linkage is ideally seen as an intermediate step in the development of a state linkage process. Realistically, event-specific identifiers and even person- specific on-scene identifiers will need to be combined with probabilistic linkage for some time. Generally, the goal of a linkage system should be to return exactly one linkage for each and every transported or injured case between the crash record and the medical record for that case. This link (per person) should be tied to the crash record so that key information can be further linked to roadway, licensing, and other datasets related to understanding traffic safety. The extent to which this is achieved is a measure of the performance of the linkage process. The choice of linkage mechanism will influence the present and future performance of the linkage system. 7.7 Step 6: Determine Database Storage Mechanism 7.7.1 Data Warehouse The recommended approach to handling a large number of semi-related databases is to set up a data warehouse. Sometimes called a âhub-and-spokeâ system, the data warehouse allows databases to be stored separately, but linked when linkage is possible (i.e., when common identifiers are present). Databases are accessed using software (the hub) that can extract from the component databases and link for analysis and reporting purposes. This approach allows individual agencies to keep control over their databases, including storage location, input, editing, and access, while still allowing access and linkage by other permitted individuals. It also allows databases to be brought into the system one at a time, as resources are available.

50 Figure 9. Data warehouse diagram from The University of Alabama Center for Advanced Public Safetyâs CARE data warehouse (used with permission) Figure 9 shows an example data warehouse architecture used by the state of Alabama and developed at the University of Alabama Center for Advanced Public Safety. Individual databases represented by the data silo icons (e.g., crash data, linear referencing system (LRS) data, and roadway features data) reside in various locations and formats, generally original to the collecting agency. This allows the originating agency to retain control over the contents and format of the database, and allows them to continue to use existing software for access to that specific database. The Extract-Translate-Load (ETL) middleware is a dataset-specific translator that changes the format of the original dataset into one that is standardized for use by the CARE data analysis engine. The ETL system may also filter the original dataset, providing a subset of data that is translated. A reasonable goal is to filter as little as possible at the middleware phase. The analysis engine can be used by web-based software or desktop clients. The client software represents the userâs experience and should be designed to facilitate selection of key variables from linkable datasets, filtering of information that is not needed, and development of necessary analyses and reports. There are many commercial software packages available for this purpose, or it can be custom-developed.

51 The key to using the data warehouse for linkage is in the ETL middleware (and the presence of linking variables). Developing the ETL program for each database is time- consuming and requires an understanding of the standards required by the analysis engine, the contents of the database itself, and the potential uses to be made of it via the analysis engine. For example, all common variable types (e.g., time, date, sex) must have a standard format in all datasets used by the analysis engine (e.g., all time variables must be in 24-hour format, all dates in Julian, and sex must be coded as 1=Male, 2=Female). Knowledge of the needs of the analyst might prompt a recoding of variables (e.g., âDriver Ageâ is a characteristic of a vehicle that is usually in the person table and not the driver table originally; âEMS run timeâ might originally only be available by subtracting âRun Start Timeâ from âRun End Time,â but can be computed by the ETL automatically). Fortunately, this process can be completed one database at a time, and once done, it will not need to be repeated in its entirety. In other words, one-time resources can be used to pull databases such as EMS and trauma into the data warehouse with the expectation that there will not be significant ongoing costs. (Some resources should be set aside for minor updates to the ETL program as new needs are identified by analysts and new variables become available in the original database.) 7.7.2 Separate Linked Dataset The primary alternative to the more comprehensive data warehouse approach is to handle the linkage separately. In this approach, datasets are linked separately and either the linked dataset is saved on its own or variables are pulled from one dataset into the other permanently. The state of Washington ran a pilot project to investigate linkage from trauma registry and EMS to crash datasets using probabilistic linkage. The resulting linked dataset includes both medical outcome and crash data. By state law, crash data in Washington are public, but trauma registry data are protected by HIPAA. As a result, the linked variables could not be returned to the original crash dataset. Instead, Washington produced a separate, linked dataset and requires IRB permission to access it. This solution took two years to sort out because of the ambiguous status of a linked dataset with partially public and partially private information. The advantage of separating the linked dataset from its origins is that it can be done without the overhead of incorporating the component datasets into a data warehouse. In addition, the unique permission issues of the linked dataset can be handled separately. One disadvantage is that changes to the component datasets are not automatically reflected in the linked dataset. In addition, some organization must take responsibility for the linked dataset, even though it is of interest to multiple agencies. Each new dataset must have rules and a process for access, whereas the data warehouse approach centralizes access control (even though access to specific datasets and data elements may still be granted by different agencies). The single linked dataset approach should be thought of as an intermediate step on the way to incorporation in a data warehouse. Working out linkage issues on a static dataset is helpful as well, since linkage issues can be separated from issues introduced by component dataset updates. However, in the long run, the data warehouse model makes the most sense, given that data linkage is increasingly desirable across a wide variety of state databases (not just crash-EMS-hospital). 7.8 Step 7: Harmonize Common Data Elements Before linking datasets, it is necessary to have common data elements harmonized. This means that variables are in the same format, and that numeric codes mean the same thing.

52 An advantage of the data warehouse model is that it provides a partial schema for many common variables (e.g., date formats, sex, age, location, etc.). In addition, it allows the original databases to remain in their original formats, while ensuring that datasets to be linked have common formats. In selecting harmonized schemas for common data elements, it is important to use an established national schema (e.g., NEMSIS, MMUCC) wherever possible. Using an established schema will enable future linkages to databases with harmonized schemas (e.g., NEMSIS to NTDB). The state also benefits from any existing work done using those schemas, including existing XML, training, manuals, etc. 7.9 Step 8: Set Up a Pilot Project Almost universally, states that have set up or are setting up linkage programs, reported doing so on a small scale first. Pilot projects may be focused on the logistics of passing identifiers between agencies at the scene, or they may be focused on data issues (e.g., probabilistic linkage, schema standardization across databases). A good pilot test of on-scene logistics is typically limited geographically. This reduces the number of agencies involved and aids communication, training, and feedback. Logistical problems are the focus of this type of pilot. Missed identifiers, related activities that interfere with patient care or take significant extra time, and confusing processes are among the issues that can be found in trying something out on a small scale. In Florida, the Traffic-Related Injury Prevention task force has proposed a pilot test of passing an identifier at the scene between EMS and police in Orlando for 2014. In Alabama, one EMS service, the fire department, and the department of community health, all in Tuscaloosa, are setting up a pilot project to pass a person- specific identifier between EMS and trauma. Trauma bands are among the options being considered. A pilot exploration of probabilistic or hand linkage might proceed on a small number of years of data or a small geographic area. Pilot tests can be useful for gathering several types of information. First, a pilot test of probabilistic linkage will identify problems with data structures and data quality that affect all types of linkage, and it can help to prioritize data improvements. Second, a pilot should identify problems with linkage success, both in overall rate and bias. This should feed back to the data collection system by determining additional linkage variables (e.g., name) that should be included to improve the linkage success rate and reduce bias. In Washington, for example, probabilistic linkage was tested on three years of trauma registry and crash data. A geographically limited pilot is best if statewide datasets are not all in good condition. This is the case in Michigan, where the data linkage committee recommended a pilot project of probabilistic linkage among trauma and crash datasets in one county. However, linkage quality is affected by dataset size, and in particular, the discrimination performance of some variables will change with dataset size (Cook et al., 2001). Linkage testing may start on a smaller dataset, but will need to expand to the statewide dataset before the process is put in place. For all linkage processes, pilot testing can help to estimate true costs of implementing the linkage program. This is critical for states to be able to plan. However, it is important to account for the likelihood that many costs will go down over time as linkage processes are put in place. For example, initial training and software costs may not repeat often as the process progresses.

53 7.10 Step 9 (Optional): Set Up a Sampling Program In the second interim report from this project (Flannagan & Rupp, 2013), we recommended sampling medical outcome data associated with crashes at the state level as an intermediate solution to measuring serious injuries in crashes. In addition to allowing for relatively near-term measurement of serious injuries, sampling also has major advantages in monitoring the progress and success of data linkage approaches. We list this as optional because it is not technically required to complete the process of data linkage, but it is a valuable tool both for achieving the goal of measuring serious injuries early on and for evaluating the quality of linkages as they are developed. Many, if not most, linkage processes have the potential to bias results, at least in the early phases. Probabilistic linkage, which is the approach most commonly used in states at this time, has the greatest potential for bias. However, even on-the-ground identifier passing has potential for bias. For example, in crashes with multiple victims, passing person-specific identifiers is harder and more likely to fail. This means that single-vehicle crashes are more likely to link, resulting in a bias in the linked dataset towards outcomes typical of single-vehicle crashes. Sampling proceeds independent of these logistical issues, because the logistical problems of sampling revolve around the challenges of getting data from hospitals rather than identifying unique individuals in crashes. Having a dataset available from a small state sampling program allows for year-to-year assessment of the quality of the linkage process and areas for improvement. In addition, sampling allows serious injury to be measured throughout the development period. Finally, sampling can make use of any source of outcome data without the presence of a statewide database. Thus, sampling may continue to be useful in capturing information about datasets (e.g., ED) that are not yet well covered at the state level, even when other outcome databases are being successfully linked. 7.11 Step 10: Set Up Statewide Linkage Once pilot linkage is working and the details of the system have been decided, the next step is to set up the linkage process on a statewide basis. Statewide implementation needs to address a number of issues for the long run: â¢ Who pays for ongoing costs associated with the linkage process? â¢ Development of training materials for anyone involved in implementation (even probabilistic linkage) â¢ How is the linkage stored? Numeric codes added to databases? Separate linked dataset? â¢ Who has access and what is the process for access? â¢ Establishing a regular process for evaluation of the quality and coverage of the resulting linked database

Next: 8 Discussion »

Development of a Comprehensive Approach for Serious Traffic Crash Injury Measurement and Reporting Systems (2021)

Chapter: 7 Roadmap to Linkage

Welcome to OpenBook!

Get Email Updates