Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
85 A P P E N D I X F The functional form of the regression model sets the relationship between the explanatory and dependent variable. An incorrect functional form results in biased and inconsistent parameter estimates from which a CMF is derived, so it stands to reason that the selection of a functional form is critical to developing a reliable CMF. The selection of model form followed logical considerations based on forms found to be appropriate for similar research and utilized two tools to aid in the investigation of the appro- priate functional form. These were introduced in Two Tools for Finding what Function Links the Dependent Variable to the Explanatory Variables in the Proceedings for ICTCT (International Cooperation on Theories and Concepts in Traffic Safety) (Hauer and Bamfo 1997) and are called the Integrate-Differentiate (ID) and Cumulative Residual Plot (CURE) methods. Below is an overview of the methods. ID Method Overview In the ID method, the integrate function is a cumulative function, F(x). The primary assump- tion of the ID method is that if the empirical integral function, FE(X), is close to the integral function, F1(x), then the linear transformation of FE(x) should be close to one of F1(x). One can list all possible integral functions and choose the one closest to FE(X). For a better understanding of this method, refer to Hauer and Bamfo (1997). In their dataset, there is no visible relationship between crash frequency and the explanatory variable, so Hauer and Bamfo (1997) draw a bin graph to sum up the bin area, resulting in the empirical integral function. The width of the bin area is the difference between the nearest higher average annual daily traffic (AADT) and nearest lower AADT, divided by two, for each group. Then, all possible functions (e.g., power function, polynomial function, and Hoerlâs function) are listed and their linear transform graphs are compared with one of the empirical integral functions, FE(x). Figure F-1 provides an example. If the possible model, F 11 1x x( )( ) = Î± Î² + Î²+ we should see a straight line with log 1( ) Î± Î² + as the intercept and b + 1 as the slope when we plot log[FE(X)] against log(X) one. Obviously, this is true when the log (AADT) is larger than 6.5 (the dashed circle). Selection and Assessment of Model Form for Cross-Sectional Models
86 Development of Crash Modification Factors for Uncontrolled Pedestrian Crossing Treatments For most of the variables available and for all those included in the final models, the structure of the variables was categorical, e.g., Area Type is either rural or urban, and not continuous. Thus, for these variables, the ID method is not relevant and was not applied. For the remaining vari- ables, such as crossing width, the ID method indicated that either a power or exponential function seemed appropriate. With the exception of AADT and pedestrian AADT, however, none of these variables proved to be statistically significant and were not included in the final models. CURE Plot Overview A complementary method to the ID method is to build the model one variable at a time and examine the residuals to see if an alternate model form may improve the fit to the data. The tool for examining residuals is called the CURE plot. A CURE plot is a graph of the cumulative residuals (observed minus predicted crashes) against a variable of interest sorted in ascend- ing order. A good CURE plot should not have vertical drops because these are indicative of inordinately large residualsâpossible outliers. It should not have long increasing or decreasing runs because these correspond to regions of consistent over and under estimation. It should meander around the horizontal axis in a manner consistent with a âsymmetric random walk.â Even in the absence of any bias, symmetric random walks have âruns,â i.e., stretches in which sev- eral consecutive residuals tend to be positive (an up-run) or negative (a down-run). For a CURE plot that is consistent with a symmetric random walk one can find the limits beyond which the plot should only rarely go. The steps to constructing a CURE plot are the following: â¢ Step 1: Sort sites in ascending order of the variable of interest, such that N is the number of sites, n is an integer between 1 and N and S(n) is the cumulative sum of residuals from 1 to n. â¢ Step 2: For each site, calculate the residuals, res, by subtracting (observed â predicted). â¢ Step 3: For each site, calculate the cumulative residuals, S(n). â¢ Step 4: For each site, calculate the squared residuals, res2. Figure F-1. Illustration of ID method. (b) FE(x) (a) F1(x) (c) Log(FE(x)/Log(x)
Selection and Assessment of Model Form for Cross-Sectional Models 87 â¢ Step 5: For each site, calculate the cumulative squared residuals, s2(n). â¢ Step 6: Sum the cumulative squared residuals over all sites, s2(N). â¢ Step 7: For each site, estimate the variance of the random walk as: 12 2 2 2 n n N ( ) ( )( )Ï = Ï â Ï Ï ï£® ï£°ï£¯ ï£¹ ï£»ï£º â¢ Step 8: For each site calculate the 95% confidence limits as: 1.96 1.96 2 2 Lower Limit Upper Limit = â Ï = + Ï â¢ Step 9: Plot the cumulative residuals S(n) and the 95% confidence limits on the y-axis with the explanatory variable of interest on the x-axis. An example CURE plot is shown in Figure F-2 for the variable major road AADT. This is for the model for pedestrian crashes for the refuge island treatment. In this example, the model is performing fairly well but does exhibit some bias by venturing outside the boundary limits around AADTs of 20,000 and above 35,000. The amount of bias is reasonably small in that the maximum deviation from zero of the cumulative residuals is approximately 32 while the num- ber of pedestrian crashes in the data is 340. While the CURE plot method works well for continuous variables, it is not applicable to vari- ables with few categories, e.g., urban versus suburban area type or midblock versus intersection location. For such variables, a table can be produced showing the prediction bias for each level of the variable as shown in Table F-1. In this example, there are two categories of area type. As shown by the values of observed divided by predicted, the SPF is showing no bias between the two levels or area type. Where some bias appears evident, however, there is no statistical test to indicate if the biases are statistically significant. Figure F-2. Example of CURE plot. -40 -30 -20 -10 0 10 20 30 0 20000 40000 60000 Cu m ul a ve R es id ua ls Major Road AADT cumulave residuals lower limit upper limit Table F-1. Example of categorical variable results. Urban Suburban Observed/Predicted 1.00 1.00