**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

**Suggested Citation:**"Chapter 2 - Research Approach." National Academies of Sciences, Engineering, and Medicine. 2020.

*Procedures and Guidelines for Validating Contractor Test Data*. Washington, DC: The National Academies Press. doi: 10.17226/25823.

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

7 Research Approach The research approach gave special consideration to the diverse state of the practice, current federal policy, AASHTO guidelines and standards, relevant research, recent developments in project procurement methods, modified and new alternative tests and procedures, concerns with potential bias, and practical constraints. It built upon the findings of the literature review, assessment of the state of current practice as identified through the survey of SHAs, review of selected SHAsâ practices, and review of the fundamental statistics associated with procedures currently in use. The research included evaluation of identified candidate validation procedures, selection of a set of procedures for further consideration, and subsequent application of these procedures to a series of data sets. Figure 1 illustrates the three stages of the research approach that provided a basis for developing a proposed practice for validating contractor test data: gathering information, numerical analysis, and SHA data analysis. 2.1 Gathering Information 2.1.1 Literature Review The literature review covered technical papers, reports, and standard specifications in several fields, including highways, transportation, pavements, materials, quality management, statistics, biostatistics, and biometrics. Biostatistics and biometrics consider the application of statisti- cal and mathematical theory and methods in agriculture, biomedical science, environmental sciences, and allied disciplines. The literature review was summarized into seven categories: (a) Validation Techniques and Diversity in Procurement Methods; (b) Modification of Existing or New Statistical Tests; (c) Concern with Bias; (d) Nonparametric Tests; (e) Potential Risks Associated with F- and t-tests; (f) State of the Practice; and (g) Policy, Standards, and Guidelines. Validation Techniques and Procurement Methods Schmidt et al. proposed a procedure for developing verification test tolerances for either independent assurance (IA) programs or quality assurance (QA) specifications of highway construction materials (15). The procedure introduced statistical parameters associated with determining mean differences for either split or independent sampling. The parameters included risk, variation, and sample size and were used to develop statistical equations for comparing the mean difference between two data sets. The methods were also illustrated in an analysis example of data from six SHAs. The analysis confirmed that the power of statistical tests increased as sample size increased when all other variables remained unchanged. LaVassar et al. analyzed SHA and contractor test data from four states using various statistical methods including F- and t-tests and concluded that with adequate sample size, F- and t-tests were effective in validating quality control (QC) test results (16). Also, analyzing the data at the C H A P T E R 2

8 Procedures and Guidelines for Validating Contractor Test Data statewide level resulted in sample sizes that were so large such that non-compliant material gets lost in the abundance of compliant results, making the statistical tests irrelevant. Wani and Gharaibeh used Monte Carlo simulation to evaluate contractor test data verification processes based on F- and t-tests in terms of the probability of detecting data manipulation and expected pay (10). For the simulations, SHA and contractor test results were randomly sampled from normal distributions representing independent samples. The contractor data were then manipulated by (a) reducing standard deviation while not changing the mean and (b) changing the mean while not changing the standard deviation in attempts to increase pay factors. The data were then put through the verification process. The probability of detecting the data manipula- tion and associated pay factors was computed for thousands of iterations to develop operating characteristic (OC) curves for measuring the probability of detecting such data manipulation Figure 1. Illustration of research approach.

Research Approach 9 and related pay factors. The OC curves showed that even at relatively high sampling rates (five SHA to 20 contractor test results), the power of the test values was low. For example, there was only a 60% chance of identifying a 50% reduction in contractor standard deviation. The simulations showed that the expected pay for the same sample size (5 vs. 20) could be manipu- lated from 87% to 100% without detection in the most extreme cases. Such manipulation would lead to overpayment by the SHA. The authors recommended a series of options to reduce risk such as separating the contractor quality management team from the project management team and the use of larger lots (and a larger sample size) to compare contractor test results to those from SHA (10, 17). To address the probability of detecting data manipulation and associated pay factors, Arambula and Gharaibeh proposed accumulating SHA and contractor test results on consecu- tive lots to increase sample size and therefore the power of the statistical tests used to verify contractor test results (18). Two types of cumulative sampling techniques were used, continuous cumulative and chain-lot. The chain-lot method utilizes a concept similar to a moving average, where a fixed number of lots (e.g., three) are individually tested first, and if they are found to conform to the statistical procedure, the lots are accumulated to a maximum of four lots. The set of four lots continues until a nonconforming lot is encountered, and the process then reverts back to the sampling of individual lots. The chain-lot method is illustrated in Figure 2. These two cumulative sampling techniques were applied to actual SHA and contractor data with indepen- dent sampling. Analysis showed that a chain-lot sampling method with three accumulated lots significantly increased the power of F- and t-tests for verification purposes, while not significantly changing the percent within limits (PWL) associated with the same data. This method is illus- trated in Section 3.3.2 with examples from two SHAs that are currently using a similar approach. Cleveland et al. (19) and Tam et al. (20) have both reported on the use of a risk-based multi- tiered verification approach. The methodology is based on ranking acceptance quality charac- teristics (AQCs) of a material based on its relative impact on performance into three categories: 1, 2, and 3 with 1 being the most critical and 3 being the least critical. For portland cement concrete (PCC), compressive strength, temperature, and slump may be used as AQCs with com- pressive strength being in category 1, while slump would be in category 3. Then different levels of scrutiny are used in the contractor test verification procedures for each category with the most rigorous used for category 1 characteristics and the least rigorous used for category 3. Sampling and testing frequencies are also established based on risk with greater frequencies for higher risk characteristics. For category 1 characteristics, an AASHTO procedure using F- and t-tests is employed, but for category 3 items, verification might be limited to review of control charts of Figure 2. Illustration of chain-lot sampling methodology, I = 3 (i.e., 4 lots) (18).

10 Procedures and Guidelines for Validating Contractor Test Data contractor data and/or inspection. This approach is becoming common with design-build and public-private partnerships. The Texas Department of Transportation has a quality assurance program manual for it (59). Modification of Existing or New Statistical Tests A critical task of the research was the evaluation of potential contractor test validation pro- cedures or modification of existing procedures. This activity relied on the review of current AASHTO guidelines, state of the practice, fundamentals of the statistical methods currently used, and consideration of new potential statistical methods. The AASHTO implementation manual refers to two procedures for verifying independently sampled contractor and SHA tests (5). One procedure involves hypothesis tests, while the other simply compares a single SHA test to a set of contractor tests. Statistical tests, termed âhypothesis tests,â are used when it is necessary to test whether it is reasonable to accept an assumption about a data set. Conducting a hypothesis test requires that an assumed set of conditions known as the null hypothesis (Ho) be defined. An alternative hypothesis (Ha) is an alternative set of conditions that are assumed to exist if Ho is rejected. The statistical procedure assumes that Ho is true, and then investigates the data to determine if there is adequate evidence that it should be rejected. Ho cannot actually be proved by the test, it can only be disproved by the test. If Ho cannot be disproved (rejected in statistically correct terms) the test fails to reject the hypothesis. Hypothesis tests are performed at a significance level (Î±) which is the probability of incorrectly rejecting Ho when it is actually true. Significance levels are typically 0.10, 0.05, 0.025, or 0.01. This means, for example, if Î± of 0.01 is used and Ho is rejected, then there is only 1 chance in 100 that Ho, is true and it was erroneously rejected. The most commonly used procedure relies on both F- and t-tests. For each hypothesis test, Ho is defined for contractor and SHA test results from the same population. For the F-test, Ho is defined by the equal variability of the contractor and SHA data, and for the t-test, Ho is defined by the equal means of contractor and SHA data sets. It is important to compare both the vari- ability (variances) and means when comparing two data sets. The F-test is used for comparing the variances (standard deviations squared) of the two data sets; the t-test is used for comparing means of the two data sets. These procedures are statistically sound and have more statistical power in identifying actual differences than a procedure that relies on a single SHA test for com- parison. Commonly available computer programs make the use of these tests relatively simple. The procedure for comparing a single SHA test to a set of contractor tests is based on a single, independently sampled SHA test being compared with five to 10 contractor tests. To be acceptable, the SHA test result must fall within Xâ Â± CR (where Xâ and R are the mean and range, respectively, of the contractor tests, and C is a factor that is a function of the number of contrac- tor tests) of the five to 10 contractor tests. The allowable interval is based on the number of con- tractor tests and a significance level, Î±, of approximately 0.02. This is not an effective procedure for validation as research has shown the test to be weak statistically with high associated risk for use on highway projects (5). Several other statistical tests are available. An important factor to consider in any test is the relatively small sample sizes commonly observed with SHA quality testing (verification testing and acceptance). In the independent, two-sample t-test, researchers have used the equivalence of variances between two samples as a criteria for deciding between using the pooled-variance procedure or Satterthwaiteâs method (21). The pooled t-test and confidence interval (CI) were suggested if equivalence of two variances was not violated, and Satterthwaiteâs method was recommended if variances were not equal. The pooled t-test assumes that the populations are normally distributed, with equal population standard deviations. The pooled CI and t-tests are sensitive to the normality and equal standard deviation assumptions. The observed data can be

Research Approach 11 used to assess the reasonableness of these assumptions. When using a t-test, the data normality assumption should first be assessed (using box plots and histograms). The F-test is then used to examine the equal variances assumption. Satterthwaiteâs method assumes normality, does not require equal population standard deviations, and uses the same CI and a test statistic formula with a different standard error. Satterthwaiteâs procedures are somewhat conservative and adjust the standard error and degrees of freedom to account for unequal population variances (21). Nonparametric methods are available for two-sample tests. These methods do not involve any assumptions as to the form or parameters of a frequency distribution (22). In the indepen- dent, two-sample t-test, normality, independence, and equal variances are assumed. This t-test is robust against non-normality but is sensitive to dependence. The test is moderately robust against unequal variance if two sample sizes are close to each other, but it is much less robust if they are quite different (i.e., differ by a ratio of three or more). To determine whether the equal variance assumption is appropriate considering normality, population variances are compared using sample variances. Because such tests are sensitive to non-normality, their use is commonly avoided. Instead, Leveneâs test, a nonparametric test for comparing two variances that does not assume normality, may be used (23). The Fligner-Killeen test is another nonparametric test that is robust against departures from normality (24). The Mann-Whitney test (also known as the Wilcoxon test) is a nonparametric test for two independent samples, although analogous tests are possible for paired samples (25). The Mann-Whitney procedure assumes independent random samples from two populations that have the same shapes and spreads (i.e., the frequency curves for the two populations are âshiftedâ versions of each other but are not required to be symmetric). The Mann-Whitney procedure provides a CI and tests the difference between the two population medians (and means, if the populations are symmetric). Permutation tests (also called randomization, rerandomization, or exact tests) are available for comparison purposes. A permutation test is a type of statistical significance test in which the distribution of the test statistic under Ho is obtained by calculating all possible values of the test statistic under rearrangements of the orders on the observed data points (26, 27). In other words, the method by which treatments are allocated to data points in an experiment design is mirrored in the analysis of that design. If the orders are interchangeable under Ho, then the resulting tests yield exact significance levels. Confidence intervals can then be derived from the tests. The theory has evolved from work by Fisher (56) and Pitman in the 1930s (57). The basic premise for the permutation test is to use only the assumption that it is possible that all of the data sets are equivalent and that every data point is the same before sampling began (i.e., the position in the data set to which they belong is not differentiable from other positions before the positions are filled). Bootstrap-based methods that test hypotheses concerning parameters and entail less strin- gent assumptions are alternatives to the permutation test methods (28). The bootstrap test uses the data of a sample study at hand as a âsurrogate populationâ to approximate the sampling distribution of a statistic. In other words, resample (with replacement) from the sample data at hand and create a large number of âphantom samplesâ known as bootstrap samples (29). The sample summary is then computed on each of the bootstrap samples (usually a few thousand samples). For the bootstrap-based test, verifying assumptions of normality and equality of vari- ances for the population is unnecessary (inferences are valid even when assumptions are not verified), and there is no need to determine the underlying sampling distribution for any popu- lation quantity (large sample sizes must be generated). Concern with Bias There is a general concern about contractor test data due to fear of bias ultimately related to payment (11â14). The role of contractor test results in the QA process has been the focus

12 Procedures and Guidelines for Validating Contractor Test Data of many publications (3, 4, 12). The project found that key characteristics of the acceptance processes for hot mix asphalt (HMA) included verification through various combinations of one-to-one comparisons, F- and t-tests. Contractor and SHA test results were compared using F- and t-tests with a Ho; contractor tests provided effectively the same results as SHA tests, at Î± of 0.01 (12). For each material property tested, the differences in proximity, defined as deviation from target values [such as percent asphalt binder content (AC) specified in a mix design, or excess beyond minimum percentage of mat density], and the variance among the differences between contractor and SHA test results formed the basis for comparison. Analyses were conducted at the project level for projects with adequately large datasets and across all projects within each state. The trend among HMA test data was that contractor test results generally yielded smaller variances than SHA test results, and in many cases the differences were statistically significant. SHA test results yielded larger mean deviations from target values more frequently than con- tractor test results, but not to the extent observed with variances, and statistically significant differences were less common than with variances. PCC and aggregate base test result data sets were too small to allow for generalization of results. The findings from this research were considered controversial within the industry. Over the years, the U.S. Department of Justiceâs Office of Inspector General has investigated multiple construction projects with findings of fraud and/or misconduct associated with contractor test data that were used as a basis for accep- tance and payment (14). Nonparametric Tests Ludbrook and Dudley compared F- and t-tests to nonparametric rank-order tests in biomedi- cal experiments and recommended employing nonparametric rank-order tests when there are some reasonable doubts in fulfilling the F-test and t-test assumptions (26). Eudey et al. noted the need to use nonparametric tests for data sets that are too small to adequately test the assumptions of the F- and t-tests or meeting the normality assumption of the F- and t-tests (27). Derryberry et al. (30) and Callaert (31) stressed the point of using nonparametric tests such as the Wilcoxon test and Mann-Whitney-Wilcoxon (MWW) test. OâGorman compared several statistical tests to the F-test, including Friedmanâs test and proposed a new aligned rank F-test that maintains its Î± and has relatively high power, especially when the number of observations is high compared to Friedmanâs test (32). Fay and Proschan conducted a similar effort comparing the t-test and MWW test (33). A framework for using nonparametric tests when there are some reasonable doubts in fulfilling the F- and t-tests assumptions is presented in Chapter 3. Potential Risks Associated with F- and t-tests The F- and t-tests are considered powerful methods for identifying the difference in variances and means, and they achieve this purpose as long as the fundamental assumptions of the tests are not violated. Concerns have been identified in state of the practice with respect to violation of some of F- and t-tests assumptions. For example, Kahler noted some of these violations and gave examples of several misleading results (34). Other studies identified situations where the F-test fails to indicate that a two-sample t-test may be inappropriate (35â38). The F-test is not helpful when the data do not follow a normal distribution due to its sensitivity to the normality assumption. A Type I error is defined as the incorrect rejection of a true Ho (e.g., when comparing two means and concluding the means were different when in reality they were not different). The probability of the statistical test leading to a Type I error is denoted by Î±. A Type II error is defined as the failure to reject a false Ho (e.g., when comparing two means and concluding that the means were not different when in reality they are different). The probability of the statistical test leading to a Type II error is denoted by Î² (39â41).

Research Approach 13 For unequal sample sizes, a preliminary F-test is typically applied prior to the t-test. However, Markowski and Markowski noted that the F-test is unlikely to detect many situations where the t-test should be avoided even when sampling from a normal distribution (36). Zimmerman and Zumbo (37) noted that other preliminary tests of equality of variances (e.g., Levene or OâBrien test) are also ineffective and recommended that an unequal variance t-test without a prior variance test be used (38). State of the Practice FHWA conducts QA reviews of selected SHA practices and procedures every year as well as a survey of SHA practices every 2 years to ascertain the status of 23 CFR 637B implementation (3, 4, 7, 8, 42). It is common that up to four SHA reviews are conducted annually; these have included: â¢ Interviews with SHA headquarters staff, region/district and field office personnel, and FHWA Division personnel. â¢ Review of SHA implementation strategies, which includes policy and procedure documenta- tion and office records where applicable. â¢ Visits to construction project sites to assess field practices. â¢ Identification of best practices. The reviews are summarized over a period of time to produce state-of-practice reports. FHWA also produces a QA assessment report every 2 years based on a survey of SHA prac- tices relevant to 18 QA items. This assessment evaluates the effectiveness of the QA programs in ensuring that states receive high quality materials, make appropriate payment for the quality provided, and minimize the potential for fraud and abuse (7). Figure 3 identifies the 31 SHAs that, at the time of the survey, were using contractor test results in the acceptance decision process related to PCC pavement, HMA pavement, and PCC bridge decking materials (7, 8). The most commonly used acceptance criteria were strength and smoothness for PCC pavements and in-place density and smoothness for HMA pavements. For PCC bridge deck acceptance, concrete strength and permeability were the most commonly used acceptance criteria. VT NH MA Figure 3. States that use contractor test results in acceptance decisions (7).

14 Procedures and Guidelines for Validating Contractor Test Data FY 08 FY 14 Figure 4. Number of SHAs meeting high-risk QA best practices (7). (PCCP = portland cement concrete pavement.) The SHAsâ responses were weighted based on minimizing the potential of waste, fraud, and abuse. The SHAsâ total response was given a score based on percentage of the SHA collective score divided by the total potential score; practices with larger numbers indicate the areas with the largest risk for the SHA. Figure 4 shows the number of SHAs meeting high-risk QA best practices at the time of the survey. The figure shows that only three of 12 (25%) SHAs using F- and t-tests for validating contractor PCC test results and nine of 30 (30%) SHAs using F- and t-tests for validating contractor HMA test results met this high-risk best practice. Figure 5 illustrates the risk related to the 18 survey topics based on the number of SHAs requiring improvement in each area multiplied by a weighting factor assigned to that area. Practices with larger numbers desig- nate areas with the larger SHA risk (7). The report notes that statistical validation of contractor test results and improving the inde- pendent assurance approach provide the greatest opportunity for the SHA to further reduce overall program risk (7). AASHTO Provisional Practice PP-84 (43) is a performance-based PCC mix design procedure with recommended acceptance criteria. The practice includes two new test methods: Super Air Meter (SAM) and electrical resistivity (ER) (44). A recommended practice for testing PCC mixtures using SAM and ER is currently under development (58). Policy, Standards, and Guidelines Burati, et al. presented guidance for developing new or modifying existing acceptance plans and QA specifications (45). This guidance supports 23 CFR 637B requirements (4) and is consis- tent with FHWA Technical Advisory T6120.3 (46) that provides clarification of the requirements and guidance to field personnel. These documents highlight the importance of independent sampling and validation of contractor test results using F- and t-tests as outlined in AASHTO implementation manual and guide specifications (5, 6).

Research Approach 15 Note: Topics 2, 3, 9, 16, and 18 have a weighting factor of 7. Topics 15 and 17 have a weighting factor of 5. Topics 1, 8, and 11 have a weighting factor of 3. Topics 4, 5, and 6 have a weighting factor of 2. Topics 7, 10, 12, and 13 have a weighting factor of 1. NTPEP = National Transportation Product Evaluation Program. PWL = percent within limits. PD = percent defective. Figure 5. Weighted risk associated with best practices (7). Material Type Number (%) of SHAs That Use Contractor Test Results Overall Response 22 (100) Asphalt Concrete Mixture 21 (95) PCC Mixture 14 (64) Base or Drainage Aggregate 9 (41) Subgrade or Embankment 9 (41) Reinforcing or Structural Steel 4 (18) Other Materials 4 (18) Table 1. SHAsâ use of contractor test results. 2.1.2 Survey of State Highway Agencies A web-based survey was conducted to obtain information on the use of contractor test results as part of the acceptance procedure for different types of materials, including asphalt concrete mixture, PCC mixture, base or drainage aggregate, subgrade or embankment, and reinforcing or structural steel. The survey questions are presented in Appendix A. Twenty-eight SHAs completed the survey, 22 of which (79%) responded that they use con- tractor test results as part of the acceptance process and six SHAs (21%) do not use contractor test results. Among these are three SHAs that never used contractor results and three SHAs that indicated future plans for considering use of contractor results. Table 1 shows a breakdown of SHA use of contractor results for the different material types. Since only about half the SHAs responded to the survey, the conclusions derived from the survey may not be representative of the entire country. The results listed in Table 1 show that 21 agencies (95%) use contractor test results in the acceptance process for asphalt concrete mixtures, 14 agencies (64%) use contractor test data for PCC mixture acceptance, and nine agencies (41%) use contractor test results for the acceptance process for unbound material. SHAsâ use of contractor test results for asphalt concrete mixture acceptance is discussed in the following section. Use of contractor test results for other materials is presented in Appendix B.

16 Procedures and Guidelines for Validating Contractor Test Data Asphalt Concrete Mixture Of the 21 SHAs that indicated use of contractor test results for acceptance of asphalt concrete mixtures, 17 agencies provided information on the process for validating contractor test results; these are listed in Table 2. Table 2 shows that F- and t-tests are used by four SHAs (24%) and less fundamental methods (average deviation and multi-laboratory precision) are used by six SHAs (35%). Seven SHAs (41%) reported use of other processes; these were: â¢ Moving average with department verification tests and split sample comparison tests. â¢ Independent assurance parameters on split samples. â¢ Independent samples for contractor and SHA. â¢ Direct test to test comparison within specified comparison limits. â¢ Design-build projects use F- and t-tests with independent samples. Design-bid-build projects use operational tolerances. â¢ Multi-laboratory precision value and F- and t-tests for both independent and split samples. â¢ Starting on F- and t-tests on independent samples. Most of the rest are accepted on a 4-pt running average of results and disputes determined by multi-laboratory precision. A majority of the SHAs noted that the provisions for using contractor test results are covered in standard specifications, material/construction manuals, and/or supplemental specifications. Fourteen SHAs (48%) indicated no concerns with their processes, and three SHAs (10%) noted concerns. Nine SHAs (31%) had no problems, five SHAs (17%) noted having a problem with adequate staffing, and one SHA (3%) indicated a problem with time to complete the testing. Four SHAs (14%) noted concerns related to the differences between laboratories and the provi- sions for dispute resolution. The survey asked if the construction process was changed when using contractor test results. Twenty-four (83%) SHAs responded; the responses are summarized in Table 3. Method Used to Validate the Contractor Test Data for Asphalt Concrete Mixtures Number of Responses (%) F- and t-tests, independent samples 3 (17.6) F- and t-tests, split samples 1 (5.9) Paired t-test, split samples 0 (0.0) t-test, independent samples (analysis assumes similar variance in data sets) 0 (0.0) Average deviation (AD) or average absolute deviation (AAD) 3 (17.6) Multi-laboratory precision value (acceptable deviation between test values) 3 (17.6) Other 7 (41.2) Table 2. Methods used to validate the contractor test results for asphalt concrete mixtures. Effect of Use of Contractorâs Test Data for Asphalt Concrete Mixtures on the Frequency of Noncompliance Action Number of Responses (%) No change in frequency for noncompliance actions 8 (33.3) Higher frequency of efforts to resolve test result differences between laboratories without dispute 4 (16.7) Higher frequency of dispute 5 (20.8) Higher frequency of work stoppages 2 (8.3) Higher frequency of in-place material removal and replacement 2 (8.3) Other 3 (12.5) Table 3. Effect of use of contractor test results for asphalt concrete mixtures on the frequency of noncompliance actions.

Research Approach 17 3 3 1 1 1 3 2 1 1 3 3 2 1 7 3 3 3 1 Asphalt Concrete Mixture Portland Cement Concrete Mixture Base or Drainage Aggregate Subgrade or Embankment Reinforcing or Structural Steel 0 2 4 6 8 10 12 14 16 18 Number of SHA responses F- & t- test (independent) F- & t- test (split) Paired t-test (split) t-test (independent, similar variance) average deviation (AD) Multi-lab precision value Other Figure 6. Methods used by SHAs for validating contractor test results. Eight SHAs noted that use of contractor test data did not affect the frequency for noncompli- ance actions, five SHAs noted higher frequency of dispute, and three SHAs noted other issues. These issues included changes in the process and staffing levels, considering methods to make the contractor sampling and testing random or blind, challenges in implementing F- and t-tests, and changing the acceptance criteria. Survey Summary and Observations Twenty-nine SHAs completed the survey, 22 of which use contractor test results as part of the acceptance procedure for different material types (see Table 1). Of these, 21 SHAs use contrac- tor test results in the acceptance process for asphalt concrete mixtures, and 14 and nine SHAs use contractor test results in the acceptance process for PCC mixtures and unbound material, respectively. Figure 6 shows the methods used to validate contractor test results for various materials. The figure shows that F- and t-tests are used much less by SHAs than other funda- mental or higher-risk methods. The survey also revealed how SHAsâ validation of contractor test results has impacted noncompliance actions. About half of the survey respondents noted concerns with validation processes currently being used. The responses from 28 SHAs led to the following observations: â¢ Contractor test results are most commonly used in the acceptance process of asphalt concrete mixtures, followed by PCC, base aggregate, and subgrade. â¢ Different methods are used to validate contractor test results, including F- and t-tests, average deviation, and multiple laboratory difference (or a variation on these methods); there is no dominant method. â¢ A majority of SHAs have no concerns about their current validation process and introduced no changes to their sampling and testing program because of the use of contractor test results as part of their acceptance program, but they noted a concern about adequacy of staffing to perform the validation. 2.2 Numerical Simulations Based on the findings of the literature review and the survey of state practices, the procedures/ tests listed in Table 4 were recommended for evaluation. Several of these tests were identified based on application [e.g., analysis of variance (ANOVA), normality assumption], robustness of the test, and reported successful application. These tests were categorized based on function. Hypothesis tests included Studentâs t-test, Welchâs t-test

18 Procedures and Guidelines for Validating Contractor Test Data (unequal variance t-test), paired t-test, Mann-Whitney test, and Kolmogorov-Smirnov test. Variance tests included F-test, Ansari-Bradley test, Leveneâs test, and Bartlettâs test. Normality was checked using Anderson-Darling test, Shapiro-Wilk test, and Lilliefors test. 2.2.1 Normal Distribution Data Sets The identified tests were evaluated using numerical simulations to quantify associated risks and select acceptable tests considering multiple distribution types and construction material AQCs; the evaluation process is illustrated in Figure 7. The first step in the process was generating a random sample from a distribution with a known mean, Âµ1, and a known standard deviation, s1 (illustrated by the normal distribution curve to the right half of Figure 7), which represents the SHA sample (or Sample 1). The mean, x _ 1, and the standard deviation, s1, of Sample 1 were Test Also Known As Comments D2S limits One-on-one comparison (tests method variability only) â Â± CR Low power range test Equal variance t-test Studentâs t-test Mean comparison Unequal variance t-test Welchâs, Satterthwaiteâs Mean comparison Paired t-test Mean comparison Ansari-Bradley test Nonparametric Mann-Whitney Wilcoxon test, Mann-Whitney U, Mann-Whitney-Wilcoxon (MWW) Nonparametric Fligner-Killeen test Nonparametric F-test Variance comparison Leveneâs test Variance comparison Bartlett's test Variance comparison Friedman's test Variance comparison Kruskal-Wallis test Variance comparison Kolmogorov-Smirnov test Mean comparison Anderson-Darling test Normality Shapiro-Wilk test Normality Permutation test Randomization Bootstrap-based test Randomization X Table 4. Tests identified for evaluation. Figure 7. Numerical simulations flow chart for normal distribution.

Research Approach 19 then calculated. The next step was generating another random sample from a second distribu- tion with a known mean, Âµ2, and a known standard deviation, s2 (illustrated by the normal distribution curve on the left half of Figure 7), which represents the contractor sample (or Sample 2). The mean, x _ 2, and the standard deviation, s2, of Sample 2 were then calculated. With these sample statistics, the tests recommended for further evaluation were applied to the two samples (Sample 1 and Sample 2), and the result was recorded. For instance, when the t-test is applied, Ho was that the data in Sample 1 and Sample 2 come from independent random samples from normal distributions with equal means. Ha was the two means are not equal. The t-test hypothesis was coded a value of zero if the test did not reject Ho (equal means) and a value of 1 if the test did reject Ho (unequal means). These three steps completed one iteration of evaluating the t-test in this example. The process was then iterated a number of times to account for the variability coming from the random number generation. The âsuccess rateâ of each test was then evaluated by calculating the ratio of the number of hypothesis test results with a value of zero, NH0 (i.e., âPassâ) to the total number of iterations, NT, as follows: ( ) = ÃSuccess Rate % N 100 T 0NH For each AQC, four different scenarios of distributions were examined using this iterative process. Figure 8 illustrates the four scenarios considered for in-place density when the mean of the SHA distribution, Âµ1, was 94.0% and standard deviation, s1, was 1.0%. In the first scenario, Âµ1 was equal to the mean of the contractor distribution, Âµ2, and s1 was equal to standard deviation of the contractor distribution s2 (Âµ1 = Âµ2 and s1 = s2); the two distributions coincide as shown in Figure 8. In this case, the t-test hypothesis test result is expected to be zero since the means of the two samples were equal (X _ 1 = X _ 2 and S1 = S2). In the second scenario, Âµ1 was equal to Âµ2 but s1 and s2 were not equal (Âµ1 = Âµ2 and s1 â s2). In the third scenario, Âµ1 was not equal to Âµ2 but s1 and s2 were equal (Âµ1 â Âµ2 and s1 = s2). In the fourth scenario, Âµ1 and Âµ2 were not equal, and s1 and s2 were not equal (Âµ1 â Âµ2 and s1 â s2). A MATLAB code (47) was developed to run the iterative process (details are presented in Appendix C). Six different AQCs were considered in the numerical simulations. In-place density, AC, and air voids (AV) were considered for HMA and compressive strength, flexural strength, and thick- ness were considered for PCC. Table 5 lists these AQCs and the values used in the numerical simulations. Because of the wide range of target means and standard deviations coefficient of variation (CV) of these AQCs, the ratio of the standard deviation to the mean was considered the most appropriate parameter for comparing the test results. Over 1.2 million numerical simulations were completed for normally distributed data sets. Table 6 lists the number of simulations conducted for each AQC, distribution scenarios, number of SHA samples (Sample 1), number of contractor samples (Sample 2), and number of iterations for the normally distributed data sets. A similar set of analyses was also conducted for the two nonparametric data sets, using skewed and bimodal distributions. The total number of simulations completed was over 3.7 million; the results are presented in Chapter 3. 2.3 SHA Data Data for highway projects were used to test the effectiveness of the validation procedures. These data were obtained from six states and included test results for HMA, PCC, and aggre- gate base.

20 Procedures and Guidelines for Validating Contractor Test Data Scenario 1 Scenario 2 Scenario 3 Scenario 4 Figure 8. Numerical simulations distribution scenarios for in-place density.

Research Approach 21 2.3.1 Data Processing SHA data were received in PDF files and spreadsheets. The PDF files were scanned test reports from multiple projects within a SHA. These reports contained SHA test results and the corresponding contractor results. The PDF files included some duplicate reports; these were cross checked and only one copy of each report was used. The PDF files were converted into spreadsheets to allow analysis of the data using Microsoft Excel. The test results in the PDF scanned reports were then cross checked against the spreadsheets; all identified discrepancies were corrected to match the original PDF scanned reports. The spreadsheets typically listed project numbers, lot and sublot numbers, SHA test results, and corresponding contractor test results; they required only minimal pre-processing before use. A MATLAB code was developed to scan and sort the data contained in the spreadsheets according to project number and lot number. The test results of one lot represented a sample. Table 7 lists the SHA data processed for further analysis. HMA AQCs included density, AVs, and AC. AQC Number of Scenarios Sample 1 Sizes Sample 2 Sizes Number of Iterations Number of Simulations HMA - In-place Density 4 5 10 1,000 200,000 HMA - Asphalt Binder Content 4 5 10 1,000 200,000 HMA - Laboratory Air Voids 4 5 10 1,000 200,000 PCC - Compressive Strength 4 5 10 1,000 200,000 PCC - Flexural Strength 4 5 10 1,000 200,000 PCC - Thickness 4 5 10 1,000 200,000 HMA - In-place Density 4 5 10 20 4,000 HMA - Asphalt Binder Content 4 5 10 20 4,000 HMA - Laboratory Air Voids 4 5 10 20 4,000 PCC - Compressive Strength 4 5 10 20 4,000 PCC - Flexural Strength 4 5 10 20 4,000 PCC - Thickness 4 5 10 20 4,000 HMA - In-place Density 4 5 10 10 2,000 HMA - Asphalt Binder Content 4 5 10 10 2,000 HMA - Laboratory Air Voids 4 5 10 10 2,000 PCC - Compressive Strength 4 5 10 10 2,000 PCC - Flexural Strength 4 5 10 10 2,000 PCC - Thickness 4 5 10 10 2,000 HMA - In-place Density 4 5 10 5 1,000 HMA - Asphalt Binder Content 4 5 10 5 1,000 HMA - Laboratory Air Voids 4 5 10 5 1,000 PCC - Compressive Strength 4 5 10 5 1,000 PCC - Flexural Strength 4 5 10 5 1,000 PCC - Thickness 4 5 10 5 1,000 Total number of simulations for normal distribution 1,242,000 Total number of simulations for all three distributions (1,242,000 Ã 3) 3,726,000 Table 6. Numerical simulations summary. Pavement Type AQC Units Representative Values Mean (Âµ) Standard Deviation (Ï) Coefficient of Variation (CV) HMA AC % 5.5 Â± 0.5 9.1 % In-place density % 94 Â± 1.0 1.1 % AV % 7 Â± 0.5 7.1 % PCC Flexural Strength psi 550 Â± 100.0 18.2 % Compressive Strength psi 6,000 Â± 1000.0 16.7 % Thickness inch 10 Â± 0.25 2.5 % Table 5. AQCs and their representative values.

22 Procedures and Guidelines for Validating Contractor Test Data Processing of SHAsâ data revealed the following observations: â¢ Most of the SHA data were obtained using independent sampling techniques, but some data were obtained using split samples with contractors. Using split rather than independent samples can influence the acceptance and payment decisions. While the data obtained from independent samples are influenced by the variability in material, process, sampling, and test method, only variability of the test method influences the data obtained from split samples. Sampling methods are discussed and illustrated with an example in Section 3.3.2. â¢ SHA definitions of lots, sampling, and testing frequencies varied, resulting in different scenarios for a number of SHA and contractor samples. In general, sample sizes can be clas- sified into three categories based on the number of SHA samples per lot: (1) a single SHA test result per lot, (2) 3 to 20 SHA test results per lot, and (3) more than 20 SHA test results per lot; these categories are discussed and illustrated with an example in Chapter 3. 2.3.2 Plan for Sampling, Testing, and Validation Processing SHA data revealed that some SHA sampling and testing plans that use contractor data in acceptance decisions lack independent sampling (and thus do not meet the require- ments of 23 CFR 637B) and some use a single SHA sample per lot. The research team developed plans for sampling and testing the SHA data and for validating contractor data; these plans are presented as a proposed practice for validating contractor test data (see Part II). The sampling plans are illustrated for the minimum SHA test per lot (Case 1) and for cumulative lots (Case 2). Sample Size It is suggested that anticipated risk be considered when establishing minimum sample sizes for both the SHA and contractor. Guidelines for selection of optimum number of samples for SHA ID Material Type AQC Number of Projects Average Lots per Project Total Samples (Lots) SHA 1 HMA Density 259 15 3,804 AV 302 7 2,050 PCC Strength 16 22 354 Thickness 16 22 354 SHA 2 PCC Strength 18 1 25 SHA 3 HMA Density 690 7 5,084 AV 708 8 5,620 AC 720 9 6,488 No. 8 Sieve 720 9 6,487 No. 200 Sieve 720 9 6,490 SHA 4 Aggregates Base 2 inch Sieve 3 41 123 1 inch Sieve 3 41 123 3/8 inch Sieve 3 41 123 No. 10 Sieve 3 41 123 No. 40 Sieve 3 41 123 No. 200 Sieve 3 41 123 Liquid Limit (LL) 3 41 123 Plasticity Index (PI) 3 41 123 Moisture Content (MC) 3 41 123 SHA 5 HMA AV 289 6 1,734 AC 289 6 1,734 VMA 289 6 1,734 Note: VMA = voids in mineral aggregate Table 7. SHA data used for further analysis.

Research Approach 23 validating contractor test data as a function of SHA buyerâs risk (Î²), contractor sellerâs risk (Î±), and sample size (n) and for determining the risks to SHAs when using F- and t-tests for compar- ing SHA and contractor data sets and integrating OC curves are discussed in a few publications (45, 49). Case 1: Minimum SHA Tests Per Lot This case requires a minimum of six sublots per lot, where a minimum of three SHA results and six contractor results are used for sampling and validation using F- and t-tests. A minimum of three results from each entity are required for the statistical comparisons, although using just three results diminishes the statistical power of the F- and t-tests substantially and increases the risks associated with Type I and Type II errors; larger sample sizes would reduce both SHA and contractor risk. An assessment of anticipated risk when establishing minimum sample sizes for both the SHA and contractor should be considered. Figure 9 illustrates the sampling, testing, and validation process for Case 1. The process consists of three stages: sampling, primary validation, and secondary validation. In the sampling stage (upper portion of Figure 9), each sample in each lot that consists of a minimum of six split samples was split into three equal portions and labeled 1-A, 1-C, 2-A, 2-C, etc. The number designates the sample number, and the letter designates the identity [A for the Agency (SHA) portion of the split, and C is the contractor portion of the split]. Three sublots were randomly selected to represent the SHA portions for primary validation (1-A, 3-A, and 6-A); the results of the contractor tests on sublots corresponding to these SHA samples (i.e., 1-C, 3-C, and 6-C) were excluded from the F- and t-tests statistical comparisons, and results from samples 2-C, 4-C, and 5-C were used in the primary validation (F- and t-tests). Thus, the contractor and SHA results from the same sublots were not used in the primary validation (simi- lar to that required by 23 CFR 637B). The initial step in the acceptance procedure is to test SHA and contractor data sets accord- ing to ASTM E178 to detect outlying observations (48). If an outlier was detected in either set, the outlying observation was discarded, although the plan recommends determining prob- able causes before discarding a test result. Outlier detection is discussed with an example in Chapter 3. For primary validation, the independent contractor data set is validated against the SHA data set using the F-test and Welchâs t-test (unequal variance t-test) at a significance level, Î±, of 0.05. The F-statistic is calculated as the ratio of the larger variance from either the contractor results or the SHA results, sa 2, to the smaller variance of either one, sb 2, as follows: â = 2 2F statistic s s a b The F-critical value is obtained from Tables E.2 through E.5 in Appendix E at a level of Î±/2 and degrees of freedom, dfa and dfb, determined as follows: = â = â1 and 1df n df na a b b Where na and nb are the numbers of samples corresponding to the larger and smaller vari- ances, respectively. When the F-statistic is less than F-critical, the hypothesis is not rejected, and it is concluded that the variabilities of the two data sets are not different. Otherwise, it is concluded that the variabilities of the two data sets are different. When the variabilities are found to be different, the cause of difference should be investigated. As a starting point, the methods used by SHA and the contractor personnel to obtain the samples and conduct the tests should be reviewed.

24 Procedures and Guidelines for Validating Contractor Test Data Figure 9. Sampling and validation process â Case 1.

Research Approach 25 Since the primary validation tests are based on independent samples, there is a potential for differences in sampling, testing, and materials variability. For example, material segregation, whether it occurs in haul vehicles or during placement or sample handling, is a common source of variability. The Welchâs t-test (also known as the unequal variance t-test) can be used to compare the means of the data sets. Welchâs t-statistic is calculated as follows: = â + 1 2 1 2 1 2 2 2 t x x s n s n The critical t-value is obtained from the t-table (Table E.1 in Appendix E) at a level of Î±/2 and degrees of freedom, df â², approximated as follows: ( ) ( ) â² = +ï£«ï£ ï£¶ ï£¸ â + â1 1 1 2 1 2 2 2 2 1 2 1 2 1 2 2 2 2 2 df s n s n s n n s n n The estimated degrees of freedom should be rounded down to the nearest integer because the degrees of freedom are presented in the t-tables as integers. When the absolute value of the t-statistic is less than t-critical, then the hypothesis is not rejected, and it is concluded that the means of the SHAâs results and the contractorâs results are not different. Otherwise, it is concluded that the means of the SHAâs results and the contractorâs results from independent samples are different. If both the F-test and Welchâs t-test indicate that the SHA results and the contractor results are not statistically different (i.e., the data sets are from the same population), the contractorâs data are considered validated. If either the F-test or Welchâs t-test indicates that the contractor results and the SHA results are statistically different, then the contractorâs data are not validated; secondary validation is required. Secondary validation is performed when the contractorâs data are not validated in the primary validation stage. The next step is to compare SHA results and the contractor results from the same sublots (i.e., split portions) using the paired t-test. As illustrated in Figure 9, the secondary validation compares all available results (i.e., 1-A to 1-C, 2-A to 2-C, 3-A to 3-C, 4-A to 4-C, 5-A to 5-C, and 6-A to 6-C). The paired t-test is then used to determine if the average difference between these pairs of results is statistically different from zero. The t-statistic for the paired t-test is calculated as follows: =t x s n d d Where x _ d is the average of the differences between the split sample test results; sd is the stan- dard deviation of the differences between the split sample test results, and n is the number of split samples. The critical t-value is obtained from the t-table (Table E.1 in Appendix E) at a level of Î±/2 and (n â 1) degrees of freedom. When the paired t-statistic is less than t-critical, it is concluded that pairwise difference between SHA results and contractor results is not statistically different than zero, and the contractorâs results are validated by secondary validation. Otherwise, the

26 Procedures and Guidelines for Validating Contractor Test Data contractorâs data are not validated. Because the power of the paired t-test increases with the increase in sample size, use of all available split sample results for secondary validation is desired. Case 2: Cumulative Validation Lots This case compares lots with a single SHA result to a contractor sample size of three or more observations per lot. Because an F-test cannot be performed in this situation, a cumulative sam- pling technique is proposed. This sampling technique utilizes a concept similar to a moving average, where a fixed number of lots (e.g., three) are accumulated to form a single cumulative validation lot (CVL). Lots 1, 2, and 3 form CVL 1, then Lot 1 in the set is dropped and a new lot (i.e., Lot 4) is added to form CVL 2. In this technique, illustrated in Figure 10, a window of three (or more) lots will continue until a nonconforming lot is encountered, and then the process restarts and a new CVL is formed. The process used in Case 2 is similar to that used in Case 1 except at the sampling stage where validation results from three consecutive lots are combined to form a CVL. CVL 1 includes the results from Lots 1, 2, and 3, and the second CVL drops the results from Lot 1 and combines the results from Lots 2, 3, and 4. This moving CVL process continues as long as the validation process confirms the contractorâs data. An example to illustrate the cumulative sampling tech- nique when performing data validation is presented in Chapter 3. This technique is part of the validation plan described in the proposed practice for validating contractor test data. The results of applying the sampling, testing, and validation plan on SHA data are presented in Chapter 3. Figure 10. Cumulative sampling technique â Case 2.