**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

**Suggested Citation:**"Appendix J: A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2011.

*A Data-Based Assessment of Research-Doctorate Programs in the United States (with CD)*. Washington, DC: The National Academies Press. doi: 10.17226/12994.

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

APPENDIX J A Technical Discussion of the Process of Rating and Ranking Programs in a Field This appendix explains in detail how the various parts of the rating and ranking process for graduate programs fit together and how the process is carried out. Figure J-1 provides a graphical overview of the entire process and forms the basis for this appendix. The appendix addresses each of the boxes in Figure J-1 separately, starting at the top and generally working downward and to the right. The topics in this appendix include: â¢ a summary of the sources of data used in the rating and ranking process, â¢ the survey (S)-based weights, the regression (R)-based weights, and the details of the calculations of the endpoints of the 90 percent ranges â¢ the simulation of the uncertainty in the weights by random-halves sampling, â¢ the simulation of the uncertainty in the values of the program variables, â¢ the combination of the simulated weights for the significant program variables with the simulated standardized values of the program variables to obtain simulated rankings, and â¢ the resulting 90 percent ranges of rankings that are the primary rating and ranking quantities that we report. â¢ a description of an alternative ranking methodology that combines measures of interest to the user. 285

286 APPENDIX J THE METHOD FOR CALCULATING THE R AND S RANKINGS Figure J-1 A graphical summary of the NRCâs approach to rating and thereby ranking graduate programs. The three sets of data: X, P, and R. X = the collection of the faculty importance P = the collection of the values of the R = the collection of ratings of programs by the measures. A complete array with an program variables. A complete array with a faculty raters. An incomplete array, with ratings importance value for every program variable by value for every program (that satisfies the only for the sampled programs and rated only every responding faculty member. inclusion criteria for rating and ranking) in a by those faculty members who were sampled field, on every program variable. to rate a given sampled program. (1) Random halves sampling of (2) Random halves sampling of raters faculty in X. in R. (3) Random perturbation of the (2a) Results in one random half of R, (1a) Results in one random values in P. % % denoted by R . half of X, denoted by X . % (2b) Average R over raters to get (1b) Average % X over faculty to get the (3a) Results in one randomly average ratings for sampled programs, r . This is the dependent perturbed version of P, denoted survey-based (S) weights, x . The sum % of these weights = 1.0. by P . variable in the regressions. % (4) Standardize P to get % P * . Standardize program variables to Mean = 0, and SD = 1. Denote result by P*. These are the independent Calculation of an R-ranking variables in the regressions (5a) (a) Transform original program variables to and in the x . principal components (PCs). (b) Perform backwards stepwise regression to obtain a stable fitted equation predicting average ratings from the remaining PCs. (c) Transform resulting coefficients back to the original program variables to get the regression-based weights, Ë m , and make their absolute sum = 1.0. The Ninety Percent Ranges of the S- The Ninety Percent Ranges of the R-Rankings Rankings (5c)Repeat step (2) 500 times drawing a new (5b)Repeat step (1) 500 times drawing a new random half sample each time to get 500 Ratings for random half sample each time to get 500 each program Rank the programs for each set of Ratings for each program Rank the 500 ratings This results in 500 Rankings for each programs for each set of 500 ratings This program. Use these 500 Rankings to get the 90 results in 500 Rankings for each program. percent range of the regression-based (R) Rankings Use these 500 Rankings to get the 90 percent for each program. range of the survey-based (S) Rankings for each program.

APPENDIX J 287 The Three Data Sets The empirical basis of the NRC ratings and rankings are the three data sets indicated in the three unlabeled boxes at the top of Figure J-1. The first, denoted by X, is the collection of faculty importance measures that were derived from data that were collected in the faculty questionnaire. The data in X are used to derive the direct or survey-based (S) weights discussed more extensively below. The second data set, denoted by P, is the collection of the values of the 20 program variables that were collected from various sources for each program. The data in P are used in the final ratings and rankings of the programs and are discussed in greater detail below. The third, denoted by R, is the collection of ratings of programs by faculty raters. These ratings were made separately from the faculty questionnaire and involved only a sample of programs from each field and only a sample of faculty raters from that field. This sample of faculty ratings plays a crucial role in the derivation of the regression-based weights, discussed more extensively below. Box (1b): The Direct Weights From the Faculty Questionnaire1 Let us turn first to the survey (S) or direct weights in box (1b) in Figure J-1, leaving boxes (1) and (1a) to the later discussion of how the uncertainty in these data was simulated. The faculty questionnaire asks each graduate-program faculty respondent to indicate how important each of 21 characteristics is to the quality of a program in his or her field of study. 2 This information is then used to derive the survey (S) or direct weights for each surveyed faculty member, as described below. The original 21 program characteristics listed on the faculty questionnaire are shown in Table J-1, and they were divided into three categoriesâfaculty, student, and program characteristics. Of the original 21, there are 20 for which adequate data were deemed to be available to use in the rating process, and these 20 data values for each program became the 20 program variables used in this study to which we repeatedly refer. Faculty respondents were first asked to indicate up to four characteristics in each category that they thought were âmost importantâ to program quality. Each characteristic that was listed received an initial score of 1 for that faculty respondent. These preferences were then narrowed by asking the faculty members to further identify a maximum of two characteristics in each category that they thought were the most important. Each of these selected characteristics received an additional point, resulting in a score of 2. Given this approach, at most, 12 of the program characteristics can have a non-zero value for any given faculty member; and of these 12, 6, at most, will have a score of 2, and the rest will have a score of 1. At least 8 program characteristics will have a score of 0 for each faculty respondent, more than 8 would be zero if the respondent selected less than 4 as the âimportantâ or 2 as the âmost importantâ characteristics. A final question asked faculty respondents to indicate the relative importance of each of the three categories by assigning them values that summed to 100 over the three 1 The importance of program attributes to program quality is surveyed in Section G of the faculty questionnaire. 2 The number of student publications and presentations was not used because consistent data on it were unavailable. The direct or survey-based and regression-based weights were calculated without it.

288 APPENDIX J categories.3 For each faculty respondent, his or her importance measure for each program characteristic was calculated as the product of the score that it received times the relative importance value assigned to its category. Finally, the 20 importance measures for each faculty respondent were transformed by dividing each one by the sum of his or her importance measures across the 20 program variables. Faculty characteristics Tablei.J-1. The 21 Program Characteristics Listed in the Faculty Questionnaire Number of publications per faculty member ii. Number of citations per publication (for non-humanities fields) iii. Percent of faculty holding grants iv. Involvement in interdisciplinary work v. Racial/ethnic diversity of program faculty vi. Gender diversity of program faculty vii. Reception by peers of a faculty memberâs work as measured by honors and awards Student characteristics i. Median GRE scores of entering students ii. Percentage of students receiving full financial support iii. Percentage of students with external funding iv. Number of student publications and presentations (not used) v. Racial/ethnic diversity of the student population vi. Gender diversity of the student population vii. A high percentage of international students Program characteristics i. Average number of Ph.D.âs granted in last five years ii. Percentage of entering students who complete a doctoral degree in a given time (6 years for non-humanities, 8 years for humanities). iii. Time to degree iv. Placement of students after graduation (percent in either positions or postdoctoral fellowships in academia) v. Percentage of students with individual work space vi. Percentage of health insurance premiums covered by institution or program 3 The faculty task can be thought of as asking faculty how many percentage points should be assigned to each category. The sum of the percentage point weights adds up to 100.

APPENDIX J 289 We will use the following notation consistently: i for a faculty respondent, j for a program in a field, and k for one of the 20 program variables. Thus, xik denotes the measure of importance placed on program variable k by faculty respondent i. The values, xik, are non- negative and, over k, sum to 1.0 for each faculty respondent i. The importance measure vector for faculty respondent i is the collection of these 20 values, xi = (xi1, xi2, . . , xi20). (1) The entries in these x-vectors are non-negative and sum to 1.00. Denote the vector of average importance weights, averaged across the entire set of faculty respondents in a field, by x = ( x1 , x2 ,..., x20 ) . (2) The mean value, xk , is the average weight of the importance given to the kth program variable by all the surveyed faculty respondents in the field. The averages, { xk }, are the direct or survey- based weights of the faculty respondents because they directly give the average relative importance of each program variable, as indicated by the faculty questionnaire responses in the field of study. Thus, the final 20 importance measures of the program characteristics for each faculty respondent are non-negative and sum to 1.0. Boxes (2b), (4): The Regression-Based Weights We next consider the processes in boxes (2b) and (4) in Figure J-1 that lead to the regression- based weights. Again, we leave boxes (2) and (2a) to our later discussion of how we simulated the uncertainty in these data. The regression-based weights represent our attempt to ascertain how much weight is implicitly given to each program variable by faculty members when they rate programs by using their own perceived quality of the programs they are rating. We used linear regression to predict average faculty ratings from the 20 program variables and interpreted the resulting regression coefficients as indicating the implicit importance of each program variable for faculty ratings. This is different from the survey or direct weights that were just described. We have broken down the process of obtaining the regression-based weights into the three parts indicated by boxes (2b) and (4) which we now discuss in turn. Box (2b): The average ratings for the sampled programs The ratings data in R of Figure J-1 are the ratings given by the sampled faculty members to the sample of programs that they were requested to rate. A randomly selected faculty member, i, rates a randomly selected program, j, on a scale of 1 to 6 in terms of his or her perception of its quality. Denote this rating by rij. The matrix sampling plan used was designed so that a sample of up to 50 of the programs in a field was rated by a sample of the graduate faculty members in that same field. Each rater rated about 15 programs, and none rated his or her own program. On average, each rated program was rated by about 44 faculty raters. The rater sample was stratified to ensure proportionality by geographic region, program size (measured by number of faculty),

290 APPENDIX J and academic rank. The program sample was stratified to ensure proportionality by geographic region and program size. R is the array of all the values of rij. Note that R is an incomplete array because many faculty members who responded to the questionnaire did not rate programs and many programs in a field were not rated, except for the small fields. Box (2b) indicates that we compute the average of these ratings for program j, and denote this average rating by rj . Because each programâs average rating is determined by a different random sample of graduate faculty raters, it is highly unlikely that any two programs will be evaluated by exactly the same set of raters. Denote the vector of the average ratings for the sampled programs in a field by r . The values of the average ratings in r are the dependent variable in the regression analyses used to form the regression-based weights. Box (4): The program variables and standardizing Denote the value of program variable k for program j by pjk, and define the vector of all program variables for program j by pj = (pj1, pj2 , . . , pj20), (3) and the array with rows given by pj by P. A cursory examination of the program characteristics listed in Table A-1 shows that they are on different scales. For example, the number of publications per faculty member (numbers in the fives and tens), the median GRE scores of entering students (numbers in the hundreds), and the percentage of entering students who complete a doctoral degree in 10 years or less (fractions) are reported in values that are of very different orders of magnitude. If these values are left as they are, the size of any regression coefficient based on them will be influenced by both the importance of that program variable for predicting the average ratings (which is what we are interested in), as well as the scale of that variable (which is arbitrary and does not interest us). The program variables with large values, such as the median GRE scores, will have very small coefficients to reflect the change in scale in going from GRE scores (in the hundreds) to ratings (in the 1 to 6 range). Conversely, program variables with small values, such as proportions, will have larger regression coefficients to reflect the change in scale in going from numbers less than 1 to ratings (in the 1 to 6 range). To avoid the ambiguity between the influence of the scale and the real predictive importance of a variable, we needed to modify the values of the different program variables so they have similar scales. This would ensure that program variables with the same influence on the prediction of faculty ratings would have similar regression-coefficient values. Our solution is the very common one of standardizing the pjk-values by subtracting their mean across the programs in a field and dividing by the corresponding standard deviation. This will result in program variables that have the same mean (0.0) and standard deviation (1.0) across the programs in the field. In this way, no program variable will have substantially larger or smaller values than any other program variable across the programs in a field. For the regressions of box (4), the standardization was done only over the programs that were sampled for rating. We denote the values of the standardized program variables with an asterisk (pjk* and P*). Two program variables (Student Work Space and Health Insurance) were coded as 1 (present) or -1 (absent). We felt that there was no need for additional standardization of these two program variables and they were not standardized to have mean 0 and variance 1.

APPENDIX J 291 The standardized program variables for the sampled and rated programs served as the predictor or independent variables in the regressions that lead to the regression-based weights. Box (5a): The regressions and the regression-based weights The statistical problem addressed in box (4) is to use r and P* as the dependent and independent variables, respectively, in a linear regression, to obtain the vector of regression-based weights, Ë m , using least squares. It should be noted that only the data in P* for the sampled programs are used. The data for the non-sampled programs in P* are not used in this step of the process. Two immediate problems arise. These are: (1) the number of observations (i.e., the number of sampled programs in a field) is 50 or less, while the number of independent variables (i.e., the program variables in P*) is 20, and (2) a number of the program variables are correlated with each other across the programs in a field. This is less than an ideal situation for obtaining stable regression coefficients. There are too few observations to hope for stable estimates of the coefficients for 20 variables. The fact that these variables are also correlated does not help matters either. If we had ignored these two problems, least-squares regression methods would have tended to assign coefficients rather arbitrarily to one particular variable or to other variables that are correlated with it, and how this worked out would depend on which programs were included in the sample of rated programs. The resulting unstable regression coefficients would have been unusable for our purposes. For example, as expected, when we fit a linear model that included all 20 of the program variables, we found that for a number of the variables, the coefficients and their signs did not make intuitive sense. However, we found, as expected, that they made more sense when we used various step-wise selection methods for reducing the number of variables used as predictors. With only 50 cases, we had to expect that we could not use all 20 variables in the prediction equations without adjustments. After examining a variety of approaches, we settled on using a backwards, step-wise selection method applied to the 20 principal component (PC) variables formed from the 20 program variables (rather than using the original 20 program variables). The regression coefficients obtained for the remaining PC variables were then transformed back to scale of the original 20 program variables, with the result that all 20 program variables now had non-zero coefficients, but these coefficients were subject to several linear constraints implied by the deleted PC variables. The principal component variables are linear combinations of the original 20 program variables that have two properties: (1) they are uncorrelated in the sample, and (2) they can give exactly the same predictions as do the original variablesâthat is, every prediction equation that is possible with the original variables is also possible to form using the PC variables, using different regression coefficients. The PC variables are usually ordered by their variances from largest to smallest, but this plays no role here. There are as many PC variables as there are original variablesâin our case, 20. If we denote the array of original 20 standardized variables for the sample of rated programs as P*, then the corresponding array of the 20 PC variables, C, is given by the matrix multiplication, C = P*V, where V is the 20 by 20 orthogonal matrix specified by, among other things, the singular value decomposition of P*. After the regression coefficients are estimated

292 APPENDIX J using the PC variables, we get back to the coefficients for the original standardized variables in P* by transforming the vector of regression coefficients by the transformation, V. Our step-wise use of the PC variables proceeded as follows. We begin with a least- squares prediction equation, predicting r from C, that includes all of the PC variables. Then a series of analyses is performed, with one PC variable at a time being left out of the prediction equation; the PC variable that has the least impact on the fit of the predicted ratings (as measured by its t-statistic) is removed. This process is repeated, removing one PC variable each time, until the remaining PC variables each add statistically significant improvements to the fit of the predictions of the ratings (at the 0.05 level). The result is a set of regression coefficients, the PC coefficients, Î³Ë ï¬ï which predict the sample of program ratings from a subset of the PC variables, i.e., Ë r = C Î³Ë ï® (4) In Equation 5, the caret denotes estimation. Moreover, for the PC variables that have been eliminated during the backwards selection process, the corresponding PC-coefficients, Î³Ëk , are zero. These zeros mean that we are setting the coefficients of certain linear combinations of the original variables to zero rather than setting the coefficients for some of the original program variables to zero. This was regarded as a virtue, because we did not necessarily eliminate any of the original program variables from the prediction equation used to find the regression-based weights. By proceeding this way, we are not forced to give a zero weight to one of two collinear variables in the step-wise procedure. Instead, both collinear variables will typically load onto the same principal components and get some weight when the matrix V is applied to the PC coefficients to obtain the coefficients for the original program variables, i.e., m = V Î³Ë . Ë (5) In the same way, the matrix of estimated variances and covariances of Î³Ë , obtained from the least-squares output, may be transformed to the corresponding matrix for m .4Ë th Ë The regression coefficient for the k program variable, denoted by m , is the regression- based weight for program characteristic k as a predictor of the average ratings of the programs by Ë Ë Ë Ë the faculty raters, and m = (m1 , m2 ,..., m20 ) . The predicted perceived quality rating for a sampled program can be expected to differ somewhat from the actual average rating for that program. For example, for the two fields studied in Assessing Research Doctorate Programs: A Methodology Study, the root-mean-square deviation between the predictions and the average ratings was 0.42 on a 1-to-6 rating scale for both mathematics and English. In addition, the (adjusted) R2 of the regressions of average ratings on measured program characteristics was 0.82 for mathematics and 0.80 for English. These values indicate that the predictions account for about 80 percent of the variability in average 4 If the weights from the R and S measures were to be combined, the variances from this matrix would be used later [in box (6) of the computation of combined weights] in the computation of the âoptimal fractionâ for combining the survey-based and regression-based weights.

APPENDIX J 293 ratings. We regarded this as satisfactory levels of agreement between predicted and actual to use these methods in this study. These results show that the predicted perceived quality ratings agree fairly well with the actual ratings. However, these results do not indicate how well a prediction equation that was based on a sample of programs will reproduce the predictions of the equation for the whole population of programs in a field. The data for mathematics, reported in Assessing Research Doctorate Programs: A Methodology Study, indicate that using 49 programs did a reasonably good job of reproducing the predictions based on the whole field of 147 mathematics programs.5 Thus, we decided that in developing the regression-based ratings, we would use a sample of 50 programs from a field if it had more than 50 programs and use almost all of the programs in fields with 50 or fewer programs. When there were fewer than 30 programs in a field, it was combined with a larger discipline with similar direct weights for the purposes of estimating the regression-based weights.6 In two cases, computer engineering and engineering science and materials, there were fewer than 25 programs, and these fields were not ranked, although data are reported for all 20 characteristics.7 Ë There is one final alteration in the values of m that needs to be mentioned. The survey- based or direct weights, { xk }, have absolute values that sum to 1.0. This is not necessarily true Ë of the regression coefficients, { mk }. The scale of mk depends on both the scale of pjk and the scale of the average ratings, { rj }.We decided, because initially our intent was to combine these two sources of the importance of the various program variables, that they needed to be on similar scales. We decided to force them both to sum to 1.0 in absolute value8. This allows the direct and regression-based weights to have negative values where they arise, typically in the regression- based weights, without requiring anything complicated to deal with this. Using the sum of absolute values allows the sign of the regression-based weights to be determined by the data Ë rather than by an a priori hypothesis. Thus, we divided each regression coefficient, mk , by the sum of the absolute values of all the regression coefficients. In this way, both the direct and 5 See Appendix G of Assessing Research Doctorate Programs: A Methodology Study, National Research Council (2003) 6 The fields for which this was done were: Small Field Surrogate Field Aerospace engineering Mechanical engineering Agricultural economics Economics American studies English literature Astrophysics and astronomy Physics Entomology Plant science Forestry Plant science Food science Plant science Theatre and performance English literature 7 Ranges of rankings are not provided for three fields that were in the original taxonomy: 1)Languages, Societies, and Cultures, for which the sub-fields were too diverse to it as a coherent field; and 2)Engineering Science and Materials and 3) Computer Engineering, which fell below the minimum of 25 programs to permit the calculation of rankings for a field. The committee had not anticipated this when it developed the taxonomy, or the fields would not have been included as a separate field. 8 We use the absolute value here because, for time to degree, a higher value should receive a negative weight. Note that normalization has no effect on relative rankings, since it is simply a linear transformation.

294 APPENDIX J regression-based weights are fractional values, mostly positive but some negative, whose absolute sums equal 1.09. Boxes (1), (1a), (2) and (2a): Simulating the Uncertainty in the Direct and Regression-Based Weights The survey-based (S) or direct weight vector, x , is subject to uncertainty; that is, a different set of respondent faculty would have led to different values in x . Disagreement among the graduate faculty on the relative importance of the 20 program variables is the source of the uncertainty of the direct or survey-based weights. The average ratings of the sampled faculty in r are also subject to uncertainty; a different sample of raters or programs would have produced different values in r . One way to reflect this uncertainty is to use the sampling distributions of x and r . There are various ways that these sampling distributions may be realized. We chose an empirical approach that made no assumptions about the shapes of the various distributions involved, but this allowed us to use computer-intensive methods to let the sampling variability of both x and r influence the final ratings and rankings. We examined two empirical approaches, Efronâs bootstrap and a random-halves (RH) procedure suggested by the committee chairman. We found that both gave very similar final results in terms of the final ranges of rankings and ratings. The bootstrap requires taking a sample of N with replacement from the relevant empirical distribution. The RH procedure requires taking a sample of N/2 without replacement from the same empirical distribution. We chose to use the RH procedure because it cut the sampling computations in half, is fairly easy to explain, and as far as we could tell, gave essentially the same results as the bootstrap for ranking and rating. Boxes (1) and (2): The random halves procedure The RH procedure for both x and r are nearly the same, and with the same justifications. X is a complete array whose rows denote the N faculty respondents, while R is an incomplete array whose rows denote the n sampled faculty raters for a field. In the case of X, the RH procedure requires a random sample of size N/2 of the faculty respondents. In the case of R, the RH procedure requires a random sample of size n/2 of the faculty raters. Repeated draws from these random half samples are then used to simulate the uncertainty in x and r , respectively. Alert readers may worry that these half samples will exhibit too much variability in the resulting averages; after all, a half sample has only half the number of cases as a full sampleâ and the bootstrap always takes a full sample of N or n. The explanation of why a half sample without replacement has essentially the same variability as a full sample with replacement is most easily seen by considering the variance of the mean of a sample without replacement from a finite population. It is well known from sampling theory that the variance of the mean from a sample of size N/2, from a population of size N is, essentially, 9 Ë The estimated standard deviations of the { mk }, obtained in standard ways from the regression output, were also divided by this sum to make them the correct size for use in the process of combining the direct and regression- based weights, discussed below.

APPENDIX J 295 Ï x2 N Ïx 2 Var( xk ) = k (1 â / N) = k . (6) âNâ 2 N â â â2â That is, because of the âfinite sampling correction,â the variance from a random half sample without replacement is exactly the same as the variance of a random sample of twice the size with replacement (there is a small âN versus N â 1â effect that Formula 11 ignores). This is why the bootstrap and the RH methods give such similar results in our application to the uncertainty of the direct weights. There are other reasons to expect the RH method to produce a useful simulation of the uncertainty of averages.10 The same reasoning applies to the RH sampling of the faculty raters in R to simulate the uncertainty in the average ratings, r , used to obtain the regression-based weights. The procedure was to sample a random half of all raters for programs in a field and compute the average rating for each program from that half sample. The regression-based weights are subject to uncertainty from two sources. The first is the uncertainty arising from sampling the faculty raters and, as indicated above, the RH sampling directly addresses this source. The second is from using average ratings from a sample of programs rather than all the programs to develop the regression equation from which the regression-based weights are derived. In the discussion of box (4), above, we gave our reasoning for believing the sample of 50 programs is adequate, and how we pool the data from other related fields when the number of programs in a field is smaller than 50. In addition, while the use of ratings for a sample of programs has the practical value of reducing the workload of the faculty raters, our implicit use of the predicted average ratings, {Mj}, from Equation 5 above, rather than actual average ratings, { rj }, also reduces some of the uncertainty due to the sampling of the programs to be rated. For these two reasons, we believe that this second source of uncertainty is not as important as that simulated by the RH procedure for the uncertainty in the average ratings, Ë and consequently, for the regression-based weights, m . We always drew the RH samples 500 times, and those for x were statistically independent of those for r . This gives us 500 replications of the direct or survey-based weights and 500 replications of the regression-based weights. Boxes (3) and (3a): Incorporating Uncertainty into the Program Variables In addition to the uncertainty in the survey-based (direct) and regression-based weights discussed above, there is also some uncertainty in the values of the program variables themselves. Some of the 20 program variables used to calculate the ratings also vary or have an error associated with their values due to year-to-year fluctuations. Data for five of the variables (publications per faculty, citations per publications, GRE scores, Ph.D. completion, and number of Ph.D.âs) were collected over time, and averages over a number of years were used as the values of these program variables. If a different time period had been used, the values would have been 10 The random-halves procedure has a place in the statistical literature, but with other names. It is an example of the âdeleted-dâ jackknife as described in Efron and Tibshirani, (1993) An Introduction to the Bootstrap. New York: Chapman and Hall. p. 149, with d = n/2. It is described by Kirk Wolter in a private communication as an example of the âbalanced repeated replicationâ or âbalanced half samples,â and described in Wolter, K. M. (2007) Introduction to Variance Estimation., 2nd ed. New York: Springer-Verlag.

296 APPENDIX J different. To express this type of uncertainty, a relative error factor, ejk, was associated with each program variable value, pjk. The relative error factor was calculated by dividing the standard deviation over the series by the square root of the number of observations in the series, and then dividing that number by the value of the variable pkj. For example, the publications per faculty variable is the average number of allocated publications per allocated faculty over 7 years, and a standard error value was calculated for this variable as SD/â7. This standard error was then divided by the value of the publications per faculty variable to get the relative error factor for this program variable. For the other 15 program variables that are used in the ratings, no data on variability were directly obtained during the study, and we assigned a relative error of 0, 0.1 or 0.2 to these variables. The relative error for the variables Student Workspace and Health Insurance were given an error of 0, because they were thought to have little or no temporal fluctuation over the interval considered; and for Percent of Faculty Holding Grants, the error assigned was 0.2, because an examination of data from the National Science Foundation Survey of Research Expenditure indicated this to be an appropriate estimate. The remaining 12 program variables were assigned a relative error of 0.1. Each program had its own relative error factor for each program variable, ejk. Just as we had simulated values from the sampling distributions of x and r via RH sampling, we also wanted to reflect the uncertainty in the values of the program variables themselves rather than using the fixed values, {pkj}, in computing program ratings. We did this in the following way. The value, pkj, was perturbed by drawing randomly from the Gaussian distribution, N(pkj, (ekpkj)2).This distribution has a mean equal to the variable value pkj and a standard deviation equal to the relative error, ek, times the variable value, pkj. Thus, the entire % array P is randomly perturbed to a new array, P . This perturbing process is repeated 500 times, and each one is standardized to have mean 0.0 and standard deviation 1.0 for each of the 20 program variables to produce 500 standardized arrays, P *. %

APPENDIX J 297 Boxes (5b) and (5c): The Ninety Percent Ranges of the S and R Rankings In box (5b) we have already calculated 500 replications of the survey-based weights and in box (5c) we have done the same for the Regression-based weights for the given field [from box (2b)] and from 500 replications of the steps in boxes 5b and 5c we have 500 replications of the standardized perturbed version of P that contains the program variable data for all of the programs to be rated in the field. For either measure, denote the kth replication of Rj by R (j k ) . To obtain the kth replication of the rankings of the programs, sort the values of R (j k ) over j from high to low and assign the rank of 1 to the program with the highest rating in this set. In case of tied ratings, we use the standard procedure in which the ranks are averaged for the tied cases, and the common rank given to the tied programs is the average of the ranks that would have been given to the tied set of programs. For each of the replications of the ratings, there is a corresponding replication of the rankings of the programs, resulting in 500 replications of the ranking of each program. Instead of reporting a single ranking of the programs in a field, we report the ninety percent range of the rankings for each program. This is an interval starting with the rank that was at the 5th percentile in the distribution of the 500 replications of the ranks for the given program, and ending at the 95th percentile of this distribution. The interpretation of the ninety percent range is that it is range that covers the middle ninety percent of the rankings and reflects the uncertainty in the survey-based (direct) and regression-based weights and in the program data values five percent of a programâs rankings in our process are less than this interval and five percent are higher. The interval itself represents what we would expect the typical rankings for that program to be, given the uncertainty in the process and the ratings of the other programs in the field.11 These ninety percent ranges are reported for the R and S measures, as well as for the three dimensional measures. AN ALTERNATIVE APPROACH TO CONSTRUCTING RANKINGS: COMBINING THE R AND S MEASURES The prepublication version of the revised Methodology Guide appeared in July 2009 and explained the methodology developed by the committee at that time, that is, one that combined the R-based and S-based measures in a way that will be described below. In July 2009, the committee had estimated ranges of rankings for only a handful of fields and assumed that this method of estimation would be generally satisfactory. In theory it is, but when applied to data for additional fields it became clear that there were some fields for which the range of program rankings based on the S-measure differed considerably from that based on the R-measure. Further, the committee came to view any set of ranges of rankings that it might develop as 11 In an earlier draft of this guide, we chose an inter-quartile range, but this choice, rather than some other range (eliminating the top and bottom quintile, for example) is arbitrary. The current approach uses broader ranges which result in greater overlap of ranges, but has the advantage of covering most of the rankings a program might achieve. The point of introducing uncertainty in our calculations is that we do not know the âtrueâ ranking of a program. The purpose of presenting a ninety percent range is to provide a range in which a programâs ranking is likely to fall.

298 APPENDIX J illustrative, that is, any range of rankings depended critically on the characteristics chosen and the weights applied to those characteristics. The R- and S- based ranges of rankings were two examples of data-based ranking schemes, but there are others. In fact, the dimensional measures described in the body of this Guide, are an example12. The technical description of further steps that the committee carried out to obtain ranges of rankings using the combined measures are described in this sectionâbeginning with an alternative conceptual diagram. 12 In most cases, it would not make sense to combine the dimensional measures because they yield differing results for most programs.

APPENDIX J 299 Figure J-2 A graphical summary of the alternative method. The three sets of data: X, P, and R. P = the collection of the values of the R = the collection of ratings of programs by the X = the collection of the faculty importance program variables. A complete array with a faculty raters. An incomplete array, with ratings measures. A complete array with an value for every program (that satisfies the only for the sampled programs and rated only importance value for every program variable by inclusion criteria for rating and ranking) in a by those faculty members who were sampled every responding faculty member. field, on every program variable. to rate a given sampled program. (3) Random perturbation of the (1) Random halves sampling of (2) Random halves sampling of raters values in P. faculty in X. in R. (1a) Results in one random (3a) Results in one randomly (2a) Results in one random half of R, % perturbed version of P, denoted denoted by R .% half of X, denoted by X . % by P . % (2b) Average R over raters to get % (1b) Average X over faculty to get the % (4) Standardize P to get average ratings for sampled survey-based (S) weights, x . The sum % * . Standardize program P programs, r . This is the dependent of these weights = 1.0. variables to Mean = 0, and SD variable in the regressions. = 1. Denote result by P*. These are the independent variables in the regressions and in the x . (6) Select policy weight, w= Â½. (5) (a) Transform original program variables to principal components (PCs). (7) Combine x , m and w = Â½ Ë (b) Perform backwards stepwise regression to obtain a stable fitted equation predicting average ratings from using the optimal fraction to form the remaining PCs. the combined weights, f0. (c) Transform resulting coefficients back to the original program variables to get the regression-based weights, Ë m , and make their absolute sum = 1.0. (8) Repeat the steps from (1) to (7) 500 times. Use the resulting 500 (9) Repeat the steps 3 to 3a to get 500 replications of samples of f0 to eliminate program variables in X and P* having non- % P * , and combine them with the final 500 replications significant combined weights. Repeat this until there are no non-significant of f0 to get 500 Ratings for each program. Rank the program variables. Final output is last 500 replications of f0 with zero programs for each set of 500 ratings. This results in 500 entries for all non-significant variables. Rankings for each program. Use these 500 Rankings to get the Inter-quartile range of the Rankings for each program. Note: Shaded boxes indicate steps used in an alternative technique and are omitted from the technique used to generate the current rankings

300 APPENDIX J Boxes (6) and (7): The Combined Weights To motivate our method of combining of the direct and regression-based weights, we start by describing the direct and regression-based ratings. Remembering that the standardized values of the program variables for program j are denoted by pjk*, the direct rating for program j, using the average direct weight vector, x , is Xj, is given by 20 Xj = âx k =1 k p jk * . (7) Ë The regression-based rating for program j, using the regression-based weight vector, m , is Mj, is given by 20 Mj = âm p Ë k =1 k jk *. (8) Note that the regression-based rating is a linear transformation of the predicted ratings used to obtain the regression-based weights, because the constant term of the regression is deleted, and the weights have been scaled by a common value so that their absolute sum is 1.0. The procedure for computing regression-based ratings can be used for any program, sampled or not, in the given field. Simply use Mj as defined in Equation 7 above, where {pjk*} comes from Ã¶ the data for program j and the { mk } are the regression-based weights based on the sample of programs and raters.13 We combined the direct ratings with the regression-based ratings as follows. Let w denote a policy weight and form the following combination of the direct and regression-based ratings: Rj = wMj + (1 â w)Xj. (9) The policy weight, w, is chosen in box (5) of Figure J-1, and is the amount the regression-based ratings are allowed to influence the combined rating, Rj. When w = 0, the regression-based rating has no influence on the Rj. When w = 1, the Rjs are totally based upon the regression-based ratings. Any compromise value of w is somewhere between 0 and 1. We did not actually form both the direct and regression-based ratings in our work. Instead, we exploited the simple linear form of these given by: 20 20 20 Rj = w â mk p jk * + (1 â w) â xk p jk * = Ë âf k p jk * (10) k =1 k =1 k =1 where the combined weight, f k , is given by 13 We have throughout estimated linear regressions. Is this assumption justified? We can only say that, empirically, we tried alternative specifications that included quadratic terms for the most important variables (publications and citations) and did not find an improved fit.

APPENDIX J 301 Ë f k = w mk + (1 â w) xk . (11) The representation of the combined rating given in Equations 9 and 10 is a linear combination of the program variables that uses the combined weights, { f k } defined in Equation 10. The combined weight f k is applied to the kth standardized program characteristic, pjk* for each k, and then all 20 of these weighted values are summed to obtain the final combined rating for program j. Ë However, because both mk and xk are subject to uncertainty, we made one additional adjustment to Equation 10 that is described below, following the discussion of how we simulated the uncertainty in both the direct weights and in the average ratings used to form the regression- based weights. Box (7): Using the optimal fraction to combine the direct and regression-based weights. Ë In deriving the ranges of ratings that reflect the uncertainty in mk and xk , simulated values, mk, Ë and xk, are drawn from the sampling distributions of mk , and xk , respectively, using independent RH samples from the appropriate parts of R and X. These two simulated values are to be combined to form a simulated value, fk, for f k in Equation 11. However, the simple weighted average in Equation 11 only reflects the effect of the policy weighting, w, and ignores the fact that both mk, and xk are independent random draws from distributions, rather than fixed values. We want to combine mk, and xk in such a way as to bring the simulated value, fk, as close as possible to f k on average, and in a way that will also reflect the policy weight, w, appropriately. This section outlines our approach to choosing the optimal fraction to apply to mk to achieve this. The optimal fraction is the amount of weight applied to mk that minimizes the mean-square error of fk, treating f k as a target parameter to be estimated. First, consider a general weighting, fk(u), that uses a fraction, u. This weighting has the form fk(u) = umk + (1 â u)xk. (12) Ë By construction of the RH procedure, the mean of the distribution of mk is mk (the regression coefficients that are obtained when the data from all n faculty raters are used). Similarly, the mean of the distribution of xk is xk , the mean importance value that is obtained when the data from all N faculty respondents are averaged. We may regard fk(u) as an estimator of Ïk, given by Ïk = w mk + (1 â w) xk . Ë (13) The problem then is to find the value of u that will minimize the mean-square error (MSE) of fk(u) given by MSE(u) = E(fk(u) â Ïk)2, (14)

302 APPENDIX J where, in Equation 14, the notation, E(fk(u) â Ïk)2 denotes the expectation or average taken over Ë the independent RH distributions of mk and xk . The MSE is a measure of the combined uncertainty in fk(u). The MSE in (14) can be written as MSE(u) = E(umk + (1 â u)xk â w mk â (1 â w) xk )2 Ë = E(u(mk â mk ) + (1 â u)(xk â xk ) + (u â w) mk + (w â u) xk )2 Ë Ë = E(u(mk â mk ) + (1 â u)(xk â xk ) + (u â w)( mk â xk ))2. Ë Ë (15) The point of re-expressing Equation 14 as Equation 15 is that now when the squaring is carried out, all of the terms except the squared ones have zero expected values and can be ignored. If we denote the variance of the sampling distribution of mk by Ï2( mk ) and the variance of xk by Ë Ë Ï2( xk ), then Equation 15 becomes MSE(u) = u2Ï2( mk ) + (1 â u)2Ï2( xk ) + (u â w)2( mk â xk )2. Ë Ë (16) It is now a straightforward task to differentiate Equation 16 in u, set the result to zero, and solve for the optimal u-value, u0k, which we call the optimal fraction. This calculation results in Ï 2 ( xk ) + w(mk â xk )2 Ë u0k = . (17) Ï ( xk ) + Ï (mk ) + (mk â xk )2 2 2 Ë Ë The optimal fraction in Equation 12 has some useful and intuitive properties. It takes on the value w when there is no uncertainty about the direct and regression-based weights. Ë Moreover, w has no influence on the optimal fraction when mk and xk are equal. In that case, the th direct weights and regression-based weights on the k program characteristic are the same, and the optimal fraction combines the two simulated values in a way that is inversely proportional to their variances, so that the value with less variation gets more weight. Note also, that the value in Equation12 is the same for all of the RH simulated values of mk and xk. The two variances in Equation 12, Ï2( xk ) and Ï2( mk ), may be found in standard ways. Ë The value of Ï2 ( xk ) is given by Ï2( xk ) = Ï2(xk)/NF, (18) where NF denotes the number of faculty in the field who supply direct weight data, and Ï2 (xk) denotes the variance of the individual direct weights given to the kth program variable by these faculty respondents. The value of Ï2( mk ) is obtained from the regression output that produces Ë mk when the data from all faculty raters in a field are used. Its square root, Ï( mk )is the standard Ë Ë Ë Ë error of the regression coefficient, mk . Finally, because we rescaled the mk so that their absolute sum was 1.0, the same divisor must be applied to Ï( mk )to put it on the corresponding scale. Ë

APPENDIX J 303 If we now replace the u in Equation 17 with u0k given in Equation 17, we then obtain the combined weight that optimally combines the two simulated values of the weights, mk, and xk, into the combined rating, given by 20 R0j = âf k =1 0k p*kj (19) where f0k = u0kmk + (1 â u0k)xk, (20) and u0k is given by Equation 17. The vector of optimally combined weights is denoted by f014. The values of R0j from Equations 19 and 20 are used as the 500 simulated values of the combined ratings for the purposes of determining the ranking interval ranges for each program that is discussed below. In performing the RH sampling to mimic the uncertainty in the direct and regression- based weights, it should be emphasized that the random half samples from X and R were statistically independent. This is our justification for assuming that the random draws, mk, and xk, are statistically independent in the calculation of the optimal fraction, u0k.15 As a final point, we did realize that the approach to calculating the optimal fraction described above did not take into account any correlation between the direct and regression- based weights for different program variables. We did examine a method that did, but it simply produced a matrix version of Equation 12 that reduced to the procedure we used when the program variables were uncorrelated, but was otherwise difficult to implement with the resources available to us. Box 8: Eliminating Non-Significant Program Variables After we have obtained the 500 simulated values of the combined weights by applying Equations 17 and 20 to the 500 simulated values for the direct and regression-based weights, we were in a position to examine the distributions of these 500 values of the combined weights for each program variable. The distributions of the combined weights for some of the program variables did not contain zero and were not even near zero. However, other program variables had combined weight distributions that did contain zero. If zero is inside the middle 95 percent of this distribution, we declare the combined weight for that program variable to be non-significant for the rating and ranking process (in analogy with the usual way that distributions of parameters are tested for statistical significance). If the combined weight for a program variable is not significantly different from zero, the variable for that coefficient is dropped from further computations. This elimination of program variables required us to recalculate everything above box (8) in Figure J-2. The eliminated program variables are ignored in calculating the direct and regression-based weights for the other variables. New RH samples are drawn, the direct weights are retransformed so that the absolute sum of the remaining direct weights was 1.0, the regressions are re-run using the reduced set of program variables as predictors, and new optimal 15 The fact that the raters for each field were a subset of those who answered the faculty questionnaire may confuse some into thinking that our independence assumption may not be justified. This is an unfortunate misunderstanding of the simulation of uncertainty in the rating and ranking process. It is the statistical independence of the two RH sampling processes that matters, nothing else.

304 APPENDIX J fractions are computed to combine the direct and regression-based weights. Finally, the 500 simulated combined coefficients are again tested for statistical significance from zero. This process is repeated until a final set of combined weights, each of which is significantly different from zero, is obtained. Only after this testing and retesting process is performed are the final sets of 500 combined coefficients ready for use in the computation of the intervals of rankings that are discussed in box (5) of Figure J-1. The values for the combined weights that correspond to the eliminated variables are set to 0.0 in each of the final 500 simulated values of f0. These 500 vectors of combined weights are used in the production of the ratings that are used to produce the final intervals of rankings for each program, as discussed later. Empirically, the examination of three fields suggests that this process has two useful effects. First, the middle of the inter-quartile ranges of rankings of programs is changed very little, so that the ranges before eliminating nonsignificant program variables and those after this elimination are centered in nearly the same places16. Second, the widths of these inter-quartile ranges are slightly reduced or are unchanged. These are the effects that we would expect from eliminating variables that are having only a noisy effect on the ranking and rating process, and for this reason, we have continued to include box (8) in our rating and ranking process. Nonetheless, the inter-quartile intervals do shift more markedly than the medians, when estimated coefficients are set to zeroâlargely for those departments near the middle of the rankings. This is because quartile estimates are more variable than median estimates. There are even rare instances in which the intervals calculated both ways do not overlap. From this point on, the calculation of the ranges of rankings is carried out as described in the section about the R-and S- ranges of rankings. 16 Examination of the effect of this procedure gave correlations between the median rankings with and without the elimination of nonsignificant variables of .99.