Read "A Guide to the Methodology of the National Research Council Assessment of Doctorate Programs" at NAP.edu

« Previous: 5 An Example

Page 33 Cite

Suggested Citation:"APPENDIX A A Technical Discussion of the Process of Rating and Ranking Programs in a Field." National Research Council. 2009. A Guide to the Methodology of the National Research Council Assessment of Doctorate Programs. Washington, DC: The National Academies Press. doi: 10.17226/12676.

Page 34 Cite

Page 35 Cite

Page 36 Cite

Page 37 Cite

Page 38 Cite

Page 39 Cite

Page 40 Cite

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

APPENDIX A A Technical Discussion of the Process of Rating and Ranking Programs in a Field. This appendix explains in detail how the various parts of the rating and ranking process for graduate programs fit together and how the process is carried out. Figure A-1 provides a graphical overview of the entire process and forms the basis for this appendix. We address each of the boxes in Figure A-1 separately, starting at the top and generally working downward and to the right. The topics in this appendix include: â¢ a summary of the sources of data used in the rating and ranking process, â¢ the direct weights, the regression-based weights, the methods used to calculate the regression-based weights, â¢ the simulation of the uncertainty in the weights by random-halves sampling, â¢ the construction of the combined weights using an optimal fraction to combine the simulated values of the direct and regression-based weights, â¢ the elimination of variables with nonsignificant combined weights, â¢ the simulation of the uncertainty in the values of the program variables, â¢ the combination of the simulated combined weights for the significant program variables with the simulated standardized values of the program variables to obtain simulated rankings, and â¢ the resulting inter-quartile ranges of rankings that are the primary rating and ranking quantities we report. 33 PREPUBLICATION COPYâUNEDITED PROOFS

Figure A-1 A graphical summary of the NRCâs approach to rating and thereby ranking graduate programs. The three sets of data: X, R and P. X = the collection of the faculty R = the collection of ratings of P = the collection of the values importance measures. A complete programs by the faculty raters. An of the program variables. A array with an importance value for incomplete array, with ratings only complete array with a value for every program variable by every for the sampled programs and every program (that satisfies the responding faculty member. rated only by those faculty inclusion criteria for rating and members who were sampled to ranking) in a field, on every rate a given sampled program. program variable. (1) Random halves (2) Random halves sampling of faculty in X. (8) Random perturbation sampling of raters in R. of the values in P. (1a) Results in one random (2a) Results in one random % half of X, denoted by X . half of R, denoted by % R. (3) Standardize (1b) Average % X over faculty program variables to (2b) Average % R over raters Mean = 0, and SD = 1. to get the direct weights, x . The sum of direct weights = to get average ratings for Denote result by P*. 1.0. sampled programs, r . This These are the is the dependent variable in independent variables (5) Select policy weight, w= Â½. in the regressions. the regressions. The combined weights The regressions Ë (6) Combine x , m and w (4) (a) Transform original program (8a) Results in one = Â½ using the optimal variables to principal components (PCs). randomly perturbed fraction to form the (b) Perform backwards stepwise version of P, denoted by regression to obtain a stable fitted % combined weights, f0. equation predicting average ratings from P. the remaining PCs. (c) Transform resulting coefficients back (8b) Standardize to the original program variables to get % % P to get P * . Eliminating non-significant variables the regression-based weights, m Ë , and make their absolute sum = 1.0. ranges of Rankings The Inter-quartile (7) Repeat the steps from (1) to (6) 500 times. Use the (9) Repeat the steps (8) to (8b) to get 500 replications resulting 500 samples of f0 to eliminate program % of P * , and combine them with the final 500 variables in X and P* having non-significant combined weights. Repeat this until there are no non- replications of f0 to get 500 Ratings for each program. significant program variables. Final output is last 500 Rank the programs for each set of 500 ratings. This results in 500 Rankings for each program. Use these replications of f0 with zero entries for all non- 500 Rankings to get the Inter-quartile range of the significant variables. Rankings for each program. 34 PREPUBLICATION COPYâUNEDITED PROOFS

THE THREE DATA SETS The empirical basis of the NRC ratings and rankings are the three data sets indicated in the three unlabeled boxes at the top of Figure A-1. The first, denoted by X, is the collection of faculty importance measures that were derived from data that were collected in the faculty questionnaire. The data in X are used to derive the direct weights discussed more extensively below. The second, denoted by R, is the collection of ratings of programs by faculty raters. These ratings were made separately from the faculty questionnaire and involved only a sample of programs from each field and only a sample of faculty raters from that field. This sample of faculty ratings plays a crucial role in the derivation of the regression-based weights, discussed more extensively below. The third data set, denoted by P, is the collection of the values of the 20 program variables that were collected from various sources for each program. The data in P are used in the final ratings and rankings of the programs and are discussed in greater detail below. More details about these three data sets are also available in Section 2 of this report. BOX (1b): THE DIRECT WEIGHTS FROM THE FACULTY QUESTIONNAIRE36 We turn first to the direct weights in box (1b) in Figure A-1, leaving boxes (1) and (1a) to our later discussion of how we simulated the uncertainty in these data. The faculty questionnaire asks each graduate-program faculty respondent to indicate how important each of 21 characteristics is to the quality of a program in his or her field of study. 37 This information is then used to derive the direct weights for each surveyed faculty member, as described below. The original 21 program characteristics listed on the faculty questionnaire are shown in Table A-1, and they were divided into three categoriesâfaculty, student, and program characteristics. Of the original 21, there are 20 for which adequate data were deemed to be available to use in the rating process, and these 20 data values for each program became the 20 program variables used in this study to which we repeatedly refer. Faculty respondents were first asked to indicate up to four characteristics in each category that they thought were âmost importantâ to program quality. Each characteristic that was listed received an initial score of 1 for that faculty respondent. These preferences were then narrowed by asking the faculty members to further identify a maximum of two characteristics in each category that they thought were the most important. Each of these selected characteristics received an additional point, resulting in a score of 2. Given this approach, at most, 12 of the program characteristics can have a non-zero value for any given faculty member; and of these 12, 6, at most, will have a score of 2, and the rest will have a score of 1. At least 8 program characteristics will have a score of 0 for each faculty respondent, more than 8 would be zero if the respondent selected less than 4 as the âimportantâ or 2 as the âmost importantâ characteristics. A final question asked faculty respondents to indicate the relative importance of 36 The importance of program attributes to program quality is surveyed in Section G of the faculty questionnaire. 37 The number of student publications and presentations was not used because consistent data on it were unavailable. The direct and regression-based weights were calculated without it. 35 PREPUBLICATION COPYâUNEDITED PROOFS

each of the three categories by assigning them values that summed to 100 over the three categories.38 For each faculty respondent, his or her importance measure for each program characteristic was calculated as the product of the score that it received times the relative importance value assigned to its category. Finally, the 20 importance measures for each faculty respondent were transformed by dividing each one by the sum of his or her importance measures across the 20 program variables. 38 The faculty task can be thought of as asking faculty how many percentage points should be assigned to each category. The sum of the percentage point weights adds up to 100. 36 PREPUBLICATION COPYâUNEDITED PROOFS

Faculty characteristics i. Number of publications per faculty member Table A-1 The 21 Program Characteristics Listed in the Faculty Questionnaire. ii. Number of citations per publication (for non-humanities fields) iii. Percent of faculty holding grants iv. Involvement in interdisciplinary work v. Racial/ethnic diversity of program faculty vi. Gender diversity of program faculty vii. Reception by peers of a faculty memberâs work as measured by honors and awards Student characteristics i. Median GRE scores of entering students ii. Percentage of students receiving full financial support iii. Percentage of students with portable fellowships iv. Number of student publications and presentations (not used) v. Racial/ethnic diversity of the student population vi. Gender diversity of the student population vii. A high percentage of international students Program characteristics i. Average number of Ph.D.âs granted in last five years ii. Percentage of entering students who complete a doctoral degree in a given time (6 years for non-humanities, 8 years for humanities). iii. Time to degree iv. Placement of students after graduation (percent in either positions or postdoctoral fellowships in academia) v. Percentage of students with individual work space vi. Percentage of health insurance premiums covered by institution or program vii. Number of student support activities provided by the institution or program 37 PREPUBLICATION COPYâUNEDITED PROOFS

We will use the following notation consistently: i for a faculty respondent, j for a program in a field, and k for one of the 20 program variables. Thus, xik denotes the measure of importance placed on program variable k by faculty respondent i. The values, xik, are non- negative and, over k, sum to 1.0 for each faculty respondent i. The importance measure vector for faculty respondent i is the collection of these 20 values, xi = (xi1, xi2, . . , xi20). (1) The entries in these x-vectors are non-negative and sum to 1.00. Denote the vector of average importance weights, averaged across the entire set of faculty respondents in a field, by x = ( x1 , x2 ,..., x20 ) . (2) The mean value, xk , is the average weight of the importance given to the kth program variable by all the surveyed faculty respondents in the field. The averages, { xk }, are the direct weights of the faculty respondents because they directly give the average relative importance of each program variable, as indicated by the faculty questionnaire responses in the field of study. Thus, the final 20 importance measures of the program characteristics for each faculty respondent are non- negative and sum to 1.0. BOXES (2b), (3) AND (4): THE REGRESSION-BASED WEIGHTS We next consider the processes in boxes (2b), (3) and (4) in Figure A-1 that lead to the regression-based weights. Again, we leave boxes (2) and (2a) to our later discussion of how we simulated the uncertainty in these data. The regression-based weights represent our attempt to ascertain how much weight is implicitly given to each program variable by faculty members when they rate programs by using their own perceived quality of the programs they are rating. We used linear regression to predict average faculty ratings from the 20 program variables and interpreted the resulting regression coefficients as indicating the implicit importance of each program variable for faculty ratings. This is different from the direct weights that were just described. We have broken down the process of obtaining the regression-based weights into the three parts indicated by boxes (2b), (3) and (4) which we now discuss in turn. Box (2b): The average ratings for the sampled programs. The ratings data in R of Figure A-1 are the ratings given by the sampled faculty members to the sample of programs that they were requested to rate. A randomly selected faculty member, i, rates a randomly selected program, j, on a scale of 1 to 6 in terms of his or her perception of its 38 PREPUBLICATION COPYâUNEDITED PROOFS

quality. Denote this rating by rij. The matrix sampling plan used was designed so that a sample of up to 50 of the programs in a field was rated by a sample of the graduate faculty members in that same field. Each rater rated about 15 programs, and none rated his or her own program. On average, each rated program was rated by about 44 faculty raters. The rater sample was stratified to ensure proportionality by geographic region, program size (measured by number of faculty), and academic rank. The program sample was stratified to ensure proportionality by geographic region and program size. R is the array of all the values of rij. Note that R is an incomplete array because many faculty members who responded to the questionnaire did not rate programs and many programs in a field were not rated, except for the small fields. Box (2b) indicates that we compute the average of these ratings for program j, and denote this average rating by rj . Because each programâs average rating is determined by a different random sample of graduate faculty raters, it is highly unlikely that any two programs will be evaluated by exactly the same set of raters. Denote the vector of the average ratings for the sampled programs in a field by r . The values of the average ratings in r are the dependent variable in the regression analyses used to form the regression-based weights. Box (3): The program variables and standardizing Denote the value of program variable k for program j by pjk, and define the vector of all program variables for program j by pj = (pj1, pj2 , . . , pj20), (3) and the array with rows given by pj by P. A cursory examination of the program characteristics listed in Table A-1 shows that they are on different scales. For example, the number of publications per faculty member (numbers in the fives and tens), the median GRE scores of entering students (numbers in the hundreds), and the percentage of entering students who complete a doctoral degree in 10 years or less (fractions) are reported in values that are of very different orders of magnitude. If these values are left as they are, the size of any regression coefficient based on them will be influenced by both the importance of that program variable for predicting the average ratings (which is what we are interested in), as well as the scale of that variable (which is arbitrary and does not interest us). The program variables with large values, such as the median GRE scores, will have very small coefficients to reflect the change in scale in going from GRE scores (in the hundreds) to ratings (in the 1 to 6 range). Conversely, program variables with small values, such as proportions, will have larger regression coefficients to reflect the change in scale in going from numbers less than 1 to ratings (in the 1 to 6 range). To avoid the ambiguity between the influence of the scale and the real predictive importance of a variable, we needed to modify the values of the different program variables so they have similar scales. This would ensure that program variables with the same influence on the prediction of faculty ratings would have similar regression-coefficient values. Our solution is the very common one of standardizing the pjk-values by subtracting their mean across the 39 PREPUBLICATION COPYâUNEDITED PROOFS

programs in a field and dividing by the corresponding standard deviation. This will result in program variables that have the same mean (0.0) and standard deviation (1.0) across the programs in the field. In this way, no program variable will have substantially larger or smaller values than any other program variable across the programs in a field. For the regressions of box (4), the standardization was done only over the programs that were sampled for rating. We denote the values of the standardized program variables with an asterisk (pjk* and P*). Two program variables (Student Work Space and Health Insurance) were coded as 1 (present) or -1 (absent). We felt that there was no need for additional standardization of these two program variables and they were not standardized to have mean 0 and variance 1. The standardized program variables for the sampled and rated programs served as the predictor or independent variables in the regressions that lead to the regression-based weights. Box (4): The regressions and the regression-based weights The statistical problem addressed in box (4) is to use r and P* as the dependent and independent variables, respectively, in a linear regression, to obtain the vector of regression- Ë based weights, m , using least squares. It should be noted that only the data in P* for the sampled programs are used. The data for the non-sampled programs in P* are not used in this step of the process. Two immediate problems arise. These are: (1) the number of observations (i.e., the number of sampled programs in a field) is 50 or less, while the number of independent variables (i.e., the program variables in P*) is 20, and (2) a number of the program variables are correlated with each other across the programs in a field.. This is less than an ideal situation for obtaining stable regression coefficients. There are too few observations to hope for stable estimates of the coefficients for 20 variables. The fact that these variables are also correlated does not help matters either. If we had ignored these two problems, least-squares regression methods would have tended to assign coefficients rather arbitrarily to one particular variable or to other variables that are correlated with it, and how this worked out would depend on which programs were included in the sample of rated programs. The resulting unstable regression coefficients would have been unusable for our purposes. For example, as expected, when we fit a linear model that included all 20 of the program variables, we found that for a number of the variables, the coefficients and their signs did not make intuitive sense. However, we found, as expected, that they made more sense when we used various step-wise selection methods for reducing the number of variables used as predictors. With only 50 cases, we had to expect that we could not use all 20 variables in the prediction equations without adjustments. After examining a variety of approaches, we settled on using a backwards, step-wise selection method applied to the 20 principal component (PC) variables formed from the 20 program variables (rather than using the original 20 program variables). The regression coefficients obtained for the remaining PC variables were then transformed back to scale of the original 20 program variables, with the result that all 20 program variables now had non-zero 40 PREPUBLICATION COPYâUNEDITED PROOFS

coefficients, but these coefficients were subject to several linear constraints implied by the deleted PC variables. The principal component variables are linear combinations of the original 20 program variables that have two properties: (1) they are uncorrelated in the sample, and (2) they can give exactly the same predictions as do the original variablesâthat is, every prediction equation that is possible with the original variables is also possible to form using the PC variables, using different regression coefficients. The PC variables are usually ordered by their variances from largest to smallest, but this plays no role here. There are as many PC variables as there are original variablesâin our case, 20. If we denote the array of original 20 standardized variables for the sample of rated programs as P*, then the corresponding array of the 20 PC variables, C, is given by the matrix multiplication, C = P*V, where V is the 20 by 20 orthogonal matrix specified by, among other things, the singular value decomposition of P*. After the regression coefficients are estimated using the PC variables, we get back to the coefficients for the original standardized variables in P* by transforming the vector of regression coefficients by the transformation, V. Our step-wise use of the PC variables proceeded as follows. We begin with a least- squares prediction equation, predicting r from C, that includes all of the PC variables. Then a series of analyses is performed, with one PC variable at a time being left out of the prediction equation; the PC variable that has the least impact on the fit of the predicted ratings (as measured by its t-statistic) is removed. This process is repeated, removing one PC variable each time, until the remaining PC variables each add statistically significant improvements to the fit of the predictions of the ratings (at the 0.05 level). The result is a set of regression coefficients, the PC coefficients, Î³Ë , which predict the sample of program ratings from a subset of the PC variables, i.e., Ë r = C Î³Ë . (4) In Equation 4, the caret denotes estimation. Moreover, for the PC variables that have been eliminated during the backwards selection process, the corresponding PC-coefficients, Î³Ëk , are zero. These zeros mean that we are setting the coefficients of certain linear combinations of the original variables to zero rather than setting the coefficients for some of the original program variables to zero. This was regarded as a virtue, because we did not necessarily eliminate any of the original program variables from the prediction equation used to find the regression-based weights. By proceeding this way, we are not forced to give a zero weight to one of two collinear variables in the step-wise procedure. Instead, both collinear variables will typically load onto the same principal components and get some weight when the matrix V is applied to the PC coefficients to obtain the coefficients for the original program variables, i.e., m = V Î³Ë . Ë (5) 41 PREPUBLICATION COPYâUNEDITED PROOFS

In the same way, the matrix of estimated variances and covariances of Î³Ë , obtained from the Ë least-squares output, may be transformed to the corresponding matrix for m . The variances from this matrix are used later in box (6) in the computation of the âoptimal fractionâ for combining the direct and regression-based weights. The regression coefficient for the kth program variable, denoted by mk , is the regression- Ë based weight for program characteristic k as a predictor of the average ratings of the programs by Ë Ë Ë Ë the faculty raters, and m = (m1 , m2 ,..., m20 ) . The predicted perceived quality rating for a sampled program can be expected to differ somewhat from the actual average rating for that program. For example, for the two fields studied in Assessing Research Doctorate Programs: A Methodology Study, the root-mean-square deviation between the predictions and the average ratings was 0.42 on a 1-to-6 rating scale for both mathematics and English. In addition, the (adjusted) R2 of the regressions of average ratings on measured program characteristics was 0.82 for mathematics and 0.80 for economics. These values indicate that the predictions account for about 80 percent of the variability in average ratings. We regarded this as satisfactory levels of agreement between predicted and actual to use these methods in this study. These results show that the predicted perceived quality ratings agree fairly well with the actual ratings. However, these results do not indicate how well a prediction equation that was based on a sample of programs will reproduce the predictions of the equation for the whole population of programs in a field. The data for mathematics, reported in Assessing Research Doctorate Programs: A Methodology Study, indicate that using 49 programs did a reasonably good job of reproducing the predictions based on the whole field of 147 physics programs.39 Thus, we decided that in developing the regression-based ratings, we would use a sample of 50 programs from a field if it had more than 50 programs and use almost all of the programs in fields with 50 or fewer programs. When there were fewer than 30 programs in a field, it was combined with a larger discipline with similar direct weights for the purposes of estimating the regression-based weights.40 In one case, computer engineering, there were fewer than 25 39 See Appendix G of Assessing Research Doctorate Programs: A Methodology Study, National Research Council (2003) 40 The fields for which this was done were: Small Field Surrogate Field Aerospace engineering Mechanical engineering Agricultural economics Economics American studies English literature Astrophysics and astronomy Physics Entomology Plant science Forestry Plant science Food science Plant science Engineering science and mechanics Mechanical engineering Theatre and performance English literature 42 PREPUBLICATION COPYâUNEDITED PROOFS

programs, and this field was combined with the field of electrical and computer engineering to estimate the regression-based coefficients.41 Ë There is one final alteration in the values of m that needs to be mentioned. The direct weights, { xk }, have absolute values that sum to 1.0. This is not necessarily true of the regression Ë coefficients, { mk }. The scale of mk depends on both the scale of pjk and the scale of the average ratings, { rj }.We decided, because our intent was to combine these two sources of the importance of the various program variables, that they needed to be on similar scales. We decided to force them both to sum to 1.0 in absolute value42. This allows the direct and regression-based weights to have negative values where they arise, typically in the regression-based weights, without requiring anything complicated to deal with this. Using the sum of absolute values allows the sign of the regression-based weights to be determined by the data rather than by an a priori Ë hypothesis. Thus, we divided each regression coefficient, mk , by the sum of the absolute values of all the regression coefficients. In this way, both the direct and regression-based weights are fractional values, mostly positive but some negative, whose absolute sums equal 1.0. The Ë estimated standard deviations of the { mk }, obtained in standard ways from the regression output, were also divided by this sum to make them the correct size for use in the process of combining the direct and regression-based weights, discussed below. BOXES (5) AND (6): THE COMBINED WEIGHTS To motivate our method of combining of the direct and regression-based weights, we start by describing the direct and regression-based ratings. Remembering that the standardized values of the program variables for program j are denoted by pjk*, the direct rating for program j, using the average direct weight vector, x , is Xj, is given by 20 Xj = âx k =1 k p jk * . (6) Ë The regression-based rating for program j, using the regression-based weight vector, m , is Mj, is given by 20 Mj = âm Ë k =1 k p jk * . (7) 41 The committee had not anticipated this when it developed the taxonomy, or the field would not have been included as a separate field. 42 We use the absolute value here because, for time to degree, a higher value should receive a negative weight. 43 PREPUBLICATION COPYâUNEDITED PROOFS

Note that the regression-based rating is a linear transformation of the predicted ratings used to obtain the regression-based weights, because the constant term of the regression is deleted, and the weights have been scaled by a common value so that their absolute sum is 1.0. The procedure for computing regression-based ratings can be used for any program, sampled or not, in the given field. Simply use Mj as defined in Equation 7 above, where {pjk*} comes from Ë the data for program j and the { mk } are the regression-based weights based on the sample of 43 programs and raters. We combined the direct ratings with the regression-based ratings as follows. Let w denote a policy weight and form the following combination of the direct and regression-based ratings: Rj = wMj + (1 â w)Xj. (8) The policy weight, w, is chosen in box (5) of Figure A-1, and is the amount the regression-based ratings are allowed to influence the combined rating, Rj. When w = 0, the regression-based rating has no influence on the Rj. When w = 1, the Rjs are totally based upon the regression-based ratings. Any compromise value of w is somewhere between 0 and 1. We did not actually form both the direct and regression-based ratings in our work. Instead, we exploited the simple linear form of these given by: 20 20 20 Rj = w â mk p jk * + (1 â w) â xk p jk * = Ë âf k p jk * (9) k =1 k =1 k =1 where the combined weight, f k , is given by Ë f k = w mk + (1 â w) xk . (10) The representation of the combined rating given in Equations 9 and 10 is a linear combination of the program variables that uses the combined weights, { f k } defined in Equation 10. The combined weight f k is applied to the kth standardized program characteristic, pjk* for each k, and then all 20 of these weighted values are summed to obtain the final combined rating for program j. Ë However, because both mk and xk are subject to uncertainty, we made one additional adjustment to Equation 10 that is described below, following the discussion of how we simulated the uncertainty in both the direct weights and in the average ratings used to form the regression- based weights. 43 We have throughout estimated linear regressions. Is this assumption justified? We can only say that, empirically, we tried alternative specifications that included quadratic terms for the most important variables (publications and citations) and did not find an improved fit. 44 PREPUBLICATION COPYâUNEDITED PROOFS

BOXES (1), (1a), (2) AND (2a): SIMULATING THE UNCERTAINTY IN THE DIRECT AND REGRESSION-BASED WEIGHTS The direct weight vector, x , is subject to uncertainty; that is, a different set of respondent faculty would have led to different values in x . Disagreement among the graduate faculty on the relative importance of the 20 program variables is the source of the uncertainty of the direct weights. The average ratings of the sampled faculty in r are also subject to uncertainty; a different sample of raters or programs would have produced different values in r . One way to reflect this uncertainty is to use the sampling distributions of x and r . There are various ways that these sampling distributions may be realized. We chose an empirical approach that made no assumptions about the shapes of the various distributions involved, but this allowed us to use computer-intensive methods to let the sampling variability of both x and r influence the final ratings and rankings. We examined two empirical approaches, Efronâs bootstrap and a random- halves (RH) procedure suggested by the committee chairman. We found that both gave very similar final results in terms of the final ranges of rankings and ratings. The bootstrap requires taking a sample of N with replacement from the relevant empirical distribution. The RH procedure requires taking a sample of N/2 without replacement from the same empirical distribution. We chose to use the RH procedure because it cut the sampling computations in half, is fairly easy to explain, and as far as we could tell, gave essentially the same results as the bootstrap for ranking and rating. Boxes (1) and (2): The random halves procedure The RH procedure for both x and r are nearly the same, and with the same justifications. X is a complete array whose rows denote the N faculty respondents, while R is an incomplete array whose rows denote the n sampled faculty raters for a field. In the case of X, the RH procedure requires a random sample of size N/2 of the faculty respondents. In the case of R, the RH procedure requires a random sample of size n/2 of the faculty raters. Repeated draws from these random half samples are then used to simulate the uncertainty in x and r , respectively. Alert readers may worry that these half samples will exhibit too much variability in the resulting averages; after all, a half sample has only half the number of cases as a full sampleâ and the bootstrap always takes a full sample of N or n. The explanation of why a half sample without replacement has essentially the same variability as a full sample with replacement is most easily seen by considering the variance of the mean of a sample without replacement from a finite population. It is well known from sampling theory that the variance of the mean from a sample of size N/2, from a population of size N is, essentially, Ï x2 N Ïx 2 Var( xk ) = k (1 â / N ) = k . (11) âNâ 2 N â â â2â That is, because of the âfinite sampling correction,â the variance from a random half sample without replacement is exactly the same as the variance of a random sample of twice the 45 PREPUBLICATION COPYâUNEDITED PROOFS

size with replacement (there is a small âN versus N â 1â effect that Formula 11 ignores). This is why the bootstrap and the RH methods give such similar results in our application to the uncertainty of the direct weights. There are other reasons to expect the RH method to produce a useful simulation of the uncertainty of averages.44 The same reasoning applies to the RH sampling of the faculty raters in R to simulate the uncertainty in the average ratings, r , used to obtain the regression-based weights. The procedure was to sample a random half of all raters for programs in a field and compute the average rating for each program from that half sample. The regression-based weights are subject to uncertainty from two sources. The first is the uncertainty arising from sampling the faculty raters and, as indicated above, the RH sampling directly addresses this source. The second is from using average ratings from a sample of programs rather than all the programs to develop the regression equation from which the regression-based weights are derived. In the discussion of box (4), above, we gave our reasoning for believing the sample of 50 programs is adequate, and how we pool the data from other related fields when the number of programs in a field is smaller than 50. In addition, while the use of ratings for a sample of programs has the practical value of reducing the workload of the faculty raters, our implicit use of the predicted average ratings, {Mj}, from Equation 7 above, rather than actual average ratings, { rj }, also reduces some of the uncertainty due to the sampling of the programs to be rated. For these two reasons, we believe that this second source of uncertainty is not as important as that simulated by the RH procedure for the uncertainty in the average ratings, Ë and consequently, for the regression-based weights, m . We always drew the RH samples 500 times, and those for x were statistically independent of those for r . This gives us 500 replications of the direct weights and 500 replications of the regression-based weights that we then combined into 500 replications of the combined weights, which we describe next. Box (6): Using the optimal fraction to combine the direct and regression-based weights. Ë In deriving the ranges of ratings that reflect the uncertainty in mk and xk , simulated Ë values, mk, and xk, are drawn from the sampling distributions of mk , and xk , respectively, using independent RH samples from the appropriate parts of R and X. These two simulated values are to be combined to form a simulated value, fk, for f k in Equation 10. However, the simple weighted average in Equation 10 only reflects the effect of the policy weighting, w, and ignores the fact that both mk, and xk are independent random draws from distributions, rather than fixed 44 The random-halves procedure has a place in the statistical literature, but with other names. It is an example of the âdeleted-dâ jackknife as described in Efron and Tibshirani, (1993) An Introduction to the Bootstrap. New York: Chapman and Hall. p. 149, with d = n/2. It is described by Kirk Wolter in a private communication as an example of the âbalanced repeated replicationâ or âbalanced half samples,â and described in Wolter, K. M. (2007) Introduction to Variance Estimation., 2nd ed. New York: Springer-Verlag. 46 PREPUBLICATION COPYâUNEDITED PROOFS

values. We want to combine mk, and xk in such a way as to bring the simulated value, fk, as close as possible to f k on average, and in a way that will also reflect the policy weight, w, appropriately. This section outlines our approach to choosing the optimal fraction to apply to mk to achieve this. The optimal fraction is the amount of weight applied to mk that minimizes the mean-square error of fk, treating f k as a target parameter to be estimated. First, consider a general weighting, fk(u), that uses a fraction, u. This weighting has the form fk(u) = umk + (1 â u)xk. (12) Ë By construction of the RH procedure, the mean of the distribution of mk is mk (the regression coefficients that are obtained when the data from all n faculty raters are used). Similarly, the mean of the distribution of xk is xk , the mean importance value that is obtained when the data from all N faculty respondents are averaged. We may regard fk(u) as an estimator of Ïk, given by Ïk = w mk + (1 â w) xk . Ë (13) The problem then is to find the value of u that will minimize the mean-square error (MSE) of fk(u) given by MSE(u) = E(fk(u) â Ïk)2, (14) where, in Equation 14, the notation, E(fk(u) â Ïk)2 denotes the expectation or average taken over Ë the independent RH distributions of mk and xk . The MSE is a measure of the combined uncertainty in fk(u). The MSE in (14) can be written as MSE(u) = E(umk + (1 â u)xk â w mk â (1 â w) xk )2 Ë = E(u(mk â mk ) + (1 â u)(xk â xk ) + (u â w) mk + (w â u) xk )2 Ë Ë = E(u(mk â mk ) + (1 â u)(xk â xk ) + (u â w)( mk â xk ))2. Ë Ë (15) The point of re-expressing Equation 14 as Equation 15 is that now when the squaring is carried out, all of the terms except the squared ones have zero expected values and can be ignored. If we denote the variance of the sampling distribution of mk by Ï2( mk ) and the variance of xk by Ë Ë Ï2( xk ), then Equation 15 becomes MSE(u) = u2Ï2( mk ) + (1 â u)2Ï2( xk ) + (u â w)2( mk â xk )2. Ë Ë (16) 47 PREPUBLICATION COPYâUNEDITED PROOFS

It is now a straightforward task to differentiate Equation 16 in u, set the result to zero, and solve for the optimal u-value, u0k, which we call the optimal fraction. This calculation results in Ï 2 ( xk ) + w(mk â xk ) 2 Ë u0k = . (17) Ï ( xk ) + Ï (mk ) + (mk â xk )2 2 2 Ë Ë The optimal fraction in Equation 17 has some useful and intuitive properties. It takes on the value w when there is no uncertainty about the direct and regression-based weights. Ë Moreover, w has no influence on the optimal fraction when mk and xk are equal. In that case, the th direct weights and regression-based weights on the k program characteristic are the same, and the optimal fraction combines the two simulated values in a way that is inversely proportional to their variances, so that the value with less variation gets more weight. Note also, that the value in Equation17 is the same for all of the RH simulated values of mk and xk. The two variances in Equation 17, Ï2( xk ) and Ï2( mk ), may be found in standard ways. Ë The value of Ï2( xk ) is given by Ï2( xk ) = Ï2(xk)/NF, (18) where NF denotes the number of faculty in the field who supply direct weight data, and Ï2(xk) denotes the variance of the individual direct weights given to the kth program variable by these faculty respondents. The value of Ï2( mk ) is obtained from the regression output that produces Ë mk when the data from all faculty raters in a field are used. Its square root, Ï( mk ), is the standard Ë Ë Ë Ë error of the regression coefficient, mk . Finally, because we rescaled the mk so that their absolute sum was 1.0, the same divisor must be applied to Ï( mk ) to put it on the corresponding scale. Ë If we now replace the u in Equation 12 with u0k given in Equation 17, we then obtain the combined weight that optimally combines the two simulated values of the weights, mk, and xk, into the combined rating, given by 20 R0j = âf k =1 0k p*kj (19) where f0k = u0kmk + (1 â u0k)xk, (20) and u0k is given by Equation 17. The vector of optimally combined weights is denoted by f045. 45 The weights f0k differ little from the weights that would be obtained from equation (10) with w = Â½ in fields with a large number of programs. For example, the program described in Chapter 5 in economics is one of 117 programs, and the root mean square difference between the optimal weights calculated from Equation 20 and those from Equation 10 with w = Â½ over the 500 iterations is 0.00468. The average absolute difference in rankings for the 117 48 PREPUBLICATION COPYâUNEDITED PROOFS

The values of R0j from Equations 19 and 20 are used as the 500 simulated values of the combined ratings for the purposes of determining the ranking interval ranges for each program that is discussed below. In performing the RH sampling to mimic the uncertainty in the direct and regression- based weights, it should be emphasized that the random half samples from X and R were statistically independent. This is our justification for assuming that the random draws, mk, and xk, are statistically independent in the calculation of the optimal fraction, u0k.46 As a final point, we did realize that the approach to calculating the optimal fraction described above did not take into account any correlation between the direct and regression- based weights for different program variables. We did examine a method that did, but it simply produced a matrix version of Equation 17 that reduced to the procedure we used when the program variables were uncorrelated, but was otherwise difficult to implement with the resources available to us. BOX (7): ELIMINATING NON-SIGNIFICANT PROGRAM VARIABLES. After we have obtained the 500 simulated values of the combined weights by applying Equations 17 and 20 to the 500 simulated values for the direct and regression-based weights, we were in a position to examine the distributions of these 500 values of the combined weights for each program variable. The distributions of the combined weights for some of the program variables did not contain zero and were not even near zero. However, other program variables had combined weight distributions that did contain zero. If zero is inside the middle 95 percent of this distribution, we declare the combined weight for that program variable to be non- significant for the rating and ranking process (in analogy with the usual way that distributions of parameters are tested for statistical significance). If the combined weight for a program variable is not significantly different from zero, the variable for that coefficient is dropped from further computations. This elimination of program variables required us to recalculate everything above box (7) in Figure A-1. The eliminated program variables are ignored in calculating the direct and regression-based weights for the other variables. New RH samples are drawn, the direct weights are retransformed so that the absolute sum of the remaining direct weights was 1.0, the regressions are re-run using the reduced set of program variables as predictors, and new optimal fractions are computed to combine the direct and regression-based weights. Finally, the 500 simulated combined coefficients are again tested for statistical significance from zero. This programs in economics between those for the optimal weights and those with w = Â½ is 3.972 and 3.979 for the 1st and 3rd quartile ratings, respectively. The average difference in the lengths of the ranking range over the 117 programs was 6.047 for optimal weighting and 6.032 for the w = Â½ weighting. These differences may be greater if the field is composed of a small number of programs with fewer responses by the faculty for the importance weights and a larger variance on those weights, such as applied mathematics with 33 programs. 46 The fact that the raters for each field were a subset of those who answered the faculty questionnaire may confuse some into thinking that our independence assumption may not be justified. This is an unfortunate misunderstanding of the simulation of uncertainty in the rating and ranking process. It is the statistical independence of the two RH sampling processes that matters, nothing else. 49 PREPUBLICATION COPYâUNEDITED PROOFS

process is repeated until a final set of combined weights, each of which is significantly different from zero, is obtained. Only after this testing and retesting process is performed are the final sets of 500 combined coefficients ready for use in the computation of the intervals of rankings that are discussed in box (9) of Figure A-1. The values for the combined weights that correspond to the eliminated variables are set to 0.0 in each of the final 500 simulated values of f0. These 500 vectors of combined weights are used in the production of the ratings that are used to produce the final intervals of rankings for each program, as discussed later. Empirically, the examination of three fields suggests that this process has two useful effects. First, the middle of the inter-quartile ranges of rankings of programs is changed very little, so that the ranges before eliminating nonsignificant program variables and those after this elimination are centered in nearly the same places47. Second, the widths of these inter-quartile ranges are slightly reduced or are unchanged. These are the effects that we would expect from eliminating variables that are having only a noisy effect on the ranking and rating process, and for this reason, we have continued to include box (7) in our rating and ranking process. Nonetheless, the inter-quartile intervals do shift more markedly than the medians, when estimated coefficients are set to zeroâlargely for those departments near the middle of the rankings. This is because quartile estimates are more variable than median estimates. There are even rare instances in which the intervals calculated both ways do not overlap. BOX (8), (8a) AND (8b): INCORPORATING UNCERTAINTY INTO THE PROGRAM VARIABLES In addition to the uncertainty in the direct and regression-based weights discussed above, there is also some uncertainty in the values of the program variables themselves. Some of the 20 program variables used to calculate the ratings also vary or have an error associated with their values due to year-to-year fluctuations. Data for five of the variables (publications per faculty, citations per publications, GRE scores, Ph.D. completion, and number of Ph.D.âs) were collected over time, and averages over a number of years were used as the values of these program variables. If a different time period had been used, the values would have been different. To express this type of uncertainty, a relative error factor, ejk, was associated with each program variable value, pjk. The relative error factor was calculated by dividing the standard deviation over the series by the square root of the number of observations in the series, and then dividing that number by the value of the variable pkj. For example, the publications per faculty variable is the average number of allocated publications per allocated faculty over 7 years, and a standard error value was calculated for this variable as SD/â7. This standard error was then divided by the value of the publications per faculty variable to get the relative error factor for this program variable. 47 Examination of the effect of this procedure gave correlations between the median rankings with and without the elimination of nonsignificant variables of .99. 50 PREPUBLICATION COPYâUNEDITED PROOFS

For the other 15 program variables that are used in the ratings, no data on variability were directly obtained during the study, and we assigned a relative error of 0, 0.1 or 0.2 to these variables. The relative error for the variables Student Workspace and Health Insurance were given an error of 0, because they were thought to have little or no temporal fluctuation over the interval considered; and for Percent of Faculty Holding Grants, the error assigned was 0.2, because an examination of data from the National Science Foundation Survey of Research Expenditure indicated this to be an appropriate estimate. The remaining 12 program variables were assigned a relative error of 0.1. Each program had its own relative error factor for each program variable, ejk. Just as we had simulated values from the sampling distributions of x and r via RH sampling, we also wanted to reflect the uncertainty in the values of the program variables themselves rather than using the fixed values, {pkj}, in computing program ratings. We did this in the following way. The value, pkj, was perturbed by drawing randomly from the Gaussian distribution, N(pkj, (ekpkj)2).This distribution has a mean equal to the variable value pkj and a standard deviation equal to the relative error, ek, times the variable value, pkj. Thus, the entire % array P is randomly perturbed to a new array, P . This perturbing process is repeated 500 times, and each one is standardized to have mean 0.0 and standard deviation 1.0 for each of the 20 program variables to produce 500 standardized arrays, P *. % BOX (9): THE INTER-QUARTILE RANGES OF RANKINGS In box (9) we have already calculated 500 replications of the combined weights after eliminating the nonsignificant program variables for the given field [from box (7)] and from 500 replications of the steps in boxes (8), (8a) and (8b), we have 500 replications of the standardized perturbed version of P that contains the program variable data for all of the programs to be rated in the field. Now we use Equations 17, 19,and 20 to combine the replications of the combined weights with the replications of the standardized perturbed program variables to obtain 500 replications of the combined rating Rj for each program, j. Denote the kth replication of Rj by R (j k ) . To obtain the kth replication of the rankings of the programs, sort the values of R (j k ) over j from high to low and assign the rank of 1 to the program with the highest rating in this set. In case of tied ratings, we use the standard procedure in which the ranks are averaged for the tied cases, and the common rank given to the tied programs is the average of the ranks that would have been given to the tied set of programs. For each of the replications of the ratings, there is a corresponding replication of the rankings of the programs, resulting in 500 replications of the ranking of each program. Instead of reporting a single ranking of the programs in a field, we report the inter- quartile range of the rankings for each program. This is an interval starting with the rank that was at the 25th percentile (also called the first quartile) in the distribution of the 500 replications of the ranks for the given program, and ending at the 75th percentile (the third quartile) of this 51 PREPUBLICATION COPYâUNEDITED PROOFS

distribution. The interpretation of the inter-quartile range is that it is the middle of the distribution of rankings and reflects the uncertainty in the direct and regression-based weights and in the program data values, twenty-five percent of a programâs rankings in our process are less than this interval and 25 percent are higher. The interval itself represents what we would expect the typical rankings for that program to be, given the uncertainty in the process and the ratings of the other programs in the field.48 48 The choice of an inter-quartile range, rather than some other range (eliminating the top and bottom quintile, for example) is arbitrary. IQRs are standard in the statistical literature. Broader ranges would result in greater overlap. The point of introducing uncertainty in our calculations is that we do not know the âtrueâ ranking of a program. The purpose of presenting an IQR is to provide a range in which a programâs ranking is likely to fall. 52 PREPUBLICATION COPYâUNEDITED PROOFS

Next: APPENDIX B Questionnaires »

A Guide to the Methodology of the National Research Council Assessment of Doctorate Programs (2009)

Chapter: APPENDIX A A Technical Discussion of the Process of Rating and Ranking Programs in a Field

Welcome to OpenBook!

Get Email Updates