Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
4 The Methodologies Used to Derive Two Illustrative Rankings Ranking programs based on quantitatively based estimates of program quality is a highly complex task. Rankings should be based both on data that reflect the relative importance to the user of the available measures and on the uncertainty inherent in them. Users of rankings should clearly understand the basis of the ranking, the choice of measures, and the source and extent of uncertainty in them. It is highly unlikely that rankings calculated from composite measures will serve all or even most purposes in comparing the quality of doctoral programs. The committee has worked for more than three years on arriving at a satisfactory methodology for generating rankings for doctoral programs. This work was pursuant to the portion of the charge, which states: The study will consist of . . . 3) the design and construction of program ratings using the collected data including quantitatively based estimates of program quality. It is this portion of the charge that called for constructing program ratings and deriving rankings from them that reflect program quality. Were it not in the committeeâs charge, it would be a useful exercise in itself simply to collect program data under comparable definitions and share them widely. This chapter describes how the committee decided what kinds of data to collect and how to use those data to approach the task of providing ratings and rankings for programs. In pursuing this task, it was guided by some motivating ideas that reflected concerns in the higher education community about rankings and their uses and that were described in considerable detail in the 2003 study already noted, Assessing Research-Doctorate Programs: A Methodology Study. These concerns about the 1995 rankings and rankings from other sources were that they do the following: â¢ Encourage spurious inferences of precision. As the committee describes in this report, there are many sources of uncertainty in any ranking, ranging from the philosophicalâany ranking implies comparability of what may not be comparableâto the statisticalâsources of variation are present in any aggregation of measures. â¢ Overly rely on reputation. Reputation, although it has the advantage of reflecting dimensions of program quality that are difficult to quantify, may also be dated and include halo effectsâthat is, visibility effects that obscure the quality of smaller programs or good programs in less well-known universities. 49
50 A DATA-BASED ASSESSMENT OF RESEARCH-DOCTORATE PROGRAMS IN THE U.S. â¢ Lack transparency. Even when it is based on explicit measures, the weighting of these measures in the ranking may not be discernable or may change from year to year in ways that are not made clear. In addressing these weaknesses in rankings, the committee sought to design a methodology that would result in rankings with the following characteristics: â¢ Data-based. The rankings were constructed from observable measures derived from variables that reflected academic values. â¢ A reflection of the prevailing values of faculty in each program area. The rankings were calculated using the opinions of faculty in each program area of both what was important to program quality in the abstract and, separately, how experts implicitly valued the same measures when asked to rate the quality of specific programs. â¢ Transparent. Users of the rankings could understand the weights applied to the different measures that underlay the rankings and, if they wished, calculate rankings under alternative weighting assumptions. Achieving these seemingly simple objectives in a scientifically defensible way was not a simple undertaking. The committee had to undertake the following tasks: â¢ Determine what kinds of measures to include. To be included, a measure had to be one that the participating universities either collected in the course of regular institutional research and management, such as enrollment counts, or that the committee felt should be known by a responsible doctoral program, such as the percentage of entering students who complete a degree in a given amount of time. â¢ Ascertain faculty values. Faculty were asked, on the one hand, to identify the measures that were important to program quality and then asked, on the other, to rate a stratified sample of programs in their fields. â¢ Reflect variation among faculty and faculty raters. Because faculty may not be in complete agreement on the importance of the different measures or the rating of sampled programs, differences in views were reflected by repeatedly resampling the ratings and, for each resampling, calculating the resulting weights or overall program ratings. This approach leads naturally to presenting a range of rankings on any measure for a given program. â¢ Design specific measures along separate dimensions of program quality. Although overall measures are useful, some users may be particularly interested in measures that focus on specific aspects of a graduate program. The committee calculated ranges of rankings for three of these aspects: research activity, student support and outcomes, and diversity of the academic environment. That said, the two approaches provided in this report are intended to be illustrative of the process of constructing data-based ranges of rankings that reflect the values of the faculty who teach in these programs. It is also possible to produce ranges of rankings that reflect the values of the users. Production of the rankings turned out to be more complicated and to be accompanied by more uncertainty than originally thought. As a consequence, the illustrative rankings
METHODOLOGIES USED TO DERIVE TWO ILLUSTRATIVE RANKINGS 51 described in this chapter are neither endorsed nor recommended by the National Research Council as an authoritative conclusion about the relative quality of doctoral programs. In summary, the committee urges users of these rankings and data to examine them very carefully, as the committee has. It also apologizes for any errors that might be uncovered. It does expect that, as a result of this data collection effort, updating will be easier next time for the respondents and will result in fewer errors. USE OF RANKINGS In attempts to rank doctoral programs, sports analogies are especially inappropriate. There are no doctoral programs that, after a long regular season of competition followed by a month or more of elimination playoffs, survive to claim âWeâre Number 1!â Perceptions of the quality of doctoral programs are built over many years of making agonizing tenure decisions and making choices about areas of specialization and the resolution of competing views about the most fruitful direction of a field of study. The evidence of excellence is not easily summarized in runs batted in, earned run averages, or percentage of games won. Instead, it is the result of hundreds of judgments by peer reviewers for journals and presses, as well as citations that accumulate as an area of study develops and grows. The answer, then, to âWhat is the best doctoral program in biochemistry?â should not be the name of a university, but a follow-up question to the interlocutor about what he or she means by âbestâ and in what respects. The committee was keenly aware of the complexity of assessing quality in doctoral programs and chose to approach it in two separate ways. The first, the general survey (S) approach, was to present faculty in a field with characteristics of doctoral programs and ask them to identify the ones they felt were the most important to doctoral program quality. The second, the rating or regression (R) approach, was to ask a sample of faculty to provide ratings (on a scale of 1 to 5) for a representative sample of programs and then to ascertain how, statistically, those ratings were related to the measurable program characteristics. In many cases the rankings that could be inferred from the S approach and the R approach were very similar, but in some cases they were not. Thus the committee decided to publish both the S-based and R-based rankings and encourage users to look beyond the range of rankings on both measures. Appendix G shows the correlations of the medians of the two overall measures for programs in each field. The fields for which the agreement between the R and S medians is poorest are shown in Box 4-1.
52 A DATA-BASED ASSESSMENT OF RESEARCH-DOCTORATE PROGRAMS IN THE U.S. BOX 4-1 Fields for Which the Correlation of the Median R and S Ranking Is Less than 0.75 Animal Sciences Civil and Environmental Engineering Comparative Literature French and Francophone Language and Literature Geography German Language and Literature Linguistics Mechanical Engineering Pharmacology, Toxicology and Environmental Health Philosophy Religion Sociology Spanish and Portuguese Language and Literature The online tables that accompany this study present ranges of rankings for two overall measures for all ranked programs and additional ranges of rankings for three dimensional measures. Those who view rankings as a competition may find this abundance of rankings confusing, but those who care about informative indicators of the quality of doctoral programs will likely be pleased to have access to data that will help them to improve their doctoral programs.
METHODOLOGIES USED TO DERIVE TWO ILLUSTRATIVE RANKINGS 53 SUMMARY OF THE METHODOLOGY OF THE ILLUSTRATIVE PROGRAM RANKINGS Figure 4-1 shows the steps involved in calculating the two types of overall program rankings (R and S). Faculty Students Institutions and Programs Existing 1. DATA Answers to questions provided by 4,838 doctoral programs at 221 institutions and combinations of institutions in 59 fields across the sciences, engineering, social sciences, arts, and humanities covering institutional practices, program characteristics, and faculty and student demographics obtained through a combination of original surveys and existing data sources (NSF surveys and Thompson-Reuters publication and citation data). 2. WEIGHTS In two surveys shown in Appendix D, program faculty provided the NRC with information on what they value most in Ph.D. programs: 1. Faculty were asked directly how important they felt 21 items in a list of program 2. characteristics were (for S weights). 2. A sample of faculty rated a sample of programs in their fields. These ratings were then related through regressions to the same items as appeared in (1) using a principal components transformation to correct for colinearity (for R weights). 3. ANALYSIS âSurvey (S)â and âregression-based (R)â weights provided by faculty were used to calculate separate ratings, reflecting the multidimensional views faculty hold about factors contributing to the quality of doctoral programs. 4. RANGES OF RANKINGS Each programâs rating was calculated 500 times by randomly selecting half of the raters from the faculty sample in step 2 and also incorporating statistical and measurement variability. Similarly, 500 samples of survey based weights were selected. The R weights and the S weights were then applied to 500 randomly selected sets of program data to produce two sets of ratings for each program. These ratings for each of the 500 samples determined the R and S rank orderings of the programs. A ârange of rankingsâ was then constructed showing the middle 90 percent range of calculated rankings. What may be compared, among programs in a field, is this range of rankings.
54 A DATA-BASED ASSESSMENT OF RESEARCH-DOCTORATE PROGRAMS IN THE U.S. Faculty were surveyed to obtain their views on the importance of different characteristics of programs as measures of quality.1 Ratings were then constructed based on these faculty views of how those measures related to criteria of program quality, as discussed in the section on dimensional measures. The views were related to program quality using two distinct methods: (1) asking faculty directly to rank the importance of characteristics in a survey (S); and (2) asking faculty raters2 to provide reputational program ratings (R) for a sample of programs in a field and then relating these ratings, through a regression model that corrected for correlation among the characteristics, to data on the program characteristics. The two methods approach the ratings from different perspectives. The direct, or survey-based, approach is a bottom-up approach that builds up the ratings from the importance that faculty members give to specific program characteristics independent of reference to any actual program. The regression-based method is a top-down approach that begins with ratings of actual programs and uses statistical techniques to infer the weights given by the raters to specific program characteristics. The survey-based approach is idealized. It asks about the characteristics that faculty feel contribute to quality of doctoral programs without reference to any particular program. The second approach presents the respondent with 15 programs in his or her field and information about them3 and asks for ratings of program quality,4 but the responders are not explicitly queried about the basis of their ratings. The weights derived from each approach were then applied to the value of the 20 measures for each program to yield two sets of ratings for each program. Each rating was then recalculated 500 times using different samples of raters and varying the data values within a range.5 The program ratings obtained from all these calculations could then be arranged in rank order and, in conjunction with all the ratings from all the other programs in the field, used to determine a range of possible rankings. Because of the various sources of uncertainty, each ranking is expressed as a range of values. These ranges were obtained by taking into account the different sources of uncertainty in these ratings (statistical variability from the estimation, program data variability, and variability among raters). The measure of uncertainty is expressed by reporting the endpoints of the 90 1 All questionnaires, including that for faculty, appear in Appendix D. 2 The raters were chosen through a sampling process that was representative of the distribution in each field of faculty by rank, size of program, and region of the country. 3 The following data were given to the raters: the program URL, the list of program faculty, the average number of Ph.D.âs (2001â2006), the percentage of new Ph.D.âs planning academic positions, the percentage of the entering cohort completing their degree in six years or less (fields outside the humanities fields) or eight years or less (humanities), the median time to degree (2004â2006), the percentage of female faculty, and the percentage of faculty from underrepresented minorities. All data were for 2005â2006 unless otherwise indicated. 4 The question given raters about program quality was as follows: On a scale from 1 to 6, where 1 equals not adequate for doctoral education and 6 equals a distinguished program, how would you rate this program? Not Adequate for Doctoral Donât Know Education Marginal Adequate Good Strong Distinguished Well Enough 1 2 3 4 5 6 9 5 The range of data values was either plus or minus 10 percent or the actual range of variation if multiyear data were collected on the questionnaire. A Monte Carlo selection was used to vary the selection of raters and of data.
METHODOLOGIES USED TO DERIVE TWO ILLUSTRATIVE RANKINGS 55 percent range of rankings6 for each programâthat is, the range that contains the middle 90 percent of a large number of ratings calculations that take uncertainty into account.7 In summary, the committee obtained a range of rankings for each program in a given field by first devising two sets of weights through two different methods, direct, or survey-based, and regression-based. It then standardized all the measures to put them on the same scale and obtain ratings by multiplying the value of each standardized measure by its weights and adding them together. It acquired both the direct weights and the coefficients from regressions through calculations carried out 500 times, each time with a different set of faculty, to generate a distribution of ratings that reflects their variability. The range of rankings for each program was obtained by trimming the bottom 5 percent and the top 5 percent of the 500 rankings to obtain the 90 percent range. This method of calculating ratings and rankings takes into account variability in rater assessment of the things that contribute to program quality within a field, variability in the values of the measures for a particular program, and the range of error in the statistical estimation. It is important that these techniques yield a range of rankings for most programs. The committee does not know the exact ranking for each program, and to try to obtain oneâby averaging, for exampleâwould be misleading because it has not assumed any particular distribution of the range of rankings.8 Thus within the 90 percent range, a programâs rankings could be clustered at one endpoint or the other, so that averaging the two endpoints could be misleading. The datasheet that presents the range of rankings for each program lists the programs alphabetically and gives the range for each program. Users are encouraged to look at groups of programs that are in the same range as their own programs, as well as programs whose ranges are above or below, in trying to answer the question âWhere do we stand?â A similar technique was used to calculate the range of rankings for each of the dimensional measures for each field. Some possible ways of using the ranges of rankings and the data tables are discussed in Chapter 6. The rankings for the overall R and S measures and for the dimensional measures for each of the programs in each of the 59 fields with ranges of rankings are available on the Web site (www.nap.edu/rdp) and should be taken as illustrative of the different approaches.9 The 2009 Guide to the Methodology described a methodology that assumed that the R-based coefficients could be combined with the S-based coefficients using a formula that appears on page 48 of the Technical Appendix to the pre-publication version of the guide. This formula relied on the variances of the samples used to calculate each set of coefficients. Upon looking at every field, however, the committee found that for some fields these variances could be very large, especially for those fields in which either the field was heterogeneous in the sense that the same field encompassed very different forms of scholarly productivity or there were relatively few raters. This situation resulted in R and S medians that did not correlate well, and so the committee abandoned its plan to combine the coefficients that were calculated in the two ways. Instead of one overall range of rankings the committee presents these two measures separately. The fields for which the correlation of the two measures at the median was below 0.75 were listed earlier in Box 4-1 with details shown in Appendix G. 6 The committee calls these endpoint values the 95th percentile and the 5th percentile. 7 The 90 percent range eliminates the top and bottom 25 ratings calculated from 500 regressions and 500 samples of direct weights from faculty. The range contains 90 percent of all the rankings for a program. In the Guide to the Methodology, the range chosen was 50 percent, but the committee later decided that this range was overly restrictive. 8 Two programs with the same 90 percent range could have very different means and medians. 9 The 24,190 rankings (one range for each of the 5 measures for 4,838 programs) are too numerous to present in this written report.
56 A DATA-BASED ASSESSMENT OF RESEARCH-DOCTORATE PROGRAMS IN THE U.S. DIFFERENCES FROM THE 1995 REPORT The summary in Table 4-1 makes it immediately clear that there are significant differences in the methodology for the two studies. These differences alone can have an effect on the relative ranking of a program. Here are some of the more obvious sources of difference: TABLE 4-1 Summary of Differences Between 1995 and 2006 Studies 1995 Study 2006 Study University Participation 274 universities (including schools of professional 221 universities and combinations of universities psychology) Field Coverage 41 fields, all of which were ranked 59 ranked fields, 3 fields not ranked but with full data collection, 14 emerging fields Program Inclusion Nominated by institutional coordinators Based on NSF Ph.D. production data and the nominations of institutional coordinators Number of Programs 3,634 ranked 4,838 ranked, 166 unrated Faculty Definition 78,000 total, 16,738 nominated as raters Of the 104,600 total, 7,932 faculty were chosen (faculty could be counted in more than one through a stratified sample for each field to program) participate in the rating study. Faculty could be counted in more than one program, but were usually counted as âcoreâ in only one. Faculty members were allocated fractionally among programs according to dissertation service so that, over all programs, he or she was counted no more than once. 1995 Study 2010 Study Ratings and Rankings Raters nominated by the institutional coordinators 1. All faculty were given a questionnaire and were sent the National Survey of Graduate Faculty, asked to identify the program characteristics in which contained a faculty list for up to 50 programs three categories that they felt were most in the field. Raters were asked to indicate familiarity important, and then identify the categories that with program faculty, scholarly quality of program were most important. This technique provided faculty (scale 1â6), familiarity with graduates of the survey-based (S) weights for each field. program (scale 1â3), effectiveness of program in educating research scholars (scale 1â4), and change 2. A stratified sample of faculty in each field in program quality in the last five years (scale 1â3). were given a stratified sample of 15 or fewer sampled programs to rate them on a scale of from 1 to 6. Included in the data for raters was a faculty list and program characteristics. These ratings were regressed on the program characteristics to determine the regression- Rankings were determined for each program by based (R) weights. These weights were then calculating the average rating for a program and assumed to hold for all programs in the field so arranging all the programs from lowest to highest that all programs could receive a rating based based on average ranking. on these weights. 3. The S weights and the R weights, calculated as just described, were used to calculate S rankings and R rankings. 4. Uncertainty was taken into account by
METHODOLOGIES USED TO DERIVE TWO ILLUSTRATIVE RANKINGS 57 introducing variation into the values of the measures and by repeatedly estimating the ratings obtained by taking repeated halves of the raters chosen at random. Ratings were calculated 500 times. 5. The ratings in step 4 were ordered from lowest to highest. The ratings of all programs in a field were pooled and arranged in rank order. The range covering 90 percent of rankings was then calculated for each program.a a This is a simplified description. The exact process is more complex and is described in detail in Appendix J. Measurement of Quality The 1995 measure of program quality is known as a âreputational measureââthat is, raters judged the âscholarly quality of program faculty.â The 1995 study noted that this measure is highly correlated with program size and, quite possibly, with visible faculty.10 Reputation may also be âdatedâ and not reflect recent changes in faculty composition. Finally, the reputation of program faculty may not be closely related to faculty performance in mentoring students or encouraging a high proportion to complete their degrees within a reasonable period of time.11 By contrast, in its rating exercise the committee for the current study asked respondents for their familiarity with each program, and presented data on size, completion, time to degree, and faculty diversity. It also provided a Web site for the program, in addition to a faculty list. The task was to rate the program rather than the scholarly quality of program faculty. A rater had to rate at most 15 programs, not 50. Once the ratings were obtained, they were then related to the 20 measures through a modified regression technique.12 Specification of the Measures In addition to the reputational measures the 1995 study provided a few program characteristics: faculty size, percentage of full professors, and percentage of faculty with research support. In addition, awards and honors received in the previous five years and the percentage of program faculty who had received at least one honor or award in that period were given for the arts and humanities. For engineering and the sciences, the percentage of program faculty who published in the previous five years and the ratio of these citations to total faculty, as well as the Gini coefficients for these measures (a measure of dispersion), were shown. Data were also presented on students: the total number of students, the percentage of students who were female, and the number of Ph.D.âs produced in the previous seven years. Finally, information was provided on doctoral recipients: the percentage who were female, minority, and U.S. citizens; the percentage 10 âVisible facultyâ refers to faculty who are highly productive and visible in the scholarly literature, but also faculty who may have been highly productive in the past, are less productive in the present, and are often called upon for public comment. 11 For a more detailed discussion of the strengths and weaknesses of reputational measures, see the 1995 study (National Research Council, Research Doctorate Programs in the United States, 22â23) and the section in this chapter âCautionary Words.â 12 For details of the statistical techniques, see Appendix J. .
58 A DATA-BASED ASSESSMENT OF RESEARCH-DOCTORATE PROGRAMS IN THE U.S. who reported research assistants and teaching assistants as their primary form of support; and the median time to degree. But even though all of these âobjective measuresâ were reported, they played no explicit part in determining program ranking. By contrast, the current study explicitly includes most of these measures and many more, and attempts to relate them directly to the rating that goes into the program ranking. Overall Comparability If the âqualityâ of a program is unchanged, will any of the present ranges of rankings be the same as the 1995 ranking? Although an excellent program is an excellent program by any measure, there is no reason to expect the 1995 rankings to match the present range of rankings on either the S-based or the R-based measure. As this description of the two studies makes clear, the studies used different methodologies for all three calculations. Some important sources of variability are as follows: â¢ The current study is highly data-dependent. Although the data submitted by the universities were checked and verified repeatedly, errors may remain. And large errors could skew the rankings. Nonreputational data were not explicitly a part of the 1995 rankings, although they were reported in tables in the appendixes. â¢ The research strength of the faculty as measured by publications and citations was an important determinant of quality in both studies, but the method of counting differed between the studies in two important respects: In the current study, publications for the previous 10 years for humanities faculty, which were not counted in 1995, were collected from faculty rÃ©sumÃ©s. Books were given a weight of 5, and humanities articles were given a weight of 1. Second, in non-humanities fields, the 1995 study counted citations for articles published by faculty that had appeared in the previous five years. In the current study, citations that appeared in 2001â2006 were traced to articles that had been published as far back as 1981. This method of counting had the advantage of including âclassicâ long-lived articles. Again, the committee was unable to collect citation data for the humanities. â¢ The committee for the current study asked the institutional coordinators to name the programs they wished to include, but it did define a program as an academic unit that fits at least three of these four criteria: â¯ Enrolls doctoral students â¯ Has a designated faculty â¯ Develops a curriculum for doctoral study â¯ Makes recommendations for the award of degrees.13 Because separate programs were being housed in different academic units, a few institutions used this definition to split what would normally be considered a program into smaller units that still met the criteriaâthat is, what is normally perceived as a unified program was ranked as separate programs. In the rating sample, however, only the one program judged to be the major program in the field at that institution was included. 13 These were the criteria listed on the NRC program questionnaire.
METHODOLOGIES USED TO DERIVE TWO ILLUSTRATIVE RANKINGS 59 â¢ Dimensional measures were not included in the 1995 study. In the planning meetings that preceded the study, the point was repeatedly raised that earlier rankings had not explicitly taken into account measures that reflected on graduate education14 or the diversity of the educational environment. In summary, the current study differs in methodology and conception from the 1995 study. Both studies do provide rankings, however, the current study provides ranges of rankings, reflecting a variety of sources of uncertainty. In addition, are illustrative of two different approaches. There are other approaches and weighting of characteristics that reflect alternative user values. It is possible to try to compare the sets of rankings from the two studies, but the definition of faculty, methods of enumerating publications and citations, and the inclusion of additional characteristics in this have all changed, as well as the methodology. CAUTIONARY WORDS A Guide to the Methodology of the National Research Council Assessment of Doctoral Programs (2009) details the methodology used to create the rankings in the current study. As noted in the previous section, the methodology adopted in the current work is substantially different from that used to obtain the rankings described in the 1995 National Research Council report An Assessment of Research-Doctorate Programs: Continuity and Change, although it is very similar to that proposed in the 2003 NRC report Assessing Research-Doctorate Programs: A Methodology Study. Under the current methodology, when program measures in a field are similar, program differences in the range of rankings can be highly dependent on the precise values of the input data and very sensitive to errors in those data. The committee and the staff have worked diligently in recent years to ensure the quality of the measures used to generate ratings and rankings and have tried to reduce measurement errors as much as possible. Such errors can arise from clerical mistakes, from misunderstandings by respondents about the nature of the data requested from them, or from problems within the public databases used. That said, even though the input data underwent numerous consistency checks and the participating respondent institutions were given the opportunity to provide additional quality assurance, the committee is certain that errors in input data remain, and that these errors will propagate through to the final posted rankings. Its hope is that after all of the input data are made public, any significant errors will be found and reported so they can be rectified in a timely fashion before the ranking and rating exercise is repeated. Some readers may be surprised about the degree to which program rankings will have changed from the 1995 report. These changes may stem from three factors: (1) real changes in the quality of the programs over time; (2) changes in the principles behind the ranking methods; and (3) simple error, either statistical or from faulty data. The reader should keep in mind that the charge to the committee and the consequent decisions of the committee may have increased the sensitivity of the results to the third factor, and it would now like to spell out some of these issues in greater detail 14 In the 1995 study â93Eâ was a reputational measure of the effectiveness of the program in graduate education, but was very closely correlated with â93Q,â the reputational measure of scholarly quality. The committee felt it needed a separate measure based on data.
60 A DATA-BASED ASSESSMENT OF RESEARCH-DOCTORATE PROGRAMS IN THE U.S. Reputational Measures From the outset the committee, responsive to the statement of task, favored producing a large variety of measures that correlate with the quality of Ph.D. programs. Those measures included publications and citations, peer recognition in the form of honorific awards, and indicators of the resources necessary to create new knowledge. One measure rejected was the direct use of perceived quality, or reputational standing, of these programs, even though this measure was the principal one used in the 1995 study. At present there is widespread distaste in the academic community for the use of reputational measures. On the one hand, reputational measures are generally recognized to have many strengths, including subtlety and breadth of assessment and the widespread use of such markers. On the other hand, reputational measures may reflect outdated perceptions of program strength as well as the well-known halo effect by which some weak programs at a strong institution may be overrated.15 On balance, recognition of these shortcomings resulted in the committeeâs decision to reject the direct use of these perceived quality measures. But the committee was divided: some members did not want to collect data on perceived quality at all, while others favored the direct use of reputation. The policy finally adopted was an intermediate oneâto collect direct data on the perceived quality of Ph.D. programs only for a sample of programs in each field and from a sample of faculty members who had responded to the faculty survey that produced the âdirect measuresâ of quality. The ratings that resulted were then correlated with the measured variables (such as citations, honors, and awards), and âweightsâ were obtained for the latter to best âpredictâ the reputational measures. The idea here was to benchmark objective measures against a reputational measure, but not to use the reputational measure itself. This was the procedure recommended in the 2003 NRC report, and it had numerous consequences, foreseen and unforeseen. Perceived quality, or reputation, is, of course, real, and it is real in its consequences. Reputation affects studentsâ and professorsâ perceptions and their actions related to graduate Ph.D. education. Because it is an important element in the measurement of program quality, the methodology was designed to utilize its virtues but avoid some of the attendant defects (such as time lag). And yet this decision, while required by the statement of task, remained controversial, with some committee members still preferring the direct use of reputational measures. As several committee members noted, some of the other âquantitativeâ measures used, such as honors and awards, were in fact very closely related to and reflected perceived qualityâthat is, reputation. Weights and Measures The committee collected an unprecedented amount of useful data on Ph.D. programs. But to turn this set of discrete measures into an overall set of rankings, it had to combine the various measures into a summary measure, which required, in turn, decisions about how much weight to give to each of the measures. One obvious way to weight the different quality measures was to use faculty ratings of the importance of the measures in assessing overall program quality, and this method was one of the two(the S measure) used to derive the criteria for quality. However, because faculty were not asked to evaluate reputation as a quality, it was excluded from the summary measure constructed from the weighted average of strictly objective measures. 15 However, weak programs at strong institutions may benefit from the presence of the stronger programs in an increasingly interdisciplinary environment.
METHODOLOGIES USED TO DERIVE TWO ILLUSTRATIVE RANKINGS 61 An attempt to model reputation was made by conducting a rating exercise for a sample of programs and then relating these ratings to the same characteristics as were included in the âSâ measure. The result was a measure of quality (the R measure) based on statistical modeling of the quality evaluations (the regression-based model) but made up of the same components as the direct measures. The measures of importance to the faculty were correlated with the perceived quality measure, suggesting that these two parameters describe valid measures of real program quality. The R ranking, then, reflects the relation of the subjective ratings to the data, but by relying entirely on objective data, even this measure, in effect, eliminated any subjective adjustments raters might make in the way they perceived the quality of specific programs, as contrasted with the application of rules they might apply to evaluate programs in general. Reliance solely on data-based objective measures rather than the explicit use of direct measures of reputation sometimes resulted in ratings that appeared to some committee members to lack face validity. To take one example of what may be lost when using only the objective data, faculty members in almost all fields in the sciences give a high weight to citations. Citations, however, are a complex indicator of impact or quality, and, of course, they are indirect measures of reputation. Their complexity arises from the equally complex pattern of behavior of scholars and scientists when referencing works in their published writings. The pattern of citations varies considerably by field, by specialty, by the use of books versus journals as the principal mode of scholarly communication, by self-citation practices, and by the decay of citation frequency over time, among many other patterned differences. Take, for example, two equally distinguished statistics departments: one heavily emphasizes statistical theory, the other, biostatistics. Every member of these departments is honored in a variety of ways. And yet it is almost certain that the average number of citations will be far greater in the department that emphasizes biostatistics, because it is a much larger field with a far larger publishing audience than statistical theory. Thus the score on citations as a measure of quality will differ greatly between the two departments and will lead to very different ratings. A reputational measure would have confirmed the point that both of these departments are truly distinguished. The ranges of rankings based on S measures and R measures differ in the degree to which subjective considerations enter. The ratings on which the R measures are based may depend on the subjective assessment of omitted variables for which there may be no quantitative measures. The omission of subjective considerations in the regression-based measure is treated as an error term in the regression equation and does not appear in the model values reported in the rankings. The result is ranges of rankings for some programs that deviate markedly from what many experts in the fields might find convincing when they take subjective considerations into account. By contrast, the S measures may be subject to variations resulting from incorrect or misunderstood reports of data. Users of these ranges of rankings need to be aware of the consequences of using purely objective measures and interpret the range of rankings in light of the major methodological differences between what was done in this study and what has been done previously. Principles of Academic Organization The interpretation of ranking ranges is further complicated by two other decisions that the committee made in designing the study. It decided to accept the respondent institutionâs principles of academic organization and to thus include multiple programs from the same university in the same program category if they met the criteria for a separate program and the university
62 A DATA-BASED ASSESSMENT OF RESEARCH-DOCTORATE PROGRAMS IN THE U.S. submitted the data for assessment. Each of these programs was rated separately, but they were all included in computation of the range of rankings. For example, Harvard has three doctoral programs under âEconomics,â and Princeton has two doctoral programs under âHistory.â16 Because the assessed quality of these programs tends to be similar, multiple programs from the same university could occupy multiple slots in a similar position in the range of rankings, thereby âcrowding outâ or reducing the rankings of other programs entering higher-ranking ranges and thus distorting the reported results. Another factor is that the committeeâs definition of a program to be rated did not always produce uniform definitions of comparable program areas, leading in some cases to results that are difficult to interpret in terms of ranges of rankings. For example, some mathematics programs include statistics and applied math, whereas others do not. Some anthropology departments include physical anthropology, while others do not. Perhaps at the extreme is the broad field of public health. Different subfields, such as biostatistics and epidemiology, are included as if they are the same program area, when clearly they are different in kind. This situation produces results that are difficult to interpret. When these differences were known, they are noted, but the reader should be alert when comparing specific programs to the possibility that they are not completely comparable. Counting Citations and Publications The committee initially chose to collect citation data for relatively recent publications produced by core and new faculty in each of the Ph.D. programs. This decision, however, tended to bias the data against Ph.D. programs with more senior scholars, particularly in the social and behavioral sciences and humanities, where the pattern of publication over a career differs considerably from those patterns in the physical and biological sciences and engineering. After this bias was noted, the committee decided to collect citations over a much longer timescale. Thus publications going back roughly 20 years, to 1981, in the science, social sciences, and engineering fields are considered in the citation count. This set of procedures can lead to a bias either for or against senior scholars, and without further research even the sign of the effect is uncertain. The situation is inherently complex. Summary of Cautions The cautions mentioned here are intended to alert readers to aspects of the methods used in this study that differ from those used in other studies, including ones conducted by the National Research Council. These innovations may produce ranges of rankings that surprise knowledgeable people in a field and contradict their views of the actual quality of specific programs. An examination of the data on individual variables for a program, together with the weights assigned to the different objective measures for each program should help to clarify the reasons for the specific rankings. The subtle, nonquantifiable variables that might make reputation more than the weighted sum of objective variables are not captured by the method adopted by the committee. In view of these limitations to the methods for obtaining ranges of rankings, some members of the committee remained skeptical that these results capture fully the relative quality 16 At Harvard each one is in a different administrative unit; Princeton has both a history program and a history of science program.
METHODOLOGIES USED TO DERIVE TWO ILLUSTRATIVE RANKINGS 63 of the doctoral programs in certain fields. For this reason, they should be used as illustrative. In general, the range of rankings captures well the relative standing of most Ph.D. programs. Some outliers, however, cannot be explained by the data in hand, and it may be that had more robust measures of reputational standing, or perceived quality, been used, these anomalies might be better understood or might have disappeared. Therefore, the committee suggests that anyone making strong comparisons with the 1995 rankings using either the R or S measure be cautious. Such comparisons can lead to a misinterpretation of the âactualâ rankings of programs, however they might be defined. It would be especially misleading to overstate the significance of changes in rankings from the 1995 NRC report in view of the differences in adopted methodologies. Finally, it is useful at this point to return to the topic of simple errors in the input data, because this is the most serious problem with which the staff and the committee wrestled. At a very late stage in its work the committee undertook a final set of âsanity checksâ; the fields were divided up, and groups of academic fields were assigned to individual committee members to see if they could identify any anomalies in areas with which they were familiar. Many anomalies were found and were addressed, but surely some must have escaped notice. The committee thus urges readers to use the illustrative ranges of rankings with caution. Small differences in the variables can result in major differences in the range of rankings, especially when a program is very similar on other measures to other programs in its field. But individual instances of programs that should have been ranked considerably lower or significantly higher than the tables indicate may emerge, and so it is strongly recommended that the rankings of individual programs be treated with circumspection and caution and analyzed carefully.