Read "Assessing Medical Technologies" at NAP.edu

« Previous: 2. The Scope of U.S. Medical Technology Assessment

Page 70 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 71 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 72 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 73 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 74 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 75 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 76 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 77 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 78 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 79 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 80 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 81 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 82 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 83 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 84 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 85 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 86 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 87 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 88 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 89 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 90 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 91 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 92 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 93 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 94 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 95 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 96 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 97 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 98 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 99 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 100 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 101 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 102 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 103 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 104 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 105 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 106 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 107 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 108 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 109 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 110 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 111 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 112 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 113 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 114 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 115 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 116 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 117 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 118 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 119 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 120 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 121 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 122 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 123 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 124 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 125 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 126 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 127 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 128 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 129 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 130 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 131 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 132 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 133 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 134 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 135 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 136 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 137 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 138 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 139 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 140 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 141 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 142 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 143 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 144 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 145 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 146 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 147 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 148 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 149 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 150 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 151 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 152 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 153 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 154 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 155 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 156 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 157 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 158 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 159 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 160 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 161 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 162 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 163 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 164 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 165 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 166 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 167 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 168 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 169 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 170 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 171 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 172 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 173 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 174 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Page 175 Cite

Suggested Citation:"3. Methods of Technology Assessment." Institute of Medicine. 1985. Assessing Medical Technologies. Washington, DC: The National Academies Press. doi: 10.17226/607.

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

_, Methods of Technology Assessment As Chapter 1 indicates, technology as- a foundation for building a system of tech- sessment offers the essential bridge be- nology assessment for the nation. tween basic research and development and prudent practical application of medical technology. We have a substantial body of methods that can be applied to the various tasks of assessment, and their availability makes possible the acceptance, modifica- tion, or rejection of new technologies on a largely rational basis. That rationality, however, depends on many factors that go well beyond safety and efficacy, including, among other components, economics, eth- ics, preferences of patients, education of physicians, and diffusion of information. The methods that have been developed can take some account of most of these compo- nents, although combining the results for the components is a major task and one that is far from settled or solved. The exis- tence of these assessment methods provides The outline, introduction, and conclusions of this chapter were developed by Frederick Mosteller. The various sections of the chapter were drafted primarily by other authors identified at the opening of each sec- tion. 70 Most innovations in health care technol- ogy rest on some theoretical ideas held by the innovators. These ideas inevitably range in strength from very well informed to hopeful speculation. Beyond this, a few innovations are purely empirical in the sense that someone has noticed that the technology seemed to work, even though no underlying mechanism was proposed or understood. In considering medical tech- nologies, no matter how strong or weak the theoretical justification, experience must be decisive. If in practice the innovation is clearly better or clearly worse than existing technologies, then the innovation deserves adoption or rejection. It is known from much experience that merely having a good idea, a good theory, or a constructive observation is not enough because there are so many unexpected interfering variables that may thwart the innovation and the in- novator. Learning from controlled experi- ence is central to progress in health care. Learning from experience itself without formal planning often presents great diffi- culties and sometimes leads to long-main-

METHODS OF TECHNOLOGY ASSESSMENT tained fallacies, partly because of the lack of control of variables. This method is slow and expensive unless the effects are huge. Planning and analysis and scientific testing provide ways to strengthen the learning process. This chapter describes a number of techniques or methodologies that help to systematize learning from experience in health care technology. Few people are acquainted with more than a few of the methods used for assess- ment. Usually investigators are acquainted with the few methods most frequently used in their own specialties. Consequently, it seems worthwhile to give a brief descrip- tion of the more widely used methods and what they are most useful for studying. For direct attack on evaluation through data acquisition, clinical trials are highly regarded. For generating hypotheses, the case study and the series of cases have spe- cial value. Registries and data bases some- times produce hypotheses, sometimes they help evaluate hypotheses, and sometimes they aid directly in the treatment of pa- tients. Sample surveys excel in describing collections of patients, health workers, transactions, and institutions. Epidemiological and surveillance stud- ies, although not synonymous, are well adapted to identifying rare events that may be caused by adverse effects of a tech- nology. Quantitative synthesis (meta-analysis) and group judgment methods give us ways to summarize current states of knowledge and sometimes to predict the future. Simi- larly, cost-effectiveness analysis (CEA) and cost-benefit analysis (CBA) offer ways of introducing costs and economics into these assessments. Modeling provides a way to simulate the future and still include com- plicated features of the real life process and to see what variables or parameters seem to produce the more substantial effects. When backed with strong, although lim- ited, empirical investigation, it may add much breadth to an evaluation. 71 Sometimes what is learned to be true in a scientific laboratory may not, at first, be successfully applied in practical circum- stances. Myriad reasons can explain this: the new technique is not correctly applied, or to the right kinds of cases, or it is not ap- plied assiduously enough, or too assidu- ously, etc. This idea in medical contexts is captured in the terms efficacy and effec- tiveness. Efficacy refers to what a method can accomplish in expert hands when cor- rectly applied to an appropriate patient; effectiveness refers to its performance in more Unmoral routine applications. The rel- evance of these ideas here is that some of the methods presented below are more nat- urally adaptable to assessing one of these or the other. The reader will probably appre- ciate, for example, that surveillance and data banks point toward assessing effec- tiveness, and most randomized clinical trials point toward assessing efficacy. Although randomized clinical trials of- fer the strongest method of assessing the ef- ficacy of a new therapy, it is recognized that it is not possible to have randomized trials for every version of every innovation. However desirable that might be, it is not feasible. Consequently, other methods of assessment are often going to be depended on; of course, some technologies actually require other methods. This in turn means that steps need to be taken to strengthen the other methods. These steps have two forms. First, where possible, apply the known ways of improving studies, such as observational studies (for example, have a careful protocol, use random samples, use blindness where possible, and so on). Sec- ond, many of these methods could be im- proved if research were carried out to find new ways to improve them. Therefore, specific research that could lead to getting stronger results from the weaker methods is often suggested. Possibly, research will find that particu- lar methodologies are best when applied to special classes of treatments. For example,

72 perhaps noninvasive drugs and devices could be handled in one way and invasive methods in another. Perhaps data banks and registries could offer good results from some class of problems. Answers to such questions are not now available. At the same time that the need for im- proving the weaker methods is recognized, it is also recognized that the methods al- ready in existence are not sufficiently often applied. The Office of Health Technology Assessment (OHTA) evaluates the safety and effectiveness of new or as yet unestab- lished medical technologies and proce- dures that are being considered for cover- age under Medicare. Requests for these evaluations come from the Health Care Fi- nancing Administration (HCFA). OHTA carries out its evaluations by reviewing the literature and by getting advice from vari- ous agencies and professional organiza- tions. The information so acquired is syn- thesized to reach some conclusion. OHTA does not gather primary data itself. Again and again, it turns out, and OHTA notes, that the primary data are almost nonexis- tent and that primary data would be re- quired to reach a well-informed conclu- sion. In advising HCFA about coverage for various medical technologies, OHTA pre- pared 65 reports in the years 1982, 1983, and 1984. Lasch (1985) reviewed these re- ports to see what the state of the informa- tional base on safety and efficacy seemed to be (K. E. Lasch, Synthesizing in HRST Re- ports, unpublished report, Harvard School of Public Health, 1985~. Lasch sorted the reports into four categories, as follows: 1. The technology enjoyed widespread use and was considered an established technology. 2. The data base for the technology was insufficient; there was a call for more stud- ies and better research designs, or accuracy was questioned for diagnostic tests. 3. The data base was sufficient; the technology was not recommended. ASSESSING MEDICAL TECHNOLOGY 4. The technology was outmoded, not routinely used, and not an established therapy. After the studies were categorized for the 3 years, Lasch found the results shown in Table 3-1. The percentage values of the results are similar from year to year. The category of insufficient data stands out. In noting that 69 percent of these assess- ments have insufficient data to reach a sat- isfactory conclusion, it should not be as- sumed that the technologies in the other categories have always been evaluated on the basis of strong data. The categories were chosen to generate a clear set when the evidence was inadequate. The first cat- egory of widespread use may also include poorly evaluated technologies. This study, then, offers a clear message that many technologies that physicians wish to use have not been adequately evaluated. Simi- larly, at the consensus conferences, speak- ers frequently point out the lack of primary data (National Institutes of Health tNIH], 1983, 1984~. Thus, the most important need is to gather more primary data. As we report later in this chapter, the Office of Technology Assessment (OTA; 1980a) polled data analysts who conduct cost-effectiveness and cost-benefit analyses of health care technologies and found lack of information to be a uniformly signifi- cant problem. More primary research is needed, and this will have to be led in part by research physicians with training in quantitative methods and supported by doctoral-level epidemiologists and biostatisticians. All three groups are in short supply (National Academy of Sciences, 1978, 1981, 1983~. At the least the development of methods will also require epidemiologists and bio- statisticians. Therefore, on both grounds, we will need funds for training research personnel. Many assessment methods are described in some detail in the sections that consti-

METHODS OF TECHNOLOGY ASSESSMENT 73 TABLE 3-1 Distribution of Technologies for Years 1982, 1983, and 1984 into Four Typesa when Reviewed by OHTA for HCFA Percentages for Each Yearb Category 1982 1983 1984 Total Widespread use 16 (4) 19 (4) 21 (4) 18 (12) Insufficient data 68 (17) 76 (16) 63 (12) 69 (45) Data sufficient; technology not effective 4 (1) 0 (O) O (O) 2 (1) Technology not used or outmoded 12 (3) is (1) 16 (3) 11 (7) Totals 100 (25) 100 (21) 100 (19) 100 (65) aEach of the 65 reports was assigned to one of the above categories based on a reading of the summary and discussion sections. Coding of the 65 reports revealed that the four categories were mutually exclusive; each report fell neatly into one of the categories. bNumbers of studies shown in parentheses. tote the main body of this chapter. Unless explicitly interested in research methods, some readers may wish to scan cursorily through the chapter. Most sections follow a pattern that opens with a brief description of the method, fol- lowed by typical purposes and uses and by a subsection addressing capabilities and limitations, including some remarks on ways of strengthening the method in prac- tical use. Sometimes a final subsection dis- cusses research that could be done that might lead to improvements in the method. RANDOMIZED CLINICAL TRIALS* The randomized clinical trial (RCT) is a method of comparing the relative merits (and shortcomings) of two or more treat- ments tested in human subjects. A well-de- signed and -executed RCT is widely re- garded as the most powerful and sensitive tool for the comparison of therapies, diag- nostic procedures, and regimens of care. More broadly, the RCT can be regarded as an unusually reliable method for learn- ing from experience; its success lies in structuring that experience so as to fore- *This section was drafted by Lincoln E. Moses. close many sources of ambiguity. In the health sciences the method is applied not only in comparing therapies but also diag- nostic methodologies, ways of imparting information to patients, and regimens of care (e.g., home care versus critical care units for certain heart patients). In gen- eral, if alternative ways of accomplishing an aim are in competition, the RCT may be the best technique for resolving their relative merits. Notice that comparison is at the heart of the method. A clinical trial is not a device for ascertaining the health consequences of a toxic substance in food or for elucidating the etiology of a disease. It is a method for comparing interventions that are applied and controlled by the investigator. The clinical trial becomes an RCT if there is a deliberate introduction of randomness into the assignment of patients (eligible for both, or all, of the treatments) to treat- ment A, treatment B. etc. The reasons for such a method of assignment are discussed below. Hereafter, when referring to an RCT, it is contemplated that it satisfies these two conditions: 1. No subject is admitted without hav- ing been judged to be equally suitable to

74 receive any one of the treatments being of- fered to the subject's class of patients. 2. No subject is admitted without hav- ing volunteered to receive either treat- ment, as may be assigned. Practical Problems of Comparing Treatments Two factors make it intrinsically diffi- cult to compare different treatments. First, the subjects receiving the treatments usu- ally are different people, so differences found between the treatments could be due to differences among the subjects in the groups. If the groups differ in any system- atic way (whether recognized or not), the treatment comparison may be biased; bias can exaggerate, nullify, or reverse true dif- ferences. Second, even if the treatments could be compared in the same patients (as sometimes happens), the contrast between the treatments will vary from one patient to another, producing uncertainty in the overall assessment. This is the problem of variability. Large samples can reduce the disturbance of variability but do not help with bias. If two treatment groups are differently constituted, then bias in the treatment comparison must be regarded as likely. The phrase "differently constituted" ap- plies, for example, where the treatment groups are (1) admitted to the study by dif- ferent means, (2) treated in different places, at different times, or by different sets of practitioners; (3) assessed by differ- ent groups; or (4) analyzed and reported by different teams. Randomization in a clinical trial is aimed at preventing bias. Two characteris- tic features are essential to realizing that aim. First, the study is conducted under a protocol that makes explicit exactly what questions are to be studied, what treat- ments are to be applied; and how, to what kind of patients, when, and where. It also specifies how assessment of outcomes will ASSESSING MEDICAL TECHNOLOGY be done and how statistical analyses will be conducted. Second, the RCT calls for assignment of the respective treatments to each eligible patient admitted to the study by means of a random choice. The effect of this is to en- sure that the two treatment groups are not "differently constituted", indeed, they are brought into being as random subsets of a singly constituted group which is opera- tionally defined by the protocol. The protocol-controlled RCT is even stronger whenever knowledge of which treatment a patient has received is screened from participants (patients, treat- ing physicians, outcome assessors). A result of such "blinding" is to ensure that placebo effects remain randomly assorted to the treatments. Another result is to prevent differential decisions about care during the study. It is especially important that those assessing outcomes be blind to the type of treatmentunless the outcome is entirely objective, e.g., length of survival. In some cases, blinding of physicians may not be possible, such as when a medical modality is being compared with a surgical one. The Protocol The protocol is a written prescriptive document that spells out the purposes and rationale of the trial and how it will be conducted. Specifics include the criteria of eligibility for inclusion of patients in the trial and criteria for exclusionand de- scription of treatments, adjuvant therapy, outcome measurements, patient follow- up, and statistical analyses to be per- formed. The protocol also specifies the numbers of patients to be entered and the mechanics of randomization. The protocol is both a planning document and a proce- dures manual. The aim is to provide trust- worthy answers at the end of the study to the following questions: What treatments were applied, to what kinds of patients, with what results? What do the results mean?

METHODS OF TECHNOLOGY ASSESSMENT Provisions for blindness and for the or- der in which processes are to be performed can be central to the validity of a study and to the value of the protocol that governs it. If the decision to enter each patient into the trial is made in the knowledge of which treatment the next patient will receive, then ample opportunity for building up noncomparable treatment groups is at hand, so the protocol should not use alter- nate-patient assignment to the treatments. If a rather subjective diagnostic test W as- sesses a condition thought to be related to another test V, then W measured after V is not the same as W measured before V; it may be important that the protocol specify the order in which they are to be done. The careful protocol attempts to specify in ad- vance all procedural steps that may mate- rially affect execution of the trial and inter- pretation of its results. A well-conducted RCT requires not only a good protocol but also that the trial be carried out in accordance with it. The pro- tocol may call for specific steps to check on (and promote) protocol adherence. Staging and laboratory analyses may be checked by introducing (blindly) occasional standard specimens. Samples of study records may be checked back to more basic clinical rec- ords. Visits by monitors, combined with audit, may be routinely conducted in multicenter studies. The protocol also has the character of a compact among the participating investi- gators, relevant human subjects commit- tees, and funding sponsors. This contrac- tual character lends stability to a study over its lifetime, helping to supply definite answers to the questions concerning what was done, to what kinds of patients, and with what results. Random Assignment to Treatment The primary reason for random assign- ment is to prevent bias by breaking any possible systematic connection of one treat- ment or the other with favorable values of 75 interfering variables (whether recognized or not). A fuller appreciation of this princi- ple may be gained by considering two al- ternative modes of treatment comparison that are sometimes advocated. The first is the use of historical controls, the second is the use of statistical procedures to adjust for treatment group differences in the im- portant interfering variables. The historically controlled trial (HCT) compares outcomes on a new treatment to outcomes in previous (historical) cases from the same setting. The motivation is to arrive at decisions sooner by assigning all eligible patients rather than only half of them to the new treatment. But because the treatment and control groups come from different time periods, they are "dif- ferently constituted groups." This raises the spectre of bias and sometimes the ac- tuality. The drop in cardiovascular deaths and the decrease in perinatal mortality over the last decade are both not really un- derstood, and both exemplify temporal shifts in control levels of the sort that viti- ate historical controls. Time changes all things, including the patients' characteris- tics at a hospital, the effectiveness of adju- vant treatments not under study, the skill of surgeons with a new operation, and the skill of physicians with a new drug. Thus, it is hard to know when an HCT does reach a valid conclusion. There are successes and there are failures. An example of what seems to be a suc- cessful HCT is that of a changing policy by an institution toward stab wounds. Origi- nally, the policy had been to perform an exploratory laparotomy on all patients pre- senting with abdominal stab wounds. On the basis of advances in handling wounds and some data from refusals to give con- sent, the institution decided to change to a policy allowing surgical judgment to be ex- ercised. This reduced considerably the number of laparotomies performed (92 to 40 percent) and also the numbers of infec- tions (Nance and Cohn, 1969~. The overall complication rate dropped from 27 to 12

76 percent, and no complications occurred in 72 unexplored patients. Byar et al. (1976) call attention to an RCT comparing placebo and estrogen therapy for prostate cancer in which the survival of placebo controls admitted! in the first 2.5 years was significantly shorter (p = .01) than the survival of those admit- ted in the second 2.5 years, although ad- mission criteria, in a fixed setting, were un- changed. They point out that the use of the early placebo group (as historical controls) would have falsely led to the conclusion that estrogen therapy (in the second pe- riod) was effective. It is possible to consider the use of histor- ical controls whenever the variation in suc- cessive control levels is statistically taken into account. However, it may be difficult or impossible to estimate that variation; that is a practical difficulty. Furthermore, there is a theoretical principle that applies. The work of Meter (1975), and later Po- cock (1976), show that for a given standard deviation in batch-to-batch random bias, there is a minimum study size number (the number of experimental subjects) beyond which relying on historical controls, no matter how numerous they are, is inferior to dividing the sample into two equal groups, half experimental and half control. In summary, historical control trials are inferior to RCTs because (1) differently constituted groups are inherently likely to produce bias; (2) if the historical controls were comparable and if the random bias of successive batches of controls had variabil- ity that was exactly known, then reliance on the historical control data would be preferable to randomization only for stud- ies below a certain threshhold size; and (3) knowledge of variability of the random bias is often not available. One often sees the argument that the need for randomization can be circum- vented by making statistical adjustments for differently constituted subgroups, cor- recting for differences in the influential ASSESSING MEDICAL TECHNOLOGY variables that affect outcomes, and render- ing the subgroups comparable. It is easy to find statisticians who place little credence in this trust of statistical adjustment, and for cogent reasons. First, some of the most influential variables may not even be rec- ognized as important. Second, the ones that are recognized as important may not have been measured, or they may not have been measured comparably. Third, just how to make the adjustment can be very unclear; mutually influential variables can be interrelated in ways that both are im- portant and poorly understood. Random- ization avoids these difficulties by ensuring that whatever the critical variables may be and however they may conspire together to affect the outcomes, they cannot systemat- ically benefit one treatment over the other, beyond those vagaries of chance for which the significance test specifically makes al- lowances. This approach avoids the effort of trying to unravel the Gordian knot of causation and cuts through it at one stroke, by random assignment. Before leaving the subject of random as- signment, the idea of randomization within strata should be addressed. If some pretreatment variable, say stage of disease, is known to be strongly related to outcome, then it can be wise to design the study so that (nearly) equal numbers of both treat- ments occur at each level of that pretreat- ment variable. This kind of design is quite natural for multi-institutional studies, when each institution is treated as a stra- tum. Refining the randomization to be done separately within strata does not give added protection against bias, but it may increase the efficiency of a study, i.e., in- crease its effective sample size (usually only moderately). Limitations of RCTs The method, powerful as it is, is hard to apply under certain circumstances. If out- comes mature after decades, then comple-

METHODS OF TECHNOLOGY ASSESSMENT tion of the RCT requires long-term main- tenance of protocol-controlled follow-up, which is difficult and expensive. If a sufficiently rare outcome is the end- point of interest, then detection of treat- ment differences may call for unworkably large sample sizes. One example was con- cern about the safety of the anesthetic, halothane. Detection of differences in sur- gical death rates (about 2 percent overall) that might relate to anesthetic choice would amount to trying to distinguish be- tween death rates such as 1.9 percent and 2.1 percent a task calling at least for hun- dreds of thousands of patients. The retro- spective study that was done did arrive at conclusions, but they were expressed with diffidence made necessary by the possible existence of unrecognized biases. Sometimes it is objected that an RCT is not applicable because treatments are too variable to be controlled with the speci- ficity that an RCT demands. This objec- tion is sometimes false; for example, a treatment may be defined to allow modifi- cation as indications arise in the course of therapy. In other cases, the objection is simply specious, for it asserts the impossi- bility of answering the question "What is the treatment?" That impossibility would block any kind of objective assessment of it. A rather more difficult limitation to deal with grows out of the possibility that a new procedure started in an RCT may, outside that trial, evolve into a superior modified version of the treatment. Then, continua- tion of the RCT is at risk of being irrelevant or unethical. There is a real problem here, and it deserves more study; the question is how the use of protocol and randomization can help to speed sound evolution of new therapies. One proposal has been to "ran- domize the first patient." (See, for in- stance, Chalmers, 1975, 1981.) Inherent in the concept of randomizing the first pa- tient is a fluid protocol that allows a change in the details of a new treatment as 77 the investigators improve their perfor- mance (the "learning curve") or as other information appears. It has not found wide agreement. The definitive treatment of these issues is not yet at hand. The sample size of an RCT may have been planned to resolve differences of a stated size, but when it is completed, ques- tions about treatment comparisons in cer- tain subclasses of patients cannot be re- solved. This is not a limitation of the RCT per se, for more questions always can be asked of a body of data than can be an- swered by it, but one should be warned to think at the planning stage about choosing sample sizes large enough to support ade- quate treatment comparisons in particu- larly salient subgroups. It is sometimes argued that RCTs are too costly. The cost of disciplined, careful, checked medical work is of course high; the advantages of the protocol are not cheaply bought. But in many medical centers with already high standards of recordkeeping, diagnosis, etc., the incremental cost of the protocol might not be great. The incre- mental cost of randomization is negligible. Costs can be high when the base costs of bed, drugs, tests, and care are all loaded onto the RCT budget. Most of these costs would have been incurred anyway, re- gardless of how the patients were treated. Failure to distinguish between total costs, which include those that would be incurred anyway, from incremental costs of RCTs is inherently misleading and could lead to grievous policy errors. Good mea- surements of incremental costs of RCTs are needed. This will involve both conceptual effort and data gathering. Better informa- tion concerning actual incremental costs of RCTs is a topic that should receive system- atic research attention. Two other limitations of RCTs also are drawbacks to any investigational method. The first is that dispute may grow around unwelcome conclusions and hinder adop- tion of the findings. The second is that the

78 RCT may give a clear verdict in patients of the kind used in the trial, but leave unan- swered the question of efficacy in different kinds of subjects. This issue, dubbed exter- nal validity sometimes is readily dealt with; thus, the Salk vaccine trials showed the vaccine to be effective in first-, second-, and third-grade children. No difficulty was found in generalizing the conclusion to both older and younger children. Some- times things are harder they may even demand further RCTs. External validity is of course a problem whenever we under- tal~e to learn from one body of experience and then apply the results to other experi- ence; it is not a peculiar difficulty of RCTs. We do not know as much as we could af- ford to about designing studies with an eye on external validity. This is another area that deserves further research effort. Strengthening RCTs The primary paths to good quality lie in designing a strong protocol and executing it faithfully. The paper by Goodman in Appendix 3-B of this committee's report gives a systematic treatment of most of the key features of a strong protocol. Extensive accounts of RCT protocol are given in works by Friedman et al. (1981) and Sha- piro and Louis (1983~. Some additional ideas on pre- and post-protocol execution deserve comment here. First, the study should be large enough; if it is too small to have a good chance of establishing the existence of a plausibly sized actual improvement, then it needs to be made larger or to be abandoned. Other- wise, work, money, and time will be de- voted to an effort that lacks a good chance of producing a useful finding. Statistical methods for assessing adequacy of planned study size (power calculations) are well es- tablished and should be used. (Sometimes, however, the opportunity to do a study is too good to be missed even if it is too small ASSESSING MEDICAL TECHNOLOGY to be definitive. This should be reported with the study in hope that results of other studies can be combined with these and to- gether they may reach firm conclusions.) Second, the participating investigators should fully understand and be fully sup- portive of the investigation. Persons with initial convictions about relative merits of the treatments may prove to be encum- brances to successful execution of the pro- tocol. Third, in planning for the time and number of cooperating centers that will be needed to carry the study through, be real- istically guarded about the flow of eligible patients that can be anticipated. Seasoned RCT veterans recommend safety factors of two, five, even ten. The foregoing suggestions all relate to the planning phase. A final way of strengthening the RCT applies to the com- pletion phase. Write about and report it well. In par- ticular, the operational definitions of all terms should be clear. Thus, the reader should not be left with doubts about how the subjects were defined and selected, how they were assigned to treatments, what treatments were applied, or how out- comes were measured. In addition, the re- port should specify whether study staff were blind to treatment allocation at key steps like enrollment in study, determina- tion of eligibility, interpretation of diag- nostic tests, measurement of outcome, etc. These issues were prominent among those that DerSimonian et al. (1982) checked in reviewing reports of clinical trials in four leading medical journals and that Emerson et al. (1984) checked in reviewing reports in six leading surgical journals. Both stud- ies answered five questions: (1) What were the eligibility criteria for admission to the study? (2) Was admission to the study done prior to allocation of treatment? (3) Was allocation to treatment done at random? (4) What was the method of randomiza-

METHODS OF TECHNOLOGY ASSESSMENT tion? (5) Were outcomes assessed by per- sons who were blind to treatment? Good reporting will also explain the quality control measures that were an- plied, methods of follow-up used, and au- dit checks employed. Not only should the reader be told what was done, and how, but also what hap- pened. Summary statistics should have the aim of revealing information to the reader. The methods of statistical analysis should be explained. The best way to do this is topic by topic. The analysis was ac- tually done in such a pattern; it should be reported that way: for understanding, for specificity, and, incidentally, for ease of writing. Sometimes one finds a published paper which lists statistical procedures in the methods section. "We used chi- squared, the l-test, the F-test, and Jonck- heere's test." The use of this style of report- ing for banquet recipes would list all the ingredients in all the dishes together and report the use of stove, mixer, oven, meat grinder, egg beater, and double boiler. In addition to showing the data, or gen- erously detailed summaries of them, the statistical analysis should state each of the principal questions that motivated the study and what light the data shed on those questions. (Note that this is not the same thing at all as reporting just those results that are statistically significant.) To lend understanding both to significant and non- significant results, it is wise to use confi- dence intervals whenever feasible and to report the power of statistical tests that are applied (Freiman et al., 1978~. Interesting statistical results that arise out of studying the data (rather than from studying the principal questions that motivated the study) are necessarily on a different, and somewhat ambiguous, logical footing. It is usually wise to regard such outcomes with considerable reserve, more as hypotheses turned up than as facts established. It is es- pecially important to be candid about the 79 nature and amount of "data dredging" that has accompanied the analysis. A Final Remark The protocol has been described as a compact; its construction is typically a col- legial exercise. This entails some advan- tages. Of course, deliberation and consul- tation give opportunities for better planning. Sometimes a sequence of RCTs leads to cumulative expertise and strategiz- ing. But, some of the greatest advantages may lie in the ethical domain. The use in human beings of a new treat- ment with only partially understood prop- erties raises certain problems of ethical portent. (This is true whether that new treatment is tried in an RCT or in any other way.) Among these questions are the following: How strong is the evidence that this new treatment may be at least as good as the best available current therapy? How shall we know when we should stop using both treatments and prefer only one of them? Who shall be able to receive this new treatment, and who shall not? Each of these questions is likely to be better an- swered when decided by a group of profes- sionals, acting explicitly and consul- tatively, in a process open to review. Wishful thinking blooms wherever Homo sapiens is found, but group consultation tends more often than not to restrain it. Another advantage of the collegial building of the protocol is that investiga- tors who already believe they know which treatment is superior have the opportunity to drop out, leaving to the trial's execution investigators able to proceed in good con- science to participate themselves and to in- vite their patients to participate.

80 EVALUATING DIAGNOSTIC TECHNOLOGIES* Accurate diagnosis is central to good medical practice. Diagnostic technology provides the physician with diagnostic in- formation. However, all diagnostic tests and procedures have associated costs and risks. Thus, persons involved with medical care must determine whether an individ- ual test or procedure provides significant new diagnostic information and whether the information provided and its impact on subsequent medical care offset the costs and risks of the technology. For each diag- nostic test, these and related questions re- quire assessment of (l) the diagnostic infor- mation provided and (2) the impact of the resulting therapy on patient outcome. Such assessments of diagnostic technology rarely are performed. Most diagnostic technology undergoes only narrow and limited evaluation. The lack of more com- prehensive assessment severely limits the efficient and optimal use of diagnostic tests and procedures. Fineberg et al. (1977) has formulated a hierarchy of evaluation of diagnostic tech- nologies: 1. Technical capacityDoes the device or procedure perform reliably and deliver accurate information? 2. Diagnostic accuracy Does the test contribute to making an accurate diagno- . ~ slsr 3. Diagnostic impact Does the test result influence the pattern of subsequent diagnostic testing? Does it replace other di- agnostic tests or procedures? 4. Therapeutic impact Does the test result influence the selection and delivery of therapy? Is more appropriate therapy used after application of the diagnostic test *This section was contributed by I. Sanford Schwartz. ASSESSING MEDICAL TECHNOLOGY than would be used if the test was not available? 5. Patient outcomeDoes performance of the test contribute to improved health of the patient? Clearly, if diagnostic technology fails ut- terly at any step in this chain, then it can- not be successful at any later stage. If it succeeds at some stage, this implies success in the prior stages (even if they have not been explicitly tested) but does not tell what success may be attached to later stages. Thus, an accurate test may or may not lead to more accurate diagnosis, which in turn may or may not lead to better ther- apy, and that in turn may or may not even- tuate in better health of the patient. Be- cause many tests may be involved, it can require carefully designed studies to gauge success or failure of any particular one at stages 2 through 5. Present Evaluation Methods The first step in the hierarchy of evaluat- ing diagnostic tests and procedures is deter- mination of the technical performance of the test. Several factors are involved in this evaluation. The first deals with the ability of the test actually to measure what it claims to measure. Replicability and bias of test results are important measures of test performance. Replicability (i.e., preci- sion) reflects the variance in a test result that occurs when the test is repeated on the same specimen. A highly precise test ex- hibits little variance among repeated mea- surements, an imprecise test exhibits great variance. The greater this variation, the less faith one may have in a single test's results. However, a precise test is not nec- essarily a good test. A test may exhibit a high level of replicability yet be in error. A good test must be reliable (i.e., unbiased); that is, it must exhibit agreement between the mean test result and the true value of the biologic variable being measured in the

METHODS OF TECHNOLOGY ASSESSMENT sample being tested. Evaluations of clini- cal tests should consider both the replica- bility and reliability of the technology. Fi- nally, the safety of a diagnostic technology should be determined. Performance of the test should involve no unusual, unaccepta- ble, or unexpected hazard. FDA regula- tions require some minimal level of safety and technical performance to be demon- strated for many diagnostic tests before marketing approval is granted (OTA, 1978a). The purpose of a diagnostic test or pro- cedure is to discriminate between patients with a particular disease and those who do not have the disease. However, most diag- nostic tests measure some disease marker or surrogate (e.g., a metabolic abnormality that is variably associated with the disease) rather than the presence or absence of the disease itself. The performance level of a diagnostic test depends on the distribution of the marker being measured in diseased and nondiseased patients and on the tech- nical performance characteristics of the test itself (its precision and reliability). TN=True negatives FP= False positives EN = False negatives TP = True positives an, , of ILL () Population _ if Non-diseased 81 Each disease marker has a distribution in populations of diseased and nondiseased patients. Unfortunately, these distribu- tions frequently overlap so that measure- ment of the markers does not permit com- plete separation of the diseased and nondiseased populations (Figure 3-1~. In these circumstances no matter what cutoff value, k, is chosen it is not possible to en- sure that all patients on one side have the disease and all those on the other are free of the disease. We are instead left with some false positives and some false negatives, as indicated in Figure 3-1. By moving k to a larger or smaller value the relative proba- bilities of these two kinds of error will be altered. These probabilities can be tabu- lated in a format like that in Table 3-2. It should be borne in mind that the nu- merical values of these probabilities will change if the cutoff value of k is changed. The two most commonly used measures of diagnostic test performance are sensitiv- ity and specificity (Table 3-3~. These test characteristics deal with the ability of the diagnostic test to identify correctly subjects Cutoff point Diseased Poculation k Test iA:u[ k TEST VALUE FIGURE 3-1. Relationship of test value to diseased and nondiseased populations for a hypo- thetical diagnostic test.

82 TABLE 3-2 Outcomes of Diagnostic Test Use TEST RESULT DISEASE STATUS Disease Present _ _ Disease Absent _ Positive Negative True positives False negatives False positives _ True negatives with and without the condition of interest. Sensitivity measures the ability of a test to detect disease when it is present. It mea- sures the proportion of diseased patients with a positive test. This can be expressed by the ratio, True positives True positives + False negatives Specificity measures the ability of a test to correctly exclude disease in nondiseased patients. It measures the proportion of nondiseased patients with a negative test. This can be expressed as True negatives . True negatives + False positives Sensitivity and specificity have been adopted widely because they are consid- ered to be stable properties of diagnostic tests when properly derived on a broad spectrum of diseased and nondiseased pa- ASSESSING MEDICAL TECHNOLOGY tients. That is, under such circumstances their values are thought not to change sig- nificantly when applied in populations with different prevalence, presentation, or severity of disease. However, if diagnostic tests are not derived on an appropriately broad spectrum of subjects their values will change as the prevalence and severity of disease are varied in the populations tested (Ransahoff and Feinstein, 1978~. Test sensitivity and specificity as mea- sures of diagnostic test performance taken alone do not reveal how likely it is that a given patient really has the condition in question if the test is positive, or the proba- bility that a given patient does not have the disease if the test is negative. The fraction of those patients with a positive test result who actually have the disease is called the predictive value positive of a test. It is cal- culated by the ratio of True positives . True positives + False positives The fraction of patients with a negative test result who are actually free of the dis- ease is called the predictive value negative and is determined by the ratio of True negatives True negatives + False negatives TABLE 3-3 Operating Characteristics of Diagnostic Tests Measure of Performance Characteristic True positives Sensitivity True positives + False negatives Specificity = Predictive values positive = Predictive values negative = True negatives True negatives + False positives True positives True positives + False positives True negatives True negatives + False negatives

METHODS OF TECHNOLOGY ASSESSMENT The predictive value positive and predic- tive value negative of a diagnostic test measure respectively how likely it is that a positive or negative test result actually rep- resents the presence or absence of disease in a given population of patients with a given prevalence of disease. The positive and negative predictive values of a diagnostic test, however, are not stable characteristics of that test. Rather, they depend strongly on the prevalence of the condition being examined in the population being tested. As the disease prevalence (pretest likeli- hood of disease) decreases, the proportion of individuals with a positive test result who actually are diseased falls and the pro- portion of nondiseased patients falsely identified as being diseased rises. Con- versely, as the prevalence of disease in- creases, the proportion of patients with a positive test result who are in fact diseased 1 .0 ~ 9 _ > . _ .~ 8 - ~ .7 _ - o .6 LO .5 ~ .4 oh o IIJ .3 .1 83 increases, while the proportion of patients with a negative test result who are not suf- fering from the disease falls. This fact has enormous implications for diagnostic tests, particularly when they are used in popula- tions with a low prevalence of disease, such as when a test is used to screen for the pres- ence of an uncommon disease. The receiver operating characteristic (ROC) curve (Lusted, 1969; Metz, 1978; Metz et al., 1973, Robertson and Zweig, 1981; Swets, 1979; Swets and Pickett, 1982) provides an economical display of the information in the two-by-two table for various values of k. Figure 3-2 is an ex- ample showing for each of five values of k the sensitivity and specificity information. Consider the point marked B; we see that using the cutoff value of k = 1.0 mm in the exercise stress test yields sensitivity of about 0.65 and specificity of about 0.85 (since a< out .2 ' 1 , 1 1 , , - ST Segment Depression: "Positive" if A:~0.5 mm B:~1.0 mm C:~1.5 mm D:~2.0 mm E:~2.5 mm .3 .4 .5 .6 .7 .8 .9 1.0 FALSE POSITIVE RATIO (1-specificity) FIGURE 3-2 Receiver operating characteristic curve for the exercise stress test for the diagno- sis of coronary artery disease as the criterion for a positive test is varied.

84 1 - specificity is about 0.153. From the curve, it is easy to see how lowering the cut- off value increases the sensitivity at the cost of also increasing the false-positive ratio. Figure 3-3 shows another use of ROC curves. The curves of tests A and B make it evident that test A is the better of the two, because at every false-positive ratio it has higher specificity than does test B (or equivalently, at every specificity, test A has a lower false-positive ratio than does test B). The more closely an ROC curve can fit into the upper left-hand corner, the better its performance. The diagonal curve C represents a test based on pure chance; if the test called every patient positive with a probability of one-fourth (for example, if cutting a deck of cards produced a spade), the point P would result, showing specific- ity and a false-positive ratio both to equal 0.25. 1.0 .8 ASSESSING MEDICAL TECHNOLOGY So, an important use of the ROC curve is to compare alternative tests; an ROC curve that lies above and to the left of another corresponds to the better of the two tests. Then, the choice of a particular k value for that test amounts to choosing the sensitiv- ity and specificity that will be employed. The particular choice of k may depend on the purpose for which the test is being used. One might require more stringent criteria to confirm (rule in) a suspected clinical diagnosis thari to screen for or ex- clude (rule out) disease. A cutoff criterion with high specificity (to the left on a ROC curve) is desired when confirming a dis- ease. A cutoff point with high sensitivity is desired when screening for a disease, al- though such a point is accompanied by lower test specificity. Such a cutoff point corresponds to a point upward and to the right on a ROC curve. Test A // I/ Test B ~ .9 ._ ._ .~ ~ 6 // a) 4 Am O .3 ~ 1 AMP tr .2 ~ HI 1 1 1 1 1 1 1 1 1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 .0 / Test C FALSE POSITIVE RATIO (1-specificity) FIGURE 3-3 Three hypothetical ROC curves: for tests A and B and for the chance test C.

METHODS OF TECHNOLOGY ASSESSMENT Comparison of two tests by means of their ROCs is far better than attempting to do so from the two-by-two tables, as can be seen by the following argument. If one test (A) has a higher sensitivity but a lower specificity than the other test (B), one cannot be sure which test performs better. Three alternative possibilities exist: test A may perform better than test B (Fig- ure 3-4a), test B may perform better than test A (Figure 3-4b), or they may represent the same or equivalent tests with different cutoff values used (Figure 3-4c). But ROC curve analysis can differentiate among these possibilities by empirically determin- ing test performance and comparisons over a range of cutoff points (Lusted, 1969; Metz, 1978; Metz et al., 1973; Robertson and Zweig, 1981; Swets, 1979~. When one test performs better than a second test at some levels of use (e. g., with high specificity) but worse at other levels of use (e.g., with high sensitivity), the ROC curves cross and comparison is then more complicated (Schwartz et al., 1983~. Limitations of Present Evaluation Methods Diagnostic technologies usually are eval- uated adequately with respect to technical performance. However, issues of test repli- cability and reliability commonly receive less consideration. 1.0 0.8 G 0.6 - 0.4 0.2 O 1.0 _ 0.8 0.6 - 0.4 0.2 1 1 1 1 1 o .2 .4 .6 .8 1.0 False Positive Ratio 4a. Test A is superior to test B 85 Evaluations of diagnostic tests are ham- pered by the common practice of excluding indeterminate or uninterpretable results from published reports of test perfor- mance. Some patients cannot cooperate with a diagnostic test or procedure. In other patients, the test is uninterpretable because of technical factors. Few investi- gators evaluating diagnostic tests identify such patients in their published evalua- tions. Usually these patients are deleted from evaluations of test performance be- cause they do not fit neatly into a two-by- two table or ROC. However, such patients may constitute a considerable portion of patients for whom a test is advocated. In- clusion only of those patients with defini- tive test results represents reporting of a se- lected sample. In these cases published results of test performance overstate the di- agnostic test's actual performance in clini- cal application. The range of patients on whom diagnos- tic tests are evaluated often is inadequate. Commonly, a test first is evaluated on pa- tients with advanced disease and on young, very healthy controls. Such a strategy may be appropriate at a preliminary stage of test evaluation, because if a diagnostic test cannot separate patients with extremes of disease presentation, it is unlikely to per- form well when the diagnosis is less obvi- ous. However, many tests perform well in patients with extreme cases of disease but 1.0 0.8 0.6 - 0.4 0.2 1 1 .8 1 0 .2 .4 .6 False Positive Ratio 4b. Test B is superior to test A ' of_ B : A - I ~ I .2 .4 .6 .8 1.0 False Positive Ratio 4c. Test A is equivalent to test B FIGURE 3-4 Comparison of two hypothetical tests (A and B) illustrating the need for receiver operating charac- teristic curve analysis to compare their performance over a range of cutoff points.

86 perform poorly in those patients with in- termediate probabilities of disease. The performance estimates of diagnostic tests that are derived from extremes of popula- tions (patients with severe disease and healthy controls) will deteriorate as the test is performed on a broader spectrum of pa- tients. Feinstein (1977) has identified sev- eral groups of patients on whom a diagnos- tic test should be evaluated: (1) patients with the disease who are asymptomatic, (2) patients with symptoms and signs rep- resentative of the spectrum of the disease of interest, (3) patients without the disease of interest who have other diseases which produce similar signs and symptoms, and (4) patients without the disease of interest who have other diseases which affect the same organist as the disease of interest or which occur in a similar anatomical loca- tion~s). Most important for the clinician are those patients in whom they will apply the test. Generally this group is composed of those patients suspected of the disease by virtue of their symptoms or clinical find- ings, but in whom disease presentation is not obvious or is somewhat atypical. The irony is that while it is in such patients that the diagnostic test is most needed, it is in this population that its performance may be poorest and least predictable and in which it is least likely to be evaluated. The lack of a proper spectrum of patients in di- agnostic test evaluations has been shown to lead to overstated performance of diagnos- tic tests (Ransahoff and Feinstein, 1978~. A very large percentage of new evalua- tions in diagnostic radiology are currently based on ROC curves, whose use also is in- creasing in other areas of investigation. A1- though ROC curve analysis represents the state-of-the-art method to evaluate diag- nostic tests, it has several limitations. Un- less used to analyze data in a timely fashion it usually is impossible to construct such curves at a later time. Second, as with other procedures adequate methods are not available at present to determine the ASSESSING MEDICAL TECHNOLOGY importance of differences among the ROC curves. As with other methods to evaluate diagnostic technologies, statistically signif- icant differences among several ROC curves (Censor and Schwartz, in press; Hanley and McNeil, 1982; Swets and Pick- ett, 1982) do not necessarily imply clini- cally important differences. Many diagnostic tests require some de- gree of interpretation to arrive at a test result. Thus, diagnostic test performance commonly depends on a combination of the technical performance of the test and how it is interpreted (the test-interpreter unit). For example, the diagnostic perfor- mance of a chest radiograph depends both on the technical quality of the film image and the expertise of the radiologist or other physician interpreting the film. ROC curves evaluate the complete test-inter- preter unit. However, when evaluating a technology it may be important to differ- entiate between deficiencies in the perfor- mance of the technology and deficiencies in the performance of the technology inter- preter, as one or the other may be more easily improvable. A major problem in determining the performance characteristics of diagnostic tests is the lack of an appropriate reference standard (gold standard) against which to judge the test. The true state of nature gen- erally is not known in clinical medicine. For most diseases even the best available diagnostic test has some associated error rate. In practice one is forced to accept the best available, albeit imperfect, diagnostic test as a pseudo-reference standard. Evalu- ating a diagnostic test against an imperfect reference standard obviously results in an inaccurate measurement of test perfor- mance. The clinical use of a reference standard often presents a number of other problems, many of which are avoidable. A true refer- ence standard should be a means of deter- mining the correct diagnosis independent of the measures of the diagnostic technol-

METHODS OF TECHNOLOGY ASSESSMENT ogy being evaluated. In some circum- stances a reference standard is adopted that depends on the subjective judgment of an observer whose judgment in turn might be based, in part, on the technology in question. Another common reference stan- dard is the degree of concordance between its results and those found by subsequent tissue examination. This is a partial and in- adequate solution. The problems here in- volve case selection bias and work-up bias so that the results may not be generalizable to many cases. A third method is to use clinical follow-up as a reference standard. Such an outcome measure provides some inferential data regarding reference stan- dard performance. However, outcome measures may be confounded by the effects of time or intervening therapy. A low cor- relation between a positive test and a bad outcome might be consistent with a correct diagnosis and an appropriate, successful intervention. A high correlation might arise with ineffective diagnoses that result in deleterious treatment. The dynamic evolution of many diag- nostic technologies complicates the timely evaluation of many tests and procedures. Diagnostic technologies must be evaluated before they are adopted widely. This re- quires early evaluation. However, the results of early evaluations often are ques- tioned or rejected in light of improvements in the technology which occurs subsequent to the evaluation. The problem of techno- logical creep and how to identify optimal times to assess diagnostic technologies is important, unsolved, and vexing. Strengthening the Method Assessments of a diagnostic technology typically are confined to evaluation of the diagnostic performance of the test and do not often measure clinically important im- pacts of diagnostic tests, such as the ther- apy chosen or the clinical outcome follow- ing therapy. The scope of evaluations of 87 diagnostic technologies should be broad- ened more often to include consideration of the diagnostic decisions, therapeutic choices, and health outcomes. When ap- propriate, financial and social impacts of a test (e.g., a screening test) may call for careful evaluation. Diagnostic tests often are used in combi- nation, and it can require carefully de- signed studies to disentangle separate con- tributions of individual tests to clinical decisions and outcomes. Although one diagnostic procedure may be superior to a second as judged from their ROC curves, the choice between them may have to take account of other in- formation such as the comparative inva- siveness of the two tests, patient accep- tance, or the time scale on which results are available. ROC curve analysis can measure the dif- ferences in diagnostic performance of vari- ous combinations of diagnostic tests, but only rarely has it been used for this purpose (Feinstein, 1977~. Diagnostic technology often is evaluated on patient populations that are small in size, limited in disease spectrum, and highly dependent on expert interpretation, limiting the adequacy of the evaluation of the test or procedure. Pooling of data from different studies and different sites may help resolve many of these methodologic problems, particularly when studies con- sist of small numbers of observations or have conflicting results. Thus, data pool- ing should be explored further as a mecha- nism to improve the quality and timeliness of evaluations of diagnostic technology. However, several methodologic issues must be investigated before it can be determined that data pooling is both possible and ap- propriate. Data pooling requires studies with similar clinical situations, study de- sign (randomization, selection criteria), di- agnostic methods and techniques, observer interpretations and skills, and outcome measures. It is possible that studying diag-

88 noetic tests in collaborative studies at sev- eral institutions, as therapies are often studied in clinical trials, would be a con- structive move. Unresolved methodologic issues include questions of weighing in var- ious studies (by size, quality), selection of appropriate methods of statistical analysis and hypothesis testing, and specification of criteria for inclusion of studies in pooled analysis. Summary Although assessing diagnostic technolo- gies unquestionably is difficult, a concep- tual base has been laid. The most impor- tant problems remaining to be addressed are practical ones. One of the important problems for those researchers engaged in the evaluation of diagnostic technologies is to appreciate and acknowledge the uncer- tainty involved in test performance and in- terpretation and to consider the many fac- tors that confound the results of such evaluations. Awareness of these problems must be coupled with improved evaluation methodologies. In particular, adoption of better experimental design standards and use of ROC curve analysis will improve the results of such evaluations. Diagnostic tests should be evaluated in terms of their use with and contribution to other diagnostic tests and not merely as to the absolute ac- curacy of a test in isolation of already known clinical information. Even if these problems are addressed, other important factors remain unre- solved. One of the most important of these includes the definition and measurement of appropriate clinical endpoints for evalu- ation. Up to the present most research has dealt with the validity of tests, and few studies have evaluated the outcome of test- ing through clinical trials. For example, what is the impact of a test on the diagnos- tic process or on therapy? How is perfor- mance of a test related to ultimate health outcome? How does one evaluate a diag- ASSESSING MEDICAL TECHNOLOGY noetic test when there is no adequate refer- ence standard? How do patients and physi- cians value positive test results when there is no effective treatment for the disease of interest? How do patients and physicians value the reassurance inherent in an expen- sive technological examination as com- pared with that of a careful physical exam- ination? How much are patients consulted about their desires for immediate diagno- . ~ S1S r Investment in medical research has led to major advances in physiology, patho- physiology, biochemistry, genetics, and other basic sciences. These in turn have led to the development of many technological advances in diagnosis. However, knowl- edge of how best to apply such information clinically has lagged significantly. A major reason is underinvestment in this kind of research. Thus, we have the ironic situa- tion in which important and painstakingly developed knowledge often is applied hap- hazardly and anecdotally. Such a situa- tion, which is not acceptable in the basic sciences or in drug therapy, also should not be acceptable in clinical applications of di- agnostic technology. It is clear that existing research does not provide a firm basis for comprehensively assessing the usefulness of diagnostic tests. Diagnostic testing in the setting of patient care is expensive and has been the subject of increasing scrutiny and concern. Diag- nosis is one of the most rapidly expanding activities of medical practice, with esti- mated annual growth rates of 15 to 20 per- cent. Concern for health care costs has led to moves to limit expenditures for medical care. Programs that employ such catego- ries as Diagnosis-Related Groups (DRGs) for provider reimbursement certainly will affect the use of diagnostic testing. Present limitations of knowledge, however, will continue to hamper the physician's ability to arrive at appropriate decisions regard- ing the utilization of diagnostic tests and procedures, regardless of the reimburse-

METHODS OF TECHNOLOGY ASSESSMENT ment and planning systems in place. Sub- stantial progress in the measurement of test performance and more appropriate utili- zation of tests and procedures requires more comprehensive technology evalua- tion that focuses on the clinical impact of the technology on the patient and the pa- tient's health. THE SERIES OF CONSECUTIVE CASES AS A DEVICE FOR ASSESSING OUTCOMES OF INTERVENTION* An air of serving the common good clings to the process of publishing for gen- eral information the results of one's own extensive experience. Medicine enjoys a long tradition of such publication; valu- able results sometimes ensue. Moreover, a large share of medical knowledge has been accumulated in just this way, through the publication of series of cases. This paper examines the usefulness and limitations of series for assessing safety and efficacy of medical interventions. Two historical ex- amples initiate the discussion; the first demonstrates results with a new tech- nique; the second compares outcomes be- tween two differently treated subsets of a single series of patients. In 1847 John Snow published an epochal work, On the Inhalation of the Vapor of Ether in Surgical Operations (Snow, 1847~. In it he described the equipment he had devised, his procedure, and a descrip- tion of the 52 operations at St. George's Hospital and the 23 operations at Univer- sity College Hospital in which he had de- livered ether anesthesia by September 16, 1847. These two series (with four and two deaths, respectively) doubtless were, in the eyes of the author and his readers, harbin- *This section is adapted from an article written by Lincoln E. Moses for the New England Journal Proj- ect in the Department of Biostatistics, Harvard Uni- versity. 89 gers of the future. For a modern reader they are, as well, a window on the past; in all 75 operations, neither the thorax nor the abdomen was ever entered. The two se- ries showed the effectiveness of Snow's ap- paratus for vaporizing the ether for patient inhalation and that with the new appa- ratus and procedure (1) anesthesia was in- duced in all patients, (2) they all revived from the anesthesia, and (3) the surgery went forward more easily. All this helped to dispel the mistrust of ether anesthesia that had grown up around earlier, inept applications of ether in England during 1846. In 1835 Pierre Louis published, from his practice over the years, an account of 77 patients who had had pneumonia, uncom- plicated by other disease (Louis, 1836~. He classified them by whether or not they had survived the disease and by the day in the course of their illness on which he had be- gun bleeding them. Early bleeding turned out to be associated with reduced survival. That series of observations was an impor- tant part of his attack on bleeding as a pan- acea. These accounts although much abbre- vialed, allow us to see some of the issues re- lating to series as an information source in medicine. First, a series typically contains information acquired over a period of time. Second, the patients in the series are all similar in some essential way; with Snow they had all received ether, although with various operations; with Louis they all had the same disease (and physician), but varied in how they were treated. Third, all the patients of a defined class are reported; with Snow, all ether administra- tions on or before September 16, 1847, were reported; with Louis, all the pneumonia patients for whom he had records indicat- ing no other disease, and whom he had bled, were reported. Fourth, comparison is involved either directly, as with Louis, or indirectly, as with Snow; fairness of comparison becomes a crucial issue. (Louis

go assured his readers that the two groups, survivors and decedents, were as alike in initial severity of disease as he could ar- range by including and excluding cases from his files. Fifth, the series, whatever its value as evidence, may be influential or it may not. Louis was a member of the fac- ulty at Paris; that lent weight to his series. (This contrasts notably to the low impact of James Lind's beautifully controlled ex- periment demonstrating the curative power of lemons in treating scurvy; he was a naval surgeon without high standing, and his study's effect on the policy of the Royal Navy was delayed by some 40 years. Description of a Series The term series will be applied to studies of the results of an intervention if the study has certain characteristics: 1. It is longitudinal, not cross-sectional; postintervention outcomes are reported for a group of subjects known to the investiga- tor before the intervention. 2. All eligible patients in some stated setting, over a stated period of time, are re- ported. These eligible patients are alike; they have a common disease, they have re- ceived the same intervention, or they share some other essential characteristic. Series may have other important design characteristics, such as (1) the presence or absence of comparison groups and (2) whether the research was planned before or after the data were acquired. Thus a series, as the term shall be used, studies the outcomes of an intervention ap- plied to all eligible subjects, chosen by cri- teria that depend only on pretreatment sta- tus. The actual data collection may go forward in time according to a research plan, or it may be undertaken after all cases are complete. (Intermediate cases can occur.) The data are regarded as if the subjects were first identified as to eligibil- ASSESSING MEDICAL TECHNOLOGY ity, and then given the intervention, and then observed as to outcome. A significant fraction of current medical literature consists of articles that meet this description. Feinstein (1978) reviewed all issues of the Lancet and the New England Journal of Medicine (NElM) appearing be- tween October 1, 1977, and March 31, 1978. Of the 324 structured research pa- pers that he identified, 47 (transition co- hort and outcome cohort) contained re- ports of series, as the term is used here. This 15 percent of articles was approxi- mately equaled by 16 percent (53 papers) which reported clinical trials. Bailar et al. (1984) reviewed all Original Articles pub- lished in NElM during 1978 and 1979. Among the 332 articles studied, there were 80 that apparently met the description of series used here. Needed Information At a minimum, to interpret a series' findings securely, it is necessary to know answers to the cub reporter's legendary questions: Who were the subjects (i.e., what were their relevant characteristics?? What was done? (This calls for defining the treatment, diagnosis, staging, adjuvant care, follow-up, etc.) By whom was it done? (By world-class experts? By teaching hospital staff? By community hospital staff?) When was it done? (Over a time span long enough to permit the existence of large trends of various sorts within the se- ries?) We may even need to know why a treatment was done. (Because other treat- ments had already failed? Because the pa- tients were not strong enough to tolerate other treatment? For palliation? For cure?) Adjustment for Interfering Variables Recent series from the United Kingdom of 5,174 births at home and 11,156 births ~ , .

METHODS OF TECHNOLOGY ASSESSMENT in hospitals show perinatal mortality of 5.4/1,000 in the home births and 27.8/ 1,000 in the hospital births (Health and So- cial Service Journal, 1980~. What use can be made of these numbers? A moment's thought fills the mind with questions about the comparability of the two series of mothers: How did they differ in age, par- ity, prenatal care, prenatal complications, home circumstances, general health, and disease status? Without answers to these questions, we must hold back from any firm interpretation whatsoever. With in- formation on all these variables and doubtless some others we are better off. But with such information in hand, we would still face the hard question of how to adjust the raw results for differences in these other variables: their relevance to perinatal mortality is likely, but we do not know how to adjust numerically for these factors, even if we had the information. The complexities that are attached to adjustment are nicely exemplified in a study of more than 15,000 consecutive (eli- gible) deliveries at Beth Israel Hospital in Boston, about half of which involved elec- tronic fetal monitoring, which was the in- tervention being studied (Neutra et al., 1978~. The authors identified many vari- ables as risk factors; among them were ges- tational age, hydramnios, placental, and cord abnormalities; multiple birth; breech delivery; and prolonged rupture of mem- branes. Their primary analysis used 18 variables in a multiple-regression-derived risk index scored for each delivery. Then, each case was assigned to one of five (or- dered) strata, depending on its risk score. In addition to the primary analysis just sketched, the authors applied risk stratifi- cation in two other ways, and they also in- dependently analyzed the data in terms of log-linear models. Clearly, how to adjust is not always a straightforward question. The authors qualify their results with this observation: "Since we are applying our 91 risk score to the set of data from which the weights for the score were computed, we may be overstating the concentration of benefit in the high-risk categories." This candid caveat further attests to the intrin- sic difficulty of adjusting for relevant vari- ables in the effort to interpret series results. The message here is that the interpreta- tion of even an apparently crisp series- based difference may make heavy de- mands for additional information about the data in the series, and even with such additional information the meaning of the series' result may remain ambiguous. Capabilities and Limitations Just as a series can advance correct un- derstanding, so can a series promote the pursuit of bad leads. It is probable that nearly every discarded, once-popular ther- apy was supported by a series of favorable cases. This is known to be true, even in re- cent times, with portacaval shunt for the treatment of esophageal varices and with gastric freezing for the treatment of ulcers. The strengths and weaknesses of series as information sources deserve analysis. Per- haps there are straightforward ways to identify trustworthy information conveyed by series and to recognize spurious, mis- leading information from them. We turn now to these matters. The publication of a series of successive cases provides readers with vicarious expe- rience. The reader acquires this new "expe- rience" with little outlay of effort. Often the writer also has expended relatively lit- tle effort in collating and writing up the ex- perience to report the series. Thus, in terms of effort, the series may be regarded as an efficient information source. The useful interpretation of this vicari- ous information is likely to involve consid- erable difficulty. Good knowledge of sur- rounding circumstances is ordinarily necessary; the series may not adequately

92 ASSESSING MEDICAL TECHNOLOGY report these. Even if the needed supple- times voiced about the usefulness of mentary information is reported, correct voluntary disease registries. methods for taking quantitative account of - ' ' ' it may be hard, or even impossible, to de- vlse. Face value acceptance of the result of a series is almost never justified. Any statistic is simply the reported outcome of some process; until the process is known, one cannot know what the statistic means, however it may be named or labeled. Thus, the use and interpretation of a series result is typically a task calling for analy- sis analysis which in some instances will prove to be feasible and in others infeasi- ble. A series is a record of experience, and as such it has prima facie value; it may give very useful information about how to ap- ply a new technique and what kinds of dif- ficulties and complications may be encoun- tered. The reader of Snow's (1847) book will see this. Postmarketing surveillance produces what might be called partial se- ries (where total numbers under observa- tion can only be estimated). It is a method of study that has its just role in medical in- vestigation. The series is most liable to infirmity as an arbiter of treatment effectiveness. The two principal threats to validity are vague- ness and bias. Interpretation of Series A number of factors bear on the inter- pretation of a report of a series. Integrity of Counting The definition of a series used here has included the word all, and that word is essential. Conclusions based on selected cases are notoriously treacherous because selection can grossly affect the data; in the extreme, only the successes or only the failures might be re- ported. Presumably, the limitations of se- lected cases underlie the skepticism some- At a minimum the reacher needs to know what criteria were used to determine inclu- sion and exclusion, how many subjects were included, and what happened to each of them. There lurk here two kinds of problems. The first is operational; it may be difficult or impossible to learn some of the essential information in retrospect. The outcomes for some who belong in the series may be unknown. Patients who cannot be followed up often differ on average from ones who can; more of them may be dead; more of them may be cured. Without com- plete follow-up, the available figures lose much of their meaning. The second counting problem is defini- tional. The series report, to avoid being a recital of selected cases, needs to describe what was done to (all) eligible patients and how things turned out for each of them. Who is (was) eligible? This may depend on diagnostic criteria that require making judgment calls. What was done to the pa- tients? Judgment calls may be involved here as well; if the intervention is a new surgical procedure that has changed some- what with time, the designation of patients who did and did not receive the new opera- tion requires a decision by the investigator. Even identifying the outcome for a patient may demand a judgment call. If the surgi- cal patient dies on the operating table, there may be a question (and a decision) as to whether it was an anesthetic death, a treatment failure, or a result of the pa- tient's disease. The definitional and operational prob- lems of counting are likely to loom larger when the series study is planned after the data are already in existence. Consequences of No Protocol The ab- sence of a protocol, prepared before data are acquired, may allow certain kinds of defects to arise in a series-based study. Ex-

METHODS OF TECHNOLOGY ASSESSMENT actly what interventions were performed on what kinds of patients for what indica- tions may be unclear in hindsight. Who was counted eligible and so included and who was omitted may have been based on judgment calls. Withdrawals may be ill- documented or entirely tacit, with possibly a great effect on the results. The reader may be left to wonder whether the results reported were searched out from among many possible endpoints and thus less likely to be reproducible than significance tests indicate. There are indications of an increasing number of studies in which the research is planned after the data have been collected. Fletcher and Fletcher (1979), studying articles in the Lancet and the Journal of the American Medical Asso- ciation, found that in 1946, 24 percent of the articles were post hoc in this sense; in 1976 the corresponding figure was 56 per- cent. Of course difficulties can be miti- gated by careful reporting, but they can be eliminated only if the data are gathered so systematically as to conform to an invisible protocol in all important respects. Consequences of No Randomization The series-based study stands vulnerable to the many dangers that randomization fore- stalls. The key considerations are compara- bility of cases receiving different interven- tions and equivalence of outcome eval- uation. Were the patients who got the new treatment chosen because they were strong enough to be able to tolerate it or so sick that there was no other possible ther- apy? In either case, they are not likely to be comparable to the controls. Was any judg- ment needed in assessing the outcomes? If so, then evaluation biases can easily mas- querade as treatment differences i.e., as treatment effects. Thus, in the absence of randomization, doubts about interpretation can, and should, nag the reader. Crossing over illus- trates the difficulty. Suppose that a serious 93 disease can be treated effectively by sur- gery, but operative mortality and post- surgical sequelae are drawbacks, so a med- ical therapy is an attractive alternative. If there is a class of patients for whom the two treatments appear to be equally rea- sonable, then a suitably designed and exe- cuted RCT should tell which treatment is actually superior in that class of patients. Now, it may happen that some medically treated patients do not respond, and the gravity of the disease requires that they re- ceive the surgical treatment, after having begun therapy in the medically treated , ~ . . . group. ills IS crossing over. The effect of this in an RCT is simply to change the research question from its origi- nal form to this one: "Which is the superior policy, for patients of the class originally defined, (1) apply surgery immediately or (2) treat medically and defer surgery until it may become indicated?" This question is very likely to be a better, i.e., more realis- tic and practical, question to answer than was the original one, so no harm is done. It is in nonrandomized studies in which crossing over is more likely to be a serious problem. Now the two policies, surgery immediately and medical therapy until surgery may be necessary, can be very dif- ficult to compare because the patients re- ceiving surgery may not be identifiable in retrospect as to which policy had been ap- plied to them. A summary of this point would suggest that series-based studies are liable to grave difficulties, although of course not every study comes to a false conclusion. The problem lies in checking out the value of the individual study under consideration. This amounts to ascertaining whether se- lection biases have operated, whether as- sessment of treatment outcomes have dif- fered with treatment, and whether withdrawal of patients has biased the results. The reader may recognize the questions but be powerless to answer them

94 from the information published. The in- vestigator may be unable to answer them from the records. These difficulties are larger when the research plan comes after the data have already been recorded. Advance planning is not the only way in which timing enters as a strategic vari- able in reporting a series; there is a second way. The cases can be defined as all those present at one of several temporally or- dered stages. Thus, the series study might look at all cases of a certain disease that are present in a given setting. It might look at the subset of those who (after presenting) receive treatment A or B. It might study all those who received treatment A or B more than 6 months ago. In general, the later the stage in terms of which the series is de- fined, the greater the need for retrospec- tive inference (judgment calls) and the larger the difficulties with am. or unobtainable information. A, What About the Clear-Cut Series? The reader may think that the picture has been presented too negatively. One may reason, "If a small series is done, clear differences may be observed at once, and the complexities referred to may not need to be unraveled. If a new approach is so good that it has an explosive impact, then an acceptable study can be devised readily enough." This objection raises fair ques- tions. Isn't a large fraction of medical prac- tice based on series results? Aren't there many examples, like penicillin and ether anesthesia, in which the series unambigu- ously asserts the truth? It is true that the bulk of medical practice has evolved largely from series-based information. We know also that much of the accepted doc- trine will be discarded when more and more careful evaluations are done. The problem is to ascertain which series (which uncontrolled studies) have right answers and which do not. ASSESSING MEDICAL TECHNOLOGY What about penicillin for syphilis, sulfa against pneumococcus, and other exam- ples? These have been called "slam-bang" effects. When they occur, they are dra- matic. The very fact that these are so dra- matic should remind us that they are also rare. An effort to enumerate them will bring us to vitamin BE against pernicious anemia, penicillin for subacute bacterial endocarditis, x rays to guide setting of frac- tures, cortisone for adrenal insufficiency, insulin for severe diabetes, propranolol for hypertrophic aortic stenosis, methotrexate for choriocarcinoma, indomethacin for pa- tent ductus arteriosus, and perhaps as many again, or twice or thrice as many, or possibly even more. But more than one or two per year in the last half-century? Per- haps not. Slam-bang effects are uncommon. They result from only a tiny part of the thou- sands of studies published each year. Fur- thermore, they are not always open and shut cases. In 1847, the year that Snow published his ether series, l. Y. Simpson published his results lauding chloroform anesthesia. Controversy about compara- tive merits of the two anesthetics extended at least until the Lancet Commission (1893) examined the matter more than four decades later. At that time 64,693 ad- ministrations of chloroform and only 9,380 administrations of ether were identified, both since 1848. The commission (1893) recommended ether as safer than chloro- form in general surgery "in temperate climes." Similarly, x rays in fractures clearly work, but how long did it take to discard x rays for the treatment of acne? Prefrontal lobotomy as a treatment for schizophrenia stands as a reminder that a treatment may come into wide use and prominence on the basis of inadequate evi- dence only to be discarded later. The oc- casional slam-bang effect, confidently de- tectable from an uncontrolled study, is at the favorable end of the spectrum; at the

METHODS OF TECHNOLOGY ASSESSMENT other lies the sequence of series-based stud- ies that defy interpretation. The Office of Health Research Statistics and Technol- ogy, in its assessment report concerning transsexual surgery (1981), reviewed the nine published series that reported at least 10 cases, and then declared: These studies represent the major clinical re- ports thus far published on the outcome of transsexual surgery. None of these studies meets the ideal criteria of a valid scientific assessment of a clinical procedure, and they share many of the following deficiencies: a. There is often a lack of clearly specified goals and objectives of the intervention making it difficult to evaluate the outcomes; b. The patients represent heterogeneous groups because diagnostic criteria have varied from center to center and over time; c. The therapeutic techniques are not stan- dardized with varying surgical techniques be- ing combined with various other therapies; d. None has had adequate (if any) control groups (perhaps this is impossible); e. There is no blinding with the observers usually being part of the therapeutic team; f. Systematically collected baseline data are usually missing making comparison of pre- and postsurgery status difficult; g. There is a lack of valid and reliable instru- ments for assessing pre- and postsurgery status and the selection and scoring of outcome crite- ria usually involve arbitrary value judgments; h. A large number of patients are lost to fol- lowup, apparently due in great part to the de- sire of transsexuals to leave their past behind; and, i. None of the studies are presented in suffi- cient detail to permit replication. Although the procedure under consider- ation is quite unusual, most of the difficul- ties listed are general threats to assessments using series of patients receiving a new treatment or procedures. They also amount to a list of most of the problems that the protocol of an RCT is intended to forestall. 95 Some Additional Issues Subgroups The difficulties of directly relying on data in the whole series worsen if one attempts to pick out subclasses marked by strikingly good or bad results. The idea seems reasonable enough but ig- nores a somewhat subtle, inescapable fact: there are always to be expected some good- looking and some bad-looking subsets in any body of data, even when no preferen- tial influences have operated on any part of it. Furthermore, such subset differences can easily be large enough to look quite convincing to the unsophisticated analyst. Numerical statistics can be used to eluci- date the point. Suppose that n subjects, a random sam- ple from some population with standard deviation s, are further divided at random into k equal-sized subgroups. Then the standard error (SE) for the mean of the un- divided sample is s/4 = SE. Now, of course, the largest of the subgroup means must exceed the whole-group mean, but it can be surprising how large the excess must be. The average excess of the largest sub- group mean over the whole-group mean to be expected from random division, if k = 4, is 2.06 SE; if k = 7, then 3.56 SE; if k = 10, then 4.75 SE. The comparison of the best and the worst of subgroup means can produce even more vivid-looking (but meaningless) differences. The following rule of thumb shows this well; if a group of n subjects are divided at random into k equal-sized subgroups, then the difference between the largest subgroup mean and the smallest one has an expected value that is approximately k times the standard error of the whole group. (This rule applies for a k value of not more than 15.) With such large subgroup differences to be expected by random division, we must temper our enthusiasm when we search out a series subset that looks better than the whole group; we should face the question:

96 "Is this large enough to believe, consider- ing what chance alone would produce?" There are methods for answering this ques- tion, which also arises with RCTs (Ingle- finger et al., 1983~. Temporal Drift If a series has accumu- lated over a long time (as happened with Louis, but not Snow) additional problems are likely. Over a long enough time period, shifts can occur indeed, they are to be ex- pected. The patient population may change as referral patterns do; demo- graphic composition may drift. Supportive care, diagnostic criteria, and exposure to pathogenic agents may change over time. Even treatments may change. It follows that information about the sequence and timing of the cases in the series may be es- sential to a realistic analysis. The issue here is not hypothetical; Schneiderman (1966) gives examples of clinical trials in which, because treatments were modified part way through a trial, a second control group was initiated. Analysis then showed that the two successive control groups in the same setting, meeting the same crite- ria, differed importantly in their survival experience. Grab Samples Statistical inference is a powerful tool for learning from experi- ence. It is at its best if data are obtained in ways in which probability theory can be correctly applied, e.g., with random sam- pling from a population or with data from a randomized clinical trial. Where proba- bilistic structure is unknown, the data con- stitute what is often called a grab sample. Inference from such a sample is necessarily treacherous, whether by application of for- mal statistical methods or otherwise. The personal experience of a single physician is in some sense a grab sample; so is the case study; so is the series of successive cases. The fact that we can learn from experience shows that it is not impossible to reach valid conclusions from grab samples; but ASSESSING MEDICAL TECHNOLOGY the process is fraught with difficulty, un- certainty, and error. Grab sample data may be especially use- ful where the following two conditions are obtained: First, the data come from a well- identified setup that is relatively stable over time. Second, the data are taken from this setup at regular intervals over a pro- tracted period. Two examples help explain. First, statis- tical reports from the Metropolitan Life Insurance Company have long given useful indication of trends in longevity and dis- ease attack rates, despite the fact that their statistics could not wisely be used to esti- mate U. S. population averages of longevity and attack rates (because of selective fac- tors that apply to insurance policyholders.) Similarly, cross-sectional information from particular health care populations such as Kaiser, Mayo Clinic, and Veterans Admin- istration (VA) hospitals would not be ex- pected to apply directly to larger or differ- ent groups, but temporal changes in such series might so apply. Second, air pollutants are monitored at stations situated in particular locations; the relation of pollutant levels at such sta- tions to levels experienced by persons in the schools, homes, roads, and factories near the monitoring stations are, in general, poorly known. Nonetheless, when those monitored levels rise or fall, we feel justi- fied in thinking that the pollution levels ex- perienced by the nearby population rise or fall as well. The monitored pollutant levels would serve less well, or even not at all, as measures of absolute dose to nearby per- sons. Even such restricted use for trends de- pends strongly on the assumption of stabil- ity in the system. For example, a change in membership, fees, reporting methods, etc., can affect the interpretation of trends observed in the statistics of a health plan. Similarly, seasonal changes in prevailing wind direction might cause some areas to receive higher levels of pollution, even

METHODS OF TECHNOLOGY ASSESSMENT though every monitoring station shows lower levels. To see this, consider a loca- tion, near a major pollution source, which is upwind of that source during the region's high-pollution season but is downwind from it during the region's low-pollution season. Epidemiology is largely devoted to me- thodically and often imaginatively- identifying, estimating, and correcting for interfacing variables that obscure direct interpretation of grab samples, natural ex- periments, and series. So the message here again is that the in- terpretation and reliance that can be placed on the data in a series are generally obscure until resolved by detailed study. It is possible that detailed study will reveal essential flaws that bar trustworthy inter- pretation of a kind that one might have ini- tially hoped would have been available. Strengthening the Method The author can do much to mitigate problems of interpretation by advanced planning and full, careful reporting. The planning should be done while contem- plating the way he or she would investigate the problem by an RCT. The author can identify probable disturbing variables and measure them and report them. It is impossible to reliably make two differently constituted groups comparable by doing statistical adjustments, so doubts about se- lection bias and assessment bias cannot or- dinarily be entirely removed. But more complete information helps with the diffi- cult task of interpretation. The series of consecutive cases is a device much used in the cumulation of experience concerning medical technology. In gen- eral, trustworthy interpretation of series- based data demands clear and complete in- formation about (at least) (1) the defining characteristics of the cases in the series, (2) the interventionist applied, and (3) the *This section was drafted by Harvey V. Fineberg outcomes and how they were assessed. and Ann Lawthers-Higgins. 97 When one or more of these is ambiguous, then the meaning of the series becomes cor- respondingly obscured. Typically, the reli- able interpretation of a series requires an analytic effort, which may or may not be crowned with success. THE CASE STUDY AS A TOOL FOR MEDICAL TECHNOLOGY ASSESSMENT * Case studies can be useful tools for medi- cal technology assessment. Although lim- ited in important ways, case studies can re- veal some of the implications of medical technology that are not readily exposed by other methods of evaluation. Case studies also can provide insight into decision mak- ing about a new technology in a vivid and memorable way. The contribution of case studies may be enhanced when they incor- porate the findings of other forms of evalua- tion (controlled clinical trials, epidemio- logic surveys, simulation studies, cost- effectiveness analysis, etc.) and when they are prepared as part of a series of related case studies. The example of E1 Camino management information system given in Chapter 1 could be regarded as a case study. The book by Yin (1984) is useful in pointing to the case study as a research strategy. In this paper we describe the features that characterize case studies of medical technology and then review some recent efforts by the U. S. Congress Office of Technology Assessment to conduct case studies of medical technology, discuss the strengths and weaknesses of the method, and end with a summary. 4F What Is a Case Study? In the most general sense, any coherent discussion of events related to a topic might

98 be considered a kind of case study. For the purpose of evaluating a medical technol- ogy, a case study is a detailed account of selected aspects of the technology. These aspects include issues, events, processes, decisions, consequences, and programs oc- curring over time and affecting individ- uals, institutions, and policies. A case study is particularistic in its detail and may also be holistic in the scope of its coverage of a medical technology (Wilson, 1979b). The form and content of case studies re- flect the values and points of view of the writers, their intended audiences, and the purposes of the evaluation. In sorting out the variety of documents described as case studies, it is helpful to dis- tinguish several types of cases, as indicated in Figure 3-5. Case studies of medical technology tend to take one of two forms. The first type of case study attempts to reveal causes, to ex- plain the basis for decisions about a medi- cal technology. The second type of case study attempts to reveal consequences, to describe the direct and indirect effects of medical technology. Some case studies of medical technology are concerned both with causes and consequences. The kinds of case study concerned with causes include studies of policymaking about medical technology, studies of the TYPOLOGY OF CASE STUDIES ASSESSING MEDICAL TECHNOLOGY development of new technology, and stud- ies of the diffusion of medical technology. This type of case study typically spans a particular time horizon and may be aptly characterized as a narrative history of a process. A case study of this type usually portrays the interests, motivations; and op- portunities of different players, describes the institutional environment in which they work; and characterizes the external forces acting on the decision makers. The elements in such case studies may be re- lated in complex ways, and even those closely involved in decisions about a medi- cal technology may have only a dim per- ception of the interests and actions of oth- ers. The aim of such a case study is to describe what happened in a vivid way and, more importantly, to provide insight into why certain decisions were made at certain times and produced certain reac- tions and responses. The second type of case study is an effort to enumerate the consequences of a medi- cal technology. This type of evaluative case study organizes and presents diverse infor- mation on the impact of a medical technol- ogy. This information may be derived in part from clinical trials, epidemiologic studies, and other forms of evaluation as well as from data banks, insurance re- cords, vital statistics, and other sources of LEGAL TEACHING MEDICAL MEDICAL SOCIOLOGIC OTHER CASES CASES TECHNOLOGY CASE HISTORIES CASE HISTORIES ~ 1 1 CAUSES CONSEQUENCES Explain the basis for decisions about medical technology (Decisions as the end-point of the analysis) Explain what happens (or will happen) following decisions about medical technology (Decisions as the starting- point of the analysis) FIGURE 3-5.

METHODS OF TECHNOLOGY ASSESSMENT primary data. There is a continuum be- tween review papers that synthesize evi- dence from such primary evaluations as controlled clinical trials and case studies that characterize the manifold conse- quences of a medical technology. Com- pared with a typical clinical review paper, a case study of medical technology high- lights a wider array of consequences, such as ethical issues, effects on the organization of medical care; and economic, social, and political consequences. A case study typi- cally contains qualitative as well as quanti- tative assessments of a medical technology. Instead of measuring experienced effects of technology, a case study of an emerging medical technology may aim at anticipat- ing expected consequences. The values and judgment of a case writer are likely to be prominent parts of such an evaluative case study. Both case studies aimed at causes and case studies aimed at consequences draw upon eclectic sources of information. These range across personal interviews, historical documents, news accounts, com- pany and institutional records, data banks, epidemiologic studies, clinical trials, and more. Any source of information bearing on the subject properly is grist for the case writer's mill. A case writer may employ a similarly varied array of analytic methods in developing a case study, including, for example, historical and journalistic re- search, survey methods, cost accounting, and statistical analysis. The challenge in writing a useful case study is to create an accurate and coherent picture of the subject. Stylistically, a case study may be written in academic prose for an academic audience or, to engage a busy policymaker, in the form of a vivid narra- tive, complete with dialogue. Closely related to case studies of particu- lar medical technologies are case studies whose subjects are diseases, medical insti- tutions, or health care policies, because these all bear directly on medical technol- 99 ogy. The kinds of case studies considered here are quite different from case reports of individual patients or of unusual clusters of patients that may be reported in the medical literature. A case study developed for technology assessment also differs from cases devel- oped for teaching purposes. A teaching case study in a school of management, gov- ernment, or public health typically presents a cast of characters and a chronol- ogy of events. The case may describe the values and attitudes of the characters, and it may discuss outside factors affecting the situation. The cases used in teaching may or may not adhere to all the facts of a real- life situation. The aim of the case study is to project the student into a decision-mak- ing role and to stimulate discussion about the best way to deal with the situation. Through vicarious experience with a num- ber of cases, the student is expected to be- come better prepared to face real situations that resemble the cases. In this sense of aiming to produce lessons that will apply to new situations, teaching case studies are like case studies of medical technology. Case Studies from the Office of Technology Assessment The U. S. Congress Office of Technology Assessment (OTA) has sponsored a series of case studies of medical technology. As of the spring of 1984, two dozen had been published (OTA, 1983a). These cases cover a range of technologies (drugs, devices, equipment, procedures) at different stages of diffusion; the technologies are used by diverse medical specialists for a variety of clinical purposes (prevention, diagnosis, therapy, rehabilitation); they involve high costs and raise a variety of ethical and pol- . . Cy Issues. The first 19 of these studies were under- taken as part of OTA's assessment of cost- effectiveness analysis of medical technol- ogy. As such, a principal emphasis in most

100 of the studies was to synthesize available information on the costs and clinical effects of the subject technology. Some of the cases represent new areas of application of cost- effectiveness analysis, and several develop new methods for assessing the costs and benefits of particular types of medical technology, such as diagnostic equipment. Some of the case studies describe events and decisions about the subject technology and discuss the reasons behind some of the . · . cleclslons. The OTA relied on the collection of studies to identify common problems and advantages of cost-effectiveness and cost- benefit analysis in health care. The cases also illustrated numerous observations and conclusions in the OTA report on the im- plications of cost-effectiveness analysis of medical technology (OTA, 1980a). The remaining case studies in the OTA series cover sundry health technologies and topics: passive restraint systems in automo- biles, telecommunications devices for hear- ing-impaired persons, alcoholism treat- ment, therapeutic apheresis, and the relation between hospital length of stay and health outcome. The issue of cost-ef- fectiveness arises in some of these cases, and they also deal with questions of techni- cal feasibility, variation in medical prac- tices, political decision making, and ethi- cal consequences. Weaknesses of Case Studies The accuracy and completeness of a case study depend on the skill and insights of the case analyst. If the case writer is bi- ased, careless, or misguided, the case study will be similarly misleading. Similar com- ments might be made about the analyst in any type of evaluation. However, over- sights and analytic errors in a case study may be more difficult to detect than simi- lar problems in other forms of evaluation, such as reports of a clinical trial. A central problem in the use of case stud- ASSESSING MEDICAL TECHNOLOGY ies for technology assessment is the deriva- tion of generalities from particular in- stances. The main interest in a case study is usually less what it says about the subject itself than what it implies about a class of technologies (decisions, institutions, issues, etc.) that are like the subject of the case study in some ways. This kind of implica- tion is prone to error, because the case study omits important considerations or misrepresents causes and consequences, because the reader misinterprets the case study, or because the new situation differs from the case study in unrecognized ways. Campbell and Stanley (1963) regarded case studies as useless tools for evaluating the benefits and risks of a program or treat- ment because individual case studies lack controls. In making this judgment, they did not take account of case studies as methods for revealing the process of deci- sion making, nor did they consider the role of case studies in assessing the ethical and social consequences of medical technology. Strengths of Case Studies In some instances of major policy deci- sions about medical technology, case stud- ies may be the only practical means of in- vestigating the causes and consequences of those decisions. Case studies can provide some leads and suggest more detailed and structured studies. Cases like the artificial heart (OTA, 1983a) and the swine flu im- munization program (Neustadt and Fine- berg, 1983) are complex and singular events. Case studies of each may help deci- sion makers prepare to deal with similar situations that are likely to arise in the future. Case studies can convey the complexity involved in decisions about many medical technologies. The open form of a case study makes it suitable for raising some of the ethical, social, legal, and political con- sequences of technology. A case study al- lows these issues to be presented from dif-

METHODS OF TECHNOLOGY ASSESSMENT ferent points of view and can juxtapose Conclusions these issues against evaluations of the safety and efficacy of a medical tech- nology. The vividness and concreteness of a case study may carry a powerful intellectual and emotional impact on the reader. A sin- gle case, properly presented, may motivate a decision maker to act more surely than a scientifically sounder abstract analysis. A case study can provide the kind of memo- rable paradigm that people use to interpret other experiences. This feature is, of course, a hazard as well as a benefit. Strengthening Case Studies Some analysts have suggested that sets of case studies may overcome some limita- tions of generalizability that apply to indi- vidual cases (Kennedy, 1979; Hoag~in et al., 1982~. When case studies are grouped with an eye toward using them as sample surveys, data collected for the cases should reflect a range of attributes that relate to the specific purposes of the case analysis. Even if it is not possible to assemble a sam- ple that permits estimates of the fraction of instances in which an attribute occurs, it still may be possible to represent the bulk of pertinent attributes in the sample. Kruskal and Mosteller (1980) use the term coverage to describe the idea of such a broadly representative sample. Including both examples and counterex- amples (successes and failures) in the set of cases may help reduce the chances of draw- ing misleading conclusions. For example, one set of case studies of medical innova- tions examined only instances of clinically valuable innovations (Globe et al., 1967~. This study found that enthusiastic advo- cates appeared to accelerate the accep- tance of good medical innovations, not rec- ognizing that strong advocates can be equally effective in promoting what turn out in retrospect to have been medical mis- takes (Fineberg, 1979~. 101 (ease studies of the causes and conse- quences of decisions about medical tech- nology can be useful forms of technology assessment. Case studies provide the most practical means of investigating some com- plex and exceptional medical technologies. Case studies also provide a mechanism for discussing some of the ethical, social, and political consequences of medical technol- ogy that are not readily assessed in other ways. Case studies can link assessments of the development and diffusion of medical technology with evaluations of the impact of the technology. Among the principal weaknesses of case studies are their dependence on the percep- tions and judgments of the case analyst and the hazards of generalization from uncon- trolled observations. The vividness and memorability of a case study can produce unwarranted convictions as well as helpful lessons. These shortcomings may be at least partially overcome by integrating the results of stronger methods of analysis into case studies and by undertaking a series of related case studies to investigate a class of medical technologies or a set of issues about medical technology. REGISTERS AND DATA BASES* In this section we distinguish between registers (lists of patients, usually with lim- ited clinical and demographic descriptors about each patient) and data bases (with more detailed and comprehensive data about each patient). Except in names of or- ganizations, we reserve the term registry for an organizational structure that de- velops and maintains a register. The distinction between registers or se- ries and data bases is not sharp, and many sources of data have some characteristics of *Contributors to this section were John Laszlo, John C. Bailar III, and Frederick Mosteller.

102 each. Table 3-4 lists additional features that tend to distinguish between registers and data bases but that may occur in ei- ther. Registers generally cover larger num- bers of cases, and they often are suitable for use by numerous cooperating institu- tions, whereas data bases usually serve more restricted uses. Registers are better adapted for general, multiuse, and mul- tiuser public resources and may be used for measuring trends in disease incidence or in the use of medical facilities, for tracking patients, or sometimes as indexes to more detailed patient records for special studies. Data bases are more often developed to an- swer research questions dealing with clini- cal epidemiology, such as questions about prognostic factors or the natural history of disease. Registers generally are file-ori- ented and readily searchable by patient name or key word identifiers. Data bases generally have a hierarchical structure, and they contain and can generate multi- ASSESSING MEDICAL TECHNOLOGY ple files. Many registers and most data ba- ses are stored in computers, and data bases are maintained and used with the help of a data base management system. Research investigators develop disease- oriented registers of many descriptions largely to identify patients for epidemio- logic studies and to track identified pa- tients for data on the outcome of the dis- ease and for treatment. Such registers have often been misunderstood by persons not directly involved. One might expect that health professionals would want and need an accurate filing and tracking system to identify numbers and types of patients and treatments given, to quantify and evaluate the usage of expensive and sometimes haz- ardous resources, and to ascertain treat- ment results and survival. However, there is no recognized consensus on the value of existing registers, nor even a well-accepted definition of their functions. Furthermore, the services that registers provide are diffi- TABLE 3-4 Comparison of Health Registers and Data Bases Features Health Registers Single institutions, regional, or nationwide Data Bases Data source lob characteristics Epidemiology of disease categories for large populations. Large-scale follow-up. Large numbers of patients Detail required Minimal File structure Single-file oriented; can be manually stored Types of benefits Time scale of major benefits Assessment of public health programs. Influence on distribu- tion of health resources. Regional variations in diseases, treatments. Long-term Limited number of institutions. Often uni-institutional Specifically oriented to a particular disease, therapy, and/or procedure. Smaller numbers of patients Comprehensive Multiple file capabilities Generally computer stored by a data base management system Clinical applications Short-term to long-term

METHODS OF TECHNOLOGY ASSESSMENT cult to evaluate quantitatively, leaving only intuitive or global impressions. These difficulties reflect, and ultimately contribute to, several problems that often affect disease registers: unstable funding, weak local support (especially from per- sons asked to provide the raw data), and uncertain long-term continuity. Some of these issues about registers and data bases will be examined. Bailar (1970a,b) has expanded on some of the points below. Many of the examples refer to cancer registers, which, taken together, are both the oldest and the largest of these enterprises. Approximately 1,000 hospital- based cancer registries are included in the cancer programs approved by the Ameri- can College of Surgeon's Commission on Cancer (Commission on Cancer, 1974~; these cover 55 to 60 percent of all patients found to have cancer in the United States, exclusive of superficial skin cancers that are not treated in hospitals and present very little threat to life and health. Numer- ous other registries are not yet approved or are seeking approval. A rough estimate is that the total cost of approved cancer regis- tries in the United States is at least $20 mil- lion per year. Medical accrediting agencies consider that registers are crucial to cancer pro- grams, presumably because registers have important functions in patient follow-up, education, and research. Oncologists con- sider them useful as an index for case find- ing at best and woefully incomplete at worst. Hospital administrators consider them costly and often fail to understand their need, and most physicians have never used a register or been inside of a cancer registry office. Data bases tend to be of two types. One type covers specified diseases or popula- tions, the other covers procedures or thera- pies. Either type may also collect informa- tion about a control group. For example, the Coronary Artery Disease (CAD) data bank at Duke University Medical Center 103 includes all patients with chest pain, re- gardless of cause. This data base originally was limited to patients who had cardiac catheterization but has been expanded to collect information from additional pa- tient groups regarding noninvasive tests and coronary care unit (CCU) admissions (Rosati, 1973~. The Duke data base is used to evaluate various technologies that may be used for patients with stable chest pain. Data from coronary care units, such as arrhythmia monitoring, hemodynamic monitoring, rehabilitation programs, and electrophysiology studies, contribute to knowledge about the management of pa- tients with myocardial infarction. In contrast, the National Heart, Lung, and Blood Institute (NHLBI) data base of angioplasty candidates remains procedure oriented, because it covers only patients undergoing percutaneous transluminal coronary angioplasty (PTCA) (NHLBI, 1982~. Unlike the Duke CAD data base, the NHLBI data base does not collect data about patients who are considered can- didates for the procedure but do not receive it. Some other clinical data bases are those of the Seattle Heart Watch (Bruce, 1974, 1981), the Coronary Artery Surgery Study (Principal investigators of CASS and their associates, 1981; Chaitman, 1981) the Maryland Institute of Emergency Medi- cine (MIEM) (Cowley, 1974), and patients seen at the Mayo Clinic (Kurland and Mol- gaard, 1981~. Other disease-oriented data bases focus on rheumatology (Dannenberg et al., 1979; Fries et al., 1974; McShane et al., 1978), gastrointestinal diseases (Gra- ham and Wyllie, 1979; Hovrocks et al., 1976; Wilson, 1979a), radiology (Jeans and Morris, 1976), mental health (Evenson et al., 197S), head injuries (Braakman, 1978; Galbraith, 1978; Jennett, 1976; Knill- lones, 1978; Teasdale, 1978), and cerebral ischemia (Heyman et al., 1979).

104 Uses of Registers and Data Bases The largest cancer registry system in the United States (really a consortium of re- gional registries) is the Survival, Epidemi- ology, and End-Results (SEER) program of the National Cancer Institute, which gathers data on cancer incidence and mor- tality rates in several widely dispersed ar- eas of the United States (five states, five ad- ditional metropolitan areas, and Puerto Rico) with about 10 percent of the U.S. population. The SEER data file includes all known invasive cancers (excluding su- perficial skin cancers) diagnosed in resi- dents of these areas since 1973. Data ele- ments include demographic factors, disease site, type of cancer, and whether or not surgery was done (Pollack, 1982a). Na- tional estimates of cancer incidence rates are largely based on these data. Quality control procedures include a regular pro- gram of workshop training for medical re- cord abstracters. The currently available 5-year base of data (1973-1977) contains about 350,000 cases (Young et al., 1981~. More recent data are to be published soon. Regional registers with a direct and im- mediate role in patient care have been de- veloped to support programs of organ transplantation, by the timely matching and transport of donor kidneys to potential recipients. The present 32 regional hemo- dialysis registries may in time be coordi- nated, perhaps along the lines of the Euro- pean Dialysis Transplant Association (Groot, 1982~. The National Implant Registry, a pri- vate organization, contracts with hospitals to track patients with mechanical and prosthetic implants, such as heart valves and vascular grafts, to preserve account- ability between the manufacturer of a de- vice and its recipient and to alert hospitals and physicians to defects and deficiencies. The National Implant Registry has a re- search component that permits plain lan- guage inquiries of its computer records for ASSESSING MEDICAL TECHNOLOGY such things as the frequency and location of use of a specific brand, model, or ge- neric class of a device. If a device is re- placed and a second implant is registered for the same patient, the system flags the episode for further investigation. Because registers tend to be large and complex, the kinds of data to be collected must be carefully thought out. This often is in terms of a minimal data set of items to be submitted for each patient, with or with- out optional supplementary items. Settling on the minimal data base for a register re- quires careful thought and agreement by the sponsors of such matters as the general problems to be attacked, data items the registry will collect, what can profitably be analyzed, what will be the cost, and who will pay which costs at various steps from the initial collection of data through the analysis and dissemination of results. For example, if a cancer register is to be used for more than studies of crude counts of patients, it should contain demographic descriptors, information about the type and severity or extent of the disease, and considerable follow-up and survival data. Very strict requirements for complete- ness of documentation can have the effect of restricting the coverage of the registry by excluding otherwise appropriate cases that fail to meet those requirements. On the other hand, lax standards of documen- tation can result in the mistaken inclusion of patients whose disease status is not actu- ally appropriate to the registry. Balancing these opposed considerations is a challenge that should not be shirked or ignored. Disease registers that have nearly com- plete coverage of a defined population, such as residents of a given state, are espe- cially useful. They can avoid or measure certain biases frown factors that cause dif- ferent kinds of patients to go to different hospitals, and they can be used for impor- tant classes of studies that require the com- putation of incidence rates. Additional functions of a disease regis-

METHODS OF TECHNOLOGY ASSESSMENT try, such as deriving statistics on patient survival for each hospital, comparing modes of treatment, and finding clues re- garding predisposing occupational, envi- ronmental, or genetic risk factors, are highly desirable when they can be ob- tained in a cost-effective manner. Admin- istrative uses of registers are important in resource allocations. Questions of patient access to the hospital, travel distance, and availability of a local physician are easily an- swered from demographic data. Such data have some value to medical investigators and health administrators and offer the potential for substantial cost-effective use. Despite the attractiveness of using regis- ters to answer questions about the effec- tiveness of treatments, experience shows that retrospective analysis of hospital charts often is imprecise. The necessary data about diagnostic evaluations and the nuances that affect therapeutic decisions and outcomes seem to be unobtainable with a basic registry system. The same con- siderations apply to certain kinds of epide- miologic analyses in which only a detailed prospective search for occupational, envi- ronmental, or genetic factors can be given serious attention. A cancer registry in the Netherlands is specifically oriented to look for genetic predisposition (Lips et al., 1982~. Older cancer staging systems are based solely on anatomic and pathologic factors, whereas functional tests, hormo- nal status, and the like are now recognized as important in describing disease biology and prognosis. Opportunities to use large registers or specialized data bases for clinical research and for exchange of information among various centers were mentioned above; see also Laszlo (submitted for publication). Previous publications (Blum, 1982; Cox and Stanley, 1979a; Cox et al., 1979b) have illustrated the kinds of information that can be obtained and have offered rec- ommendations on how to modernize can- cer staging systems. For example, Blum 105 (1982) has developed a computer program that searches a substantial data base for causal hypotheses. The numbers of drugs on the U.S. mar- ket is huge, but the number of combina- tions of drugs is astronomical. If there were only 1,000 drugs in common use, there would be essentially 500,000 pairs of such drugs and a far greater number of combi- nations when three or more drugs are used. Some patients take many drugs simultane- ously (Blum, 1982~. Physicians cannot hope to keep track of the consequences for patients of each possible combination. Both Stanford University and Massachu- setts General Hospital have computer sys- tems that track and warn of adverse drug reactions. In the Stanford system, each prescription for a hospital inpatient trig- gers a literature-based listing of potential drug interactions between the prescribed drug and other previously prescribed drugs. The program automatically notifies pharmacists, nursing staff, and physicians of these potential interactions (Cohen et al., 1974; Tatro et al., 1979~. The Kaiser-Permanente organization developed and used a Drug Reaction Moni- toring System (DRMS) which the Food and Drug Administration (FDA) and the Na- tional Center for Health Services Research (NCHSR) supported for 5 years as a dem- onstration project. The DRMS followed both inpatients and outpatients to assess the freauencv of drug and event associa- tion, to assist in determining causality, to assess the size of the public health problem, and to support other scientific investiga- tions (Friedman, 1972~. One long-range goal of such studies is to build up evidence of safety and risk for many drugs. Using the DRMS, Friedman (1983a) was able to review earlier concerns that long-term use of rauwolfia predisposes to breast cancer after age 50. He found no statistically sig- nificant relation and estimated that for this population the risk ratio is less than two- fold, if in fact there is any excess risk at all.

106 An extension of this approach can screen drugs for carcinogenicity, as Friedman and Ury (lO83b) have done. They reported on 143,574 patients followed from 1969 through 1978. The 95 most commonly used drugs or drug groups and 120 selected drugs used less commonly were screened. Because many drugs and many cancer sites lead to many statistical tests, the authors point out that results can be no more than preliminary and suggestive. As an exam- ple, concern had been expressed about an association between amphetamines and Hodgkin's disease. The authors found only one case of Hodgkin's disease among 506 us- ers of phenmetrazine hydrochloride and 880 users of diethylpropion hydrochloride. Capabilities and Limitations of Registers and Data Bases The strengths and weaknesses of disease registers will be reviewed in this section, again using cancer registers as a model (Bailer, 1970a,b; Commission on Cancer, 1974; Cutler et al., 1974; Demographic Analysis Section, 1976; Feigl et al., 1981; Gershman et al., 1976; Maclennan et al., 1978; Queram, 1977; WHO, 1976a,b, Young et al., 1976~. The old-style registry is characteristically in or near a medical re- cord room, often at a site remote from phy- sicians' offices. This separation creates two related problems: It is unnecessarily diffi- cult for registrars to consult physicians about questions (i.e., what did the surgical staging reveal?; was this a primary pancre- atic cancer?), and distance and the lack of frequent contact inhibit the use of registry data. A register that is used little is largely useless, doomed to mediocrity, and likely to contribute to the poor reputation of dis- ease registries in general. Registries work best when they are not separate and distant appendages but rather an integral part of both the clinical program (as some cancer registries are integral to cancer consulta- tion, treatment, and follow-up) and the ASSESSING MEDICAL TECHNOLOGY patient data program (Laszlo et al., 1976) Such an approach has been successfully employed in many cancer centers. A surprising number of problems and concerns arise even in a basic registry sys- tem (Bailer, 1970a,b; Laszlo et al., 1976~. Completeness of case findings is difficult to document but may be a major problem at many institutions that code only the re- cords of inpatients. Furthermore, uniform- ity of approaches at different centers can be assured only by an extensive and costly set of quality control procedures. As an example of the efforts required to maintain data quality at even the most ba- sic level, the Centralized Cancer Patient Data System (CCPDS) worked for several years to develop a 36-item minimal data set. These items require the inclusion of an entire volume of definitions. Even so, reab- stracting studies performed by CCPDS showed that coding disagreements at com- prehensive cancer centers were primary site, 6 percent; histology, 14 percent; and stage, 23 percent (Feigl et al., 1981~. It seems likely that less thoroughly supervised registry systems have error rates that are at least as large. Timeliness of record completion is often a problem. Medical record libraries or other organizational units that house regis- tries may be pressed to meet other business, office, and professional review functions, so that records often are not abstracted un- til many months after patients have been discharged from the hospital. Thus data are not current, and to the extent that they rely on fading memories they even may not be accurate. Some record libraries are also so large or so inefficient that a return clinic visit note fails to find its way back to the chart promptly, thereby giving an inap- propriate signal that the patient has not kept a return appointment. Reminder let- ters under such circumstances may reflect adversely on the competence of the insti- tution. Registry organizers may ask questions

METHODS OF TECHNOLOGY ASSESSMENT that are too numerous or too complex to be collected quickly by clerks, or require a level of detail that cannot be found in charts or cannot be used for analysis. Such pitfalls are well known in cancer registries; for example, many cancer registries once asked for extensive staging information as well as detailed information about treat- ment, including the drugs and doses used. Individual hospital and state tumor regis- tries invested many years of work to ab- stract hundreds of thousands of cases of cancer with classifications of tumor stage that have never been carefully studied and classifications of treatment that have never been subjected to critical analysis. This was a matter of fundamental purposes, ob- jectives, and methods, not a quality con- trol problem; even if the abstracting could have been done to perfection, it would still have been of limited value. A system which has such encumbrances cannot be moni- tored for its quality, cannot be operated by clerks, and cannot be justified in terms of its cost. The process of organizing cancer regis- tries as part of hospital cancer programs does serve a cohesive function that is im- portant but difficult to quantitate. These benefits have only recently been recog- nized and are being addressed by the net- work of hospitals served by the registry program of the Joint Committee on Ac- creditation of Hospitals and its agent in this matter, the American College of Sur- geons, and by the Surveillance, Epidemiol- ogy, and End Results Program (Demo- graphic Analysis Section, 1976; Feigl et al., 1981; Young et al., 1976~. Roos (1979) illustrates the thoughtful- ness and multiplicity of methods that good analyses of nonexperimental data require, although each situation may require spe- cial approaches not transferable to other studies or other data bases. Roos found that 75 to 95 percent of ordinary tonsillectomy patients could not meet the criteria for a current randomized trial of this operation. 107 Several different analyses, including a sim- ulation, led to different answers because some analyses tended, on their face, to give less credit to the operation than others. The overall effect of the several analyses was to suggest that during the year follow- ing surgery, the average patient experi- ences between 0.1 and 0.8 fewer episodes of respiratory disease than if the operation had not been performed. Because the oper- ation has some hazards and offers modest gains, Roos concluded that physicians should take a very conservative approach to tonsillectomy. The costs of operating a registry may be calculated in widely different ways: with or without the costs of lifetime follow-up of registered patients; with or without the salaries of staff (who also have other du- ties); with or without the costs of using reg- istry data for various purposes (such as computer costs of data analysis); with or without overhead and fringe benefits, etc. This explains in part the very wide diver- gence in quoted costs, ranging from a few dollars to perhaps $150 per registered case. There would be much value in a study that would assess, on a uniform basis and in a variety of settings, the real costs of specific registry activities, including extra chart handling, abstracting, follow-up, and data analysis (manual or computerized). De- spite the uncertainties, real marginal costs might be estimated at about $65 per case for a registry recording 300 new cases per year, as indicated in Table 3-5. Total costs for the SEER program, which include costs of analysis, lifetime follow-up, coordination of many diverse registries, and intensive quality control, probably exceed $100 per case entered (Pollack, 1982b). The capabilities and limitations of data bases are similar to those of registers except that data bases offer richer possibilities for detailed study, generally at greater real cost. Perhaps the greatest challenge in or- ganizing a data base system is to gain

108 TABLE 3-5 Approximate Costs of Registry for 300 New Cases per Year Budget Item Salary and Benefits Space and Equipment Printing and Postage Telephone Miscellaneous Cost (1982 dollars) $15,000a 1,000 1,500 1,200 500 Total $19,200 aVaries with experience and geographic location. SOURCE: Kindly provided by the American College of Surgeons (1974~. agreement as to its elements. Many persons naively think that a broad-based data set on all patients, with records for each inter- esting event over time, could be quite use- fully browsed and queried. Such an ap- proach is not practical for large-scale studies that depend on retrospective searches through data bases, particularly when the data records have been compiled and entered by a variety of observers from hospital records that were not regularly and consistently collected for each patient of the type being studied. Each data base should have extensive in- put from persons with a variety of inter- ests, beginning with the initial organiza- tion phase and continuing through the life of the data base. These persons should rep- resent the data users; the persons who will supervise data collection and analysis; and the administrators who are responsible for providing space, money, and other re- sources. These individuals should be au- thorized to speak for their colleagues in the case of an institutional venture, particu- larly because registries require long-term commitments. Legal advice also is needed when questions of data access are dis- cussed, for the system must be designed to protect patient and doctor confidentiality, with access governed by established guide- lines. There should be an executive com- mittee that set policy for the use of data, as ASSESSING MEDICAL TECHNOLOGY has been described for the CCPDS (Feigl et al., 1981~. In the case of federally spon- sored programs, particularly those funded by contract, special (although simple) sys- tems are needed to protect the confiden- tiality of data that might be misused by persons who request access thereto through the Freedom of Information Act. (The Act includes special provision for the protec- tion of individual privacy, but not for sta- tistical summaries.) This seems not to be a problem for grant-funded research data, which remain the property of the inves- tigator. Registries are most likely to survive if they are attached to other successful pro- grams (such as tumor clinics) and if the leaders of those programs are willing to help in acquiring the needed financial and other resources. The chances of success are improved for registries that are part of more extensive data bases that become an accepted part of the hospital and clinic en- vironment, that are used often by physi- cians, that can be used to evaluate systems of health care or new technologies, and that are reimbursable as part of the medi- cal care system. Stand-alone registries and data bases can exist for research purposes, but their future as permanent parts of the medical scene is no more secure than that of the research grants that support them (Laszlo, 1984~. The recruitment, training, and supervi- sion of registry staff depend on the type, size, and scope of the register; but general requirements include (1) meaningful phy- sician supervision for education and mo- rale of staff as well as to enlist the necessary cooperation of other physicians, (2) able and motivated clerical staff as registrars, (3) locally designed as well as regional or national training programs for initial and continuing training, (4) membership of the chief registrar on the hospital cancer com- mittee, (5) hospital by-laws that specify the functions of the registry, and (6) a close functional relationship (but not necessarily

J METHODS OF TECHNOLOGY ASSESSMENT physical proximity) with the medical rec- ord library, which must provide and later refile charts as needed. Hospital adminis- trators need to understand and support the registry, and the support of the chief rec- ord librarian can be invaluable. The simpler registry file systems are be- ing supplemented and/or supplanted by more extensive data bases with multiple file capabilities. These are of many vari- eties and represent major new trends as physicians discover their capability to ma- nipulate medical information in com- puters. The next decade may see a large in- crease in the number of such systems, and those that survive will be those used to im- prove patient care, to assess medical tech- nology, or both. Strengthening Uses of Registers and Data Bases Boos (1979) states: "So many different types of analyses are possible given these data bases that some dialogue with review- ers and readers is necessary. Researchers obviously have their own emphases, priori- ties, and biases. Without suggestions from others at each stage of the research process, significant opportunities for improving both our methodology and our substantive understanding will be lost." Clearly some education about possible analyses will help the medical community to make more ap- propriate use of data bases for the compari- son of therapies. Research on ways to im- prove such usage might be helpful. Aspects of patient confidentiality some- times constrain the sharing of data among institutions. Thus a patient may be fol- lowed by two or more hospital registries that do not communicate with each other, especially if the patient has not given per- mission to share data and there is no cen- tralized (state or regional) registry. A1- though considerable effort has been devoted to examining data sharing prob- lems on multi-institutional research 109 (Boruch and Cecil, 1979; National Acad- emy of Sciences, 1985), action and re- search on ways to reduce these problems is needed. Because data bases ordinarily contain information from patients whose treat- ment was chosen in an uncontrolled man- ner and delivered in an uncontrolled and poorly monitored fashion, groups receiv- ing different treatments cannot be ex- pected to be similar in prognosis. Attempts to use data to compare the effects of differ- ent treatments must therefore use analyti- cal devices to attempt to remove the effects of biases. Such devices are not entirely sat- isfactory, and both new methods and a better understanding of old methods are needed. SAMPLE SURVEYS* Sample surveys (of, for example, medical records, health care providers, or the popu- lation) are useful for evaluating medical technologies in clinical use. Surveys can es- timate disease incidence and prevalence rates; quantify the use of medical care ser- vices and procedures; and estimate costs and benefits of drugs, procedures, and other technologies. Population-based health sur- veys collect information on health status, functional capacity, medical care utiliza- tion, and costs or charges. The information may be obtained in person, by telephone, or by mail. Surveys of health care providers capture information about patients' demographic and medical characteristics, the medical diagnoses associated with contacts in the health care delivery system, the tests and procedures carried out, and charges for the services provided. Sample surveys of vital recordsbirth, death, and stillbirth certificates can also * This section was drafted by Dorothy Rice with assistance from John C. Bailar IlI.

110 yield information useful for technology as- sessment. Files of registered births and deaths, for example, provide lists of events for which considerable additional infor- mation may be obtained from various sources (perhaps on a sample basis), in- cluding hospitals and physicians who pro- vided care to the individual whose record is selected for the study. The focus in this section is on national sample surveys that are designed to pro- vide accurate representation of the uni- verse that provided the samples. Such sur- veys are very large undertakings that must serve many diverse and sometimes compet- ing purposes, not only that of technology assessment. They may therefore not ex- actly suit the needs of the technology asses- sor with a specific question, but they have many other important strengths such as ready availability of past as well as current data, large sample sizes, and the immense but unquantifiable advantages of data quality that come from having a large, full-time, experienced staff capable of state-of-the-art methods. Uses of Sample Surveys A few of the ways in which sample sur- vey data have been used in relation to tech- nology assessment are shown below; many others could be cited. · Scitovsky (1979) used data from the National Health Interview Survey (NHIS) conducted by the National Center for Health Statistics to develop national esti- mates of visits to physicians' offices and then combined these results with data from the National Ambulatory Medical Care Survey (NAMCS) to estimate the percent- age of physician visits that had a labora- tory procedure ordered or provided. · A report from the National Center for Health Care Technology (NCHCT) (1981) estimated from NAMCS data that there are in the United States 80,000 to 110,000 new ASSESSING MEDICAL TECHNOLOGY candidates each year for surgical treatment of coronary artery disease. · The Office of Technology Assessment (OTA) report (1982c) "Technology and Handicapped People" used NHIS data to estimate the number of persons with se- lected impairments. · Much of the data for OTA's report (1983b), "Variations in Hospital Length of Stay: Their Relationship to Health Out- comes," came from the National Hospital Discharge Survey. NCHS Surveys Under the National Health Survey Act of 1956, the National Center for Health Sta- tistics (NCHS) has developed a model set of population-based sample surveys for gov- ernment and private uses. The purpose is to present a broad picture of the nation's health status and of the use of health re- sources, and to show various aspects of health and health resources in relation to each other. The surveys draw their infor- mation from the people, the institutions and professions that provide health ser- vices, and from vital records. These cross- sectional surveys assess the health status of the population at specific points in time and allow examination of changes over time. They encompass and produce statis- tics on the extent and nature of illness and disability in the United States, the determi- nants of health; environmental, social, and other health hazards; and health care costs and financing. Several national sample surveys con- ducted by NCHS have special functions concerned with health care technology; these are briefly described below. The National Hospital Discharge Survey (NHDS) has collected data about dis- charges from nonfederal short-stay hospi- tals continuously since it began in 1964 (NCHS, 1980b). NHDS can provide a wide range of data on both trends in diagnoses associated with hospitalization and the

METHODS OF TECHNOLOGY ASSESSMENT treatments received by patients in hospi- tals. While the survey is designed primarily to produce national estimates it can also produce some estimates for the four major geographic regions of the country, as well as data by characteristics of hospitals such as type of ownership or number of beds. Technology diffusion in hospitals has been rapid and extensive over the past few decades. New equipment and techniques are introduced, and old techniques are used differently or discarded (Russell, 1979~. The NHDS is an excellent source of data for monitoring such trends. NHDS data on deliveries by cesarean section show how surveys can detect and il- luminate trends in certain procedures and diagnoses over the past decade (NCHS, 1980e). In 1970 cesareans were 5.5 percent of all deliveries; by 1978 the percentage was 15.2; in 1981 the NHDS reported 17.9 percent of all deliveries were by cesarean section (NCHS, 1983e). A report on hospi- tal discharge diagnoses showed that in 1977 the discharge rate for sterlization of healthy females age 15-44 was over five times as great as in 1970. Most were for tubal sterilizations, usually by laparoscopy (NCHS, 1981a). There also have been rapid changes in diagnostic tools and surgery for heart dis- ease, including steady increases in cardiac catheterization, the use of angiograms, and bypass surgery (Grossman, 1981~. Be- tween 1979 and 1981 cardiac catheteriza- tion increased 97 percent for men 65 years of age and over and 34 percent for men 45 to 64 years of age. During the same period, coronary bypass surgery increased 27 per- cent for men 45 to 64 years of age (from 3.0 to 3.8 per 1,000) and 89 percent for older men (from 1.8 to 3.4 per 1,000) (NCHS, 1983c). The NHDS has documented increases in lens extraction (cataract surgery) and in lens implantation following cataract sur- gery. Lens extraction among the elderly in- creased about 30 percent between 1979 111 and 1982; 57 percent of these procedures were accompanied by the insertion of a prosthetic lens in 1981, compared with 36 percent in 1979 (NCHS, 1980d). The use of computerized axial tomog- raphy (CT scan) among hospitalized per- sons increased from 0.8 to 1.8 per 1,000 population between 1979 and 1981 (NCHS, 1983c). The National Ambulatory Medical Care Survey, a survey of physicians in private office-based practice, began in 1973. Phy- sicians selected into the NAMCS sample are asked to complete a patient record for each sampled patient seen during a 1-week survey period. That form includes infor- mation on the physician's diagnosis and on diagnostic services ordered or provided during the visit. Prior to this reporting pe- riod physicians are inducted into the sur- vey by a brief interview that obtains infor- mation about their training and about characteristics of their practice (NCHS, 1980b). The survey does not, but could, in- clude queries about the physician's knowl- edge, attitudes, accessibility, and use of specified types of technologies. In 1980 and 1981, a special supplement on drug ther- apy was added to the NAMCS patient rec- ord after several feasibility tests and pre- tests. The drug data were coded using a system of therapeutic categories based on the Pharmacologic-Therapeutic Classifica- tion of the American Society of Hospital Pharmacists (NCHS, 1980d). A national estimate of 679,593,000 drug mentions (a physician's record of a pharmaceutical agent ordered or provided for the purpose of prevention, diagnosis, or treatment) re- sulted from the 1980 survey. The survey also produces a listing of the 100 agents most frequently utilized by phy- sicians in office practice (NCHS, 1983b). Periodic supplements of this type would permit study of changes in the use of cer- tain types of medication. Followback sur- veys of NCHS also contribute to the mea- surement of technology diffusion. For a

112 sample of registered births and deaths, considerable additional information is sought from various sources, including hos- pitals where care was received and physi- cians who gave care to the sampled person. The first National Mortality Survey cov- ered a sample of deaths in 1961 (NCHS, 1983a). Since that time NCHS had con- ducted several mortality and nasality sur- veys and an infant mortality survey. The National Natality Survey of events occurring in calendar year 1980 includes an oversampling of low-birth-weight in- fants and a sample of fetal deaths. This sur- vey shows how the followback mechanism can provide a range of data relevant to the use of health care technology. For sampled infants, information was requested from both the mothers and the providers of med- ical care (primarily the physician who de- livered the baby) about prenatal care, de- livery, and postpartum care. Emphasis was given to whether, during the 12- month period prior to the delivery, the mother had received x rays; ultrasound; thyroid tests; scans or uptakes (nuclear medicine); sonograms or deep heat diathermy; or microwave, short-wave, or radio-frequency treatments. Mothers were asked to identify the providers of each of these examinations or treatments, and the providers were asked for more specific in- formation about the exact type of proce- dure, why it was performed, and the date and place where it was performed. Infor- mation is available about the delivery, in- cluding drugs or surgical procedures used to induce or maintain labor, type of deliv- ery, anesthetics, and condition of the in- fant at birth such as the Apgar score at 1 and 5 minutes for live births. These data, in conjunction with demographic and so- cioeconomic information from the moth- er's questionnaire, her health habits during pregnancy, and information on many as- pects of pre- and postnatal care make this survey uniquely valuable for study of the ways that health care technology is used ASSESSING MEDICAL TECHNOLOGY and how it affects the outcome of preg- nancy and delivery. Data from the survey show that in 1980 about 13 percent of mothers received an x ray during pregnancy; nearly one-third of all pregnant women received at least one ultrasound examination; 29 percent of mothers 35 years of age and over received amniocentesis; and almost one-half of mothers in the survey received fetal moni- toring (NCHS, 1983f). Similar surveys of deaths are in the early planning stages, including development of sample design specifications that deal with specific diagnoses listed on the death certif- icate. The samples for past National Mor- tality Surveys conducted by NCHS have been straightforward, without focus on any particular cause of death, but a future survey could be designed from a different perspective. None of the surveys previously men- tioned collect data on costs or charges needed to measure the cost and impact of medical technology. Two recent surveys were specifically designed to obtain infor- mation on health care utilization, the ex- penditures associated with medical care, and the sources of payment for the care re- ceived. The first of these the National Medical Care Expenditure Survey (NMCES) was conducted in 1977 by the National Center for Health Services (NCHSR) in collaboration with NCHS (NCHS, 1981b). A somewhat modified version the National Medical Care Utilization and Expenditure Survey (NMCUES) was conducted in 1980 by NCHS in collaboration with the Health Care Financing Administration (HCFA) (NCHS, 1983d). The National Medical Care Utilization and Expenditure Survey (NMCUES) was a longitudinal panel study, having five con- tacts with each household in the initial sample. The contacts were spread over a period of about 15 months in order to ob- tain information for the entire calendar

METHODS OF TECHNOLOGY ASSESSMENT year. Each interview asked about health care received in a variety of locations, in- cluding dentists' offices, emergency rooms, hospital outpatient departments, and phy- sicians' offices. Information was obtained about the type of provider, the reason for the visit, the specific condition if any- for which the visit was made, whether cer- tain kinds of tests were done, the total charge for the visit, and expected sources of payment for the bill. Information not available at the time of the first report about the health care encounter was sought at subsequent interviews. Analyses of specific conditions or groups of conditions will examine the total range of services received and the expenses associ- ated with those services. The NMCES and NMCUES data should go a long way to- ward answering questions about charges for various types of services and how those charges are paid. Other NCHS surveys produce data that can be used to assess medical technology, or they could be modified to enable such use. A guiding tenet of the NCHS is that the statistical data it gathers should be available to all interested users as promptly as resources permit. The principal forms used are published reports, special and un- published tabulations, and public use data tapes. NCHS policy is to release public use data tapes from all its surveys in a manner that will not in any way compromise the confidentiality guaranteed to the respon- dents who supplied the original data (NCHS, 1980a). These public use data tapes are a major resource for analyses by health services researchers, including those involved in technology assessment. The potential uses for technology assess- ments of NCHS sample surveys and the as- sociated data collection methodologies are great despite the limitations outlined above and the lack of any systematic effort to assess needs and capabilities. There is considerable interest in making connec- tions between sample surveys and con- 113 trolled experiments in an attempt to com- bine the generalizability that surveys offer with the accuracy that experiments offer. This suggestion seems especially pertinent to longitudinal surveys (Boruch, 1985~. Health Care Financing Administration Data Through its administration of the Medi- care and Medicaid programs, the Health Care Financing Administration routinely receives data on such items as its benefi- ciary population, providers certified to de- liver care to the beneficiaries, the use of services, and reimbursements to providers. These materials can be studied by either census (100 percent) methods or sample methods. While HCFA generally uses cen- sus methods, these files are described here because we believe that many nongovern- ment investigators will prefer to use sam- ples. The Medicare Statistical System (MSS) was designed to provide data to measure and evaluate the operation and effective- ness of the Medicare program. The statisti- cal system is a by-product of three cen- trally maintained administrative record systems: Health Insurance Master File contains a record on each person who is enrolled in Medicare. Data elements for each individ- ual include the Medicare claim number, age, sex, race, place of residence, and rea- son for entitlement. This file provides en- rollment statistics and denominators for calculating all Medicare utilization rates. Provider of Service File contains infor- mation on each hospital, skilled nursing fa- cility home health agency, independent laboratory, or other institutional provider that has been certified to participate in the program. Utilization File contains Medicare bill- ing information, including such data ele- ments as copayments, deductibles, and

114 spells of illness. For a sample of bills the MSS obtains more extensive information, for example, the nature of the hospital epi- sode (diagnostic and surgical procedures), for approximately 20 percent of the hospi- tal bills. Because each record in the utilization file contains the beneficiary's claim num- ber and the provider's identification num- ber, utilization records can be matched to enrollment and provider records to deter- mine population-based statistics or pro- vider-based statistics. These record systems are extremely large: about 12 million inpa- tient hospital bills, 30 million outpatient hospital bills, and 150 million physician payment records were received and pro- cessed in 1981 (Lave, 1983~. HCFA data bases have not been widely disseminated, but they have been utilized by HCFA grantees and contractors. An ex- ample of an innovative use of Medicare data for assessing variations in health out- comes is Wennberg's analysis of small area variations in health care delivery in Ver- mont (Wennberg and Gittelsohn, 1973~. The use of HCFA data undoubtedly will grow and may be bolstered by the develop- ment of general purpose public use tapes that provide summary information on Medicare enrollment and utilization. Capabilities and Limitations of Sample Surveys Sample surveys have many potential ad- vantages over studies based on complete enumeration. If the universe surveyed is large or geographically widespread, sam- pling can be economical. Nonresponses can usually be handled more effectively, the data can be processed more quickly, and the quality of responses can almost always be improved because of the greater oppor- tunity for individual handling of problems in smaller data sets. These advantages are not always realized, however; and each re- ASSESSING MEDICAL TECHNOLOGY ported survey must be studied in detail by prospective users to determine whether the data really will serve their needs. The na- tional sample survey data collected and produced by the NCHS and HCEA are use- ful and reliable sources for some aspects of technology assessment, provided their scope and limitations are understood. For example, NHDS data are abstracted from the face sheets of medical records; any di- agnostic or therapeutic information in the medical record that is not cited on the face sheet is lost to analysis. NCHS national sample survey data are designed to measure the rate of diffusion, cost, and medical impact of medical tech- nologies when they are reasonably well es- tablished and in general use. When new, experimental, or emerging technologies are first introduced there is little impact that these data programs can measure, but as they become established they are incor- porated into various classification and cod- ing schemes, and there is a chance to quan- tify their effects. The capacity to characterize the recipi- ents of a given procedure in terms of addi- tional variables is perhaps the major ad- vantage of surveys over routine record sources. For instance, it was possible on the basis of NCHS's 1980 National Natality Survey to compare the characteristics of pregnant women on whom amniocentesis, ultrasound, and electronic fetal monitor- ing were performed (NCHS, 1983c). It is also possible in principle to examine com- binations of procedures. This could poten- tially increase our understanding of the diffusion process and affect policy regard- ing technology transfer. Data now are limited by the coding schemes used. Current Procedures Termi- nology (CPT) is generally used in the United States to code procedures for reim- bursement purposes. Diagnoses, surgical operations, and procedures may be coded to the most recent revision of the Interna- tional Classification of Diseases (ICD),

METHODS OF TECHNOLOGY ASSESSMENT which permits comparison with similar data from other countries. The approxi- mately decennial revision in the ICD may damage the comparability of data over time, although periodic changes in code systems are essential to permit introduction of new terms and procedures that reflect changes in the delivery of medical care. Ef- forts are made to preserve the same code designations, but special problems are as- sociated with the recent change from cod- ing procedures according to surgical spe- ciality to coding based on the various body systems. When the NHDS made the con- version to ICD-9-CM beginning with the 1979 data year, NCHS increased the maxi- mum number of diagnoses carried on its data tapes from five to seven; similarly, the maximum number of procedures and sur- gical operation codes assigned was in- creased from three to four. The type and character of respondents profoundly affect the design of and in- structions for survey questionnaires. Thus, survey of members of the general popula- tion must be limited to the level of medical detail that most lay persons can provide in response to survey questionnaires. Lay re- spondents may in some instances provide information of little value for the assess- ment of medical technologies. The Office of Technology Assessment (1980a) recently polled data analysts who conduct cost-effectiveness and cost-benefit analyses of various health care technologies and found that unmet data and informa- tion needs were considered a significant problem by all respondents. Information needs more often were reported to affect the cost of a given study than such factors as complexity of the problem being studied or the stage of development of the technol- ogy (NCHCT, 1981~. Thus, investigators need to rely on primary data collected for other purposes as well as on secondary data analyses. 115 Strengthening Uses of Sample Surveys The scarcity of resources for the design of new specialized surveys is likely to re- quire that established national sample sur- veys continue to be a major resource for monitoring trends in the incidence, preva- lence, diffusion, cost, and medical impact of technologies in clinical use. Several changes could strengthen the use of these surveys for technology assessment. Sample surveys are conducted by many public and private organizations for a vari- ety of uses (Mullner et al., 1983~. Stan- dardization of data elements across pro- grams would facilitate cross-survey comparisons. The National Committee on Vital and Health Statistics sponsored the development of three minimum data sets dealing with hospital discharges, ambula- tory medical care, and long-term health care (NCHS, 1980c,f; 1981c). These may form a useful beginning for development of a broader collection of minimum data sets for the multiple data collection systems needed in technology assessment. Data systems in the public and private sectors could be utilized more fully. Analy- sis of existing data, if adequate for technol- ogy assessment, is generally less expensive than collection of new data. Existing na- tional sample surveys might be expanded, where feasible, to assist in monitoring trends in health statues, medical care utili- zation, and health outcomes. Questions could be added to existing continuing sur- veys; information can sometimes be inex- pensively "piggy-backed" onto ongoing sample surveys. Sample surveys can often be used for follow-up studies, in which a specific population group is studied again, and for followback surveys, in which the past experience of a group is analyzed. A major problem in monitoring the dif- fusion of new technologies is the fixed character of most classifications and cod- ing schemes. It takes several years for clas- sifications to be modified so that new pro-

116 cedures can be distinguished from older procedures. A mechanism needs to be de- veloped to make classification more re- sponsive to emerging technologies. Health data are collected by a variety of organizations. These data could be shared more widely among agencies, often with- out introducing problems of confidential- ity, if other problems of data compatabil- ity and record linkage can be solved. To encourage sharing of data, methods should be developed that will increase the capac- ity to integrate data sets. Linkage of data files should be encouraged when there is a good reason to believe that the results of a specific linkage program will be suffi- ciently complete for the purpose and that biases and other limitations of linkage studies will not be so severe as to vitiate results. Longitudinal surveys, in which a sample of the population is followed through time, can record systematically changes in health status and can be useful for technology as- sessment. Careful evaluation of the health of persons in these samples and their use of drugs, surgical procedures, and other med- ical technologies could be useful for tech- nology assessment. EPIDEMIOLOGIC METHODS* In this section will be discussed several kinds of data sources, including registers and data bases, surveillance systems, and sample surveys. Data from these sources are rarely usable in their original form to assess technologies. Analytic methods, of- ten designated collectively as epidemiol- ogy, may be applied to data from these and other sources. Epidemiologic methods, narrowly de- fined, deal with diseases as they are ob- served in defined populations. An extended definition of epidemiologic methods might * This section was drafted by John C. Bailar III. ASSESSING MEDICAL TECHNOLOGY take in the entire range of observational studies of such things as rates of disease in- cidence or mortality, causes and risk fac- tors for specific conditions, characteristics of screening and diagnostic tools (including sensitivity and specificity), the efficacy of preventive measures, follow-up, and out- come. Epidemiology is essentially an ob- servational science; intervention by an in- vestigator is uncommon, and controlled intervention is rare. Thus in most instances in which these methods apply, bias in data is a greater threat to interpretation than is random error. Epidemiology (in contrast to clinical medicine) is not ordinarily con- cerned with identified individuals except as the identification is needed to match records or assign persons to the right popu- lation group. Clinical studies of the effects of treat- ment share much with epidemiologic methods in the overlap of problems, meth- ods of attack, analytic tools, and problems of interpretation. However, epidemiology itself is not ordinarily concerned with the responses of a disease to treatment, al- though more general responses of the host (e.g., increased susceptibility to other dis- eases) may be included. Epidemiologic investigations of acute illness (infections, trauma, or acute chemi- cal toxicity) are quite different from those of chronic or long-delayed illness (e.g., cancer, chronic obstructive lung disease, or schizophrenia). The differences are not only in the problems studied but include the methods of research study, the nature of various impediments to sound conclu- sions, the kinds of risk factors including past applications of technologies, and the means for evaluating those technologies. Examples can be found in the range of dif- ferences between studies of acute organ damage by some chemical substance and studies of the carcinogenicity of the same substance. Another important dichotomy separates input variables from output variables. In-

METHODS OF TECHNOLOGY ASSESSMENT put variables, somewhat loosely defined, are those observed prior to some effect, event, or use of a technology to be studied and might be more roughly labeled causes and correlates of causes. Output variables are those observed after the use of a tech- nology or other intervention or event and include at least all those features that are effects. Thus the definitions of input and output variables are somewhat broader than the usual concepts of independent and dependent variables. To illustrate, consider a study of a new treatment for hypertension. Input vari- ables include such things as the nature, causes, severity, and pretreatment compli- cations of the disease; age, race, occupa- tion, and other demographic variables; concomitant illnesses and family history; and, of course, the treatment itself. Output variables include such things as vital status and survival time, change in the disease process, and complications of treatment. Three very broad categories of epidemi- ologic methods will be discussed: cohort studies, case-control studies, and cross- sectional studies. Both cohort studies and case-control studies focus on relations between input and output variables, but differ in a criti- cal aspect of the way subjects are selected for investigation: in cohort studies, selec- tion of subjects is based on input variables, while in case-control studies, selection is based on output variables (White and Bailar, 1956~. Thus one might examine the relation between polio immunization and the later incidence of polio by: · comparing a sample or series of per- sons immunized with a similar sample not immunized, to see how many in each group develop polio (a cohort study, be- cause the groups are defined on the basis of treatmentan input variable); · comparing a sample or series of new polio patients with a control series free of such infection to see how many in each 117 group had been immunized (a case-control study, because the two groups are defined by the presence or absence of polioan output variable). Critical to both approaches is the con- cept of cause and effect, which implies the observation of or reporting of change over time change in the patient, in a disease, in physiologic or biochemical features, or even in features of whole populations or so- cial milieus. Cross-sectional studies are rather differ- ent and not as precisely defined. Their characteristic feature is the search for cor- relations from which inferences (including cause-effect inferences) can be made, rather than the search for change over time. Studies of many familial diseases are cross-sectional; so are many population- based studies of disease screening. Some epidemiologic studies are strictly cross-sectional and do not involve change, but do permit strong inferences about change (i.e., cause and effect) and may be treated as if they were ordinary cohort or case-control studies (Bailer et al., 1984~. An example is the reported correlation (Needleman, 1979, 1985) between behav- ioral changes in school children and the lead content of deciduous teeth, which are presumed to reflect lead levels some years earlier when the teeth were formed. Uses of Epidemiologic Methods Cohort Studies Cohort-based epidemi- ologic studies can play a substantial role in technology evaluation. Consider the fol- lowing example: Sherins et al. (1978) reported an unex- pectedly high incidence of gonadal dys- function in African boys who had received cytotoxic drugs as treatment for Burkitt's lymphoma. The study was strictly observa- tional; there was no treatment protocol, no parallel control group, no prior hypothesis, not even a well-defined population from

118 which the subjects were a describable sam- ple. Various features of the report, how- ever, suggested that the observation was quite solidly based and established a previ- ously unrecognized complication of an im- portant drug technology that is used, with variations, for many other neoplastic dis- eases. This is considered a cohort study be- cause subjects were selected on the basis of input variable (age, sex, and geographic location as well as disease and treatment) rather than because they did or did not have known gonadal dysfunction. The study had no internal controls, but the au- thors cited results in untreated, well North American boys (presumably supported by the author's own knowledge of physiology in adolescent Africans) to establish that there was indeed a high risk. Case-Control Studies Case-control stud- ies of medical technologies are often an outgrowth of the unstructured observation of something odd. If the oddity is suffi- ciently striking, there may be little need for an explicit control group in the initial re- port on some possible technologic effect. The function of controls at this initial stage of identifying and reporting a possible problem may be filled by earlier case se- ries, reports in the literature, or even com- mon knowledge. However, more definitive work, including quantitative estimates of such things as the frequency or degree of effect, will nearly always demand careful attention to controls. For example, Herbst et al. (1971) re- ported a remarkable cluster of cases of a very rare disease (adenocarcinoma of the vagina in adolescents). They also reported data for a control group and established that the occurrence of the disease was closely linked to maternal use of diethyl- stilbestrol (DES) before the offspring were born. This initial identification of a tech- nologic effect was based on sampling an output variable (all subjects had cancer; the controls did not), not the relevant input ASSESSING MEDICAL TECHNOLOGY variable (DES exposure). Later, more tightly designed studies confirmed the as- sociation and left little doubt that it was cause and effect. Fortunately, only a very small proportion of "DES daughters" de- veloped the cancera quantitative result that could not have been established by Herbst's original approach and was in fact unexpectedly reassuring. Cross-Sectional Studies The object of cross-sectional studies is to understand some state of nature rather than to study changes in state. Examples of cross- sectional studies include the identification and description of new (or newly recog- nized) diseases, determinations of disease severity or extent, establishing normal ranges for laboratory tests, and investiga- tion of pathophysiologic mechanisms. A1- though data over time may be collected in such studies, change is used only as a tool of investigation; it is only when change itself is the object of study that the work is classi- fied as cohort or case-control. Cross-sectional studies are commonly used in the evaluation of medical technolo- gies. For example, new techniques for the diagnosis and staging of disease are nearly all of this type. Re et al. (1978) provide an illustration. Narrowing of a renal artery is sometimes a cause of hypertension; this cause is likely when plasma renin activity (PRA) is at least 1.5 times as high in blood from one renal vein as in blood from the other. Be et al. (1978) proposed a modifi- cation in technology (administration of a substance called CEI) to increase the sensi- tivity of the test. They found that the mean PRA ratio was 2.94 before and 8.36 after the administration of CEI in seven unse- lected patients with known renal artery disease, while in patients without that con- dition CEI changed the mean PRA from 1.99 to 1.17. The authors conclude that CEI increases the diagnostic accuracy of the PRA test, but they are careful not to imply that their study is definitive.

METHODS OF TECHNOLOGY ASSESSMENT Capabilities and Limitations of Epidemiologic Methods Epidemiologic methods have many uses in technology assessment besides the iden- tification and characterization of health outcomes. These include studies of: · the prevalence of use of technologies, such as surgery, in various types of com- munities; · the prevalence of various health out- comes that may be technologically related, such as the Reye Syndrome, or complica- tions of elective abortion; · the prevalence of disease (e.g., patient load), risk factors, or other input variables, without study of output variables, such as a survey to determine potential demand for an artificial heart; · the distribution of availability of a technology or of its actual use and rate of diffusion; · data on technology-related costs or charges. Although these items may have some evaluative uses in themselves, results are more likely to be passed on to other evalua- tion methods discussed elsewhere in this book (e. g., cost-effective analysis/cost- benefit analysis). Epidemiologic studies of health technol- ogies rarely are planned before a specific need is noted, usually in relation to con- cerns about some undesirable effect. Most are attempts to gather information after astute clinical observation, or other meth- ods discussed elsewhere in this book, have shown that there may be a problem. Epidemiologic assessments are done most often in academia, sometimes in gov- ernment, but rarely by manufacturers, providers, or third-party payershealth maintenance organizations (HMOs) being a strong exception. This may introduce bias in selection of topics and specific methods, types of patients available, etc. Epidemiology used for technology as- 119 sessment can have the capability of: (1) ex- ploiting populations that come to hand; (2) derivation and testing of inferences (see, e. g., U. S. Public Health Service, [1964] for a thoughtful discussion); and (3) providing more satisfactory modes of post- marketing surveillance. On the other hand, epidemiologic find- ings are often straitened because (1) data often are limited, biased, and in inappro- priate forms, sometimes because the data are a by-product of a process that is con- ducted largely for other purposes (such as death certification); (2) study performance may be damaged by problems of access to data (concerns about confidentiality, etc.), the chronic shortage of trained epidemiolo- gists, and perhaps other structural prob- lems not related to particular diseases, par- ticular patient groups, or particular technologies; (3) good epidemiology is al- most always expensive and time consum- ~ng. Strengthening Uses of Epidemiologic Methods The greatest opportunity may lie simply in extending the application of epidemio- logic methods to what are called outcome studies. The aim of such studies is to learn about patient status after considerable time has elapsed following treatment. Topics include matters like 1-year survi- vorship after a certain operation or after discharge from intensive care or coronary care units. Such data are valuable in mak- ing individual decisions and in designing health care policy. Surprising findings can arise when such studies are done (Gerber et al., 1984; Wennberg, 1984~. More out- come studies are needed. The methodology for outcome studies is at a relatively early stage of development and more research will be needed. There is a need for faster, easier, less expensive, and more accurate measurement of many kinds of outcomes (e.g., the carcinogenic effects of drugs or

120 the outcome of treatment for many kinds of chronic diseases). Problems and gaps both in present data and present applications should be better identified, and ways to improve the tech- niques and uses of epidemiologic evalua- tion for purposes of health technology as- sessment (new knowledge, new organiza- tion, etc.) should be developed. Epidemiologists have been in critically short supply for at least 30 years; the sup- ply of those trained for (and interested in) technology assessment may be even tighter (National Academy of Sciences, 1978, 19814. Efforts to increase the supply of epi- demiologists should be expanded. SURVEILLANCE * Surveillance was widely used by health departments in the nineteenth century,! but modern nationwide surveillance was substantially strengthened some 30 years ago when the Centers for Disease Control (CDC) began its efforts to monitor and con- trol outbreaks of infectious disease (Ameri- can Public Health Association tAPHA], 1981; Bell and Smith, 1982; CDC, 1982a; White and Axnick, 19754. A national mor- bidity reporting system is based on data for- warded from the states to CDC for aggrega- tion. There are detailed surveillance procedures and report forms of some 30 dis- eases using a variety of sources. Some of these efforts are laboratory based, such as identification of strains of salmonella. Some are hospital acquired infections. Some are practitioner-based, such as the reporting by neurologists of Guillain-Barre neuropathy following swine flu immunization. No di- rect information is available on the costs and benefits of such programs; indeed it * This section is based on material prepared by Jef- frey Koplan and John C. Bailar III. t John Garunt reported surveillance using the bills of mortality in the seventeenth century. ASSESSING MEDICAL TECHNOLOGY would be difficult to approach such issues in this manner (see below). However, certain benefits of the surveillance system have re- cently been summarized (Kimball, 1980~. Reports of morbidity and mortality are generated weekly, monthly, and annually by the CDC. Their weekly reports have worldwide distribution as the Morbidity and Mortality Weekly Report (MMWR). Surveillance is the continuing scrutiny of the occurrence, spread, and course of a dis- ease for aspects that may be pertinent to its effective control (APHA, 1981~. Surveil- lance traditionally has been associated with communicable disease epidemiology, but now it is applied to noncommunicable diseases, health indicators, environmental and occupational hazards, and other health problems and conditions. Included in disease surveillance are the systematic collection and evaluation of a broad range of epidemiologic data, such as (1) morbid- ity and mortality reports; (2) results of spe- cial field investigations; (3) laboratory re- ports; and (4) data concerning the avail- ability, use, and untoward effects of a vari- ety of substances and devices used in disease control, such as vaccines, drugs, and surgical procedures. The concept and practice of surveillance can be extended from standard features of health and dis- ease to medical practices, such as surveil- lance of surgical procedures or vaccine use, of adverse reactions to drugs, and of health risk factors such as smoking or environ- mental hazards. Surveillance may be passive or active. Passive, voluntary surveillance puts the burden of reporting on health care pro- viders or institutions, who, as cases occur, respond by mail or telephone to a health department or other central repository of surveillance data. Such voluntary report- ing may be useful but provides incomplete data biased by the individual reporter's ex- per~ence. Active surveillance, which might apply to the same disease or practice, supple-

METHODS OF TECHNOLOGY ASSESSMENT meets or replaces passive reporting with systematic, specific inquiries directed to persons or institutions that may be able to supply data. For example, passive surveil- lance of adverse reactions to rabies vaccine might involve a periodic flyer to practicing physicians, along with a package insert asking them to report such information to the manufacturer, state health depart- ment, or a federal health agency. Active surveillance might involve tracking by state health officials of all doses of rabies vaccine distributed and periodic telephone calls to physicians and/or patients asking for specific information on circumstances of use and adverse reactions. Systematic re- porting is likely to be more comprehensive, better standardized, and better able to de- tect previously unknown effects (good or bad), but it requires a more expensive and complex system. Under the postmarketing surveillance program of the Food and Drug Adminis- tration (FDA) for drugs approved for gen- eral use, the sponsor is required to forward to FDA any reports of adverse effects of the drug. Additional information about ad- verse drug reactions comes to the FDA from several sources. These sources include reports from physicians, pharmacists, and hospitals; a monthly literature review; studies in special populations (largely, but not entirely, passive surveillance); regis- tries and data bases; and the World Health Organization (WHO). Uses of Surveillance Evaluation of technologies such as vac- cines has always relied considerably on sur- veillance methods. Even when the tech- nique has been evaluated independently, as by a randomized controlled trial (RCT), surveillance provides for continued assess- ment of the vaccine and the disease it pre- vents. The reporting of measles, combined with reports on the use of measles vaccine and possible adverse reactions, permits a 1%1 quantitative assessment of vaccine efficacy and safety (CDC, 1982a; White and Ax- nick, 1975~. Surveillance has an important role in supplementing clinical studies of rare events that may not be observed in studies with modest sample sizes. For example, if a drug therapy causes blurred vision in 1 per- son per 1,000 persons treated, a clinical trial including 200 treated patients has only a .18 probability of observing one or more such events. Increasing the sample to 2,000 raises this probability to .86 and a sample of S,000 to .993. Thus one argu- ment for postmarketing surveillance is that comparative premarketing trials of modest size may not detect rare but important ad- verse reactions, whereas continued study after the drug is released for general use in- creases the probability of detection (OTA, 1982a). Reporting may not be as rigorous as dur- ing a clinical trial, so that an adverse event that occurs to 1 individual in 1,000 might be properly reported less than half the time. A factor of one-half would affect the numbers given above as follows: Probability of Observing at Least One Adverse Drug Reaction Sample size Full reporting Fifty percent reporting 200 2,000 5,000 .18 .86 .993 .10 .63 .918 Whether omissions are more or less fre- quent than 50 percent seems not to be known. Furthermore, the incomplete and perhaps inaccurate reporting of adverse events, while unlikely to establish a causal link to the therapy, can increase the vigi- lance of the medical community and initi- ate more extensive and more reliable stud- ies (Finney, 1965, 1966~. Postmarketing surveillance information may lead the FDA to remove a drug from the market or constrain its advertising or labeling. In recent years FDA received an average of about ll,S00 reports of adverse

122 events per year, with 71 percent reported through manufacturers. When adverse drug effects are long de- layed, some form of postmarketing study may be required, because few randomized clinical trials can be continued for many years. More generally, methods other than clinical trials are used to detect many kinds of adverse outcomes (Bell and Smith, 1982~. Surveillance, even in the crude form of national incidence figures, can offer in- sight into changes in the epidemiology of a disease, including changes caused by effec- tive disease control measures and pro- grams. Assessment of a vaccine, drug, or device requires data on frequency of use, charac- teristics of persons treated, frequency and types of adverse sequelae, etc., all of which are amenable to collection by surveillance methods. Thus, disease surveillance moni- tored the progress of the World Health Or- ganization's successful smallpox eradica- tion program at the same time that it identified where cases were occurring and improved the targeting of control activities (Foege, 1971~. Surveillance can provide data useful for decision analysis, including cost-benefit and cost-effectiveness analysis (CBA/ CEA). Benefit-risk studies of smallpox vac- cination based on surveillance of vaccine use and reactions and worldwide disease occurrence led to the conclusion and subse- quent federal government recommenda- tion that routine smallpox vaccination was no longer warranted (Lane, 1969~. Recent studies of CBA/CEA of vaccination pro- grams for measles (White and Axnick, 1975), pertussis (Koplan et al., 1979), mumps (Koplan and Preblud, 1982), and hepatitis (Mulley et al., 1982) have used surveillance data for disease incidence, dis- ease complication rates, rates of vaccine usage, and rates of adverse vaccine reac- tions. Evaluations of screening procedures also ASSESSING MEDICAL TECHNOLOGY require estimates of disease prevalence. For example, routine surveillance has pro- vided data crucial for assessments of prena- tal cytogenetic screening (Hook et al., 1981), maternal serum alpha-fetoprotein screening for neural tube defects (Layde et al., 1979), screening for rubella immunity (Farber and Finkelstein, 1979), and lead screening (Berwick and Komaroff, 1982~. As the scope of epidemiology has broad- ened, so has the subject matter found suit- able for surveillance. The Boston Collabo- rative Drug Project collects data on the uses and adverse effects of various pharma- ceuticals for use in epidemiologic studies of benefits and risks (Boston Collaborative Drug Surveillance Program, 1973~. Sur- veillance of techniques for contraception and pregnancy termination has led to in- creased information about their benefits and risks (CDC, 1979), such as recognition of the association between a particular in- trauterine device (the Dalkon shield) and pelvic inflammatory disease. Programs de- signed to identify and modify medical pro- cedures associated with high rates of noso- comial (hospital-fostered) infections have relied on surveillance methodologies (Dixon, 1978; Haley et al., 1980~. Surveil- lance of environmental and occupational hazards and disease, including injuries, is helpful in the assessment of technologies, including those aimed at controlling envi- ronmental hazards and reducing illness and injuries in the workplace (U.S. Con- sumer Product Safety Commission, 1982~. The surveillance of local and regional trends in the use of surgical procedures, such as surgical sterilizations, can raise questions to be answered by more directed analytic studies. Data from the National Survey of Family Growth and from the Centers for Disease Control (1982b) have been used to estimate the cumulative prev- alence of hysterectomy and tubal steriliza- tion among women of reproductive age in the United States. The cumulative preva- lence of tubal sterilization for women aged

METHODS OF TECHNOLOGY ASSESSMENT 15-44 years more than doubled from 1971 to 1978, and the rate at least tripled for woman under 30 years old. Proportionate increases in the cumulative prevalence of hysterectomy were not as great, but by 1978, 19 percent of women 40-44 years old had undergone hysterectomy. Although the main methods for post- marketing surveillance are cohort studies, case-control studies, and voluntary report- ing by physicians, Wennberg has proposed some additional types of population studies that use insurance claims and hospital dis- charges. He and his colleagues use such data to study per capita use of procedures, hospital expenditures, use of beds and per- sonnel, and outcomes of care in various subpopulations of a state or region (Wenn- berg and Gittelsohn, 1973; Wennberg et al., 1980; Wennberg, 1981; Vermont Health Utilization Survey, unpublished re- port, 1973~. They used surveillance meth- ods to show that rates of tonsillectomy and adenoidectomy varied across 32 hospital areas in Rhode Island, Maine, and Ver- mont by a factor of 6, while hysterectomy varied by 3.6, prostatectomy by 4, and herniorrhaphy by about 1.5. The point is not that reductions are required, because a comparison of rates does not in itself tell whether the high rates provide correspond- ing health benefits, but rather that such large variations in usage deserve study and explanation. When physicians become aware of such variations their practice may change. Gittlesohn and Wennberg (1977) found that in the highest-rate area of Ver- mont, the cumulated rate of tonsillectomy during childhood dropped from 65 percent to 8 percent when the Vermont Medical Society provided a new information pro- gram for local physicians. Earlier studies by Lembcke (1956) and Dyke et al. (1974) showed similar effects from surveillance on the number of hysterectomies in Canada. Such methods also can be applied to study the costs, health benefits, and mor- tality associated with various policies 123 about the use of procedures. For example, in the United States a strategy of low use of prostatectomy has been projected to lead to 1,900 deaths annually while a high use strategy would lead to 6,80O, a ratio of about 3.6 (Wennberg, 1981~. Surveillance also can be used to study longer-term survival and thus to contribute information about the risks of various med- ical procedures. Wennberg proposes using insurance claims data as a kind of surveil- lance mechanism to study rates of compli- cations. For example, Schaffarzick et al. (1973) found that various types of compli- cations following intraocular lens trans- plant were resolved in 42 percent by lens removal alone, in 9 percent by lens re- moval and replacement, and 50 percent without lens removal. Some new and existing technologies might be usefully evaluated in a surveil- lance system. Although maternal serum alpha-fetoprotein appears to be a valuable screening test for neural tube defects, its performance in the field could be better as- sessed by developing a surveillance system in which laboratory accuracy and stan- dards (including false positives and false negatives), distribution of services, patient follow-up, interpretation of results, and actions taken are all monitored in a regular and systematic manner. The recently available hepatitis B virus vaccine is being evaluated by surveillance directed toward determining both its efficacy and any ad- verse effects (Wennberg, 1981~. Capabilities and Limitations of Surveillance Surveillance mechanisms may be as carefully designed and controlled as, for example, the RCT, but surveillance is in- tended to serve different functions, includ- ing a role in disease control. Skilled staff can institute surveillance on a routine basis relatively quickly, and surveillance can provide data useful for technology assess-

124 ment from diverse geographical areas and over long periods of time. lick et al. (1979) and Remington (1976) believe that more general mechanisms are both needed and feasible. Finney (1965) has outlined many aspects of a good moni- toring system. Although surveillance data often are in- complete, they can be used to evaluate dis- ease trends if the manner of data collection is consistent and variations in the com- pleteness of reporting are small. When the reporting fraction varies over time, conclu- sions drawn from reported trends can be erroneous. During the World Health Or- ganization's Smallpox Eradication Pro- gram, graphs of smallpox incidence re- vealed peaks that reflected improved surveillance and the reporting of cases that previously went unreported rather than an increase in incidence (Foege et al., 1971~. Similarly, changes in disease definitions, professional interest and activity (report- ing physicians, clinics, etc.), and incen- tives (economic, political, social) can influ- ence disease reporting to create spurious trends in disease incidence. A map showing the incidence of syphilis by state could re- flect differences in such matters as case- finding activity, availability of public clinic facilities, program priorities, and re- porting practices, as well as actual differ- ences in incidence (CDC, 1982a). Passive surveillance systems usually pro- vide information at little cost, but they may be neither timely nor accurate and have problems of underreporting and as- certainment bias. A passive surveillance system in Washington, D.C., failed to de- tect an epidemic of diarrhea! disease caused by a drug-resistant strain of Shi- gella sonnet (Kimball et al., 1980~. Strengthening Uses of Surveillance Various ways to improve postmarketing surveillance have been suggested (IMS America Ltd., 1978~. For example, the ASSESSING MEDICAL TECHNOLOGY Joint Commission on Prescription Drug Use recommended that comprehensive postmarketing drug surveillance be devel- oped for the United States to detect serious adverse events that occur more often than 1 in 1,000 patients and that methods be de- veloped to detect rarer adverse events and delayed events as well as to evaluate bene- fits. It proposed a private, nonprofit center to aid in these developments. One possibility is to divide the nation into geographic regions and to release new drugs and other technologies to some re- gions and not others, recognizing in the analysis the possibility of region-to-region variations in frequency of use for various purposes (OTA, 1978a). This approach would fit well with some aspects of decen- tralized decision making (e.g., to states or to third-party carriers) but would have to be developed carefully to avoid problems of both practicality and ethics. The regions should be selected on logical grounds after careful study, rather than by simple conti- guity (such as northeast south midwest, and west). The use of population samples in comparable regions could avoid having the whole country be the guinea pig for all drugs, and a means for rotating first mar- keting regions would assure that no one re- gion would always bear the burdenor reap the benefits of first marketing. Postmarketing surveillance mechanisms for drugs are now extensive and rather well developed, arid surveillance of major infec- tious disease appears adequate, but other areas lag behind. Substantial well-targeted surveillance systems might be quite helpful in such areas as iatrogenic illness, environ- mental health hazards, and long-term oc- cupational risks. Research is needed on ways to develop surveillance programs for such health problems in the context of budget restraints, growing concerns for in- dividual privacy, and a general lack of in- centives to develop and preserve the neces- sary records (e.g., of occupational ex- posure to potentially toxic chemicals).

METHODS OF TECHNOLOGY ASSESSMENT QUANTITATIVE SYNTHESIS METHODS META-ANALYSIS* Meta-analysis is a statistical method for obtaining quantitative answers to specific questions from collections of primary arti- cles dealing with the same subject. From information obtained from each article, a synthesis is made which may produce a much stronger conclusion than any of the separate articles can provide. There are many methods of combining information from several sources, and some of these are discussed here. Louis et al. (1985) give a large collection of exam- ples of meta-analyses in public health. Uses of quantitative synthesis methodol- ogy, especially for obtaining overall signif- icance levels from groups of studies, have expanded steadily since the 1930s (e.g., Fisher, 1938, 1948, Mosteller, 1954; Pear- son, 1938, l9S0~. Although applications of synthetic methods have been most promi- nent in the social sciences, the utility of these methods has been demonstrated in applications to data from medical trials. Certain of these applications are described as examples in this section. Two models are often applied. The first supposes that each study measures the same true quantity, although with differ- ing precision for the different studies. The second, usually more realistic, model as- sumes that each study estimates a some- what different quantity, and that we want to assess the properties of the distribution of these different quantities. For example, the gains from a treatment might be differ- ent in different institutions, not just be- cause of sampling errors, but because the characteristics of the patients and proce- dures at institutions differ. In spite of these differences, we want to know how well a treatment works when we aggregate across * Clifford Goodman contributed this section. 125 institutions, perhaps by averaging. The methods of Cochran (1954), Gilbert et al. (1977), and DerSimonian and Laird (1982) offer ways to estimate the means and the variability of the true effects. There are several major benefits to quantitative synthesis of groups of trials. One is the increase in power, the ability to detect significant differences between treatment and control groups. The larger the sample size of patients (or other sub- jects) assumed to be drawn from a common distribution, the more likely that a certain effect will be detected as statistically signif- icant. A second benefit of quantitative synthe- sis is in obtaining improved estimates of ef- fect size (usually defined as the difference in means divided by the standard deviation of single observations in the control group). Where effect sizes from each of a group of studies are assumed to be esti- mates of a single true effect size, averaging (or otherwise cumulating) effect sizes may provide a better estimate. A third benefit is in describing the form of a relationship. Combining studies provides more data or a greater range of data with which to de- scribe relationships among variables. A fourth benefit is in detecting contradic- tions or discrepancies among groups of studies. Faced with a collection of studies in a particular area of research, a reviewer may analyze and compare subgroups of studies with, for instance, divergent find- ings to detect mediating factors of study design, treatment, context, measurement, or analysis that otherwise may not have ap- peared noteworthy (Pillemer and Light, 19804. A prevalent criticism of quantitative synthesis of findings across trials is that there may exist unpublished negative (or zero-effect) trials that are not available for pooling, thus biasing the sample of trials included in the synthesis. In their quantita- tive synthesis of 345 studies on interper- sonal effects (discussed below, under

126 "Cumulation of Significance Levels"), Rosenthal and Rubin (1978b) address this "file drawer" problem, demonstrating that the number of studies averaging a zero ef- fect that would be required to reduce sig- nificant findings of that synthesis to insig- nificance (p 0.05) is in the tens of thousands. Of course, the "file drawer" could have negative results as well. Devine and Cooke (1983) compared results in journals and books with those of unpublished dissertations. They were studying reductions in length of hospital stay associated with patient education. They found the reductions in the published articles to be 17 percent while those in the unpublished articles were 14 percent. These published effects are a little larger than unpublished ones, as many critics ex- pect. Of course, studies should not be indis- criminately grouped for achieving the ben- efits of synthesis. As discussed throughout this section, the utility of these quantita- tive synthesis procedures requires making certain assumptions about similarities among grouped studies. As is the case in any synthesis, the re- viewer should specify the criteria for selec- tion of studies included in the calculation of average effect size. The notion of scope of studies selected for synthesis is most im- portant. If a wide-scope synthesis shows small differences among treatments de- signed to achieve the same effect, then hy- potheses about differences among these treatments may be of minor importance. If the wide-scope synthesis shows large dif- ferences among treatments, then the syn- thesis can be applied on a smaller scope, so as to identify important differences among treatments. Deciding which studies should be in- cluded in synthesis is a controversial topic. Glass (1976) and Hunter (1982) suggest that restriction of scope of studies consid- ered for synthesis should be more topical than methodological. They suggest that ASSESSING MEDICAL TECHNOLOGY when a reviewer excludes studies because of methodological deficiencies, valuable information may be wasted based on an as- sertion (of deficiency) which is not tested. Glass proposes grouping studies on some measure of quality as an alternative to dropping studies thought to be poorly done. If no relationship is found between quality and outcome, there is no reason to drop the studies with low measures of quality. If a relationship is found, then the reviewer and those agreeing with the defi- nition of quality should weight the better studies more heavily (Glass, 1976; Rosen- thal and Rubin, 1978b). For studies of psy- chotherapy, Glass notes that the "mass of 'good, bad, and indifferent' reports" show almost exactly the same results. Glass's procedure for study selection has been met with criticism by those who favor rejection of studies which utilize insuffi- cient methodological controls (Eysenck, 1978~. Pillemer and Light (1980) note that agreement on what constitutes a good study, or which of two measurement pro- cedures provides better information, may be difficult to attain. They suggest that, al- though it seems sensible to exclude studies that do not meet basic methodological standards, variation in study designs ma be an asset, as follows. An analyst actually may want to insure that several major types of designs are represented, not to combine them blindly, but so that the outcomes can be compared. He or she may choose to stratify by type of design and/or gen- eral type of outcome measure, and then ran- domly select a number of studies from each stratum. This would build diversity. Voting Methods Several voting methods use tests to deter- mine whether the result of a voting method is statistically significant (Hunter, 1982~. One set of methods rests on the assumption that if a null hypothesis is true, then the

METHODS OF TECHNOLOGY ASSESSMENT correlation between treatment and effect is zero (or if significance levels are used, half would be expected to be larger than 0.50 and half smaller than 0.50~. Statistical tests are used to determine whether the ob- served frequency of findings is significantly different from the 50/50 split of positive and negative correlations expected if the null hypothesis were true. Thus, if 12 of 15 results are consistent in either direction, the sign test indicates that results so rare as this occur by chance only 3.6 percent of the time (Rosenthal, 1978a). In another ap- proach to vote testing, the proportion of positive significant findings can be tested against the proportion expected under the null hypothesis (typically, p < 0.05 or p < 0.01~. These methods do not take magni- tude of effect into account. Another set of voting methods estimates effect sizes across studies, given the sample sizes of the studies used and the relative proportion of studies showing positive and negative effects (Hedges and Olkin, 1980~. Confidence intervals may also be con- structed around the overall effect sizes. These methods rest on the assumption that there is one true effect size for the treat- ment and that each study represents a sam- ple of a distribution of measurements taken of the true effect size. (This assumption, and pooling effect sizes, is discussed be- low.) Because these voting methods for es- timating effect size do not use information about effect-size magnitudes from individ- ual studies, their estimates of overall effect size are not as good as certain methods that do make use of such information when it is available (Hunter, 1982~. Cumulation of Significance Levels p-Values) Several methods may be used to cumu- late significance levels across studies to produce an overall significance level for the set of studies as a whole. This enables 127 an overall test of a common null hypothe- sis, which is generally that the compared groups have for outcomes the same popula- tion mean. Where this overall significance level is small enough, it is concluded that the treatment effect exists. A primary rea- son for cumulating significance levels is to increase power. The increased sample size of combined studies may detect differences which could not be detected by individual studies. Rosenthal (1978a) has summarized and provided guidelines for using nine methods for cumulating significance levels across studies. The major advantages of these techniques are their computational sim- plicity, low informational requirements, and few formal assumptions. The general caution offered by Rosenthal is that the studies should have tested the same direc- tional hypothesis. In general, these meth- ods are helpful when the individual studies can be considered independent and ran- dom samples. Where the studies exhibit a split of significantly positive and signifi- cantly negative study outcomes, there may be systematic differences between some studies and others. In the case of a conflict, combining these studies might lead to a false conclusion, e.g., of no effect when there are different true effects. When con- flicts arise, an explanation should be sought, and other more sensitive methods should be used (Pillemer and Light, 1980~. Other Statistical Synthesis Procedures A variety of other methods exist for sta- tistically combining the results of studies. Pillemer and Light (1980) summarize approaches for investigating interactions and comparing similarly labeled treat- ments that are more analytic than syn- thetic in that they pull together studies to search for variation and discrepancies. Rosenthal and Rubin (1978b) present a blocking technique involving comparisons of study outcomes using analysis of vari-

128 ance techniques. Light and Smith (1971) present a cluster approach where similarly labeled subgroups of studies must be screened before they are combined. The screening involves determining that the means, variances, relationships between dependent variables and covariates, subject-by-treatment interactions, and contextual effects are similar across sub- groups. Once it is determined that sub- groups are similar, or that proper statisti- cal adjustments have been made to correct for explained differences, subgroups may be combined for overall tests (e.g., of sig- nificance). In addition to increasing confi- dence in overall findings, the major benefit of such approaches may be in identifying conflicts and discrepancies among studies which can provide clues to previously un- known variables moderating effects. For summaries and reviews of the quan- titative methods discussed in this section, see especially the collection in Light (1983) as well as Cook and Leviton (1980), Glass (1976), Glass et al. (1981), Hunter (1982), Jackson (1980), Light and Smith (1971), Pillemer and Light (1980), Rosenthal (1978a), Wortman (1983), and Wortman and Saxe (1982a). The review of validity considerations for the synthesis of medical evidence in these last two references is es- pecially instructive. The following example of synthesizing groups of studies by Baum et al. (1981) pools similar pairs of treatment and control groups progressively through time. This pooling has the effect of increasing overall sample size so that statistically significant differences between the pooled treatment and control groups are observed, where such differences had been observed in only a minority of the nonpooled groups. Example of Synthesis A synthesis was performed on 26 randomized control trials, published between 1965 and 198O, comparing the effects of antibiotic prophy- laxis and no-treatment controls on postop- ASSESSING MEDICAL TECHNOLOGY erative wound infection and mortality fol- lowing colon surgery. The synthesis cumulated treatment and control groups, respectively, year by year. (Using a tech- nique described in Gilbert et al. t1977], the authors found that the pooled estimate of the within-trial variance of the difference between infection rates was larger than the estimate among trials, and likewise for the estimates of the variance of the difference in death rates. Given these findings, the authors judged the degree of homogeneity among trials sufficient for pooling.) For each additional year, cumulative effect sizes (i.e., differences between treatment and control groups) were calculated, and confidence intervals were determined around each. As early as 1975, pooled data indicate a noteworthy and significant dif- ference in effect size between antibiotic prophylaxis and no-treatment controls for both infection and mortality rates. The authors used the Mantel-Haenszel (1959) procedure to test the difference in infection rate and in mortality rate be- tween treated patients and untreated con- trols. The difference between both cumu- lative infection and mortality rates was analyzed by the Z-test for the difference between proportions, and 95 percent con- fidence intervals for the true difference were calculated. The 95 percent confi- dence intervals were a 14 + 6 percent dif- ference for infection (36 percent for the control group versus 22 percent for the treatment group) and a 6.7 + 4.4 percent difference for mortality (11.2 percent for the control group versus 4.5 percent for the treatment group). By 197O, the signifi- cance level of the difference between treat- ment and control groups had reached p < 0.01, even though only one of the six stud- ies conducted through 1970 was statisti- cally significant. Among the 14 RCTs conducted after 1975 (11 of which were significant), the 95 percent confidence intervals for the differ- ence between treatment and control

METHODS OF TECHNOLOGY ASSESSMENT groups was 26 + 6 percent for infection and 5.3 + 3.4 percent for mortality. The synthesis made significant findings for mortality possible, since no single study had enough data to demonstrate a signifi- cant difference between treatment and control for mortality. The authors conclude that continued use of no-treatment control groups is inappro- priate and that future studies should com- pare various prophylaxes for the surgery, using a previously proven standard of com- parison. The authors note that this poses a "scientific dilemma" of diminishing re- turns, because it would take a RCT with more than 1,000 patients to demonstrate a reduction in infection rate from 10 to 5 percent. Nevertheless, 1,019 patients were involved in the 12 trials studied for 1965- 1975, and 1,033 were involved in the 14 trials studied for 1978-1980. GROUP JUDGMENT METHODS* Group judgment efforts for evaluating medical technologies reflect interests in bridging gaps and resolving disparity among research findings, defining state of the art, and establishing medical and pay- ment policies. Since 1977, the National In- stitutes of Health Consensus Development Program has conducted 50 conferences on a wide variety of biomedical technologies. The Clinical Efficacy Assessment Project of the American College of Physicians has evaluated more than 60 medical proce- dures and tests since 1981. The Diagnostic and Therapeutic Technology Assessment Program of the American Medical Associa- tion was initiated in 1982 to answer ques- tions regarding the safety and effectiveness of medical technologies. The Health Care Financing Administration, Blue Cross and Blue Shield Association and member plans, * Clifford Goodman contributed this section. 129 and other third-party payers have expert panels for resolving coverage issues. Evidence from well-designed clinical studies is unavailable for certain drugs (some of those first used prior to 1962), medical devices, and most medical and surgical procedures. The scope even of the best clinical studies rarely touches upon matters of cost, cost-effectiveness, and so- cial and other issues relevant to policymak- ing of health care delivery and payment. Most clinical and payment policies rely on the implicit consensus of standard and ac- cepted practice. However, clinicians, pay- ers, and others increasingly seek a more ex- plicit consensus. Thus, for many new tech- nologies, and some others that have questionable utility, they establish panels of experts to distill available evidence, add informed opinion, and render findings that will guide policy. These panels weigh, in- tegrate, and interpret evidence, experi- ences, beliefs, and values and then formu- late guidelines, recommendations, or other findings. The evidence may consist of a sparse patchwork of contradictory re- search results of varying quality. The tech- nologies being assessed may be rapidly evolving. Panelists are subject to biases and errors of reasoning; experts and nonexperts alike are subject to oversimplification, em- piricism, case-selection biases, incentives and advocacy (Eddy, 1982~. The better group judgment efforts spell out their assumptions and identify incon- sistencies, contradictions, and gaps in re- search. They provide participants the op- portunity to learn from the perspectives and insights of others and have the means to disseminate their findings effectively. But they do not generate new scientific findings. (As discussed earlier in this chap- ter, some quantitative methods are avail- able that, under specified conditions, may be used to combine research results to add precision to and strengthen significance of findings.) The finding of an expert panel that a procedure is "standard and accepted

130 practice" does not constitute new evidence of safety, efficacy, or cost-effectiveness. This section describes two categories of- group judgment methods. The first cate- gory includes two formal methods that have been applied in many fields, includ- ing health, and for which a considerable amount of literature describes methodol- ogy and applicability: the Delphi process and nominal group technique. The second category includes newer group methods designed specifically to develop documents for use by health practitioners and policy- makers, e. g., NIH Consensus Develop- ment, Glaser's state-of-the-art process, and computerized knowledge bases. Formal Group Methods Delphi Technique The Delphi tech- nique is an interactive survey process that uses controlled feedback to isolated, anon- ymous (to each other) participants. The process normally includes (1) obtaining anonymous opinions from members of an expert group by formal questionnaire or individual interview; (2) obtaining several rounds of systematic modifications/criti- cisms of the summarized anonymous feed- back provided to the groups; (3) obtaining a group response by aggregation (often sta- tistical) of individual opinions on the final round. The Delphi technique was developed by the Rand Corporation to integrate expert opinion in making predictions for national defense needs (Dalkey, 1969~. It is cur- rently used in many fields to obtain predic- tions of events and estimates where empiri- cal data are unavailable or uncertain. It is also used to generate forecasts, plans, problem definitions, programs and poli- cies, summaries of current knowledge, and selections from alternatives (Olsen, 1982~. For health issues, the Delphi technique has been used to obtain estimates, where empirical data are insufficient, of influ- ASSESSING MEDICAL TECHNOLOGY enza epidemic parameters (Schoenbaum et al., 1976b) and of incidence of disease and rates of adverse drug reactions in preven- tive treatment of tuberculosis (Koplan and Farer, 1980~. The method has been used to develop national drug abuse policy (Jillson, 1975; Policy Research Incorpo- rated, 1975a,b), the identification of state- wide health problems (Moscovice et al., 1977), the design of a statewide health pol- icy research and development system (Gus- tafson et al., 1975), consensus on physician practice criteria (Romm and Hulka, 1979), and the implications of advances in bio- medical technology (Policy Research In- corporated, 1977~. The primary advantages attributed to the technique are that participants gener- ally have no direct contact with each other, so that variables of professional sta- tus and personality have little chance to in- fluence opinions as they might in face-to- face meetings, the process can obtain opinions at low cost from geographically isolated participants, and panelists may complete their questionnaires within their own time constraints (Delbecq et al., 1975; Olsen, 1982~. The Delphi technique does not provide the opportunity for clarification of ideas and other benefits of face-to-face interac- tion. Insights to be gained by considering conflicting or minority viewpoints may be obscured or lost through pooling of re- sponses and ranking procedures (Delbecq et al., 1975; Linstone and Turoff, 1975~. Although the method's reliability may in- crease with the numbers of participants and iterations, so does its cost. Participa- tion also drops off with more iterations. Considerable concern has been voiced that the consensus achieved in some applica- tions has been "forced," or is "artificial" (Sackman, 1975~. Although advantages and disadvantages of the Delphi technique have been cited by many, few of these are substantiated by rigorous study. Their rel-

METHODS OF TECHNOLOGY ASSESSMENT evance likely depends on the type of prob- lem under consideration (Herbert and Yost, 1979~. Nominal Group Technique The nomi- nal group technique (NOT) is a group deci- sion process developed by Delbecq et al. (1975~. The product of an NGT is a list of ideas or statements rank-ordered accord- ing to importance. The process usually in- volves the following. 1. Participants generate silently, in writing, responses (opinions, rankings, views) to a given problem. 2. The responses are collected and posted, but not identified by the author, for all to see. 3. Responses are clarified by partici- pants; a round-robin format may be used. 4. Further iterations of silent, written response, posting, and clarification may follow. 5. A final set of responses is established by voting/ranking. Like the Delphi technique, NGT bene- fits from pooling opinion and certain group interactions, while postponing eval- .. . . . . . nation anc criticism anc m~n~m~z~ng cer- tain effects of individual status and person- ality that may skew individual participa- tion in less structured group interactions. Unlike the Delphi technique, NGT allows for immediate clarification of responses. Because it requires bringing participants together, NGT may be costly in certain sit- uations. NGT has been used to formulate priori- ties for quality assurance activity in multi- specialty group clinics and their associated hospitals (Horn and Williamson, 1977; Williamson, 1978~. Policy Research Incor- porated (1975b) used an NGT to develop ranked national objectives and strategies against drug abuse. NGT was compared to a Delphi process for developing procedures for handling emergency medical services 131 cases. Physician participants were assigned randomly to the NGT and Delphi groups to develop the procedures. Six months later, they were surveyed for their opinions regarding their respective groups' conclu- sions. Although the degree of consensus reached by the nominal and Delphi groups was comparable, the NGT participants changed their opinions to a significantly greater extent than did the Delphi group participants (Thornell, 1981~. Group Methods Designed for Health Issues NIH Consensus Development Confer- ences The primary purpose of- . . . ~ , , ~ ^ the Na- tional institutes of Health (consensus De- velopment Conferences is to evaluate the available scientific information on bio- medical technologies and to produce con- sensus statements for use by health profes- sionals and the public. Examples of tech- nologies assessed are breast cancer screening, intraocular lens implantation, coronary bypass surgery, the Reye Syn- drome, liver transplantation, and diagnos- tic ultrasound imaging in pregnancy. Conference panels of 8 to 16 members usually address five or six predetermined questions. Panelists include research inves- tigators in the relevant field; health profes- sionals who use the technology; methodol- ogists or evaluators such as epidemiologists and biostatisticians; and public representa- tives such as ethicists, lawyers, theolo- gians, economists, public interest group representatives, and patients. Consensus conferences are open meet- ings that usually last 2 i/: days. The first 1 i/ days are devoted to a plenary session in which experts or representatives of task forces present information on the state of the science and the safety and efficacy of the technology. These presentations are followed by an open discussion involving

132 speakers, panelists, and members of the audience. Following the plenary session, the panel drafts consensus answers to the predetermined questions. This draft docu- ment is read to the audience on the morn- ing of the third day for further comment and discussion among the panel and audi- ence. The panel may incorporate com- ments received during this session for in- clusion in the final consensus statement. The process concludes with a press confer- ence. Consensus statements may reflect op- posing or alternative opinions if these exist; however, few statements thus far have re- flected lack of consensus. The NIH consensus format is not de- signed to limit problems associated with face-to-face interaction (e. g., relative dominance of viewpoints due to social or hierarchical factors) in group settings, as are Delphi, nominal group, or other group processes. The program has been exploring the use of decision analysis models as a ref- erence framework to assist panelists in for- mulating consensus (S. G. Pauker, Tufts- New England Medical Center, personal communication, 1985~. The NIH Consensus Development pro- gram is one of few medical technology as- sessment programs that has undergone for- mal evaluation. An impact study of the 1978 consensus conference on supportive therapy in burn care indicated high clini- cian awareness of the conference's recom- mendations (Burke et al., 1981~. The Of- fice of Medical Applications of Research (OMAR) conducted a survey to measure physician awareness of two consensus con- ferences (computed tomography iCT] scan of the brain, 1981; and hip joint replace- ment, 1982) and their results, how that awareness was obtained, and the relative effectiveness of various means of informa- tion dissemination. Among the samples of five physician specialties targeted for the CT scan conference, awareness that the conference was being held ranged between 11 and 37 percent, and awareness of its ASSESSING MEDICAL TECHNOLOGY conclusions ranged from 1 to 15 percent. Among the samples of six physician special- ties targeted for the hip joint conference, awareness that the conference was being held ranged between 7 and 21 percent, and awareness of its conclusions ranged from 0 to 10 percent. The study concluded that there is significant room for improvement in conveying information about the pro- gram, the individual conferences, and their results (Jacoby, 1983~. The NIH pro- gram has implemented a number of rec- ommendations made in a study conducted by the University of Michigan of the NIH consensus development process (Wortman and Vinokur, 1982b). The Rand Corpora- tion is conducting a study, to be completed by 1985, of how consensus conferences have affected the knowledge, attitudes, and practices of health care professionals (Rand Corporation, 1983~. The NIH program has successfully es- tablished its role as provider of informa- tion, rather than as government regulator dictating methods of clinical practice. The issuance of consensus statements has not precipitated a flurry of malpractice actions based on consensus panel findings, and there is no evidence that the program has stifled innovation (Perry and Kalberer, 1980~. Currently, at least one conference question solicits the panelists' opinion on directions for future research. Other concerns have been at least par- tially alleviated by modifications in confer- ence preparations and format. The earlier absence of biostatisticians, epidemiolo- gists, and other methodologists who could speak on the validity of scant and/or con- troversial scientific evidence has been ad- dressed by current panel representation re- quirements. Concern has been voiced that consensus statements are prone to consist of generalities representing the lowest com- mon denominator of discussion, i.e. the only points on which panel members can fully agree (e. g., Rennie, 1981) . Group judgment efforts such as the NIH

METHODS OF TECHNOLOGY ASSESSMENT program often seek to bridge gaps in and otherwise make sense of available re- once. search, so as to provide guidance for clini- cal practice. In so doing, expert panels may render recommendations relying to some extent on suggestive but not rigorously founded clinical evidence, e.g., derived from weaker epidemiologic studies as op- posed to randomized clinical trials. One recommendation of the NIH consensus panel on lowering blood cholesterol to lower dietary cholesterol for all Americans age two onwardmay have been such an instance. (See Kolata t1985], Lenfant et al. t1985], and Steinberg [1985] for discus- sion.) This is a methodological concern of any group judgment effort and is best ad- dressed with documentation of group judg- ment methodology and the characteristics of the research considered and assumptions made by the panelists. A number of consensus statements have fallen short of directly addressing certain prominent issues. In the 1981 Reye Syn- drome consensus statement, none of the 15 conference questions addressed the contro- versial role of aspirin, although limitations of studies indicating its association with the Reye Syndrome were cited. Consensus development conferences usually examine only the safety and effec- tiveness of medical technology. By not ad- dressing such matters as cost and availabil- ity of other resources, many consensus statements may be of limited value in sug- gesting guidelines for use of technologies. The consensus statement on liver trans- plantation does not address the reimburse- ment of this very expensive procedure; the existing and potential demand for the pro- cedure; or how such demand could be met in terns of available transplant teams, fa- cilities, and donor organs. At issue is the conflict between the intent to avoid confer- ence topics for which insufficient data are available for reaching scientifically valid conclusions and pressure to hold confer- ences on controversial issues, as was the 133 case with the liver transplantation confer- State-of-the-Art Diagnosis and Care of CORD Glaser (1980) coordinated a broad-based consensus development effort to compose a state-of-the-art journal arti- cle on the diagnosis and care of chronic ob- structive pulmonary disease (COPD). This project, which took approximately 2 years from initiation to publication, involved a project team of 11 physician researcher- practitioners in the COPD field, a project facilitator, and a large network of review- ers in the field. The project consisted of the writing, review, and revision of 13 drafts. The first 5 drafts were composed and re- vised by a single project team author, in- corporating the detailed review and cri- tique of each draft by the other team members. Drafts 6 through 13 were re- viewed by groups of reviewers chosen out of a lot of 120 experts. The product, pub- lished in the Journal of the American Med- ical Association (Hodgkin et al., 1975), in- vited further critique for use in a revised and expanded state-of-the-art monograph. This monograph, which was also com- posed by the project team using outside re- viewers, was published in 1979 by the American College of Chest Physicians (Hodgkin, 1979~. Medical Practice Information Project The Medical Practice Information Demon- stration Project conducted by Policy Re- search Incorporated (1979a,b) used expert teams to reach consensus on four aspects (epidemiology, diagnostic validity, thera- peutic efficacy, and economics) of three health problems: bipolar disorder (charac- terized by manic and depressive states), malignant melanoma, and rheumatic heart disease. The process involved health problem ex- pert teams and research validation teams. For each health problem, expert teams completed two instruments. In the first,

134 expert team members provided certain health problem data (e. g., incidence) and cited the sources of that data (e. g., empiri- cal study, extrapolation from empirical study, or assumption). An independent re- search validation team assessed the docu- mentary literature cited by the expert teams to determine how well the research design and statistical work of the sources supported the data. Using the first instrument summary and the report of the research validation team, the four expert teams completed second it- eration surveys of health problem data and sources and rated the probable validity of the cited data, i.e., the degree of certainty with which they held the data to be valid. The final report included narrative sum- maries for each of the four aspects of the health problem, best sources of informa- tion, documentation and rated validity of information, and policy implications. Rand- UCLA Health Services Utiliza- tion The Rand-University of California, Los Angeles (UCLA), study of health ser- vices utilization, scheduled for completion in 1985, uses consensus panels to (1) com- pile indications for performing selected medical and surgical procedures (e. g., cor- onary artery bypass surgery and upper gas- trointestinal endoscopy), (2) select the in- dications that account for the majority of procedures, and (3) evaluate the relation- ship between frequently used reasons for performing the procedure and patients' health. Panelists participate both indepen- dently and as a convened group and have tasks of reviewing the literature, amending lists of indications for the use of a proce- dure, rating, and estimation. The panels have nine members, including specialists and internists, family practitioners, or other generalists. Prior to a 1-day meeting, panelists review staff-prepared literature reviews and sets of indications for a proce- dure. Panelists may amend the list of indi- cations and then rate the clinical appropri- ASSESSING MEDICAL TECHNOLOGY ateness of each and select the most frequently used of the lot. At the consensus meeting, panelists discuss and rate in at least two rounds those indications that showed high disagreement in their initial ratings of clinical appropriateness and that account for the majority of the procedures. Panelists estimate the proportion of proce- dures for which each of the frequently used indications is responsible in a high- and low-use area of the country. Finally, pan- elists will rate each indication in terms of the improvement to be expected by use of the procedure (A. Fink and J. Kosecoff, Fink and Kosecoff, Inc., personal commu- nication, 1983~. Computer Knowledge Bases The Na- tional Library of Medicine (NLM) Lister Hill National Center for Biomedical Com- munications developed a computerized synthesis of information about a specific disease: viral hepatitis (Bernstein et al., 1980~. This was a process for establishing and updating a state-of-the-knowledge base for viral hepatitis. The project was ac- tive from 1977 through 1983, when the NLM Knowledge Base Program was dis- continued. Because of the infeasibility of sifting through the massive amount of literature (16,000 publications on hepatitis in En- glish over a 10-year period, drawn from 3,000 serials indexed by the NLM), the project's initial information sources were limited to 40 current review articles rec- ommended by a few authorities in the field. The initial knowledge base was formu- lated by consensus of a panel of 10 experts who reviewed a draft synthesis of the 40 se- lected review articles prepared by one per- son. Each expert reviewed the entire initial draft and reviewed in detail one-tenth of the draft. The experts identified weak- nesses, inaccuracies, and missing informa- tion and made suggestions for changes. De- cisions on inclusion or modification of

METHODS OF TECHNOLOGY ASSESSMENT content were made by vote of the expert group, and areas of unresolved conflict were noted. Generally, when there were two or more (out of the 10-person panel) dissenting views, the paragraph under con- sideration was modified or reconsidered by the entire group. Developers of the Hepati- tis Knowledge Base have initiated a new, similar project to develop and update a gastroenterology data base (L. M. Bern- stein, Knowledge Systems, Inc., personal communication, 1983~. One of the noteworthy aspects of the Hepatitis Knowledge Base project was the use of a computer conferencing network as the principal medium of communication linking the geographically dispersed ex- perts and project staff. This was the elec- tronic information exchange system (EIES) under development and study by the Na- tional Science Foundation and operated by the New Jersey Institute of Technology (see e.g., Hiltz and Turoff, 1978, and Siegel, 1980~. EIES and other forms of computer support are under further development and application for group judgments in a number of fields (Turoff and Hiltz, 1982~. The advantages of computer conferenc- ing participants' independence of space and time constraints and automatic re- cording of data, communications, and other transactions may be quite suitable for engaging group judgment efforts in medical technology assessment and for documenting, evaluating, and improving the group deliberative process. Strengthening Group judgment Efforts The current level of group judgment ef- forts to determine how best to put medical technologies to use is not commensurate with the effort and care devoted to devel- oping these technologies. Although we do quite well at assembling experts, we often provide them with inadequate, largely un- tested means for drawing upon their exper- tise and for organizing and weighing the 135 evidence. Group judgment methods cur- rently used are difficult to validate, be- cause they do not adequately document ev- idence and provide rationale for findings or provide estimates of the effects of adopt- ing their recommendations. The following are guidelines that may be helpful in im- plementing and evaluating a successful group judgment effort. 1. The procedures and criteria for se- lecting topics and panel members should be documented. 2. Sponsors and panelists should agree on the nature and technical understanding of the intended users and the means used in disseminating findings. 3. A chair/facilitator should be selected who is a skillful moderator and working- group coordinator, with standing in the relevant field but not necessarily expert on the particular topic, and having no partic- ular position on the topic. 4. Panelists should be chosen whose in- terests can be served by working on and us- ing the findings of the process. They should represent the relevant medical specialties, general practitioners, methodologists such as epidemiologists and biostatisticians, economists, administrators, and others who can provide important perspectives. 5. The questions to be addressed by the panel should be specific and manageable, i. e., commensurate with the available data, the time available for the process, and other resources. Panelists should be able to participate in specifying the ques- tions to be addressed, responsibilities for tasks, and project format. 6. An operational definition of consen- sus should be specified (e.g., full agree- ment, majority agreement, etc.), as well as how to present less than full agreement in the panel's findings (e.g., cite minority . · \ oplnlons). 7. Panelists should be provided with the most comprehensive scientific data possi- ble. A summary description of the avail-

136 able studies (topic, study design, findings) should be provided and cited in the final panel statement. 8. Process methodology, facts, assump- tions, estimates, criteria for findings, and rationale should be documented. Findings should include estimates of outcomes ex- pected if the panel's recommendations are followed. This documentation should en- hance internal consistency, allow others to check reasoning, and provide the basis for reassessment in light of new developments. 9. The panel should recommend re- search needed to resolve those issues con- cerning which it could not reach full agree- ment and to otherwise advance under- standing of the topic. 10. The dissemination and effects of panel findings should be evaluated to en- able improvement of the process. Panel members and other participants should be apprised of the evaluation findings. Group judgment processes could be im- proved by answering several types of re- search questions. · What conditions inhibit and enhance panelists' participation in group judg- ment? · Do group judgment processes achieve increased understanding and convergence of opinion or lowest common denominator views of panelists? · Are some processes better than others at achieving consensus? · How effective are group judgment processes in modifying policies and prac- tices concerning medical technology? · What factors (scientific findings, ethi- cal considerations, stature of other panel- ists, personal experience, etc.) are most im- portant in influencing panelists' decisions? · What factors (stature of panelists, identity of sponsoring organization, docu- mentation of groups' reasoning, media used to disseminate findings, etc. ~ are most important in influencing the adoption of group judgment findings? ASSESSING MEDICAL TECHNOLOGY COST-EFFECTIVENESS AND COST-BENEFIT ANALYSES* The Office of Technology Assessment defined the terms cost-effectiveness analy- sis (CEA) and cost-benefit analysis (CBA) as normal analytic techniques for compar- ing the negative and positive consequences from the use of alternative technologies (OTA, 1980a,b). There is thus possible a continuum of such analyses involving mea- surements of the costs of using a technol- ogy, the effectiveness of the technology in achieving its intended objectives, and de- termination of the positive and negative benefits from both intended and unin- tended consequences (Arnstein, 1977~. The principal distinction between a CEA and a CBA is in the valuation of the effects/benefits of using the technologies. In measuring benefits, a CBA requires that all important effects and benefits be ex- pressed in monetary (dollar) terms. Thus some estimates are required of the mone- tary value of all benefits gained so that they can be compared with all dollar costs expended (Cooper and Rice, 1976~. CEA avoids the requirement of attributing a monetary value to life by simply counting the lives (or years of life) saved or lost. An attempt to assess the quality of life of the years saved usually weighs differences in health status (e.g., from a value of "O" for the state of death up a positive scale of values for decreasing disability and in- creasing health status). The burgeoning interest in health care CEA/CBA is a phenomenon that began budding in the late 1970s. This interest is derived largely from provider, payer, and consumer concern over increasing health care costs and governmental spending for health care services. (For an overview of the history of CEA/CBA and health care * This section was contributed by Morris Collen and Clifford Goodman.

METHODS OF TECHNOLOGY ASSESSMENT CEA/CBA concepts, methods, uses, appli- cations, and relevant references, see OTA, 1980a,b.) OTA (1980b) reported on a study of health care CEA/CBAs conducted by med- ical function and year (1966-1978~. This report (Table 3-6) shows not only the ac- celerating increase in the use of CEA/CBA but the change from the early emphasis on prevention studies to the current domi- nance of studies on diagnostic and treat- ment technologies. An extensive survey and classification of the growth and composition of the CEA/ CBA literature for the same period 1966- 1978 is also reported by Warner and Luce (1982) and in abbreviated form by Warner and Hutton (1980~. This survey found over 500 references addressing CEA/CBA for health services. It identified a number of significant trends in the literature (Table 3-7~. It reported that the considerable growth in the CEA/CBA literature over the period surveyed was more rapid in medical than in nonmedical journals and that a preference appeared to be emerging for CEA over CBA. This study also found 137 that while the number of all types of CEA/ CBA studies increased, those related to di- agnosis and treatment technologies showed considerable increases in prominence rela- tive to those on preventive health. The de- cision orientation of the studies shifted away from organizational and societal de- cision makers to those of individual practi- tioners. The authors observed that the rapid growth of the CEA/CBA literature over the period was not matched by ade- quate skill in methodology, noting a higher proportion of technically low-quality anal- yses in the later years than in the earlier years of the period surveyed. In conducting a CEA/CBA, OTA (1978a, 1980b) recommended a series of steps to follow: 1. Define the problem for which the technology is used. The problem, which should be stated as clearly and explicitly as possible, may be in clinical disease or treat- ment, preventive medicine, or in a health care process or service. 2. State the intended objectives for us- ing the technology. These objectives may TABLE 3-6 Numbers of Health Care CEA/CBAs by Medical Function and Year (1966-1978). Year Prevention 0.0 0.0 2.5 1.5 3.0 6.5 7.0 14.5 2.S 5.0 lS.0 12.S 18.0 88.0 Diagnosis Treatment Othera 0.0 0.3 3.0 0.5 2.0 3.5 2.0 4.0 5.0 10.0 16.0 17.0 25.5 88.8 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 Total 0.0 1.7 3.5 2.0 3.0 4.0 4.0 10.5 14.0 14.5 28.0 37.5 18.S 141.2 S 3 6 2 8 11 14 IS 22 22 33 3S 31 207 "Includes mixes of all three functions (prevention, diagnosis, and treatment), administration, general, and un- known. SOURCE: Office of Technology Assessment (OTA, 1980b).

138 ASSESSING MEDICAL TECHNOLOGY TABLE 3-7 Trends in Health Care CBA/CEA, 1966-1973 and 1974-1978a. Trend 1966-1973 1974-1978 Average annual number of publications Publications in medical journals as percent of total journal publications CEAs as percent CEAs + CBAs Percent articles on: Prevention Diagnosis Treatment Percent articles with orientation of: Individual Organization Society aAll differences significant at p = 0.05. SOURCE: Warner and Luce (1982~. 17.0 40.2 42.1 44.7 18.8 36.5 8.3 21.3 70.4 73.0 62.7 63.2 22.0 30.9 47.2 15.8 10.8 73.4 be expressed in terms of patient outcomes (e.g.,decreasing mortality) or in terms of health care processes (e. g., decreasing costs). 3. Identify any alternative technology that can be used to achieve the stated ob- jectives. Usually the analysis compares a new or modified technology with old or currently used technologies. 4. Analyze the effects and benefits re- sulting from the use of the technology. "Ef- fectiveness" is generally expressed as the extent to which intended objectives are ac- tually achieved in ordinary practice and is distinguished from "efficacy" which is usu- ally defined as the probability or extent of achieving the objectives under ideal condi- tions (OTA, 1978a). A wide variety of evaluative ap- proaches, including randomized clinical trials and epidemiological studies, form the basis for assessing effectiveness of medi- cal technology. Effects of a diagnostic technology may be expressed in terms such as the percentage of correct diagnoses achieved, time and cost to complete the di- agnostic process, or cost per true-positive test. Effects of a treatment technology may be expressed in terms of disability, mortal- ity, patient well-being or reassurance, or time and cost to complete the treatment process. Effects of a supporting/coordinat- ing technology such as--an information sys- tem may be expressed in terms of data er- ror rates' response times to queries, or cost for retrieval per information unit. All intended consequences (effects/bene- fits) should be studied, and all important unintended consequences should be identi- fied and assessed (OTA, 1980b). Some ef- fects/benefits will be positive (i.e., desir- able), some will be negative (i.e., undesir- able), and some may be indeterminate. Generally in CEA/CBA, all important ef- fects/benefits should be considered to whomsoever they accrue (Klarman, 1973~. Included are those affecting the individual patients, effects upon other health care re- sources/services, and effects on the family/ society/employer. For some technologies it is not possible to ascertain final patient outcomes so that as a compromise one mea- sures intermediate outcomes, such as the diagnostic accuracy of clinical testing pro- cedures or the resultant changes in pa- tients' smoking habits from a smoking ces- sation program. 5. Analyze costs associated with the use of the technology. Costs of the health care process should include all expenses to all participants resulting from the use of re- quired resources (personnel, facilities, equipment, supplies), including the direct controllable costs and overhead uncontrol-

METHODS OF TECHNOLOGY ASSESSMENT fable costs (Klarman, 1973~. Patient- consumer costs should be identified, in- cluding charges for services received, time and earnings lost from work, transporta- tion costs, and any other expenses. For a technology system or program, opportu- nity cost should be considered as an esti- mate of the value of other opportunities that are forgone because of the investment in the specific technology selected. In CEA/CBA all negative-costs, i.e., savings attributed to the use of the technology, are considered.to be effects or benefits. All future costs and monetary values of future benefits should be discounted to their present value in order for them to be compared appropriately with one another. The discount rate attempts to adjust for what a dollar invested today would earn in interest. For long-term projections, low discount rates tend to favor projects whose benefits accrue in the distant future (OTA, 1980a,b); accordingly, selection of appro- priate discount rates are often controver- sial and usually are subjected to sensitivity analysis (redoing the calculations with dif- ferent rates). OTA (1980b) provides an ex- ample of how the particular discount rate chosen can have a substantial impact on the outcome of the analysis, because in- vestment in health programs often means spending present money (which is not dis- counted) for future benefits (which are). In such programs, the higher the discount rate, the less attractive the program ap- pears. As an example, suppose one spends $1,000 today, expecting to save $2,000 in medical costs 10 years later. In order to compare the expected benefit ($2,000 sav- ings) with the costs of program ($1,000), one must discount the benefit to its esti- mated "present value." Consider the var- ied results using different annual discount rates with a cost of $1,000 (in present dol- lars) and a benefit of $2,000 (in year 10~. 139 Discount rate (%) o 0 Present Value of benefit $2,000 1,228 1,017 771 Present Value of net benefit (B - C) $1,000 228 17 - 228 And, if the benefit were not related for 20 years, the results would be: Discount rate (%) o s 7 10 Present Value of benefit $2,000 754 517 297 Present Value of net benefit (B - C) $1,000 - 246 - 483 - 709 This example shows the power of discount- ing and the resultant importance of the choice of the discount rate. 6. Differentiate the perspective of the analysis. Since the explicit objectives sought may vary somewhat from the view- point of the patient, the physician, the ad- ministrator, and the policymaker, a com- prehensive CEA/CBA may be very complex if the aim is to satisfy all partici- pants. Objectives, benefits, and costs differ for each of these participants. Public soci- etal benefits and costs often differ substan- tially from private benefits and costs. For the public policymaker, societal benefits sought may be primarily in cost reduction of improved accessibility of health care ser- vices. From the viewpoint of the private hospital administrator, the cost-effective capital-intensive technology may be that with the highest financial return on invest- m;ent. The health care provider will seek the technology that minimizes his costs or maximizes the desired patient outcomes. From the viewpoint of the patient- consumer, the primary benefits desired are improved health outcome at an affordable cost, yet other important considerations are length of time to complete the care pro- cess and satisfaction with the process. 7. Analyze uncertainties. Relevant ret- rospective data for a CEA/CBA are often uncertain as to their accuracy, and some-

140 times they are entirely unavailable. Timely prospective data for predicting future events is rarely available. In such instances of uncertainty, a sensitivity analysis for im- portant variables should be performed to test the sensitivity of the analytic results to potentially important variations in the data used. By a variety of techniques, such as by consensus development of experts who are selected appropriately to attempt to minimize bias, estimates can be derived that can be used as substitutes (valid or not) for valid primary data. Usually a se- ries of scenarios are tested in which various assumptions are specified for critical un- certain variables. 8. Interpret results. The results of the analysis should be discussed in terms of va- lidity, sensitivity to changes in assump- tions, likely variations in benefits and ef- fects over time (e.g., by discounting), and implications for policymaking. If it is not possible to arrive at a single decision or rec- ommendation, the important conse- quences from using the alternative technol- ogies studied should be presented in order to decrease the uncertainty of decision making. Important ethical, legal, or societal is- sues should be identified and their implica- tions discussed. Strictly on the ground of efficiency for a CEA, the alternative with the lowest cost-effectiveness ratio would be preferable because it could achieve the de- sired objectives at the lowest cost. Simi- larly for a CBA, the alternative with the greatest net of benefits minus costs should be preferred, except that the monetary values attributed to years of life may make the results controversial. In any actual de- cision, however, policymakers should con- sider also social effects such as equity and political importance (Banta et al., 1981~. Social and ethical consequences of medical technology are increasingly being ques- tioned in such applications as support sys- tems for prolonging life in incurable termi- ASSESSING MEDICAL TECHNOLOGY nal patients, organ transplants (e. g., heart, liver, and kidney) for which the de- mand exceeds the supplies, and artificial organs (e.g., heart and kidney) where eq- uity of funding and distribution will al- ways be an issue. An example of a CBA is given in Appen- dix 3-A of this chapter to illustrate the ana- lytic process. Uses of CEA/CBA The uses of CEA/CBA for technology as- sessment can be categorized by (1) the type of technology (i.e., drugs, devices, proce- dures, instruments, equipment, or a group of these components into a system) and (2) the application of the technology (i.e., for medical diagnosis, medical treatment, pre- ventive medicine, or for supporting/coor- dinating functions of medical services). Drugs, chemicals, vaccines, and similar agents have been studied using CEA/CBA, with special consideration of their efficacy and safety (OTA, 1978a). OTA (1980a) proposed a hypothetical CEA model for as- sessing a drug's cost-effectiveness if the effi- cacy and safety of the drug could be quan- tified in measurable units of "net health effect." Then the "net cost of achieving a desired net health effect" (e.g., specified reduction in morbidity and mortality) could be derived by determining the cost of the drugs and of the treatment of any of its side effects and subtracting the savings from the use of the drug. A cost-effective- ness ratio for the drug would be the net cost divided by units of net health effect. Simi- larly, the cost-effectiveness ratios could be derived and compared for alternative drugs or existing treatment modalities. CEA/CBA have been applied to immuni- zations, such as for pneumococcal pneu- monia (Patrick and Woolley, 1981; Wil- lems et al., 1980), influenza (OTA, 1981a), and rubella (Schoenbaum et al., 1976a).

METHODS OF TECHNOLOGY ASSESSMENT CEA/CBA have been used for a variety of devices, instruments, machines, and equipment. Such assessments require de- tailed analysis of the process, technical procedures, and personnel using the de- vices; and they employ different analytic methods for diagnostic, therapeutic, or co- ordinating/supporting applications. A notable example is the case of com- ~ ' ~ ~ scanning (OTA, puteu tomographic 1978b, 1981b). This study generally fol- lowed the traditional model for the eco- nomic evaluation of diagnostic procedures, and it considered outcomes, benefits, and effects. Usually the assessment of diagnos- tic technology separates the evaluation of the cost-effectiveness of the process in achieving its diagnostic objectives from the cost-effectiveness of subsequent treatment technologies which have a different set of specific objectives (McNeil, 1979~. A vari- ety of diagnostic and screening tests have been studied, including hypertensive re- novascular disease (McNeil et al., 1975), cancer (Eddy, 1980), multiphasic screen- ing (Collen et al., 1970, 1973, 1977; Col- len, 1979b,d), lead screening (Berwick and Komaroff, 1982), mammography (Collen, 1979a), diagnostic x-rays (Collen, 1983), and endoscopy (Showstack et al., 1981~. CEA/CBA for treatment technologies have been reported for a wide variety of therapeutic devices and procedures, such as surgery (Bunker et al., 1977), psycho- therapy (OTA, 1980c), hemodialysis for end-stage renal disease (Stange and Sum- mer, 1978; OTA, 1981c), preoperative an- timicrobial prophylaxis (Shapiro et al., 1983), and for therapeutic decision making in general (Pauker and Kassirer, 1975~. CEA/CBA have been used for assess- ment of multiple devices aggregated into complex systems (Collen, 1979c), such as medical information systems (Drazen and Metzger, 1981; Richart, 1974), and alter- native health care programs such as ambu- latory versus inpatient care (Berk and 141 Chalmers, 1981~. See Appendix 3-A of this chapter for a more detailed example. Capabilities and Limitations OTA (1980a) has emphasized that CEA/ CBA should not serve as the sole or primary determinant of a health care decision, but the CEA/CBA process could improve deci- sion making by considering not only whether the technology is effective but also whether it is worth the cost. In general, a CEA is most useful for making a choice as to the lowest cost tech- nology to achieve a specified objective, benefit, or effect; and CBA is most useful for making a choice between technologies producing various objectives, benefits, or effects as to which could produce the high- est value for the costs expended. A CEA is especially useful for assessing the past performance of a technology when specific limited objectives are defined and reliable data are available to achieve these same defined objectives. Such retrospec- tive analysis can be relatively simple and inexpensive and can be used to support ra- tional decision making to the extent that the CEA does permit comparison of costs per unit of effectiveness among competing alternatives for achieving the same objec- tives. Still, the accurate determination of actual costs of resources used, or of appro- priate associated incremental costs, is not always readily obtainable, and charges or fees for services are often substituted that may not be directly related to true costs (Finkler, 1982~. When a CEA extends the analysis to study unintended consequences from using alternative technologies, the analysis be- comes more complex and expensive. Un- certain or missing data is then an impor- tant problem and a sensitivity analysis becomes necessary. CEA does not permit comparison of complex technologies hav- ing different or multiple objectives associ-

142 ated with different process or outcome Program A measures unless uniform composite indexes $4,000 = 2 $2,000 of outcome measures are used. A CBA is capable of assessing the values of technologies that have differing objec- tives by converting all of their effects/bene- fits to dollar values. Thus, as has been em- phasized, a CBA requires a dollar deter- mination of added years of life, quality of life, etc., so that costs expended can be compared to the dollar value of benefits gained. The dilemma for a CBA of valuing life and death in monetary terms can be avoided by a CEA; however, a CEA is not as useful for setting policy priorities among different types of technologies because ex- pressing all of the effects/benefits in equiv- alent units is usually possible only with dol- lars. OTA (1980b) emphasizes that there are certain technical considerations that can significantly alter how a CBA is inter- preted, as, for example, the use of net ben- efit (that is, benefit minus cost) rather than the cost-benefit ratio as a criterion to com- pare programs. The former (net benefit) approach is usually preferred, especially when the alternative programs are widely variant in scope. As an illustration, OTA (1980b) considered two programs. Program A costs $2,000 and reaps gross benefits of $4,000; program B costs $2 mil- lion and reaps gross benefits of $3 million. A net benefit approach yields the following results. Program A $4,000- $2,000= $2,000 Program B $3 million - $2 million = $1 million Clearly, program B is preferred, given the ability to finance the project and setting aside for the example all considerations of equity and distributional effects. However, a benefit-cost ratio (B/C) would yield the following results: ASSESSING MEDICAL TECHNOLOGY Program 13 $3 million = 1 5 $2 million Now, program A is clearly preferred. No- tice that the ratio gives the reader no indi- cation of the size of the expected benefits, nor the size of the program. Also, although program A gives a better rate of return for the money invested, there is no reason to believe that it can be increased in scale and still maintain the high rate of return. Sometimes a marginal analysis (i.e., the additional benefit derived from adding one more unit of expense) may help determine the optimal size of a program and the point at which a given technology is no longer cost-effective. Because CEA/CBA are primarily eco- nomic types of analysis and most useful for cost-containment decisions, they are lim- ited in their ability to help with policy de- cisions that affect primarily the quality of care. Valid quantitative measures of ef- fects and benefits of quality of care are not available, and the validity of the estimates of any such variables used are controver- sial. Similarly, social values, ethical con- siderations, and political realities may well take precedence over analytical economic results (Banta et al., 1981~. OTA (1980a) has noted the conflict between equity and efficiency as an important issue in the use of CEA/CBA and cites the difficulties of measuring a person's worth; of rating bet- ter or worse welfare states; of assigning values to equity, fairness, and justice; and of valuing lives. Although significant advances have been made in rational clinical decision making (Weinstein and Feinberg, 1980), OTA (1980a) has pointed out that CEA/ CBA has had little relevance to decision making in practice because the primary fo- cus of CEA/CBA is cost-effectiveness from a societal or policymaking viewpoint. In addition since the physician's major re-

METHODS OF TECHNOLOGY ASSESSMENT sponsibility is to the patient-consumer, the perspective of the physician is often very different from that of the policymaker. The stage of development of the technol- ogy is an important factor affecting the va- lidity of the analysis. Often, for a new technology when an assessment might be especially useful, insufficient time has elapsed to permit adequate reliable evalua- tive data to have accumulated. CEA/CBA can be a useful tool for plan- ning for the future, and prospective ana- lytic simulation models can attempt to pre- dict costs and effects/benefits of competing alternative programs. OTA (1980a) em- phasizes the importance of sensitivity anal- ysis to cope with the problem of missing data and the uncertainties about the future by testing a range of discount rates, vary- ing the weights used to compute quality- adjusted life expectancy, and testing all important variables over a range from best to worst cases. OTA (1980a) emphasized the infinite number of unintended consequences (also called externalities, second-order effects, side effects, spillovers, or unintentional ef- fects from using the technology), such as the effects on technical manpower and the training programs needed for a new tech- nology. The costs of such important effects should be estimated and included in the CEA/CBA. Strengthening Use of CEA/CBA It is an important question as to what ex- tent CBA/CEA are actually used by policy decision makers. Certainly, the usefulness of CEA/CBA will depend upon the impor- tance of the technology in affecting medi- cal care costs and patient outcomes. Ac- cordingly, the criteria for selecting medical technology for CEA/CBA should recognize that approximate analyses of timely tech- nology can be more useful than certain analysis for unimportant technology. Also, the usefulness of CEA/CBA for de- 143 creasing the uncertainty of policy decision making for cost-containment or budgetary planning can be enhanced by judicious ap- plication of sensitivity analysis. For miss- ing or uncertain data, an appropriate group of experts, selected carefully to mini- mize bias, can use consensus development techniques to provide credible estimates of missing data. Then by studying a variety of assumptions for important variables and by using middle, low, and high values to appropriately express realistic, optimistic, and pessimistic scenarios, the policymaker may estimate the limits of errors in pro- jected costs and establish minimum, maxi- mum, and break-even costs for the pro- gram. Better methods also are needed for mea- suring the health status of individuals and of groups and for valuing changes in health status. Any important ethical, legal, soci- etal, and political implications of using the technology will need to be considered in the process of making policy decisions. A problem of all comparative secondary analysis, including CEA/CBA, is the lack of standardization of primary component evaluations so that data from different sources cannot be appropriately combined. The development of better organized and standardized data collection methods would greatly facilitate CEA/CBA, and the promulgation of standard preferred methods for analysis would encourage their wider use (Institute of Medicine, 1981~. The usefulness of CBA/CEA can be in- creased by improved analytic methodol- ogy. Better methods are needed for imput- ing or substituting for missing data. Methodology used should be understand- able by the policymakers who need the in- formation to make their decisions, and conclusions or recommendations should be supported by the best data available. The analytic methodology and the data used should be credible and presented in a form understandable to the decision

144 makers. Data interpretation and recom- mendations should be separated from data analysis so decision makers can review the data and minimize possible biases intro- duced by the evaluator's conclusions. OTA (1980a) emphasizes that many method- ological weaknesses of CEA/CBA may be hidden by the process of deriving a numeri- cal cost-benefit or cost-effectiveness ratio and encourages the use of arranging all the elements that are included in the decisions. Thus, sometimes a tabular array of the data can enable useful comparisons and in- ferences. Recommendations Cost-effectiveness analysis and cost- benefit analysis are assessment methods for an economic analysis of the positive and the negative consequences from the use of alternative technologies. A formal series of steps usually followed in conducting a CEA/CBA: define the problem, determine the objectives for using the technology, identify the alternative technologies, ana- lyze the intended effects and benefits and also all the important unintended conse- quences, analyze all costs, differentiate the prospective user of the analysis (i.e., poli- cymaker, health care providers, patient), analyze uncertainties, and finally, inter- pret the results in a manner to decrease the uncertainty of decision making. CEAs are more commonly done than CBAs.CEAs are more useful for making a choice as to the lowest cost technology to achieve a specified objective, benefit, or ef- fect. CBAs are more difficult to do because all effects and benefits must be expressed in monetary terms. However, a CBAis most useful in making a choice between technol- ogies producing different objectives, bene- fits, or effects. CEA/CBA are useful for aiding in policy-level decision making but have little relevance to clinical decision making in medical practice. CEA/CBA can be useful for planning and usually em- ASSESSING MEDICAL TECHNOLOGY ploy sensitivity analysis for uncertainties of the future, such as by testing the effects of a range of discount rates on results. Better methodology is still needed, for such tasks as valuing changes in health status. The usefulness of CEA/CBA should be improved through studies to develop better methods for expressing the value of changes in health status and measuring the quality of life during years saved by the use of the technologies. TECHNOLOGY ASSESSMENT: THE ROLE OF MATHEMATICAL MODELING* A model is a representation of the real world. A mathematical model is character- ized by the use of mathematics to represent the parts of the real world that are of inter- est in a particular problem and the re- ationships between those parts. With respect to technology assessment, mathe- matical models can help describe the rela- tionship between a technology and the clinical conditions it is intended to affect and predict how the use of that technology will affect medically important outcomes. Mathematical models have proved use- ful in a broad range of applications perti- nent to the assessment of medical technolo- gies. The analytical methods of statistics, economics, decision analysis, epidemiol- ogy, and cost-effectiveness analysis are all built on mathematical models. This section will focus on another category of applica- tions: the use of mathematical models to describe the natural history of a medical condition and how the natural history is af- fected by the medical procedure. In this chapter, the term mathematical model will be restricted to this category of appli- cations. Mathematical models have been used successfully to assess a wide variety of med- * This section was drafted by David M. Eddy.

METHODS OF TECHNOLOGY ASSESSMENT teal technologies. Examples include an analysis of treatment and prevention of myocardial infarctions (Cretin, 1977), the value of continued stay certification (Averill and McMahon, 1977), a compari- son of hysterectomy and tubal ligation for sterilization (Deane and Ulene, 1977), vac- cination for swine influenza (Schoenbaum et al., 1976b), screening and treatment of hypertension (Weinstein and Stason, 1976; Stason and Weinstein, 1977), and screen- ing for cancer (Eddy, 1980; Shwartz, 1978~. Additional applications are de- scribed in several collections and reviews (Bunker et al., 1977; OTA, 1981d; Warner and Luce, 1982~. Used properly, mathe- matical models can be powerful tools in the assessment of medical technologies. Background Because the use of mathematical models in medical technology assessment is com- paratively new, it is important to under- stand how they relate to more traditional methods of technology assessment. The task of technology assessment is to estimate the consequences of using a tech- nology in a particular setting. Ideally, this would be accomplished by conducting an experiment, that is, applying the technol- ogy in the setting of interest and observing the results. For a number of reasons, how- ever, this is not possible. Most important, there are too many possible settings. A medical technology is not a static item; it takes a variety of forms depending on who is using it, on whom, when, and how. A diagnostic test can be preceded or followed by other tests; can be used at different times in the course of a patient's condition; can be used on patients with different types of problems, different ages, and different risk factors; can be used with different techniques; can be interpreted against dif- ferent criteria; and can be followed by dif- ferent therapies. Therapeutic and other 145 types of medical technologies can present in equally diverse ways. To study with tra- ditional experimental methods only one manifestation of a technology in one par- ticular setting is difficult, time-consuming, and expensive; to study all of its potential modes of use is impossible. Even when a study is designed for one particular setting, other problems arise. One may have to wait years for results, leaving the question of what to do today. Furthermore, the disease or the technology could change while the study is in progress, raising new questions about the interpreta- tion and applicability of the results. Because of these problems, a technology assessment is usually conducted in two steps. In the first, the investigator gathers the available information about the perfor- mance of the technology, focusing in par- ticular on its performance in circumstances that are related as closely as possible to the circumstances of interest. In the second, that information is processed to estimate how the technology would perform if it were applied in the actual circumstances of interest. Many methods are available for gathering information about the impact of a technology in a particular setting. These include randomized controlled trials (RCTs), community trials, case-control studies, and other experimental and epide- miological methods that are discussed else- where in this chapter. Each of these tech- niques makes observations and gathers primary data about how the technology behaved in a particular set of circum- stances, but they do not tell how the tech- nology will behave in new settings. To learn that requires the second step of a technology assessment. The main role of mathematical modeling in technology as- sessment is to assist in that step to help in- vestigators process the observations made in experimental and epidemiological stud- ies to estimate what would be expected to happen in circumstances that either have not or cannot be observed.

146 An Example As an example (see Appendix 3-B of this chapter for more detail), suppose one wanted to assess the value of Pap smears for asymptomatic women in San Diego in the mid-1980s. The list of possible circum- stances in which the Pap smear could be used is long. Should it be done at all? If so, should it be done on women starting at age 18, 20, 25, 30, or any other age? Should it be done every 6 months, every year, every 5 years? Should the ages or frequencies be different depending on a woman's family history, sexual practices, smoking habits, age, or medical condition? Should the ex- amination be performed by nurses, inter- nists, or gynecologists? Should the exami- nation be done in offices, special clinics, or mobile units? At what age might screening be stopped? Once all these questions are answered for San Diego, they can be re- asked for Dallas. And so forth. These are all important questions; one way or an- other, implicitly or explicitly, correctly or poorly, each one must be and will be an- swered every time a recommendation to perform Pap smears is made. It is clearly impossible to study all the possible applications of a Pap smear with experimental and epidemiological meth- ods. For example, only to compare the ef- fects of annual and triennial Pap smears on mortality in a randomized controlled trial (without trying to learn anything about the ages of screening, risk factors, or any other variables) would require a sample size of about one million women, followed for about two decades. Because of these limitations, the assess- ment of the Pap smear today must be pieced together from information that ex- ists, derived from many different sources. One source consists of more than a dozen studies of what happened when the Pap smear was introduced in large populations. For this source, there are usually no con- current (much less randomized) controls, ASSESSING MEDICAL TECHNOLOGY and issues like age, risk factors and the type of delivery system, even issues like which women are getting the test and how fre- quently, can rarely be studied with any precision. Other sources of information in- clude scores of studies on age-specific inci- dence rates, risk factors, the natural his- tory of the disease, the sensitivity and specificity of the Pap smear, the proportion of lesions detected in different stages in dif- ferent programs, survival rates, mortality from other causes, the cost of the test, and the cost of treatment. By default, state- ments about the value of the Pap smear in San Diego and policies about the ages, risk categories, and frequency of screening must be based on an integration of all these pieces of information. The Role of Models Processing or integrating information from different sources requires a model, some method for representing how all the information fits together, and what it im- plies about the value of the technology. Mental Models By far the most com- mon model is the mental model, in whic the person who is assessing the technology thinks about the pertinent information and mentally estimates the consequences of us- ing the technology in the circumstances of interest. A common name for this is clinical judgment. The mental model may be a very simple one for example, the assessor may be willing to assume that what hap- pened in a Pap smear-screening program in Louisville, Kentucky, in the 1960s will apply to San Diego in the 1980s, and may be willing to ignore factors like age, risk, and technique but it is still a model. Any physician or policymaker who recom- mends a particular program for a Pap smear must have considered some observa- tions and^must have made some estimates, however crude, of what would happen if that recommendation were followed.

METHODS OF TECHNOLOGY ASSESSMENT Mathematical Models The drawbacks of mental models are obvious: the com- plexity of most medical technology assess- ment problems simply exceeds the capacity of the unaided human mind. It is impos- sible to keep all the factors and numbers straight and to perform all the calculations correctly in one's head. This raises the need for mathematical models. A mathematical model is a formalization of mental model- ing. While not inherently different from mental models in intent or general ap- proach, mathematical models have quali- ties that make them useful in the assess- ment of medical technologies. First, they can encompass a large number of vari- ables, they permit the expression of com- plicated relationships between the vari- ables, and they provide rules to ensure that calculations are correct. With the use of computers there is virtually no limit to the number of factors that can be included, the complexity of the formulas, or the number of computations. Second, because mathe- matical models are explicit, they force one to be precise in making definitions, stating assumptions, and stating numbers. tur- thermore, they permit others to review the factors, assumptions, numbers, and rea- soning. But the most important feature of mathematical models is that they trans- form the essential features of a problem into a symbolic language that, unlike En- glish, Carl be manipulated to gain insights and see conclusions that are otherwise in- visible. To appreciate the power of mathe- matical models compared with mental models, consider estimating your income tax without using addition or multiplica- tion. Uses of Mathematical Models in Technology Assessment Estimating Outcomes The most im- portant use of a mathematical model is to help integrate the results of more tradi- tional methods of experimental and epide- 147 miological studies to estimate the conse- quences or outcomes of applying a technology in a particular setting. Its po- tential for this use covers a broad spec- trum, depending on the questions being asked and the number and quality of avail- able studies. Toward one end of the spec- trum, a mathematical model can extend the results of a particular research project, to examine its implications for a new set- ting that differs only slightly from the set- ting of the original project. For example, the Health Insurance Plan of Greater New York (HIP) conducted a randomized con- trolled trial of breast cancer screening, providing direct observations of the effect of an annual mammogram and an annual physical examination on a specific popula- tion of women in New York in the late 1960s. If one wanted to assess the value of breast cancer screening today in a 50-year- old woman in Oregon whose mother had breast cancer, or the value of doing a breast physical examination only, or the value of a biennial mammogram, it is pos- sible to build a mathematical model that uses the observations of the HIP study to study these new issues. Indeed, mathemat- ical modeling may be the best way to ad- dress these issues, being faster and less ex- pensive than a new RCT and more accurate than simply assuming that what happened in New York 15 years ago will happen in Oregon today (ignoring factors such as age, risk factors, and mammog- raphy technique). Toward the other end of the spectrum, mathematical models can be used to study assessment problems that have never been the subject of any comprehensive experi- mental studies. In these cases for which there are no results from closely related studies to examine (like the HIP study) the only available approach is to try to inte- grate the results of a variety of studies about particular parts of the problem. The assessment of the frequency of the Pap smear is a good example. A mathematical

148 model can integrate information about dozens of factors to provide estimated out- comes that never have been observed in any study, such as the increase in probabil- ity of death from cervical cancer, long- term costs of screening and treatment, and so forth. A great number of medical tech- nologies present assessment problems of this type and have been successfully ad- dressed in studies such as those previously cited. Additional Uses Although the main use of a mathematical model is to estimate the outcomes of applying a technology in various settings, there are other important uses. These include the analysis of disease dynamics, hypothesis testing, research planning, and communication. First, mathematical models can use in- formation from carefully designed experi- mental and epidemiological studies to im- prove our understanding of the etiology and natural history of diseases. For exam- ple, the duration and reversibility of carci- noma in situ of the cervix is an important determinant of screening, treatment, and prognosis for that disease. But neither the duration or reversibility Carl be observed directly. With a mathematical model it is possible to estimate the pertinent parame- ters for these variables from observable data (Shur, 1981~. A second, similar function of models is that they can be used to test or validate hy- potheses about the natural history of a dis- ease or the effects of a technology on the disease. In addressing such problems, an investigator typically faces a collection of observations and must formulate hypothe- ses about the underlying dynamics of the disease and the impact of the technology that explains the observations. Mathemati- cal models can be created to describe the hypothesized dynamics, parameters can be fitted, and results can be predicted. The extent to which the values predicted by the ASSESSING MEDICAL TECHNOLOGY model fit the observations provides evi- dence about the validity of the hypothesis. Third, when models are used to estimate the outcomes of applying a technology in different settings, an investigator can ex- plore the value of collecting additional in- formation by noting the sensitivity of vari- ous outcome measures to variations in assumptions and input values and by iden- tifying areas of a problem that deserve more research. By comparing the value of additional information about a parameter with the cost of obtaining that informa- tion, research priorities can be set. Finally, irrespective of their value in cal- culating estimates of outcomes, mathemat- ical models can be powerful communica- tion tools. Mathematical models force investigators to be explicit and precise, to define their terms, and to express their ideas in unambiguous terms. Furthermore, the entire exercise is open to view and criti- cism. A related use of models is to provide a framework for consensus formation. It is often desirable to have many experts from a variety of backgrounds concentrate to- gether on an assessment. Mathematical models can focus this energy, forcing par- ticipants to agree on such basic ingredients to an assessment as the objectives, options, definitions, structure of the problem, basic facts, and values or to identify explicitly their differences of opinion (e.g., Barron and Richart, 1981; Eddy, 1981; Galliher, 1981; Richart, 1981~. Types of Mathematical Models The principles of mathematical model- ing are simple and follow closely the intui- tive process that forms the basis of mental models. The first step is to identify the im- portant factors or variables that determine the value of the technology. The next step is to define the relationships between those variables that determine how a change in one variable affects another. The distin-

METHODS OF TECHNOLOGY ASSESSMENT guishing feature of a mathematical model is the use of mathematics to define the rela- tionships between variables. Simple exam- ples are the balance sheet of a bank ac- count and formulas such as distance = rate x time, or total cost = unit cost x num- ber of units. Many different types of mathematical models can be used to assess a medical technology, depending on whether the problem can be modeled as discrete or con- tinuous, deterministic or probabilistic, or static or dynamic, and depending on other modeling decisions such as the appropriate number of dimensions or distributional as- sumptions. The particular methods will not be cataloged here but range from tech- niques as simple as traditional "back of the envelope" arithmetic to far more compli- cated models that require a page, a pad, or a computer to store the variables and per- form the calculations. Like experimental and epidemiological methods, mathematical models can have varying degrees of detail and complexity, and their development can require differ- ent amounts of time and money. For exam- ple, to study the question of breast cancer screening in high-risk women, one might use a very simple mathematical model such as assuming that a positive family history of breast cancer implies a relative risk of two, and then multiplying the pertinent results of the HIP study by two. On the other hand, to address the same question, a much more complicated model could be developed, involving a detailed analysis of age-specific incidence rates in women with particular risk factors, incidence rates for other nonmalignant conditions in these women, participation rates of high-risk women in screening programs, compliance rates of such women to postscreening rec- ommendations, response to treatment, and so forth. As in the choice of an appropriate exper- imental or epidemiological study, the 149 choice of an appropriate mathematical model depends on judgments about the likelihood that different methods will yield different conclusions and the expected im- portance of different conclusions in terms of the actions they imply and the conse- quences of those actions. These judgments about which factors should be included in a mathematical model, and how the rela- tionships between the factors should be translated into the language of mathemat- ics, form the art of mathematical model- ing. Validation of Mathematical Models It is important to have some measure of how well a given model can predict a set of outcomes. The most obvious requirement is that the structure of the model makes sense to people who have good knowledge of the problem. Factors they consider to be important should be included in the model; the mathematical functions used should appeal to their intuitions. They should agree that the data sources are rea- sonable, and so forth. The concurrence of experts, therefore, might be considered a first-order validation. The next approach is to compare esti- mates made by a model with actual obser- vations. However, this is far more compli- cated than it appears because most good models are built from actual observations. Since the structure and parameters of the models are estimated to predict the obser- vations, it should be no surprise when they do. Nonetheless, not all models pass this test, and it is reasonable to define a second- order validation: any model should be able to match the data used to estimate parame- ters. Failure to pass this test strongly sug- gests that the structure of the model is faulty. A third-order validation could be made by comparing the.predictions of a model with observations that were never used to

150 construct the model. In theory, a model can be constructed using one set of existing data and tested against a different set of ex- isting data (e.g., Shwartz, 1978~. How- ever, there may be a trade-off between us- ing all the available data to construct the model, which yields a more accurate model (in the sense that it can replicate the observed data more closely) but prohibits this type of validation, or using only part of the data to construct the model (which may reduce its accuracy) and saving the re- maining data for validation. Note that for validation, one might use only part of the data. Once that assessment is completed, and the investigator is satisfied with the method, the whole data set can be used to better estimate the required parameters for future work. First- and second-order model valida- tions are made even more complicated by two facts. First, some observations are far easier to match than others. It is possible to vary some model parameters drastically and still have the model generate some esti- mates that are always close to some obser- vations. A close fit in such instances is al- most meaningless, and the weight to be placed on a first- or second-order valida- tion will depend not only the number of observations the model can predict and the accuracy of the predictions, but also on the sensitivity of predictions to the model pa- rameters about which there is the greatest uncertainty. The second fact is that obser- vations themselves could be wrong in the sense that they do not represent the popu- lation mean. A fourth-order validation could be defined by comparing the out- comes predicted by a model for a new and previously unobserved program with the actual outcomes of that program when it eventually is conducted. Unfortunately, this too may not be meaningful because the actual conditions under which a program is eventually conducted can be quite differ- ent from the operating conditions assumed when the model was constructed. Changes ASSESSING MEDICAL TECHNOLOGY in the technology itself; the age, risk, and behavior of the patients; the institutional setting; and many other factors can make comparisons meaningless. Beyond this, the random component to the outcomes of any clinical trial can prevent the predicted and observed outcomes from matching, even if a model is perfect. In brief, as important as this problem is, there is no simple and universally applica- ble procedure for validating a model. Each case must be considered by itself. In many cases only a first-order validation will be possible, and only in very rare cases will a fourth-order validation be possible. This should not, however, prohibit the use of models. The decision to use a model should be based on a comparison with the validity of the other techniques that might be used to assess the technology. For example, what is the validity of the mental models or clinical judgments that form the basis for the overwhelming majority of assessments? Limitations First, unlike the techniques for gather- ing primary information discussed in ear- lier chapters, a mathematical model does not provide any new observations. Because of this a mathematical model cannot assess or validate a technology in the sense of doc- umenting its impact with calibrated obser- vations. Second, to the extent that a model is based on subjective clinical judgments about the pathophysiology and clinical dy- namics of a problem, a mathematical model will perpetuate any errors in these judgments a variation on the theme of "garbage in, garbage out." For example, a mathematical model based on the testi- mony of eighteenth century experts would have "confirmed" the value of leeching. Building models can expose gaps, inconsis- tencies, and errors in reasoning, but to the extent that current clinical knowledge is incorrect, the errors can appear in the

METHODS OF TECHNOLOGY ASSESSMENT models and an erroneous model will pass a first-order validation. Because of this, to the greatest extent possible, models should be based on observations from well- designed studies rather than subjective judgments, and the results of a mathemati- cal model should never be preferred to the results of actual clinical experiments, when they are available. Needless to say, this problem is even more severe for mental models, which rely almost exclusively on subjective judgments. Third, mathematical models can be poorly designed. Most medical technology assessment problems, especially those that require mathematical modeling, are com- plicated. Creating a mathematical model of such problems requires a good knowl- edge of medicine, technology, mathemat- ics, and modeling. One must be able to sense the structure of the problem, identify the important factors, appreciate what simplifications are appropriate and what are not, and write reasonable equations. It is easy to make mistakes. The most com- mon error is to make unreasonable simpli- fications. Any model must simplify reality; this by itself does not detract from a model's value, and indeed one of the main purposes of a model is to help separate the important from the unimportant. The problem arises not with simplification but with oversimplification, which can render a model not only useless, but harmful. The most common causes of oversimplification are to omit important variables and to at- tempt to squeeze a problem into a familiar or convenient mathematical form, rather than to create a form to fit the problem. Fourth, the results even of a good model can be misinterpreted or misused. One of the most common errors is to take the results of a model too literally, failing to appreciate the degree of uncertainty that surrounds its results. It can be hard to resist the urge to construct a model, look around for data, insert some numbers when the data cannot be found, clearly state that 151 these assumptions are made only to dem- onstrate the performance of the model, and then believe the output. Even if the author of the model remembers its weak- nesses, others may not. Another error is to ignore the specifications and assumptions of a model and apply its results to situa- tions it was not intended to address. Still another error is to assume that the out- comes addressed by the model are the only ones that need to be considered in making a decision about the technology. It should be recognized that misinterpretation and mis- use are not problems inherent to models; they are problems with those who use the models. The solution is not to withdraw the model but to educate those who would use its results. Finally, the accuracy of the results of a model is limited by the accuracy of the data it uses. It is important however not to overstate this limitation. First, this too is not a problem with models as such; it is a data problem. The structure of a model can accurately represent reality; it is the use of the model that will be limited by the poor data. Second, this limitation is not re- stricted to mathematical models. What- ever method is used to estimate the out- comes of applying a technology, the accuracy of its conclusions will be limited by the accuracy of the available data. A model does not by itself create the need for data that would not otherwise be impor- tant. But a model does make the data needs explicit and does focus attention on poor data (which might cause discomfort), but this is not a weakness of models; it is a strength. Ignoring important factors about which there are few good data does not make those factors unimportant; it merely ignores them. Third, models have several properties that make them the preferred method for studying problems for which the data are poor: (1) the explicitness of models focuses attention on gaps and biases in the information, raising cautions about conclusions that might otherwise pass un-

152 scathed. (2) Given that data problems can- not be willed away, models are still the best method to gain insights and make esti- mates based on the best information avail- able. (3) Through sensitivity analyses, models can indicate the importance of un- certainty or poor data about a variable. (4) Models can be used to estimate the value of conducting research to get better data. While poor data spoil the quality of conclusions drawn by any method, the so- lution is not to discard models but to use models to squeeze the most information out of the data that do exist and to collect better data for the next application. In general, the worse the data, the greater the need for a model. In judging the seriousness of these limi- tations, it is important to recall that all methods of technology assessment require judgments and simplifications, all methods can deliver wrong answers, all methods can be misused, and all methods depend on the quality of the available data. While a mathematical model can never be perfect, it can still increase our ability to under- stand a problem and make decisions. Strengthening the Technique The techniques of mathematical model- ing (and the related techniques of comput- ing) already have been developed in other fields to a high level of complexity. Mathe- matical models have been used for centu- ries in other fields with great success. To- day mathematical models are used to help build bridges, design airplane wings, fore- cast weather, create video games, plan highways, analyze radiowaves, refract lenses, guide satellites, route shipments, search for oil, generate electrocardio- graphs, plan crops, control floods, com- pute tomograms, and carry out thousands of other activities. The results of this re- search already are available for applica- tion to the evaluation of medical technolo- gies. In medicine the main needs are not to ASSESSING MEDICAL TECHNOLOGY improve the techniques, but to apply them responsibly. This suggests several priorities. First, ef- forts must be made to define and demon- strate the role of mathematical models in the technology assessment process. Clini- cians, researchers, statisticians, health planners, and policymakers should be ex- posed to examples of technology assess- ments that demonstrate the strengths and weaknesses of mathematical models and that demonstrate how they fit with more traditional methods. In the end, the use of the method will depend on its helpfulness to decision makers; the first step is to pro- vide decision makers with opportunities to make that assessment. Second, there is a need for more educa- tion in the application of mathematical models to medical problems. Modelers must know more than a small number of methods; they must understand at a deep theoretical level the assumption behind and limitations of their methods, and they must be capable of modifying those meth- ods to fit a particular problem. They must also learn how to communicate with peo- ple in medicine to develop a realistic model and to describe how it can be used. On the other side, people who want to use the results of models must learn their strengths and weaknesses. Third, work is needed in the quality control of models and their applications. For example, mathematical models present special problems for the editors and readers of medical journals. The de- scription of most models is too long to fit in the usual methods section of a paper, and few reviewers could understand them if they did. Yet the form of a model can dras- tically affect its validity and usefulness. Related issues are the need to control mis- interpretation and misuse, the need for a system for validating models, and the need to calibrate the probability that a model's results accurately represent reality. A start toward these goals can be made

METHODS OF TECHNOLOGY ASSESSMENT by asking that each report of a technology assessment employing a mathematical model contain the following elements: 1. a statement of the problem; 2. a description of the relevant factors and outcomes, 3. a description of the model; 4. a description of data sources (in- cluding subjective estimates), with a de- scription of the strengths and weaknesses of each source; a list of assumptions pertaining to: a. the structure of the model (e.g., factors included, relationships, and distributions), b. the data; 6. a list of the parameter values that will be used for a base case analysis, and a list of the ranges in those values that repre- sent appropriate confidence limits and that will be used in a sensitivity analysis; 7. the results derived from applying the model for the base case; 8. the results of the sensitivity analy- ses; 9. a discussion of how the modeling as- sumptions might affect the results, indicat- ing both the direction of the bias and the approximate magnitude of the effect; 10. a description of the validation meth- od and results; 11. a description of the settings to which the results of the analysis can be applied and a list of main factors that could limit the applicability of the results; and 12. a description of research in progress that could yield new data that could alter the results of the analysis. If the analysis recommends a policy, the report should also contain: 13. a list of the outcomes that required value judgments; 14. a description of the values assessed for those outcomes; IS. a description of the sources of those values; 153 16. the policy recommendation; 17. a description of the sensitivity of the recommendation to variations in the values; and 18. a description of the settings to which the recommendations apply. Finally, greater care should be taken in the collection of data. A tremendous amount of research is conducted by thou- sands of investigators on hundreds of clini- cally important questions every year. The fact that good data do not exist for building mathematical models, or even for con- structing simpler structures like decision trees, is testimony that many of those con- ducting the research do not have a clear model in their minds of precisely how the data they are collecting should contribute to the analysis of the problem they are ad- dressing. Because a model is the tool that converts data into insights, one can argue that every experimental and epidemiologi- cal study should be preceded by a model, every datum collected should have a place in that model, and attempts should be made to collect all the data needed for the model. Conclusion Mathematical models provide a method for synthesizing existing information to es- timate the consequences of applying a technology in a particular set of circum- stances. Mathematical models should not be viewed as an isolated technique that may or may not be used in a particular as- sessment, or as an alternative to, or worse, as a competitor of clinical judgment or ex- perimental and epidemiological studies. Any assessment of any technology will re- quire integrating information from experi- mental and epidemiological studies to esti- mate how a technology will perform in a particular setting. By their explicit- ness, power, and precision, mathematical models can provide a powerful aid to hu-

154 man judgment in the interpretation of data from clinical research. SOCIAL AND ETHICAL ISSUES IN TECHNOLOGY ASSESSMENT* A little-emphasized aspect of technology assessment is the examination of the social, ethical, and legal questions raised by the use of technology in clinical practice. A1- though these questions do not always lend themselves to quantitative measurement and analysis, they can be systematically identified and evaluated. The methods for accomplishing this will not be covered in detail here, but the following discussion will serve to illustrate possible approaches. Questions to be considered include the fol- lowing: Who is affected or not affected by a technology? What ethical principles are involved in testing and use of a technology? What might be the unintended conse- quences or side effects of a technology? How does the technology fit into larger cultural political contexts? What values af- fect the application of the results? An inquiry into the consequences of the use of a medical technology on social groups and relationships will require a study of the patient as a member of a fam- ily, of an organization, and of a commu- nity. Although the sociopolitical aspects of policy decision making have long been rec- ognized, the increasing influence of the consumer/patient in policy decisions af- fecting the diffusion of medical technology only recently has been seen for its impor- tance. Toffler describes a rising "third wave" in our society bringing a great in- crease in self-help and do-it-yourself activ- ity that will powerfully affect our tradi- tional health care delivery systems. Ferguson (1980) extrapolates from con- sumerism to a new "paradigm of health" in * This section was written by Morris Collen and Lincoln Moses. ASSESSING MEDICAL TECHNOLOGY which the public increasingly embraces "holistic" or "alternative" medicine that employs less technology and uses the pla- cebo effect, biofeedback, meditation, visu- alization, and forms of body manipulation as modes of self-therapy. Naisbitt (1982) explains that the more that machine-like technology is introduced into society, the more people value the human qualities, thus accounting for the trend to forms of home care rather than institutional care. Another type of social consideration in the introduction of new technology has to do with its potential new manpower re- quirements. For instance, the change from manual to automated clinical laboratory methods required a major change in the training of laboratory technologists. The advent of coronary care units required the training and employment of highly special- ized nurses. An Institute of Medicine study (Sanders, 1979) concluded that new tech- nology often has important effects upon manpower in the community. It requires consideration of the need for new physi- cians, assistants, and technicians for the use of new equipment embodied technol- ogy. It may call for an increase in the train- ing of new specialties, but also for a de- crease in training and employment opportunities of outmoded specialties. A consideration of social benefits and costs for a medical technology should in- clude its opportunity costsalternative uses for the money. Current examples of expensive programs that raise questions of opportunity costs include Medicare's end- stage renal disease patients of a kidney transplant or lifelong dialysis. Organ transplantation generally poses cost as a major social consequence, which also has large overlaps of ethical and legal ramifi- cations. Various of society's adjustments and ac- commodations in matters of health and safety affect assessments of technologies by altering their costs either in dollars or emo- tional stress or both. Structures of all kinds,

METHODS OF TECHNOLOGY ASSESSMENT and most any transportation system, can be modified for use by disabled persons; is the real question cost-effectiveness? There is nothing technologically difficult in re- moving nonsmokers from the effluvium of smokers, but there can arise serious ques- tions of how far to carry the effort. Tech- nology assessment in any of these matters of social, ethical, and legal import has great difficulties in determining net bene- fits and costs. Once the basic demands of humanitarianism have been met, much of the rationale for technology assessment is in the purview of economics. However, softening of that economic edge is a task for the components of assessment that are con- cerned with social and ethical issues. OTA (1980a,b) observed that society has collective objectives that stem from its un- derlying values and traditions objectives that are not strictly economic and not di- rectly related to health status. These objec- tives may be concerned with the equitable distribution of medical care ensuring that the poor have adequate access to health services or with protecting the rights of the unborn, the mentally ill, or the comatose patient. An economic approach to the problems of health and medical care is firmly rooted in three fundamental observations, ac- cording to Fuchs (1974~: (1) resources are scarce in relation to human wants, (2) re- sources have alternative uses, and (3) peo- ple have different wants, with consider- able variation in the relative importance they attach to them. The basic economic problem identified by Fuchs is "how to al- locate scarce resources so as to best satisfy human wants." Constraints on economic resources will necessitate decisions as to resource alloca- tion and resource rationing, which, in turn, will raise ethical and related issues. Evans (1983) believes that in the future the major issues confronting not only medicine but this society as a whole will be the so- cial, ethical, and legal implications of re- 155 source allocation and rationing. The re- sources available to meet the demand for health care already are limited; decisions already are being made; and priorities are being set as physicians allocate their time, hospitals ration beds, and fiscal intermedi- aries devise straitened reimbursement poli- cies, he contends. It is only because those decisions are not publicized that they have not become a social issue, according to Evans, who suggests that within a society that has failed to come to grips with the meaning of death and the essence of life, rationing decisions will seem usually cruel. Yet, when these decisions are acknowl- edged as inescapable, he believes this soci- ety, this culture, will be more prepared to deal with the one event that is truly inevi- table death. Once it is apparent that all who are in need cannot be treated, the rationing pro- cess attempts to determine which potential recipients are likely to derive the greatest benefits. This usually requires (1) the de- velopment of acceptable criteria for with- holding treatment on a condition-by- condition basis and (2) identifying those who make the decisions about whom to treat. End-stage renal disease (ESRD) pro- vides an example of a medical condition for which there is a relatively long history of decisions about eligibility for treatment. The first successful treatment was hemo- dialysis. During the early years of dialysis, when very few machines were available, patient selection was made by physicians or community committees. At that time it was decided that although all patients with ESRD had a terminal condition, some had better prospects for treatment than others. The preferred candidates were selected on the basis of such criteria as age, medical suitability, mental acuity, family involve- ment, criminal record, economic status (income, net worth), employment record, availability of transportation, willingness to cooperate in the treatment regimen,

156 likelihood of vocational rehabilitation, psychiatric status, marital status, educa- tional background, occupation, and future potential. The eventual decision to extend Social Security disability benefits to pa- tients with ESRD resolved the rationing problem by removing financial limits on treatment. in general, how are criteria for the ra- tioning of limited resources likely to be de- veloped? As described elsewhere in this re- port, cost-effectiveness and cost-benefit analyses can be useful to compare various health care programs and determine which program could yield the greatest benefit at the least cost. For example, hemodialysis could be compared with heart transplanta- tion to see which has the greatest benefits per dollar expended. At some time, society will have to make some basic decisions about the allocation of economic resources between the aged and those younger. Even though they are based on explicit and even rational criteria, any plan that is eventually adopted is certainly debatable from the perspectives of others. To adopt a set of criteria including age of patients is to make a decision about limit- ing treatment. On the other hand, to treat all patients with a given disorder or within a given disease category, regardless of de- rived benefits, necessarily implies the with- holding of treatment from patients with other disorders. The question is one of pri- orities. Data can be used to set priorities, but human judgment must be exercised to determine which priorities will hold. The conscious development of explicit alloca- tion criteria, as a first step in the direction of wisely using limited resources, probably will strain our society as few issues have. Many of us will remember when such deci- sions did not have to be made and, short of cataclysm, will not understand a new im- perative of calculated neglect. Decisions must be made concerning which patients will best benefit from ex- pensive health care technology. The prob- ASSESSING MEDICAL TECHNOLOGY lem, however, is that, in many respects, so- cial and medical criteria are inextricably intertwined. People of low socioeconomic status are likely to be in poorer health and have multiple diseases. In part, this reflects poor nutritional habits, detrimental life- style, and the historical lack of resources to obtain proper health care. Consequently, if medical criteria were to be the basis on which rationing decisions are made, they might exclude the poor and disadvantaged because health and socioeconomic status are highly interdependent. For example, it is not unusual to find that of those persons with ESRD, those of lower socioeconomic status are likely to have multiple associated conditions such as diabetes, hepatitis, and hypertension. Not only are these patients less desirable candidates for dialysis and transplantation, but also they are among the more expensive patients to treat. With- out careful planning and evaluation, the gulf between the haves and have-nots, as evidenced by formal selection criteria, is likely to widen. In the above examples, interprogram analyses were applied only to health care programs. But such analysis also can be used to compare the expenditure of health care resources with other socially desirable uses of resources, such as a public assis- tance program. This requires conducting a cost-benefit analysis in which all expendi- tures and benefits are converted to mone- tary terms, after which direct comparisons can be made among diverse programs. The results of such an analysis may indicate that resources should be reallocated from social and other publicly financed pro- grams to support health programs, or vice versa. An analysis of benefits and costs of a medical technology to a community or a population group often involves political considerations. OTA (1980a) suggests that if benefits from a technology are contro- versial, nonscientific negotiations and compromise may be the best course for

METHODS OF TECHNOLOGY ASSESSMENT policymakers. The political process may respond better to community needs than the most careful cost-benefit analysis. Decisions can be made on the basis of cost-effectiveness or cost-benefit analyses, or by political activities influenced by dif- ferent lobbying groups. In any case, the first decision likely will be as to which pa- tient groups will receive support (i.e., the resource allocation decision); then, as re- sources continue to dwindle, allocations will be made within programs and deci- sions will be made as to how clinicians might ration the limited resources made available to them. Increasingly, it is appar- ent that this scenario approximates the sit- uation of the kidney disease program to- day. The Institute of Medicine (1981) sug- gested that the public would accept con- trols on the diffusion of a technology until its effectiveness was proved if it were made clear that such controls ultimately would increase the overall quality of medical care, that lack of control could decrease the quality of care, and that these controls would be applied equitably. Experience in Addressing Ethical Issues Although no permanently established group currently addresses the ethical and social consequences of technology, several bodies have in the past been specially con- stituted to address those issues. For exam- ple, the National Commission for the Pro- tection of Human Subjects was directed under section 203 of P.L. 93-348 to con- duct a "special study" of the ethical, social, and legal implications of advances in biomedical and behavioral research and technology. This commission and its suc- cessor, the President's Commission for the Study of Ethical Problems in Medicine and Biomedical and Behavioral Research, at- tended explicitly to the ethical, social, and legal implications of advances in technol- ogy. 157 The National Commission for the Pro- tection of Human Subjects (NCPHS) (1978) used several methods for assessing the social and ethical questions raised by technological innovation. In one approach investigators used the Delphi technique to examine such matters as systematic control of behavior, reproductive engineering, ge- netic screening, extension of life, and data bank-computer technology. In a second approach researchers used a case-study method for a colloquium to develop a his- torical and sociological perspective on re- cent advances in biomedical and behav- ioral research and services. Their col- loquium explored the social impact of ad- vances, of legal and institutional con- straints, and of incentives governing the introduction of new technologies into medical practice. Finally the colloquium reviewed current knowledge about the public's understanding of and attitudes toward advances and their implications. The President's Commission for the Study of Ethical Problems in Medicine and Biomedical and Behavioral Research (1983) approached its analysis of medical care by applying three basic principles: · that the well-being of people be pro- moted; · that people's value preferences and choices be respected; and · that people be treated equitably. However, they cautioned that medicine and research touch too many beliefs cen- tral to human existence to be summed up in a few principles. The commission's over- all task was to help clarify the issues and highlight the facts that appear to be most relevant for informed decision making, to suggest improvements in public policy, and to offer guidance for the people who are making decisions. They issued 13 re- ports on issues in health care and biomedi- cal and behavioral research, including the definition of death, life-sustaining treat-

158 meets, genetic engineering, and compen- sation of subjects injured in research. Although the NCPHS and the Presiden- tial Commission were especially consti- tuted to address ethical, social, and legal issues, there are other forums where these matters can be considered. The Office of Technology Assessment (OTA), for exam- ple, has taken up these issues in some of its reports (OTA 1978a; 1982b). These examples are evidence that there is a desire and some effort to carry out tech- nology assessment from ethical points of view. Nevertheless, in the committee's opinion the best methodologies for explor- ing such dimensions are still not well de- fined and more work is needed in this area. Ethics of Investigation The trial of new drugs, diagnostic proce- dures, and therapeutic maneuvers are keys to progress in health care. At the same time, these steps involve uncertainty and therefore risks. The risks are borne by pa- tients in whom the new, uncertain meth- ods are tried out, by the professionals who conduct these pioneer efforts, and by un- known future patients who may receive in- ferior care if the results of the investigative efforts are misleading. That can occur if an assessment lends support to a defective new idea or fails to reveal the worth of a genu- inely good innovation. When an enterprise imposes risks on people who have differing interests, ethical issues are surely involved. Assessment of medical technology is such an enterprise. Our consideration here of ethics of inves- tigation is informed by our acceptance of two principles: First, it is unethical to ex- ploit one person for the benefit of another. Second, to waste information that can ben- efit future patients is unethical, especially if that information has been obtained un- der conditions of risk. We find it convenient to treat the ethical issues as those that attach to three temporal ASSESSING MEDICAL TECHNOLOGY phases: initiation, conduct, and termina- tion. Initiation When is a new intervention promising enough to justify applying it to people in an experimental way? Who should judge that question and decide? What standards are applicable? Some would argue that a patient's own physician is the only one with ethical standing to de- cide. Others might prefer the advantages that can accompany collegial action and recourse to written protocol. On whom shall the novel intervention be tried? Human subject committees, in- formed consent procedures, and written protocols all address this matter but only where the novel intervention is owned to be part of an investigation. Some may find an ethical anomaly in the lack of any such parallel protections where the patient is simply undergoing treatment with this same novel intervention. Conduct Is the study so conducted that it must yield cogent information? Or is it so designed that on completion little trustworthy information can be salvaged? The ethical content of these questions sometimes leads to a policy of seating research-design experts on human subject committees. Termination There are two ways to go wrong, and both are injurious to the inter- ests of patients. If a trial is continued un- necessarily long, then unnecessarily many patients will receive an inferior treatment (the innovation, if it is inferior; the stan- dard, if the innovation is an improve- ment). The same difficulty can arise if in- vestigation of an innovation is carried forward in too long a sequence of separate studies, as Baum et al. (1981) have re- ported in a meta-analysis of studies of pro- phylactic antibiotic therapy for colon sur- gery. The second way to go wrong in termina-

METHODS OF TECHNOLOGY ASSESSMENT tion is to quit too soon before an obtain- able conclusive answer is on hand. Some of the controversy over the University Group Diabetes Program (UGDP) concerned the timing of its termination. The ethical issue in early termination is complicated by the need to weigh the relative responsibility of the investigators for the patients in the study and for all others with the disease in question. The decision was especially diffi- cult for the UGDP investigators because they often had the dual role of physician to the patients in the study. To avoid that conflict most large-scale studies now have a data-monitoring committee of clinicians, biostatisticians, and laymen to decide when to inform the investigators that a de- cision to stop needs to be made. This brief review is concluded with some ethical aspects of medical investigation with three general observations. First, the problems cannot be avoided by some shortcut like setting a policy of only trying out good innovations and not trying out poor ones. Gilbert et al. (1977) reviewed 32 randomized trials of innova- tions in surgery and anesthesia. They found that these well-tested innovations were beneficial in 49 percent of the studies. There is no shirking the inconvenient fact that theory and opinion in medicine are not reliable guides to the value of new in- terventions; they must be tried, and in ways that can produce cogent answers. Second, weak studies are not good enough. Many authors have found that the weaker the controls in a study, the better the innovation appears. Weak studies are not ethically sufficient to the task of help- ing beneficial new technologies enter the health care system. For example, Grace et al. (1966) found that in investigations of the portocaval shunt operation, the enthu- siasm of the investigator at completion of the study was lower in those studies that were better controlled. In poorly con- trolled studies, 72 percent of the investiga- tors reported "marked" improvement in 159 patients. In well-controlled studies, the in- vestigators were split 50-50 between "mod- erate" improvement and "none." Hugo Muench (Bearman et al., 1974) in a parody of statistical laws based on a lifetime of biostatistical consulting says, essentially, that nothing improves the performance of an innovation as much as lack of controls. Gilbert et al. (1977) found in poorly con- trolled trials that 64 percent of the innova- tions appeared to represent improvements as compared with 49 percent in well- nontrolled trials. Thus strong trials are needed lest the worth of an innovation be exaggerated. Third, the scientific attitude of with- holding judgment, of remaining skeptical, in the presence of inadequate evidence is commendable in medical investigations. We sometimes see controversies where ad- herents of one view insist that therapy A is better than therapy B for certain patients and will use only A, while adherents to the opposite view will only use B in such pa- tients. This is an egregious failure of tem- pering opinion with science; it is ethically unsatisfactory and should constitute a war- rant for the conduct of a controlled study. But the same theme the ethical desirabil- ity of withholding judgment arises in early termination and deferring wide- spread adoption of new methods until careful studies justify it. CONCLUSIONS AND RECOMMENDATIONS This chapter was begun with the point that technology assessment is important because it gives the bridge between basic research and development and prudent practical application of medical technol- ogy. Experience, not theory, must be the controlling factor in deciding whether to use a technology. Learning from experi- ence requires formal plans, records, and analysis, not casual observation, and prog-

160 ress in health care depends on such learn- ~ng. To summarize, the foundation for as- sessing medical technology exists in the as- sembly of methodologies and the assess- ments that are available. But much work remains to be done before the enterprise is complete. Much of that work consists of re- search. In nearly every section of this chap- ter, research needs have been pointed out. Sometimes the needs are well formulated, as with the list of six questions concerning group judgment methods, and sometimes almost implicit, as with the need to more fully exploit sample surveys of the NCHS for technology assessment. But beyond the research problems of special methodolo- gies, there is the special problem of assem- bling information from a variety of sources and integrating the results. We need to im- prove and widen the application of tech- niques like meta-analysis that can combine information from a number of studies in- tended to answer a common question about the safety and efficacy of a clinical practice. Also needed are improved meth- ods for weighing information about clini- cal benefits along with economic and social consequences of medical practice, as in the techniques of cost effectiveness analysis, cost-benefit analysis, and technology as- sessment. These many needs justify three recommendations (in italics). · Increase research activity to improve and strengthen the varied methods that are applicable to the assessment of medical technology. · Increase resources for training re- search workers in medical technology, both for advancing methodology and for applying those methods to the many un- evaluated technologies. (The reader is re- minded that epidemiology and biostatistics have been and remain personnel shortage only; areas. ~ It should be remarked at this point that need for another kind of training also flows ASSESSING MEDICAL TECHNOLOGY from the underdeveloped state of medical technology assessment: biomedical person- nel need training in the main ideas of tech- nology assessment even if they are not car- rying out the assessment themselves, because they must be able to appraise the strengths and merits of studies. · Invest greater effort and resources into obtaining evaluative primary data about medical technology in use. It can be seen that again and again not enough solid primary data are at hand to support cogent assessment. Recall that all respondents to OTA's (1980b) survey of CEA/CBA practitioners raised this com- plaint. Similarly, perusal of OHTA- OMAR reviews repeatedly point to the paucity of randomized clinical trials of other cogent primary data (NIH, 1983, 1984; Fink et al., 1984; K. E. Lasch, Syn- thesizing in HRST Reports, unpublished report, Harvard School of Public Health, 1985~. Drawing up priorities for informa- tion building and then applying resources to the task are urgent needs of the U.S. health care system. APPENDIX 3-A: EXAMPLE OF COST-BENEFIT ANALYSIS* This example of cost-benefit analysis is adapted from one given by Swain (1981~. It compares three alternatives for the re- duction of lead poisoning in fictional "Kleen City." Lead poisoning of children under 5 due to ingestion of lead from painted surfaces is a major cause of death and severe brain damage for children in this age class. The three candidate programs for reduc- ing lead poisoning in Kleen City are 1. child screening and child treatment * This appendix was prepared by Morris Collen and Clifford Goodman.

161 only; METHODS OF TECHNOLOGY ASSESSMENT 2. house testing and house deleading 3. both 1 and 2. The CBA is based on the following as- sumptions: · Planning period: IS years · Population aged 1 to 5: 17,000 · Annual births: 3,500 · Annual deaths of 1 to 5 population: 100 · Residences in the city: 10,000 · Proportion of residences with signifi- cant amounts of lead-painted surfaces ac- cessible to children: l/7 · Children with lead poisoning (levels of lead in the blood of greater than SO mi- crograms per 100 milliliters): 6 percent · Of those children with lead poisoning, those requiring chelation therapy to achieve adequate reductions in the level of lead: 35 percent · Discount rate: 8 percent $9 The notation (PV i %, n) is a cash flow $1,000 conversion factor used to determine the present value of n periodic $1 payments at discount rate n. This is: PVi% n = 1 + i expn - 1 ~ ' ~ (1 + ~) (exp Hi For example, the present value of 5 yearly $1 payments at a discount rate of 10 per- cent is: (PV 10 %, 5) (1 + O.l) (exp S) - 1 (1 +O.l)(exp5)0.1 $ For this CBA example, we will use the cash flow conversion factor (PV 8 %, 15) to determine the present value of 15 yearly (our planning period) $1 payments at our chosen discount rate, 8 percent. This is equal to: (PV8%, 15) For children aged 1 to 5 with a mini- mum of SO micrograms per milliliter, the likely outcomes are 0.003 will die as a result of lead intoxi- cation; 0.025 will exhibit permanent, severe brain damage as a result of lead intoxication; will exhibit permanent, moder- ate brain damage as a result of lead intoxication; and will return to acceptable lead levels with no signs of perma- nent damage under proper care. The costs of child screening and child treatment of lead poisoning are estimated to be $8 for locating and testing an indi- vidual child, for follow-up of children with excessive but not extreme lead levels, and for chelation therapy of chil- dren found to have extreme levels of lead present in the blood. 0.072 0.900 (1 + 0.08) (exp 15) - 1 = $8.559 (1 + 0.08) (exp 15)0.08 The costs of house testing and house de- leading are estimated to be $50 to test a house for lead paint $900 per dwelling deleaded by treat- ment of all surfaces found to have significant amounts of lead The benefits associated with the results of lead-poisoning control programs fall into two categories. 1. Benefits due to the averted costs of treatment for children who would other- wise have been afflicted with the effects of lead poisoning. Given at a present value when discounted at 8 percent, these are $600 for children who would have died of lead poisoning; $130,000 for children who would have sustained severe, permanent brain damage;

162 $17,430 for children who would have sustained moderate, perma- - nent brain damage; and $1,800 for children with no perma- nent brain damage. 2. Benefits due to the increased income that can be gained by children who would otherwise have been afflicted by the effects of lead poisoning. Given at a present value when discounted at 8 % these are $17,000 for children who would have died; $17,000 for children who would have sustained severe, permanent brain damage; $2,500 for children who have sus- tained moderate, permanent brain damage; and $1,600 for children who would have sustained no permanent dam- age. Program 1: Child Screening and Child Treatment Only Screening must be re- peated for the entire population for each of the 15 years. Assuming that there are ap- proximately 100 deaths per year of chil- dren in the 1 to S age range, then the popu- lation of Kleen City will remain nearly constant during the 15-year period in the age range of concern. During each of the years, there will be approximately 17,000 children who must be screened for lead in- toxication. Of that population, 6 percent will exhibit high lead levels and be sub- jected either to chelation therapy or fol- low-up testing during the year. Since un- der this program there is no significant removal of the original lead sources, each of the children will be subject to rescreen- ing in the subsequent year, unless they are out of the population group being studied. Costs (1~: In each year of the program, 17,000 children will be screened ($8 each), 6 percent of which will be treated (of treated: 65 percent with follow- ASSESSING MEDICAL TECHNOLOGY up and 35 percent with chelation therapy). The present value of these screening and treatment costs are: Costs (1) = (PV 8 %, 15) 17,000{ $8 + 0.06~(0.65 X $9) + (0.35 X $1,0001) = (PV8%, 15) $498,950 = $4,270,513 Costs (1) = $4,270,513 (A) Benefits (1~: The average benefits per child screened in the first year due to the averted costs of treatment and the averted lost future income for the four out- comes can be combined into one expres- sion: Average benefits per child = 0.60 t0.003(17,000 + 600) + 0.025(17,000 + 130,000) + 0.072( 2,500 + 17,430) + 0.900( 1,600 + 1,800~] Average benefits per child = $493.3656 (B) The benefits over the 15-year period can be calculated in two parts. For the 17,000 children screened in the first year, the ben- efits are: Benefits in first year = 17,000 X $493.3656 = $8,386,440 (C) In each following year, a new group of 3,500 children will be screened. The bene- fits accruing to each of these groups are: Benefits for each successive year = 3,500 X 493.3656 = $1,726,780 (D) To determine the present value of these benefits accruing over the 15 years, we do not multiply this figure by 15, since the present value of the benefits of each succes- sive year decreases. Thus, the present value of these benefits over the remainder of the planning period must be calculated: Benefits.for successive years = (PV 8 %, 15~$1,726,780 = $14,780,370 (D)

METHODS OF TECHNOLOGY ASSESSMENT Benefits (1) = $8,386,440 (C) + $14,780,370 (D) = $23,166,810 Net Gain (1) = Benefits (1) - Costs (1) = $23,166,810- $4,270,513 Net Gain (1) = $18,896,297 Benefit (l)/Cost (1) ratio $23 166.810 $ 4,270,513 = 5.42 163 the same as the benefits of the child screen- ing/treatment program for successive years (D). As a result of Program 2 house testing/ deleading in the first year, 3,500 children per year benefit by averting the costs of treatment and lost future income. The same 3,500 children per year achieve the same benefits from Program 1 annual screening/ treatment . Net Gain (2) = Benefits (2) - Costs (2) Program 2: House Testing and House = $14,780,370 - $1,785,714 Deleading Only For this program, it is Net Gain (2) = $12,994,656 assumed that house testing and deleading will be completed in the first year of the planning period. Given that assumption, then all the benefits of the house-screening process will be received after 1 year. A con- servative estimate of the benefits would al- low for the fact that without any child screening and treatment, all of the initial child population (17,000) might be suscep- tible to lead poisoning during the first year. Thus, the population receiving the benefits of house testing and deleading would be the children entering the 1 to 5 year age category after 1 year, i.e., 3,500 each year. Costs (2~: The costs associated with lead removal from residences are the cost of testing ($50 each) the 10,000 residences for lead, plus the cost of removal of lead from those ($900 for one out of every seven) houses found to have leaded sur- faces. Costs (2) = 10,000 X t$50 + (1/7) X $9001 = $1,785,714 (E) Benefits (2~: The benefits of averted cost of treatment and averted lost future income of the yearly group of 3,500 chil- dren are Benefits (2) = (PV8%, 15) 3,500 X $493.3656 Prom (B) above] = (PV 8 %, 15) $1,726,780 Benefits (2) = $14,780,370 Note that Benefits (2), the benefits of the home testing/deleading program (F), are Benefit (2)/Cost (2) ratio = $$14~780~370 = 8.23 Program 3: Combined Program The program combines child screening and treatment with house testing and delead- ing. Under this program, child screening/ treatment needs to be carried out only until the removal of lead from houses is com- pleted. If this task is completed by the end of the first year, then the only cost for child screening is that of screening the current population in the first year. This costs has been determined to be $498,950 tsee (A)~. The benefits accruing to the current popu- lation from the child screening and treat- ment are estimated to be $8,386,440 tsee (C)] Since the house testing and de- leading has been assumed to impact on the new population, its benefits can be added to those for the single year of child screen- ing to give a total set of benefits over the 15-year period of $23,166,810. The com- bined cost of the two programs will be $2,284,664. Costs (3): Costs (3) = (cost of child screening/treat- ment in first year) + (cost of house testing/de- leading) = $498,950 (A) + $1,785,714 Costs (3) = $2,284,664 (E)

164 Benefits (3): Benefits (3) = Benefits (3) = Benefits (3) = (benefits of child treat- ment/screening for first year population [17,000]) + (benefits of house testing/deleading) $8,386,440 (C) + $14,780,370 (F) $23,166,810 Net Gain (3) = Benefits (3) - Costs (3) = $23,166,810 - $2,284,664 Net Gain (3) = $20,882,146 Benefit (3) ratio $23,166,810 $ 2,284,664 = ~ = 10.14 Comparison of Programs The follow- ing comparison shows that the combina- tion program of child screening/treatment and house testing/deleading has a greater net gain as well as a higher benefit/cost ra- tio than either individual program. A choice between 1 and 2 would depend upon preference for the one with the greater benefit/cost ratio. Benefit/Cost Net Gain Ratio 1. Child screen- ing/treatment 2. House testing/ deleading 3. Combination of 1 and 2 $18,896,297 5.42 12,994,656 8.23 20,882,146 10.14 APPENDIX 3-B: AN EXAMPLE OF A MATHEMATICAL MODEL OF MEDICAL TECHNOLOGY* This appendix illustrates some of the points raised in this chapter by describing briefly a mathematical model developed to assess the value of cancer screening tests. * This appendix was prepared by David M. Eddy. ASSESSING MEDICAL TECHNOLOGY As an example, imagine an asymptoma- tic, average-risk, 40-year-old woman who had a Pap smear a year ago, and suppose we wanted to estimate the effect of repeat- ing the Pap smear today on the chance she will die of cervical cancer, or on her life ex- pectancy. How much difference would it make to wait 2 more years? To estimate the effect of a Pap smear on those and similar outcomes requires esti- mating a chain of probabilities: (1) the probability such a woman has a cervical cancer or precancerous lesion (dysplasia or carcinoma in situ) that could potentially be detected; (2) the probability that a Pap smear would detect such a lesion if it were present; (3) the probability that such a le- sion would be detected in various stages; (4) the probability that if a cancer is not de- tected at this screening examination, a can- cer will cause signs and symptoms in the interval before the next scheduled exami- nation (and the probability that event will occur at various times in the interval); (5) the probability of any interval-detected cancer occurring in various stages; (6) case- survival rates that describe the woman's prognosis, given the stage in which the le- sion is detected; and (7) the probabilities that the woman will die of other causes each year in the future. All these probabili- ties must be calculated conditional on the fact that this woman is a 40-year-old, aver- age-risk, asymptomatic individual who had a negative Pap smear a year ago; the probabilities would change if she were a different age, had high-risk factors, had symptoms, or had had a negative Pap smear at another time in the past. The power of mathematical models lies in the fact that formulas can be written for each of these probabilities. For example, the first probability is given approximately by i~ [1 - P( - t)] { [F(t + I) - F(t)] + [l-F(t+l)] EN} r(t)exp[-~tr(x)dx]dt, (1)

METHODS OF TECHNOLOGY ASSESSMENT where I is the interval of time since the last Pap smear (in this example 1 year), F(t) is the cumulative distribution for the length of time from the moment a lesion is first detect- able by a Pap smear until it becomes an inva- sive cancer, P(t) is the cumulative distribu- tion for the length of time from the first moment of invasion to the appearance of signs and symptoms that would cause the pa- tient to seek care in the absence of screening, rots is the instantaneous incidence rate of in- vasive cancers tr(O) is the rate in 40-year-old average-risk women], and EN is the random false-negative rate of the Pap smear. Each of the elements in Equation 1 has an intuitive interpretation. The variable of integration, t, denotes the possible times that the woman might develop an invasive cancer of the cervix (t = 0 is now). By inte- grating from negative infinity to positive infinity, this formula considers all the pos- ~ _ loo O ~ O' > ~ Lll 7= ~' J ~ ~ Z E is o G 50 IL Z a: ~ o 25 165 sible times that an invasive cancer might occur. For any particular time that an in- vasive cancer might occur (call this time t ' ), the expression 1 - PI - ~ ' ~ gives the probability that the woman is currently asymptomatic and has not yet detected or sought care for signs or symptoms of the cancer. F(t' + 1) - F(t') gives the proba- bility that the cancer was not potentially detectable until after the last Pap smear was done a year ago. The expression 1 - F(t' + 1) gives the probability that the le- sion was detectable before last year's Pap smear. This last expression must be multi- plied by FN, the chance that that Pap smear was falsely negative and missed it. The expression rat') expel - it r~x)dx] is the probability that this woman will in fact de- velop an invasive cancer at the time t'. A formula for the second probability is the same as Equation 1 except that Equa- ~ cnCC m11115 `< _~o ~ A>) 10 X c,) _ J ~ 5 80 _ ~6 5 4 3 2 1 60 _ \ Frequency (years between tests) 40 _ 20 _ 1 1 50 100 150 200 250 300 FINANCIAL COSTS FIGURE 3-6 Effect of Pap test frequency on financial cost and three measures of benefit for a 20-year-old average-risk woman. Main assumptions are as follows: (1) testing is begun at age 20; (2) a woman will have a checkup every 3 years for other malignant diseases from ages 20 to 40, and then annually thereafter; (3) the marginal cost of a Pap test is $10; (4) Pap test-detect- able dysplasia and carcinoma in situ precede invasive cervical carcinoma by an average of 17 years (range, O to 34 years); (5) 2.5 percent of invasive cervical cancers develop very rapidly, requiring less than 2 years to pass through dysplasia and CIS; (6) no cases of dysplasia or CIS regress spontaneously; (7) no Pap tests are falsely read as positive or suspicious; and (8) 5-year relative survival rates from time or detection (lead time adjusted) are dysplasia and CIS, 98 percent; local invasive, 78 percent; and regional invasive, 43 percent. If. a woman must also pay a $25 office visit fee for the separate visits for the Pap test, the costs increase to about $700 for an annual Pap test and $1,700 for a biannual Pap test (Eddy, 1981~.

166 tion 1 must be multiplied by l - EN, the probability that the Pap smear will not be falsely negative. In similar fashion formu- las can be written for the other important probabilities. These formulas are more complicated if one wants to consider the use of more than one type of test, a series of previous examinations done at various fre- quencies, and other factors, but the con- cepts are similar. To estimate the value of a Pap smear done at various frequencies one can apply formulas to calculate the probabilities of important clinical and economic outcomes relating to cervical cancer for each year in a woman's life, constantly updating the parameters of the formulas to keep track of the woman's changing age and screening history. The calculations can be performed for each screening strategy being evalu- ated: for example, no screening at all, screening every year, screening every 3 years, screening every year for three nega- tive examinations and then every 3 years, and so forth. Parameters for the equations, such as age-specific incidence rates [ret)] and parameters for the functions P(t) and F(t), are estimated from the data collected in clinical and epidemiological studies. The results of an analysis using parame- ter values estimated from such studies are illustrated in Figure 3-6 (Eddy, 1981~. This figure shows the estimated effect of screen- ing a woman with a Pap smear at various frequencies from age 20 to 75. The figure indicates three measures of benefit: the de- crease in the probability that the woman will die of cervical cancer; the increase in her life expectancy, given that the woman is destined to get invasive cancer; and the increase in life expectancy for the average- risk woman who may (with about a 1 per- cent probability) or may not get invasive cervical cancer. The horizontal axis gives the present value (at age 20) of a lifetime series of screening examinations minus the present value of expected savings in treat- ment costs. ASSESSING MEDICAL TECHNOLOGY The calculations indicate that the 3-year Pap smear is about 99 percent as effective as an annual Pap smear. If the 40-year-old, average-risk woman in the original exam- ple postponed her Pap smear another 2 years, the increased annual risk she would run of dying of cervical cancer would be on the order of l per lOO,000, about the same as the risk of death from one round-trip transcontinental airplane flight. REFERENCES American College of Surgeons, Commission on Cancer. 1974. Cancer Program Manual. Chicago: American College of Surgeons. American Public Health Association. 1981. Con- trol of Communicable Diseases in Man, A. Beneson, ed. Washington, D.C. Arnstein, S. R. 1977. Technology assessment: Op- portunities and obstacles. IEEE Trans. Syst. Man and Cybern. SM-7:571-582. Averill, R. F., and L. F. McMahon. 1977. A cost- benefit analysis of continued stay certification. Med. Care 15:158. Bailar J. C. 1970a. Periodical incidence surveys: I. Organization. Seminar on Cancer Registries in Latin America, Pan American Health Organization-World Health Organization, 41-100. Bailar J. C. 1970b. Periodical incidence surveys: II. Basis for the selection of survey areas. Seminar on Cancer Registries in Latin America, Pan American Health Organization-World Health Organization, 101-110. Bailar, J. C., III, T. A. Louis, P. W. Lavori, and M. Polansky. 1984. A classification for biomedical re- search reports. N. Engl. J. Med. 311:1482-1487. Banta, D. H., C. J. Behney, and J. S. Willems. 1981. Costs and their evaluation. In Toward Rational Technology in Medicine. New York: Springer Pub- lishing. Barron, B. A., and R. M. Richart. 1981. Screening protocols for cervical neoplastic disease. Gynecol. On- col. 12:S156. Baum, M. L., D. S. Anish, T. C. Chalmers, et al. 1981. A survey of clinical trials of antibiotic prophy- laxis in colon surgery: Evidence against further use of no-treatment controls. N. Engl. J. Med.305: 795-798. Bearman, J. E., R. B. Lowenson, and W. H. Gul- len. 1974. Muench~s Postulates, Laws and Corollaries, Biometrics Note #4, National Eye Institute, DHEW, Bethesda. Bell, R. L., and E. O. Smith. 1982. Clinical trials in post-marketing surveillance of drugs. Controlled Clinical Trials 3:61-68.

METHODS OF TECHNOLOGY ASSESSMENT Berk, A. A., and T. C. Chalmers. 1981. Cost and efficacy of the substitution of ambulatory for inpa- tient care. N. Engl. J. Med. 304:393-397. Bernstein, L. M., E. R. Siegel, and C. M. Gold- stein. 1980. The hepatitis knowledge base: A proto- type information transfer system. Ann. Intern. Med. 93: 169-181. Berwick, D. M., and A. L. Komaroff. 1982. Cost effectiveness of lead screening. N. Engl. J. Med. 306: 1392-1398. Blum, R. L. 1982. Discovery, confirmation, and incorporation of causal relationships from a large time-oriented clinical data base: The RX project. Comp. Biomed. Res. 15:164-187. Boruch, R. F. 1985. Enhancing the usefulness of longitudinal data by coupling longitudinal surveys and randomized experiments. Draft Report for DOL, Center for Statistics and Probability, Northwestern University. Boruch, R. F., and J. S. Cecil. 1979. Assuring con- fidentiality of social research data. Philadelphia: Uni- versity of Pennsylvania Press. Boston Collaborative Drug Surveillance Program. 1973. Oral contraceptives and venous thromboembol- ic disease, surgically confirmed gallbladder disease and breast tumors. Lancet 1:1399-1404. Braakman, R. 1978. Data bank of head injuries in three countries. Scott. Med. J. 23:107-108. Bruce, R. A., et al. 1974. Seattle Heart Watch: Ini- tial clinical circulation and electrocardiographic re- sponses to maximal exercise. Am. J. Cardiol. 33:459- 469. Bruce, R. A., et al. 1981. A computer terminal pro- gram to evaluate cardiovascular functional limits and estimate coronary event rates. West. J. Med. 135:342- 350. Bunker,J, P., B. J. Barnes, and F. Mosteller. 1977. Costs, Risks and Benefits of Surgery. New York: Ox- ford University Press. Burke, J. F., H. S. Jordan, C. B. Boyle, and E. Vanner. 1981. An Impact Study of the 1978 Consen- sus Conference on Supportive Therapy in Burn Care. Massachusetts Health Research Institute. Byar, D. P., R. M. Simon, W. T. Friedewald, et al. 1976. Randomized clinical trials: Perspectives on some recent ideas. N. Engl. J. Med. 295:74-80. Campbell, D. T., and J. C. Stanley. 1963. Experi- mental and Quasi-Experimental Designs for Re- search. Chicago: Rand McNally College Publishing. Centers for Disease Control. 1979. Abortion Sur- veillance 1977. Atlanta, Gal: Department of Health, Education, and Welfare. Centers for Disease Control. 1982a. Annual Sum- mary, 1981. Reported morbidity and mortality in the United States. Morbid. Mortal. Weekly Rep. 30~54~. Centers for Disease Control. 1982b. Annual Sum- mary, 1981. Morbid. Mortal. Weekly Rep. 30:126- 127. 167 Centor, R. A., and J. S. Schwartz. In press. Calcu- lation of the area under a ROC curve using microcom- puters. Med Decision Making. Chaitman, B. R., et al. 1981. Angiographic preva- lence of high-risk coronary artery disease in patient subsets (CASS). Circulation 64:360-367. Chalmers, T. C. 1975. Randomizing the first pa- tient. Med. Clin. North Am. 59:1035-1038. Chalmers, T. C. 1981. The Clinical Trial. Milbank Mem. Fund Q. 59:324-339. Cochran, W. G. 1954. The combination of esti- mates from different experiments. Biometrics 10: 101- 129. Cohen, S. N., M. F. Armstrong, R. L. Briggs, et al. 1974. Computer-based monitoring and reporting of drug interactions. Medinfo 1974:889-894. Collen, J. F., R. Feldman, A. Siegelaub, and D. Crawford. 1970. Dollar cost per positive test for auto- mated multiphasic screening. N. Engl. J. Med. 283:459-463. Collen, M. F. 1979a. A case study of mam- mography. In Medical Technology and the Health Care System: A Study of the Diffusion of Equipment- Embodied Technology. Prepared by the Committee on Technology and Health Care, National Academy of Sciences, Washington, D.C. Collen, M. F. 1979b. A study of multiphasic health testing. In Medical Technology and the Health Care System: A Study of the Diffusion of Equipment- Embodied Technology. Prepared by the Committee on Technology and Health Care, National Academy of Sciences, Washinggon, D.C. Collen, M. F. 1979c. A guideline matrix for tech- nological system evaluation. J. Med. Systems 2:249- 254. Collen, M. F. 1979d. Cost effectiveness of auto- mated laboratory testing. In Clinician and Chemist, D. S. Young, et al., ea., Proceedings of the First A.O. Beckman Conference in Clinical Chemistry, Ameri- can Association of Clinical Chemists, Washingon, D.C. Collen, M. F. 1983. Utilization of Diagnostic X-ray Examinations. HHS Pub. FDA 83-8208. Washington, D.C.: U.S. Government Printing Office. Collen, M. F., L. G. Dales, G. D. Friedman, et al. 1973. Multiphasic health checkup evaluation study. 4. Preliminary cost benefit analysis for middle-aged men. Prev. Med. 2:236-246. Collen, M. F., S. R. Garfield, R. H. Richart, et al. 1977. Cost analyses of alternative health examination modes. Arch. Int. Med. 137:73-79. Commission on Cancer. 1974. Cancer Program Manual. Chicago: American College of Surgeons. Cook, T. D., and L. C. Leviton. 1980. Reviewing the literature: A comparison of traditional methods with meta-analysis. J. Pers. 48:449-472. Cooper, B. S., and D. P. Rice. 1976. The economic cost of illness revisited. Soc. Secur. Bull. 39:21-36.

168 Cowley, R. A., W. J. Sacco, W. Gill, et al. 1974. A prognostic index for severe trauma. J. Trauma 14: 1029-1035. Cox, E. B., and W. Stanley. 1979a. Schema driven time-oriented record on minicomputer. Comp. Biomed. Res. 12:503-516. Cox, E. B., J. Laszlo, and A. Freiman. 1979b. Classification of cancer patients: Beyond TNM. J. Am. Med. Assoc. 242:2691-2695. Cretin, S. 1977. Cost-benefit analysis of treatment and prevention of myocardial infarction. Health Serv. Res. 12:174. Cutler, S. J., J. Scotto, S. S. Devesa, and R. R. Connelly. 1974. Third National Cancer SurveyAn overview of available information. J. Natl. Cancer Inst. 53:1565-1575. Dalkey, N. C. 1969. The Delphi Method: An Ex- perimental Study of Group Opinion. Santa Monica, Calif.: Rand Corporation. Dannenberg, A. L., R. Shapiro, and J. F. Fries. 1979. Enhancement of clinical predictive ability by computer consultation. Meth. Inform. Med. 18:10- 14. Deane, R., and A. Ulene. 1977. Hysterectomy or tubal ligation for sterilization: A cost-effectiveness analysis. Inquiry 14:73. Delbecq, A., A. H. Van de yen, and D. H. Gustaf- son. 1975. Group Techniques for Program Planning. Glenview, Ill.: Scott, Foresman. Demographic Analysis Section, Biometry Branch, National Cancer Institute. 1976. Code ManualThe SEER Program. DHEW Pub. No. (NIH)-79-1999. Bethesda, Md.: National Cancer Institute. DerSimonian, R., L. J. Charrette, B. McPeek, and F. Mosteller. 1982. Reporting on methods in clinical trials. N. Engl. J. Med. 306:1332-1337. DerSimonian, R., and N. Laird. 1982. Evaluating the effectiveness of coaching for SAT exams, a meta- analysis. Pp. 1-15 in Harvard Educational Review. Devine, E. C., and T. D. Cook. 1983. Effects of psychoeducational intervention on lengh of hospital stay: A meta-analytic review of 34 studies, American Journal of Nursing. Reproduced in R. J. Light, ed. 1983. Evaluation Studies, Review Annual, Vol. 8, pp. 417-432. Beverly Hills, Calif.: Sage Publications. Dixon, R. E. 1978. Effects of infections on hospital care. Ann. Intern. Med. 89 (part 2~: 749-753. Drazen, E., and J. Metzger. 1981. Methods for evaluating costs of automated hospital information systems. DHHS Publ. No. (PHS) 81-3283. Washing- ton, D.C.: U.S. Government Printing Office. Dyke, F. J., F. A. Murphy, J. K. Murphy, et al. 1974. Effect of surveillance on the number of hyster- ectomies in the province of Saskatchewan. N. Engl. J. Med. 296:1326-1328. Eddy, D. M. 1980. Screening for Cancer: Theory, Analysis and Design. Englewood Cliffs, N.J.: Pren- tice-Hall. ASSESSING MEDICAL TECHNOLOGY Eddy, D. M. 1981. Appropriateness of cervical cancer screening. Gynecol. Oncol. 12:S168. Eddy, D. M. 1982. Clinical policies and the quality of clinical practice. N. Engl. J. Med. 307:343-347. Emerson, J. D., B. McPeek, and F. Mosteller. 1984. Reporting clinical trials in general surgical jour- nals. Surgery 95:572-579. Evans, R. W. 1983. Health care technology and the inevitability of resource allocation and rationing decisions. J. Am. Med. Assoc. 149:2208-2219. Evenson, R. C., H. Altman, I. W. Sletten, and D. W. Cho. 1975. Accuracy of actuarial and clinical pre- dictions for length of stay and unauthorized absence. Dis. New. Syst. 36:250-252. Eysenck, H. J. 1978. An exercise in mega-silliness. Am. Psychol. 33:517. Farber, M. E., and S. N. Finkelstein. 1979. A cost- benefit analysis of a mandatory premarital rubella- antibody screening program. N. Engl. J. Med. 300:856-859. Feigl, P., N. E. Breslow, J. Laszlo, et al. 1981. The U.S. centralized cancer patient data system for uni- form communication among cancer centers. J. Natl. Cancer Inst. 67:1017-1024. Feinstein, A. R. 1977. Clinical Biostatistics, pp. 214-226. St. Louis: C. V. Mosby. Feinstein, A. R. 1978. Clinical biostatistics XLIV. Clin. Pharmacol. Ther. 24:117-25. Ferguson, M. 1980. The Aquarian Conspiracy. Los Angeles: ]. P. Tarcher, Inc. Fineberg, H. V. 1979. Gastric freezing: A study of diffusion of a medical innovation. Pp. 173-200 in Medical Technology and the Health Care System: A Study of the Diffusion of Equipment-Embodied Technology. Washington, D.C.: National Academy Press. Fineberg, H. V., R. Bauman, and M. Sosman. 1977. Computerized cranial tomography: Effect on diagnostic and therapeutic plans. J. Am. Med. Assoc. 238:224-230. Fink, A., J. Kosecoff, M. Chassin, and R. H. Brook. 1984 Consensus methods: Characteristics and guidelines for use. Am. ]. Public Health 74:979-983. Finkler, S. A. 1982. The distinction between cost and charges. Ann. Intern. Med. 96:102-1099. Finney, D. J. 1965. The design and logic of a moni- tor of drug use. J. Chronic Dis. 18:77-98. Finney, D. J. 1966. Monitoring adverse reactions to drugsits logic and its weakness. Pp. 198-207 in Medical Foundation, International Congress Series No. 115. Proceedings of the European Society for the Study of Drug Toxicity, Vol. VII. Fisher, R. A. 1938. Statistical Methods for Re- search Workers. London: Oliver & Boyd. Fisher, R: A. 1948. Combining independent tests of significance. Am. Stat. 2:30. Fletcher, R. H., and S. W. Fletcher. 1979. Clinical

METHODS OF TECHNOLOGY ASSESSMENT research in general medical journals: A 30 year per- spective. N. Engl. J. Med. 301:108-183. Foege, W. H., J. D. Millar, and J. M. Lane. 1971. Selective epidemiologic control in smallpox eradica- tion. Am. J. Epidemiol. 94:311-315. Freiman, J. A., T. C. Chalmers, H. Smith, Jr., and R. R. Kuebler. 1978. The importance of the type II error and the sample size in the design and interpre- tation of the randomized control trial. N. Engl. J. Med. 299: 690-694. Friedman, G. D. 1972. Screening criteria for drug monitoring: The Kaiser-Permanente drug reaction monitoring system. J. Chronic Dis. 25:11-20. Friedman, G. D. 1983a. Rauwolfia and breast cancer: No relation found in long term users age fifty and over. J. Chronic Dis. 36:367-370. Friedman, G. D., and H. K. Ury. 1983b. Screen- ing for possible drug carcinogenicity: Second report of findings. J. Natl. Cancer Inst. 71: 1165- 1175. Friedman, L. F., C. D. Furberg, and D. L. De- Mets. 1981. Fundamentals of Clinical Trials, ix, 225. Boston: John Wright, PSG. Fries, J. F., S. Weyl, and H. R. Holman. 1974. Es- timating prognosis in systemic lupus erythematosus. Am. J. Med. 57:561-565. Fuchs, V. R. 1974. Who Shall Live? Health, Eco- nomics and Social Choice. New York: Basic Books. Galbraith, S. L. 1978. Prognostic factors already known. Scott Med. J. 23:108-109. Galliher, H. P. 1981. Optimizing ages for cervical smear examinations in followed healthy individuals. Gynecol. Oncol. 12:S188. Garber, A. M., V. R. Fuchs, and J. F. Silverman. 1984. Case mix, costs, and outcomes: differences be- tween faculty and community services in a university hospital. N. Engl. J. Med. 310:1231-1237. Gershman, S. T., H. Barrett, J. T. Flannery, et al. 1976. Development of the Connecticut tumor regis- try. Conn. Med. 40:697-701. Gilbert, J. P., B. McPeek, and F. Mosteller. 1977. Progress in surgery and anesthesia: Benefits and risks of innovative therapy. In Costs, Risks, and Benefits of Surgery, J. Bunker, B. Barnes, and F. Mosteller, eds. New York: Oxford University Press. Gittelsohn, A. M., and J. Wennberg. 1977. On the incidence of tonsillectomy and other common surgical procedures. Pp. 91-106 in Costs, Risks and Benefits of Surgery, J. P. Bunker, B. A. Barnes, and F. Mosteller, eds. New York: Oxford University Press. Glaser, E. M. 1980. Using behavioral science strat- egies for defining the state-of-the-art. J. Appl. Behav. Sci. 16:79-82. Glass, G. V. 1976. Primary, secondary and meta- analysis of research. Educ. Res. 5:351-379. Glass, G. V., B. McGaw, and M. L. Smith. 1981. Meta-Analysis in Social Research. Beverly Hills, Calif.: Sage Publications. 169 Globe, S., G. N. Levy, and C. M. Schwartz. 1967. Science, Technology, and Innovation. Contract NSF- C667 (Battelle-Columbus Laboratories). Washing- ton, D. C.: National Science Foundation. Grace, N. D., H. Muench, and T. C. Chalmers. 1966. The present status of shunts for portal hyperten- sion in cirrhosis. Gastroenterology 50:684. Graham, G. F., and F. J. Wyllie. 1979. Prediction of gall-stone pancreatitis by computer. Br. Med. J. 1 :515-517. Groot, L. M. J. 1982. Advanced and Expensive Medical Technology in the Member States of the Eu- ropean Community: Legislation, Policy and Costs. Commissioned by the European Community. Mimeo- graph; Roermond, The Netherlands. Grossman, L. B. 1981. Beads and balloons. Pp. 2-6 in Pacemaker, Vol. 6, No. 3. Detroit: Harper Grace Hospitals. Gustafson, D. H., A. L. Delbecq, M. Hansen, and R. F. Myers. 1975. Design of a health policy research and development system. Inquiry 12:251-262. Haley, R. W., D. Quade, H. E. Freeman, et al. 1980. The Senic Project: Study on the efficacy of non- socomial infection control. Am. J. Epidemiol. 111 :472-485. Hanley, J. A., and B. J. McNeil. 1982. The mean- ing and use of the area under a receiver operating characteristic (ROC) curve. Diag. Radiol. 143:29-36. Health and Social Service Journal, May 30, 1980, pp. 702-704. Hedges, L. V., and I. Olkin. 1980. Vote-counting methods in research synthesis. Psychol. Bull. 88:359- 369. Herbert, T. T., and E. B. Yost. 1979. A compari- son of decision quality under nominal and interacting consensus group formats: The case of the structured problem. Decision Sci. 10:358-370. Herbst, A. L., M. Ulfelder, and D. C. Poskaner. 1971. Association of maternal stilbestrol therapy with tumor appearance in young women. N. Engl. J. Med. 284:878-881. Heyman, A., J. G. Burch, R. Rosati, et al. 1979. Use of a computerized information system in the man- agement of patients with transient cerebral ischemia. Neurology 29:214-221. Hiltz, S. R., and M. Turoff. 1978. The Network Nation. Reading, Mass: Addison-Wesley. Hoaglin, D. C., R. J. Light, B. McPeek, et al. 1982. Data for Decisions. Cambridge, Mass: Abt Books. Hodgkin, J. E., ed. 1979. Chronic Obstructive Pul- monary Diseases: Current Concepts in Diagnosis and Comprehensive Care. Park Ridge, Ill.: American Col- lege of Chest Physicians. Hodgkin, J. E., O. J. Balchum, I. Kass, et al. 1975. Chronic obstructive airway diseases: Current con- cepts in diagnosis and comprehensive care. J. Am. Med. Assoc. 232:1243-1260.

170 Hook, E. B., D. M. Schreinemachers, and P. K. Cross. 1981. Use of prenatal cytogenic diagnosis in New York State. N. Engl. J. Med. 305:1410-1413. Horn, S. D., and J. W. Williamson. 1977. Statisti- cal methods for reliability and validity testing: An ap- plication to nominal group judgments in health care. Med. Care 15:922-928. Hovrocks, J. C., D. E. Lambert, W. A. McAdams, et al. 1976. Transfer of computer aided diagnosis of dyspepsia from one geographical area to another. Gut 17:640-644. Hunter, J. E. 1982. Meta-Analysis: Cumulating Research Findings Across Studies. Beverly Hill, Calif.: Sage Publications. IMS America Ltd. 1978. Report of the Joint Com- mission on Prescription Drug Use, Contract 223-78- 3007. Department of Health, Education, and Wel- fare, Food and Drug Administration. Inglefinger, J. A., F. Mosteller, L. A. Thibodeau, and J. H. Ware. 1983. Biostatistics in Clinical Medi- cine, Section 11-6, pp. xiii, 316. New York: Macmil- lan. Institute of Medicine 1981. Evaluating Medical Technologies in Clinical Use. Washington, D.C.: Na- tional Academy Press. Jackson, G. B. 1980. Methods for integrative re- views. Rev. Educ. Res. 50:438-460. Jacoby, I. 1983. Biomedical technology: Informa- tion dissemination and the NIH consensus develop- ment process. Knowledge: Creation, Diffusion. Utili- zation 5:245-261. Jeans, W. D., and A. F. Morris. 1976. The accu- racy of radiological and computer diagnoses in small bowel examinations in children. Br. J. Radiol. 49:665-669, 1976. Jennett, B., G. Teasdale, R. Brackman, et al. 1976. Predicting outcome in individual patients after severe head trauma. Lancet 1: 1031-1034. Jick, H., A. M. Walker, and C. Spriet-Pourra. 1979. Postmarketing follow-up. J. Am. Med. Assoc. 242:2310-2314. Jillson, I. A. 1975. The national drug-abuse policy delphi: Progress report and findings to date. In The Delphi Method: Techniques and Applications, H. Linstone and M. Turoff, eds. Reading, Mass.: Addi- son-Wesley. Kennedy, M. M. 1979. Generalizing from single case studies. Eval. Q. 3:661-678. Kimball, A. M., S. B. Thacker, and M. E. Levy. 1980. Shigella surveillance in a large metropolitan area: Assessment of a passive reporting system. Am. J. Public Health 70:164-166. Klarman, H. E. 1973. Application of cost-benefit analysis of health systems technology. In Technology and Health Care Systems in the 1980's, M. F. Collen, ed. DHEW Publ. (HMS) 73-3016. Washington, D.C.: U.S. Government Printing Office. ASSESSING MEDICAL TECHNOLOGY Knill-Jones, R. 1978. New statistical approach to prediction. Scott. Med. J. 23:102-110. Kolata, G. 1985. Heart panel's conclusions ques- tioned. Science 227:40-41. Koplan, J. P., and L. S. Farer. 1980. Choice of preventive treatment for Isoniazid-resistant tubercu- lous infection. J. Am. Med. Assoc. 244:2736-2740. Koplan, J. P., and S. R. Preblud. 1982. A benefit- cost analysis of mumps vaccine. Am. J. Dis. Child. 136:362-364. Koplan, J. P., S. C. Schoenbaum, M. C. Wein- stein, et al. 1979. Pertussis vaccine: An analysis of benefits, risks and costs. N. Engl. J. Med. 301:906- 911. Kruskal, W., and F. Mosteller. 1980. Representa- tive sampling IV: The history of the concept in statis- tics, 1895-1939. Int. Stat. Rev. 48:169-195. Kurland, L. T., and C. A. Molgaard. 1981. The patient record in epidemiology. Sci. Am. 245:54-63. Lane, J. M., F. L. Ruben, J. M. Neff, et al. 1969. Complications of smallpox vaccination, 1968. Na- tional surveillance in the United States. N. Engl. J. Med. 281:1201-1208. Laszlo, J. 1984. In press. Two long-range clinical data bases are terminatedbang, whimper, or rip- ple? J. Clin. Oncol. Laszlo, J. Healthy registry and clinical data base terminology, with special reference to cancer-regis- tries. Submitted for publication. Laszlo, J., C. Angle, and E. Cox. 1976. The hospi- tal tumor registry: Present status and future prospects. Cancer 38:395. Lave, J., A. Dobson, and C. Walton. 1983. The potential use of Health Care Financing Administra- tion data tapes for health care services research. Health Care Financing Rev. 5:93-98. Layde, P. M., S. D. Von Allman, and G. P. Oak- ley. 1979. Maternal serum alpha-fetoprotein screen- ing: A cost-benefit analysis. Am. J. Publ. Health 69:566-573. Lembcke, P. A. 1956. Medical auditing by scien- tific methods, illustrated by major female pelvic sur- gery. J. Am. Med. Assoc. 162:646-655. Lenfant, C., B. Rifkind, and I. Jacoby. 1985. Heart panel's conclusions (letter). Science 227:582- 583. Light, R. J., ed. 1983. Evaluation Studies Review Annual, Vol. 8. Beverly Hills: Sage Publications. Light, R. J., and P. V. Smith. 1971. Accumulating evidence: Procedures for resolving contradictions among different research studies. Harv. Ed. Rev. 41:429-471. Linstone, H., and M. Turoff, eds. 1975. The Delphi Method: Techniques and Applications. Read- ing, Mass.: Addison-Wesley. Lips, K. J. M., V. D. S. Veer, A. Struyvenberg, and R. A. Geerdink. 1982. Genetic predisposition to cancer in man. Am. J. Med. 73:305-307.

METHODS OF TECHNOLOGY ASSESSMENT Louis, P. 1836. Researches on the Effects of Blood Letting on Some Inflammatory Disease and on the In- fluence of Tartarized Antimony and Vesication in Pneumonia (translated by C. G. Putman). Boston: Hilliard Gray and Company. Louis, T. A., H. Fineberg, and F. Mosteller. 1985. Findings for public health from meta-analysis, pp. 1- 20. Annual Review of Public Health. Palo Alto, Cali- fornia: Annual Reviews, Inc. Lusted, L. B. 1969. Perception of roentgen image: Applications of signal detection theory. Bad. Clinics N. Am. 7:435-445. Maclennan, R., C. Muir, and A. Winkler. 1978. Cancer Registration and Its Techniques. Lyon: Inter- national Association for Cancer Research. Mantel, N., and W. Haenszel. 1959. Statistical as- pects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst. 22:719-748. McNeil, B. J. 1979. Pitfalls in and Requirements for Evaluations of Diagnostic Technologies in Medical Technology. DHEW Publ. No. (PHS) 79-3254. Urban Institute Conference. McNeil, B. J., P. D. Varady, B. A. Burrows, and S. J. Adelstein. 1975. Cost-effectiveness calculations in the diagnosis and treatment of hypertensive renovas- cular disease. N. Engl. J. Med. 293:216-221. McShane, D. J., J. Porta, and J. F. Fries. 1978. Comparison of therapy in severe systemic lupus erythematosus employing stratification techniques. J. Rheumatol. 5:51-58. Meier, P. 1975. Statistics and medical experimen- tation. Biometrics 31:511-530. Metz, C. E. 1978. Basic principles of ROC analy- sis. Semin. Nucl. Med. 7:283-298. Metz, C. E., D. J. Goodenaugh, and K. Rossman. 1973. Evaluation of receiver operating characteristic curve data in terms of information theory, with appli- cations in radiology. Radiology 109:297-303. Moscovice, I., P. Armstrong, S. Shortell, and R. Bennet. 1977. Health services research for decision- makers: The use of the Delphi technique to determine health priorities. J. Health Politics, Policy and Law 2:388-410. Mosteller, F. M., and R. R. Bush. 1954. Selected quantitative techniques. In Handbook of Social Psy- chology, Vol. I, Theory and Method, G. Lindzey, ed. Cambridge, Mass.: Addison-Wesley. Mulley, A. G., M. D. Silverstein, and J. L. Dien- stag. 1982. Indications for use of hepatitis B vaccine, based on cost-effectiveness analysis. N. Engl. J. Med. 307:644-652. Mullner, R. M., C. S. Byre, and C. L. Kil- lingsworth. 1983. An inventory of the U.S. health care data bases. Review of Public Data Use 11:79- 192. Naisbitt, J. 1982. Megatrends. New York: Warner Books. 171 Nance, F. C., and I. Cohn, Jr. 1969. Surgical judg- ment in the management of stab wounds of the abdo- men. A retrospective and prospective analysis based on a study of 600 stabbed patients. Annals of Surgery 170:569. National Academy of Sciences. 1978. Personnel Needs and Training for Biomedical and Behavioral Research Personnel. Washington, D. C.: National Academy Press. National Academy of Sciences. 1981. Personnel Needs and Training for Biomedical and Behavioral Research. Washington, D. C.: National Academy Press. National Academy of Sciences/Institute of Medi- cine. 1983. Personnel Needs and Training for Bio- medical and Behavioral Research. Washington, D.C.: National Academy Press. National Academy of Sciences. 1985. Sharing Re- search Data, S. E. Fineberg, M. E. Martin, and M. L. Straf, eds. Washington, D.C.: National Academy Press. National Center for Health Care Technology. 1981. Coronary artery bypass surgery. J. Am. Med. Assoc. 2246:1643. National Center for Health Statistics. 1980a. Cata- log of Public Use Data Tapes from the National Cen- ter for Health Statistics. Public Health Service, DHHS Publ. No. (PHS) 81-1213. Washington, D.C.: U.S. Government Printing Office. National Center for Health Statistics. 1980b. Data Systems of the National Center for Health Statistics. Pp. 28-30 in Public Health Service, DHHS Publ. No. (PHS) 80-1247. Washington, D.C.: U.S. Government Printing Office. National Center for Health Statistics. 1980c. Long- Term Health Care: Minimum Data Set. Report of the National Committee on Vital and Health Statistics. Public Health Service, DHHS Publ. No. (PHS) 80- 1158. Washington, D.C.: U.S. Government Printing Office. National Center for Health Statistics. 1980d. The collection and processing of drug information, Na- tional Ambulatory Medieal Care Survey, United States. Prepared by H. Koeh. In Vital and Health Sta- tisties, Series 2, No. 90. Publie Health Serviee. Wash- ington, D.C.: U.S. Government Printing Offiee. National Center for Health Statisties. 1980e. Trends and variations in eesarean section delivery. Prepared by P. J. Placek, S. M. Taffel, and J. C. Kleinman. Pp. 73-36 in Health, United States, 1980. Public Health Service, DHHS Publ. No. (PHS) 81- 1232. Washington: U.S. Government Printing Office. National Center for Health Statistics. 1980f. Uni- form Hospital Discharge Data: Minimum Data Set. Report of the National Committee on Vital and Health Statisties. Publie Health Serviee, DHEW

172 Publ. No. (PHS) 80-1157. Hyattsville, Md.: U.S. Gov- ernment Printing Office. National Center for Health Statistics. 1981a. Inpa- tient utilization of short-stay hospitals by diagnosis, United States, 1978. Prepared by E. McCarthy. P. 24 in Vital and Health Statistics, Series 13, No. 55. Pub- lic Health Service, DHHS Publ. No. (PHS) 81-1716. Washington, D.C.: U.S. Government Printing Of- fice. National Center for Health Statistics. 1981b. NMCES Household Interview Instruments: Instru- ments and Procedures 1. Prepared by G. S. Bonham and L. T. Corder. Public Health Service, DHHS Publ. No. (PHS) 81-3280. Washington, D.C.: U.S. Government Printing Office. National Center for Health Statistics. 1981c. Uni- form Ambulatory Medical Care: Minimum Data Set. Report of the National Committee on Vital and Health Statistics. Public Health Service, DHHS Publ. No. (PHS) 81-1161. Hyattsville, Md.: U.S. Govern- ment Printing Office. National Center for Health Statistics. 1983a. Data Systems of the National Center for Health Statistics, pp. 63-75. Public Health Service, DHHS Publ. No. (PHS) 80-1247. Washington, D.C.: U.S. Government Printing Office. National Center for Health Statistics. 1983b. Drug utilization in office-based practices: Summary of find- ings. National Ambulatory Medical Care Survey. United States, 1980. Prepared by H. Koch. In Vital and Health Statistics, Series 13, No. 65. Public Health Service, DHHS Publ. No. (PHS) 83-1726. Washing- ton, D.C.: U.S. Government Printing Office. National Center for Health Statistics. 1983c. Health, United States, 1983, p. 13. Public Health Ser- vice, DHHS Publ. No. (PHS) 84-1232. Washington, D.C.: U.S. Government Printing Office. ASSESSING MEDICAL TECHNOLOGY National Commission for the Protection of Human Subjects. 1978. Report and Recommendations. Wash- ington, D.C.: U.S. Government Printing Office. National Institutes of Health. 1983. Liver trans- plantation consensus development. Conference Sum- mary. Volume 4, Number 7. National Institutes of Health. 1984. Diagnostic ul- trasound imaging in pregnancy. Consensus Develop- ment Conference. Volume 5, Number 1. NHLBI Angioplasty, Kent et al. 1982. Percuta- neous transluminal coronary angioplasty: Report from the registry of the National Heart, Lung, and Blood Institute. Am. J. Cardiol. 49:2011-2018. Needleman, H. L, S. K. Geiger, and R. Frank. 1985. Lead and IQ scores: a reanalysis. Science 227:701-702. Needleman, H. L., C. Gunnoe, A. Leviton, et al. 1979. Deficits in psychologic and classroom perfor- mance of children with elevated dentine lead levels. N. Engl. J. Med. 300:689-695. Neustadt, R., and H. V. Fineberg. 1983. The Epi- demic That Never Was: Decision Making in the Swine Flu Scare. 1983. New York: Vintage Books. Neutra, R. R., S. E. Fienberg, S. Greenland, and E. A. Friedman. 1978. Effect of fetal monitoring on neonatal death rates. N. Engl. J. Med. 299:324-326. Office of Health Research Statistics and Technol- ogy. 1981. U.S. Department of Health and Human Services. Transsexual surgery. Assessment Report Se- ries 1(4). Washington, D.C.: U.S. Department of Health and Human Services. Office of Technology Assessment, U.S. Congress. 1978a. Assessing the efficacy and safety of medical technologies. Washington, D.C.: U.S. Government Printing Office. Office of Technology Assessment, U.S. Congress. 1978b. Policy implications of the computed tomo- graphy (CT) scanner. GPO Stock No. 052-003-00565- 4. Washington, D.C.: U.S. Government Printing Of- fice. Office of Technology Assessment, U.S. Congress. 1980a. The implications of cost-effectiveness analysis of medical technology. Stock No. 051-003-00765-7. Washington, D.C.: U.S. Government Printing Of- National (~enter for Health Statistics. lusted. Pro- cedures and questionnaires of the National Medical Care Utilization and Expenditure Survey. Prepared by G. S. Bonham. In Series A, Methodological Report No. 1. Public Health Service, DHHS Publ. No. 83- 20001. Washington, D.C.: U.S. Government Printing Office. Notional Center for Health Statistics. 1983e. Utili- fice. zation of shortstay hospitals: United States, 1981 an- Office of Technology Assessment, U.S. Congress. nual summary. In Vital and Health Statistics, Series 1980b. The implications of cost-effectiveness analysis 13, No. 72. Public Health Service, DHHS Publ. No. of medical technology/background paper #1: Method- (PHS) 83-1733. Washington D.C.: U.S. Government ological issues and literature review. Washington, Printing Office. D.C.: U.S. Government Printing Office. National Center for Health Statistics. 1983f. Varia- Office of Technology Assessment, U.S. Congress. tion in use of obstetric technology. Prepared by J. C. 1980c. The implications of cost-effectiveness analysis Kleinman, M. Cooke, S. Machlin, and S. S. Kessel. of medical technology/background paper #3: The effi- Pp. 63-75 in Health, United States, 1983. Public cacy and cost effectiveness of psychotherapy. Wash- Health Service, DHHS Publ. No. (PHS) 84-1232. ington, D.C.: U.S. Government Printing Office. Washington, D.C.: U.S. Government Printing Of- Office of Technology Assessment, U.S. Congress. rice. 1981a. Cost effectiveness of influenza vaccination.

METHODS OF TECHNOLOGY ASSESSMENT Washington, D.C.: U.S. Government Printing Of- fice. Office of Technology Assessment, U.S. Congress. 1981b. The implications of cost-effectiveness analysis of medical technology/background paper #2: Case studies of medical technologies/care study #2: The feasibility of economic evaluation of diagnostic proce- dures: The case of CT scanning. Washington, D.C.: U.S. Government Printing Office. Office of Technology Assessment, U.S. Congress. 1981c. The implications of cost-effectiveness analysis of medical technology/background paper #2. Case studies of medical technologies/case study #1: Formal analysis, policy formulation, and end-stage renal dis- ease. Washington, D.C.: U.S. Government Printing Office. Office of Technology Assessment, U.S. Congress. 1981d. The implications of cost-effectiveness analysis of medical technology/background paper #2: Case studies of medical technologies/case study #3: Screen- ing for colon cancer: A technology assessment. Wash- ington, D.C.: U.S. Governmment Printing Office. Office of Technology Assessment, U.S. Congress. 1982a. Postmarketing Surveillance of Prescription Drugs. Washington, D.C.: U.S. Government Print- ing Office. Office of Technology Assessment, U.S. Congress. 1982b. Strategies for Medical Technology Assessment. Washington, D.C.: U.S. Government Printing Of- fice. Office of Technology Assessment, U.S. Congress. 1982c. Technology and Handicapped People, p. 22. OTA-H-179. Washington, D.C.: U.S. Government Printing Office. Office of Technology Assessment, U.S. Congress. 1983a. Abstracts of Case Studies in the Health Tech- nology Case Study Series. OTA-P-225. Washington, D.C.: U.S. Government Printing Office. Office of Technology Assessment, U. S. Congress. 1983b. Variations in Hospital Length of Stay: Their Relationship to Health Outcomes. Health Technology Case Study 24. OTA-HCS-24. Washington, D.C.: U.S. Government Printing Office. Olsen, S. A. 1982. Group Planning and Problem Solving Methods in Engineering Management. New York: John Wiley & Sons. Patrick, K. M., and R. Woolley. 1981. A cost-ben- efit analysis of immunization for pneumococcal pneu- monia. J. Am. Med. Assoc. 245:473-477. Pauker, S., and J. Kassirer. 1975. Therapeutic de- cision making: A cost-benefit analysis. N. Engl. J. Med. 293:229-234. Pearson, E. S. 1938. The probability integral trans- formation for testing goodness of fit and combining independent tests of significance. Biometrika 30: 134- 148. Pearson, E. S. 1950. On questions raised by the 173 combination of tests based on discontinuous distribu- tions. Biometrika 37:383-398. Perry, S., and J. T. Kalberer. 1980. The NIH Con- sensus-Development Program and the assessment of health-care technologies: The first two years. N. Engl. J. Med. 303:169-172. Pillemer, D. B., and R. J. Light. 1980. Synthekiz- ing outcomes: How to use research evidence from many studies. Harvard Educ. Rev. 50:176-195. Pocock, S. J. 1976. The combination of random- ized and historical controls in clinical trials. J. Chronic Dis. 29:175-188. Policy Research Incorporated. 1975a. National Drug Abuse Policy Delphi Study: Questionnaires and Summaries of Results. Baltimore. Policy Research Incorporated. 1975b. National Drug Abuse Policy Report: Final Report. Baltimore. Policy Research Incorporated. 1977. A Compre- hensive Study of the Ethical, Legal, and Social Impli- cations of Advances in Biomedical and Behavioral Re- search and Technology: Summary of the Final Report. Baltimore. Policy Research Incorporated. 1979a. Medical Practice Information Demonstration Project: Depres- sion Project: First Series of Instruments. Baltimore. Policy Research Incorporated. 1979b. Medical Practice Information Demonstration Project: Bipolar Disorder: A State-of-the-Science Report. Baltimore. Pollack, E. S. 1982a. Monitoring cancer incidence and patient survival in the United States. Proceedings, Social Statistics Section. Washington, D.C.: Ameri- can Statistical Association. Pollack, E. S. 1982b. SEER cost study. Printed for private distribution, Natonal Cancer Institute. President's Commission for the Study of Ethical Problems in Medicine and Biomedical and Behavioral Research. 1983. Summing up: Final Report on Studies of the Ethical and Legal Problems in Medicine and Biomedical and Behavioral Research. Washington, D.C.: U.S. Government Printing Office. Principal investigators of CASS and their associ- ates. 1981. National Heart, Lung, and Blood Institute Coronary Artery Surgery Study. Circulation 63(suppl. I):1-81. Queram, C. J. 1977. Cancer Registries and Report- ing Systems in the United States. Madisdn, Wis.: De- partment of Health and Social Services, Bureau of Health Statistics, Division of Health. Rand Corporation. 1983. Submission to the Office of Management and Budget of Supporting Statement and Data Collection Instruments for Assessing the Ef- fectiveness of the NIH Consensus Development Pro- gram. Ransahoff, D. F., and A. R. Feinstein. 1978. Prob- lems of spectrum and bias in evaluating the efficacy of diagnostic tests. N. Engl. J. Med. 299:926-930. Re, R., R. Novelline7 M. T. Escourrou, et al. 1978.

174 Inhibition of angiotensin-converting enzyme for diag- nosis of renal artery stenosis. N. Engl. J. Med. 298:582-586. Remington, R. D. 1976. Recommendations. Pp. 141-150 in Assessing Drug ReactionsAdverse and Beneficial, Vol. 7, Philosophy and Technology of Drug Assessment, F. N. Allen, ed. Washington, D.C.: The Interdisciplinary Communications Associates. Rennie, D. 1981. Consensus Statements. N. Engl. J. Med. 304:665-666. Richart, R. H. 1974. Evaluation of a hospital com- puter system, in Hospital Computer Systems, M. Col- len, ed. New York: John Wiley & Sons, Inc. Richart, R. M. 1981. Discussion of session III: Screening of cervical neoplasia. Gynecol. Oncol. 12:S212. Robertson, A., and M. H. Zweig. 1981. Use of re- ceiver operating characteristic curves to evaluate the clinical performance of analytical systems. Clin. Chem. 77:1568-1574. Romm, F. J., and B. S. Hulka. 1979. Developing criteria for quality of care assessment: Effect of the Delphi technique. Health Services Res. 14:309-312. Boos, L. L., Jr. 1979. Alternative designs to study outcomes. Med. Care 17:1069-1087. Rosati, R. A., A. G. Wallace, and E. A. Stead. 1973. The way of the future. Arch. Intern. Med. 131:285. Rosenthal, R. 1978a. Combining results of inde- pendent studies. Psychol. Bull. 85:185-193. Rosenthal, R., and D. B. Rubin. 1978b. Interper- sonal expectancy effects: The first 345 studies. Behav- ioral and Brain Sciences 3:377-415. Russell, L. B. 1979. Technology in Hospitals: Med- ical Advances in Their Diffusion. Washington, D.C.: The Brookings Institution. Sackman, H. 1975. Delphi Critique. Lexington, Mass.: Lexington Books. Sanders, C. A. 1979. Medical technology and the health care system: A study of the diffusion of equip- ment-embodied technology, by the Committee on Technology and Health Care. Washington, D.C.: National Academy of Sciences. Schaffarzick, R., cited by Wennberg and Git- telsohn, 1973. Schneiderman, M. A. 1966. Looking backward: Is it worth the crick in the neck? Or: pitfalls in using ret- rospective data. Am. J. Roentgenol. Radium Ther. Nucl. Med. 96:230-235. Schoenbaum, S., J. N. Hyde, L. Bartoshesky, and K. Crampton. 1976a. Benefit-cost analysis of rubella vaccination policy. N. Engl. J. Med. 294:306-310. Schoenbaum, S. C., B. J. McNeil, and J. Kavet. 1976b. The swine-influenza decision. N. Engl. J. Med. 295:759-765. Schwartz, J. S., P. J. Weinbaum, C. Nesler, et al. 1983. Assessment of tests to identify infants at high ASSESSING MEDICAL TECHNOLOGY risk of respiratory distress syndrome (RDS) using re- ceiver operating characteristic (ROC) curve analysis. Med. Decision Making 3:365. Scitovsky, A. A. 1979. Changes in the use of ancil- lary services for ~common' illness. Pp. 39-56 in Medi- cal Technology: The Culprit Behind Health Costs? Proceedings of the 1977 Sun Valley Forum on Na- tional Health Insurance, Stuart H. Altman and Robert Blendon, eds. DHEW Publ. No. (PHS) 79- 3216. New York: John Wiley. Shapiro, M., et al. 1983a. Benefit-cost analysis of antimicrobial prophylaxis in abdominal and vaginal hysterectomy. J. Am. Med. Assoc. 249:1290-1294. Shapiro, S. H., and T. A. Louis. 1983. Clinical Trials: Issues and Approaches, pp. ix, 209. New York: Marcel Dekker. Sherins, R. J., C. L. Olivery, and J. L. Ziegler. 1978. Gynecomastia and gonadal dysfunction in ado- lescent boys treated with combination chemotherapy for Hodgkin's disease. N. Engl. J. Med. 299:12-16. Showstack, J., et al. 1981. Evaluating the costs and benefits of a diagnostic technology: The case of upper gastrointestinal endoscopy. Med. Care 19:498-509. Shur, R. D. 1981. Estimating the natural history of cancer from cross-sectional data. Ph.D. Dissertation, Engineering-Economic Systems, Stanford University. Shwartz, M. 1978. An analysis of the benefits of se- rial screening for breast cancer based upon a mathe- matical model of the disease. Cancer 51:1550. Siegel, E. R., 1980. Use of computer conferencing to validate and update NLM's hepatitis knowledge base, in The Future of Electronic Communications, M. M. Henderson, and M. J. MacNaughton, eds. Washington, D.C.: American Association for the Ad- vancement of Science. Snow, J. 1847. On the Inhalation of the Vapor of Ether in Surgical Operations. (Reproduced by Lea & Febiger, Philadelphia, 1959.) Stange, P., and A. T. Summer. 1978. Predicting treatment costs and life expectancy for end-stage renal disease. N. Engl. J. Med. 298:372-378. Stason, W,. and M. Weinstein. 1977. Allocation of resources to manage hypertension. N. Engl. J. Med. 296:732. Steinberg, D. 1985. Heart panel's conclusions (let- ter). Science 227:582. Swain, R. W. 1981. Health Systems Analysis. Co- lumbus, Ohio: Grid Publishing. Swets, J. A. 1979. ROC analysis applied to the evaluation of medical imaging techniques. Invest. Ra- diology 14:109-121. Swets, J. D., and R. M. Pickett. 1982. Evaluation of Diagnostic Systems. New York: Academic Press. Tatro, D. S., T. N. Moore, and S. N. Cohen. 1979. Computer-based system for adverse drug detection and prevention. Am. J. Hosp. Pharm. 36:198-201.

METHODS OF TECHNOLOGY ASSESSMENT Teasdale, G. 1978. Prediction in action. Scott. Med. J. 23:111. The Lancet Commission on Anesthesia. 1893. Lan- cet 1:629-638, 693-708, 761-776, 899-914, 971-978, 1236-1240, 1479-1498. Thornell, C. A. 1981. Comparison of Strategies for the Development of Process Measures in Emergency Medical Services, NCHSR Research Summary Series. Hyattsville, Md.: National Center for Health Services Research. Turoff, M., and S. R. Hiltz. 1982. Computer sup- port for group versus individual decisions. IEEE Trans. Comm. COM-30(1~:82-91. U.S. Consumer Product Safety Commission. 1982. National Electronic Injury Surveillance System. NEISS Data Highlights 6:1-4. U.S. Public Health Service. 1964. Smoking and Health. Report of the Advisory Committee to the Sur- geon General of the Public Health Service. Washing- ton, D.C.: U.S. Department of Health, Education, and Welfare. Warner, K. E., and R. C. Hutton. 1981. Cost- benefit and cost-effectiveness analysis in health care. Med. Care 19:498-509. Warner, K. E., and B. R. Luce. 1982. Cost-Benefit and Cost-Effectiveness Analysis in Health Care. Ann Arbor, Mich.: Health Administration Press. Weinstein M., and W. Stason. 1976. Hyperten- sion: A Policy Perspective. Cambridge, Mass.: Har- vard University Press. Weinstein, M. C., and H. V. Fineberg. 1980. Clin- ical Decision Analysis. Philadelphia: W. B. Saunders. Weinstein, M. C., and B. Stason. 1977. Founda- tions of cost-effectiveness analysis for health and med- ical practices. N. Engl. J. Med. 296:716-721. Wennberg, J. 1984. Dealing with medical practice variations: A proposal for action. Health Affairs 3:6- 32. Wennberg, J., and A. Gittlesohn. 1973. Small area variations in health care delivery. Science 182:1102- 1108. Wennberg, J. E. 1981. A strategy for the post- marketing surveillance and evaluation of health care technology. Department of Community and Family Medicine, Dartmouth Medical School. Wennberg, J. E., J. P. Bunker, and B. Barnes. 1980. The need for assessing the outcome of common medical practices. Ann. Rev. Public Health 1:277- 295. 175 White, C., and J. C. Bailar. 1956. Retrospective and prospective methods of striding association in medicine. Am. J. Public Health 46:35-44. White, J. J., and N. W. Axnick. 1975. The benefits from 10 years of measles immunization in the United States. Public Health Rep. 90:205-207. Willems, J. S., R. Sanders, A. Riddiough, and C. Bell. 1980. Cost-effectiveness of vaccination against pneumococcal pneumonia. N. Engl. J. Med. 303:553- 559. Williamson, J. W. 1978. Formulating priorities for quality assurance activity. J. Am. Med. Assoc. 239:631-637. Wilson, D. H. 1979a. The acute abdomen in the accident and emergency department. Practitioner 222:480-485. Wilson, S. 1979b. Explorations of the usefulness of case study evaluations. Evaluation Q. 3:446-59. World Health Organization. 1976a. WHO Hand- book for Standardized Cancer Registrations. Geneva. World Health Organization. 1976b. ICD-O, Inter- national Classification of Diseases for Oncology, 1st ed. Geneva. Wortman, P. M. 1983. Evaluation research: A methodological perspective. Ann. Rev. Psychol. 34:223-260. Wortman, P. M., and L. Saxe. 1982a. Assessment of medical technology: Methodological considerations (Appendix C). In Strategies for Medical Technology Assessment, prepared by the Office of Technology As- sessment, U.S. Congress. Washington, D.C.: U.S. Government Printing Office. Wortman, P. M., and A. Vinokur. 1982b. Evalua- tion of NIH Consensus Development Process. Phase I: Final Report. Ann Arbor, Mich.: Center for Research on Utilization of Scientific Knowledge, University of Michigan. Yin, R. K. 1984. Cast Study Research, Deisgn and Methods. Beverly Hills: Sage Publications. Young, J. L., Jr., A. Asire, and E. Pollock. 1976. SEER Program: Cancer Incidence and Mortality in the United Stats 1973-1976. DHEW Pub. No. (NIH) 78-1837. Bethesda, Md: National Cancer Institute. Young, J. L., C. L. Percy, A. J. Asire, eds. 1981. SEER Program: Incidence and Mortality Data, 1973- 77. National Cancer Institute Monograph 57, USG 80, NIH Pub. No. 81-2330. Washington, D.C.: Na- tional Cancer Institute.

Next: 4. Effects of Clinical Evaluation on the Diffusion of Medical Technology »

Assessing Medical Technologies (1985)

Chapter: 3. Methods of Technology Assessment

Welcome to OpenBook!

Get Email Updates