Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 20
Page 20 2 Design of Small Clinical Trials The design and conduct of any type of clinical trial require three considerations: first, the study should examine valuable and important biomedical research questions; second, it must be based on a rigorous methodology that can answer a specific research question being asked; and third, it must be based on a set of ethical considerations, adherence to which minimizes the risks to the study participants (Sutherland, Meslin, and Till, 1994). The choice of an appropriate study design depends on a number of considerations, including: the ability of the study design to answer the primary research question; whether the trial is studying a potential new treatment for a condition for which an established, effective treatment already exists; whether the disease for which a new treatment is sought is severe or life-threatening; the probability and magnitude of risk to the participants; the probability and magnitude of likely benefit to the participants; the population to be studied—its size, availability, and accessibility; and how the data will be used (e.g., to initiate treatment or as preliminary data for a larger trial).
OCR for page 21
Page 21 Because the choice of a study design for any particular trial will depend on these and other factors, no general prescription can be offered for the design of clinical trials. However, certain key issues are raised when randomized clinical trials (RCTs) with adequate statistical power are not feasible and when studies with smaller populations must be considered. The utility of such studies may be diminished, but not completely lost, and in other ways may be enhanced. To understand what is lost or gained in the design and conduct of studies with very small numbers of participants, it is important to first consider the basic tenets of clinical trial design ( Box 2-1). KEY CONCEPTS IN CLINICAL TRIAL DESIGN Judgments about the effectiveness of a given intervention ultimately rest on an interpretation of the strength of the evidence arising from the data collected. In general, the more controlled the trial, the stronger is the evidence. The study designs for clinical trials can take several forms, most of which are based on an assumption of accessible sample populations. Clinical trials of efficacy ask whether the experimental treatment works under ideal condi- BOX 2-1 Important Concepts in Clinical Trial Design Does the trial measure efficacy or effectiveness? A method of reducing bias (randomization and masking [blinding]) Inclusion of control groups Placebo concurrent controls Active treatment concurrent controls (superiority versus equivalence trial) No-treatment concurrent controls Dose-comparison concurrent controls External controls (historical or retrospective controls) Use of masking (blinding) or an open-label trial Double-blind trial Single-blind trial Randomization Use of randomized versus nonrandomized controls Outcomes (endpoints) to be measured: credible, validated, and responsive to change Sample size and statistical power Significance tests to be used
OCR for page 22
Page 22 tions. In contrast, clinical trials of effectiveness ask whether the experimental treatment works under ordinary circumstances. Often, trials of efficacy are not as sensitive to issues of access to care, the generalizability of the results from a study with highly selective sample of patients and physicians, and the level of adherence to treatment regimens. Thus, when a trial of efficacy is done with a small sample of patients, it is not clear whether the experimental intervention will be effective when a broader range of providers and patients use the intervention. On the other hand, trials of effectiveness can be problematic if they produce a negative result, in which case it will be unclear whether the experimental intervention would fail under any circumstances. Thus, the issue of what is preferred in a small clinical study—a trial of efficacy or effectiveness—is an important consideration. In the United States, the Food and Drug Administration (FDA) oversees the regulation and approval of drugs, biologics, and medical devices. Its review and approval processes affect the design and conduct of most new clinical trials. Preclinical testing of an experimental intervention is performed before investigators initiate a clinical trial. These studies are carried out in the laboratory and in studies with animals to provide preliminary evidence that the experimental intervention will be safe and effective for humans. FDA requires preclinical testing before clinical trials can be started. Safety information from preclinical testing is used to support a request to FDA to begin testing the experimental intervention in studies with humans. Clinical trials are usually classified into four phases. Phase I trials are the earliest-stage clinical trials used to study an experimental drug in humans, are typically small (less than 100 participants), and are often used to determine the toxicity and maximum safe dose of a new drug. They provide an initial evaluation of a drug's safety and pharmacokinetics. Such studies also usually test various doses of the drug to obtain an indication of the appropriate dose to be used in later studies. Phase I trials are commonly conducted with nondiseased individuals (healthy volunteers). Some phase I trials, for example, those of studies of treatments for cancer, are performed with individuals with advanced disease who have failed all other standard treatments (Heyd and Carlin, 1999). Phase II trials are often aimed at gathering preliminary data on whether a drug has clinical efficacy and usually involve 100 to 300 participants. Frequently, phase II trials are used to determine the efficacy and safety of an intervention in participants with the disease for which a new intervention is being developed. Phase III trials are advanced-stage clinical trials designed to show con-
OCR for page 23
Page 23 clusively how well a drug works. Phase III trials are usually larger, frequently multi-institutional studies, and typically involve from a hundred to thousands of participants. They are comparative in nature, with participants usually assigned by chance to at least two arms, one of which serves as a control or a reference arm and one or more of which involve new interventions. Phase III trials generally measure whether a new intervention extends survival, or improves the health of participants receiving the intervention and has fewer side effects. Some phase II and phase III trials are designed as pivotal trials (sometimes also called confirmatory trials), which are adequately controlled trials in which the hypotheses are stated in advance and evaluated. The goal of a pivotal trial is to attempt to eliminate systematic biases and increase the statistical power of a trial. Pivotal trials are intended to provide firm evidence of safety and efficacy. Occasionally, FDA requires phase IV trials, usually performed after a new drug or biologic has been approved for use. These trials are post-marketing surveillance studies aimed at obtaining additional information about the risks, benefits, and optimal use of an intervention. For example, a phase IV trial may be required by FDA to study the effects of an intervention in a new patient population or for a stage of disease different from that for which it was originally tested. Phase IV trials are also used to assess the long-term effects of an intervention and to reveal rare but serious side effects. One criticism of the classification of clinical trials presented above is that it focuses on the requirements for the regulation of pharmaceuticals, leaving out the many other medical products that FDA regulates. For example, new heart valves are evaluated by FDA on the basis of their ability to meet predetermined operating performance characteristics. Another device is the intraocular lens whose performance must be satisfied in a prespecified grid. Medical device studies, however, rely on a great deal of information about the behavior of the control group that often cannot be obtained or that is very difficult to obtain in small clinical trials because of the small number or lack of control participants. A much more inclusive and general approach that subsumes the four phases of clinical trials is put forth by Piantadosi (1997), who defines the four phases as (1) early-development studies (testing the treatment mechanism), (2) middle-development studies (treatment tolerability), (3) comparative (pivotal, confirmatory) studies, and (4) late-development studies (extended safety or postmarketing studies). This approach is more inclusive
OCR for page 24
Page 24 than trials of pharmaceuticals; it includes trials of vaccines, biological and gene therapies, screening devices, medical devices, and surgical interventions. The ethical conduct of a clinical study of the benefits of an intervention requires that it begin in a state of equipoise. Equipoise is defined as the point at which a rational, informed person—whether patient, provider, or researcher—has no preference between two (or more) available treatments (Freedman, 1987; Lilford and Jackson, 1995). When used in the context of research, equipoise describes a state of genuine uncertainty about whether the experimental intervention offers greater benefit or harm than the control intervention. Equipoise is advocated as a means of achieving high scientific and ethical standards in randomized trials (Alderson, 1996). True equipoise might be more of a challenge in small clinical trials, because the degree of uncertainty might be diminished by the nature of the disorder, the lack of real choices for treatment, or insufficient data to make a judgment about the risks of one treatment arm over another. A primary purpose of many clinical trials is evaluation of the efficacy of an experimental intervention. In a well-designed trial, the data that are collected and the observations that are made will eventually be used to overturn the equipoise. At the end of a trial, when it is determined whether an experimental intervention has efficacy, the state of clinical equipoise has been eliminated. Central principles in proving efficacy, and thereby eliminating equipoise, are avoiding bias and establishing statistical significance. This is ideally done through the use of controls, randomization, blinding of the study, credible and validated outcomes responsive to small changes, and a sufficient sample size. In some trials, including small clinical studies, the elimination of equipoise in such a straightforward manner might be difficult. Instead, estimation of a treatment effect as precisely as necessary may be sufficient to distinguish the effect from zero. It is a more nuanced approach, but one that should be considered in the study design. Adherence to an ethical process, whereby risks are minimized and voluntary informed consent is obtained, is essential to any research involving humans and may be particularly acute in small clinical trials, in which the sample population might be easily identified and potentially more vulnerable. Study designs that incorporate an ethical process may help in reducing concerns about some of problems in design and interpretation that naturally accompany small clinical trials.
OCR for page 25
Page 25 Reducing Bias Bias in clinical trials is the potential of any aspects of the design, conduct, analysis, or interpretation of the results of a trial to lead to conclusions about the effects of an intervention that are systematically different from the truth (Pocock, 1984). It is both a scientific and an ethical issue. It is relatively easy to identify potential sources of bias in clinical trials, but investigators have a limited ability to effectively remove the effects of bias. It is often difficult to even determine the net direction and effect of bias on the study results. Randomization and masking (blinding) are the two techniques generally used to minimize bias and to maximize the probability that the test intervention and control groups are similar at the start of the study and are treated similarly throughout its course (Pocock, 1984). Clinical trials with randomized controls and with blinding, when practical and appropriate, represent the standard for the evaluation of therapeutic interventions. Improper randomization or imperfect masking may result in bias. However, bias may work in any direction (Hauck and Anderson, 1999). In addition, the data for participants who withdraw or are lost from the trial can bias the results. Alternative Types of Control Groups A control group in a clinical trial is a group of individuals used as a comparison for a group of participants who receive the experimental treatment. The main purpose of a control group is to permit investigators to determine whether an observed effect is truly caused by the experimental intervention being tested or by other factors, such as the natural progression of the disease, observer or participant expectations, or other treatments (Pocock, 1996). The experience of the control group lets the investigator know what would have happened to study participants if they had not received the test intervention or what would have happened with a different treatment known to be effective. Thus, the control group serves as a baseline. There are numerous types of control groups, some of which can be used in small clinical trials. FDA classifies clinical trial control groups into five types: placebo concurrent controls, active-treatment concurrent controls, no-treatment concurrent controls, dose-comparison concurrent controls, and external controls (Food and Drug Administration, 1999). Each type of control group has its strengths and weaknesses, depending on the scientific question being asked, the intervention being tested, and the group of participants involved.
OCR for page 26
Page 26 In a trial with placebo concurrent controls, the experimental intervention is compared with intervention with a placebo. Participants are randomized to receive either the new intervention or a placebo. Most placebo-controlled trials are also double blind, so that neither the participants nor the physician, investigator, or evaluator knows who is assigned to the placebo group and who will receive the experimental intervention. Placebo-controlled trials also allow a distinction between adverse events due to the intervention and those due to the underlying disease or other potential interference, if they occur sufficiently frequently to be detected with the available sample size. It is generally accepted that a placebo-controlled trial would not be ethical if an established, effective treatment that is known to prevent serious harm, such as death or irreversible injury, is available for the condition being studied (World Medical Association, 1964). There may be some exceptions, however, such as cases in which the established, effective treatment does not work in certain populations or it has such adverse effects that patients refuse therapy. The most recent version of the Declaration of Helsinki (October 2000 [World Medical Association, 2000]) argues that use of a placebo is unethical regardless of the lack of severity of the condition and regardless of whether the best possible treatment is available in the setting or location in which the trial is being conducted. The benefits, risks, burdens, and effectiveness of a new method should be tested against those of the best current prophylactic, diagnostic, and therapeutic methods. At present, many U.S. scientists (including those at FDA) disagree with that point of view. The arguments are complex and need additional discussion and time before a consensus can be achieved if this new direction or another one similar to it is to replace the previous recommendation. Although placebos are still the most common control used in pharmaceutical trials, it is increasingly common to compare an experimental intervention with an existing established, effective treatment. Active-treatment concurrent control trials are extremely useful in cases in which it would not be ethical to give participants a placebo because doing so would pose undue risk to their health or well being. In an active-control study, participants are randomly assigned to the experimental intervention or to an alternative therapy, the active-control treatment. Such trials are usually double blind, but this is not always possible. For example, many oncology studies are considered impossible to blind because of different regimens, different routes of administration, and different toxicities (Heyd and Carlin, 1999). Despite the best intentions, some treatments have unintended effects that are so specific that their occurrence will inevitably identify the treatment received to both the patient and the medical staff. It is particularly important to do
OCR for page 27
Page 27 everything possible to have blinded interpretation of outcome variables or critical endpoints when the type of treatment is obvious. In a study in which an active control is used, it may be difficult to determine whether any of the treatments has an effect unless the effects of the treatments are obvious or a placebo control is included, or a placebo-controlled trial has previously demonstrated the efficacy of the active control. Active treatment-controlled trials can take two forms: a superiority trial, in which the new drug is evaluated to determine if it is superior to the active control, and an equivalence trial (a noninferiority trial), in which the new drug is tested to determine if it is equivalent to but not inferior to the active control (Hauck and Anderson, 1999). Equivalence trials are designed to show that the new intervention is as effective or nearly as effective as the established effective treatment. For diseases for which an established, effective treatment is available and in use, a common design randomizes participants to receive either an experimental intervention or the established, effective treatment. It is not scientifically possible to prove that two different interventions are exactly equivalent, only that they are nearly equivalent. In a trial with no-treatment concurrent controls, a group receiving the experimental intervention is compared with a group not receiving the treatment or placebo. The randomized no-treatment control trial is similar to the placebo-controlled trial. However, since it often cannot be fully blinded, several aspects of the trial may be affected, including retention of participants, patient management, and all aspects of observation (Food and Drug Administration, 1999). A no-treatment concurrent control trial is usually used when blinding is not feasible, such as when a sham surgery would have to be used or when the side effects of the experimental intervention are obvious. No-treatment concurrent control trials can also be used when the effects of the treatment are obvious and there is a small placebo effect. To reduce bias when a no-treatment control is used, it is desirable that those responsible for clinical assessment remain blinded. In a dose-comparison concurrent control trial, participants are assigned to one of several dose groups so that the effects of different doses of the test drug (dose-response) can be compared. Most dose-response-controlled trials are randomized and double blind. They may include a placebo group or an active control group or both. For example, it is not uncommon to show no difference between doses in a dose-response study. Unless the action of the drug is obvious, inclusion of a placebo group is extremely useful to determine if the drug being tested has no effect at all or a constant positive effect above the minimum dose. There are several advantages to using a dose-response control instead of
OCR for page 28
Page 28 a placebo control. When an experimental intervention has pharmacological effects that could break the blinding, it may be easier to preserve blinding in a dose-response study than in a placebo-controlled trial (Food and Drug Administration, 1999). Also, if the optimally safe and effective dose of an experimental intervention is not known, it may be more useful to study a range of doses than to choose a single dose that may be suboptimal or toxic (Pocock, 1996). Sometimes the optimal dose of a drug has unacceptable toxicity and a lower dose—even though it is not optimal for the treatment of the disease—is safer. In this case, a dose-response-controlled trial can be used to optimize the effective dose while minimizing the concomitant toxicity. However, the same ethical issues related to withholding an established, effective treatment from participants in placebo-controlled trials are relevant in a dose-response study (Clark and Leaverton, 1994). In an external control trial, participants receiving the intervention being tested are compared with a group of individuals who are separate from the population tested in the trial. The most common type of external control is a historical control (sometimes called a retrospective control) (Gehan, 1982). Individuals receiving the experimental intervention are compared with a group of individuals tested at an earlier time. For example, the results of a prior clinical trial published in the medical literature may serve as a historical control. The major problem with historical controls is that one cannot ensure that the comparison is fair because of the variability in patient selection and the experimental environment. If historical controls are obtained from a previous trial conducted in the same environment or by the same investigators, there is a greater chance of reducing the potential bias (Pocock, 1984). Studies have shown that externally controlled trials tend to overestimate the efficacies of experimental treatments (Sacks, Chalmers, and Smith, 1982), although one example has found the treatment effect to be underestimated (Farewell and D'Angio, 1981). Therefore, when selecting an external control, it is extremely important to try to control for these biases by selecting the control group before testing of the experimental intervention and ensuring that the control group is similar to the experimental group in as many ways as possible. Trials with external controls sometimes compare the group receiving the experimental intervention with a group tested during the same time period but in another setting. A variation of an externally controlled trial is a baseline-controlled trial (e.g., a before-or-after trial). In a baseline-controlled trial, the health condition of the individuals before they received the experimental intervention is compared with their condition after they have received the intervention.
OCR for page 29
Page 29 It is increasingly common for studies to have more than one type of control group, for example, both an active control and a placebo control. In those trials the placebo control serves as an internal control to provide evidence that the active control had an effect. Some trials compare several doses of a test drug with several doses of an active control drug, all of which may then be compared with a placebo. In some instances, the only practical way to design a clinical trial is as an uncontrolled trial. Uncontrolled trials are usually used to test new experimental interventions for diseases for which no established, effective treatments are available and the prognosis is universally poor without therapy. In uncontrolled trials, there is no control group for comparison, and it is not possible to use blinding and randomization to minimize bias. Uncontrolled trials are similar to externally controlled trials, in the sense that the outcomes for research participants receiving the experimental intervention are compared with the outcomes before the availability of the intervention. Therefore, the scientific grounds for the experimental intervention must be strong enough and its effects must be obvious enough for the positive results of an uncontrolled trial to be accepted. History is replete with examples of failed uncontrolled trials, such as those for the drug laetrile and the anticancer agent interferon (Pocock, 1984). Matching and Stratification In many cases investigators may be faced with a situation in which they have a potentially large historical control sample that they want to compare with a small experimental sample in terms of one or more endpoints. This is typically a problem in observational studies in which the individuals have not been randomized to the control and experimental groups. The question is, how does one control for the bias inherent in the observational nature of these data? Perhaps the experimental participants have in some way been self-selected for their illness or the intervention that they have received. This is not a new issue. In fact, it is closely related to statistical thinking and research on analysis of observational data and causal inference. For example, as early as 1968, William G. Cochran considered the use of stratification and subclassification as a tool for removing bias in observational studies. In a now classic example, Cochran examined the relationship between mortality and smoking using data from a large medical database (Cochran, 1968). The first row of Table 2-1 shows that cigarette smoking is unrelated to mortality, but pipe smoking appears to be quite lethal. The result of this early datamining exercise could have easily misled researchers for some time at the
OCR for page 30
Page 30 TABLE 2-1 Smoking and Mortality Mortality (%) per 1,000 Person-Years Stratification or subclassification Nonsmokers Cigarette Smokers Pipe and Cigar Smokers One (all ages in database) 13.5 13.5 17.4 Two 13.5 16.4 14.9 Three 13.5 17.7 14.2 Ten 13.5 21.2 13.7 early stages of scientific discovery. It turns out that, at least at the time that these data were collected, pipe smokers were on average much older than cigarette smokers, hence the false association with an increased rate of mortality in the non-stratified group. Cochran (1968) illustrated the effect that stratification (i.e., by age) has on the direction and ultimate interpretation of the results, revealing the association between cigarette smoking and mortality ( Table 2-1). SOURCE: Cochran (1968). It might be argued that a good data analyst would never have made this mistake because such an analyst would have tested for relevant interactions with important variables such as age. However, the simple statistical solution to this problem can also be misleading in an analysis of observational data. For example, nothing in the statistical output alerts the analyst to a potential nonoverlap in the marginal distributions. An investigator may be comparing 70-year-old smokers with 40-year-old nonsmokers, whereas traditional statistical approaches assume that the groups have the same covariate distributions and the statistical analyses are often limited to linear adjustments and extrapolation. Cochran illustrated that some statistical approaches (e.g., stratification or subclassification) produced more robust solutions when they were applied to naturalistic data than when they were applied to other types of data. Rosenbaum and Rubin (1983) extended the notion of subclassification to the multivariate case (i.e., more than one stratification variable) by introducing the propensity score. Propensity score matching allows the matching of cases and controls in terms of their propensities or probabilities of receiving the intervention on the basis of a number of potentially confounding variables. The result is a matched set of cases and controls that are, in terms of probability, equally likely to have received the treatment. The limitation is that the results from such a comparison will be
OCR for page 49
Page 49 to assert that the best is among a group of three of the interventions, although they are not sure which one is the best. Subsequent studies can then focus on choosing the best of the three interventions. A key criterion in selection trials is the probability of selection of the “correct” treatment. Even more intriguing criteria have been proposed for the selection of a superior treatment. In a review of the second edition of Peter Armitage's book Sequential Medical Trials, Frank Anscombe introduced what has been called the “ethical cost” function, which considers the number of inferior treatments and the severity of such treatments errors (Lai, Levin, Robbins, et al., 1980). Consider again the finite patient horizon of N patients to be treated over the course of a given time period. Suppose n pairs of patients (for a total of 2n patients) are to be considered in the trial phase, with treatment A or treatment B randomly allocated within pairs. After the trial phase, the remaining N − 2n patients will all be given the apparently superior treatment identified in the trialphase. The ethical cost function is the total number of patients given the truly inferior treatment multiplied by the magnitude of the treatment efficacy difference. If (AD) denotes the absolute difference in average endpoint levels between the two treatments, then the ethical cost is (AD)n if the truly superior treatment is selected in the trial phase and (AD)(N − n) if the truly superior treatment is not selected. It is simple to implement a sequential version of the trial phase; it also has the virtue of achieving a substantially lower average ethical cost than that which can be achieved with a fixed sample size in the trial phase. A surprising feature of a large class of reasonable sequential stopping rules for the trial phase is that they can reduce the average ethical cost for a fixed sample size, even when the ethical cost is optimized for a given value of (AD). For example, one such rule will reach a decision in the trial phase in which n is no more than one-sixth of N. The main point for consideration in small trials, however, is that it may not be obvious how one rationalizes the trade-off between the number of patients put at risk in the trial and an ultimately arbitrary Type I error rate in a conventional trial. On the other hand, it may be much more desirable to design a selection trial with an ethical cost function that directly incorporates the number of patients given inferior treatment. Adaptive Design Adaptive designs have been suggested as a way to overcome the ethical dilemmas that arise when the early results from an RCT clearly begin to
OCR for page 50
Page 50 favor one intervention over another. An adaptive design seeks to skew assignment probabilities to favor the better-performing treatment in a trial that is under way (Rosenberger, 1996). Adaptive designs are attractive to mathematicians and statisticians because they impose dependencies that require the full arsenal of techniques and stochastic processes (Rosenberger, 1996). An assortment of adaptive designs has been developed over the past few decades, including a variety of urn models that govern the sampling mechanism. Adaptive design can be associated with complex analytical problems. If the sample size is small enough, an exact analysis by exhaustive enumeration of all sample paths is one way to provide an answer. If the sample size is larger but still not large, a Monte Carlo simulation can provide an accurate analysis. If the sample size is large, then standard likelihood-based methods can be used. An example of an adaptive design is described in Box 2-5. A major advantage of adaptive design is that over time more patients will be assigned to the more successful treatment. Stopping rules and data analysis for these types of designs are complicated (Hoel, Sobel, and Weiss, 1975), and more research is needed in this area. As with sequential designs, the disadvantage of adaptive designs is that in most trials, patients are heterogeneous with respect to the important prognostic factors, and these methods do not protect against bias introduced by changes in the types of patients entering into a trial over time. Morever, for patients with chronic diseases, responses are usually delayed so long that the advantages of this approach are often lost. Also, multiple endpoints are usually of interest, and therefore, the entire allocation process should not be based on a single response. Play-the-winner rules can be useful in certain specialized medical situations in which ethical challenges are strong and one can be reasonably certain that time trends and patient heterogeneity are unimportant. These BOX 2-5 Play-the-Winner Rule as an Example of Adaptive Design A simple version of a randomized version of the play-the-winner rule follows. An urn contains two balls; one is labeled A and the other is labeled B. When a patient is available for treatment assignment, a ball is drawn at random and replaced. If the ball is type A, the patient is assigned to treatment A; if it is type B, the patient is assigned to treatment B. When the results for a patient are available, the contents of the urn are changed according to the following rule: if the result was a success, an additional ball labeled with the successful treatment is added to the urn. If the result is a failure, a ball with the opposite label is added to the urn (Zelen, 1969).
OCR for page 51
Page 51 rules can be especially beneficial when response times are short compared with the times between patient entries into a study. An example of this is the development of extracorporeal membrane oxygenation (Truog, 1992; Ware, 1989). Risk-Based Allocation Design Risk-based allocation, a nonrandomized design, has a very specific purpose: to allow individuals at higher risk or with greater disease severity to benefit from a potentially superior experimental treatment. Because the design is nonrandomized, its use should be considered only in situations in which an RCT would not be possible. For example, when a therapy is readily available outside the study protocol or when a treatment has been in use for a long time and is perceived to be efficacious, even though it has never been subjected to a randomized trial, a nonrandomized risk-based allocation approach may be useful. Bone marrow transplantation for the treatment of advanced breast disease is an illustration. A nationwide, multicenter randomized trial was designed to test the efficacy of harvesting bone marrow before aggressive chemotherapy followed by bone marrow transplantation with the patient's own (autologous) bone marrow for women with at least 10 axillary nodes with tumor involvement. The comparison group received the standard therapy at that time which omitted the bone marrow transplantation procedure. Bone marrow transplantation was widely available outside the clinical trial, and women were choosing that therapy in large numbers, drastically slowing patient enrollment in the trial. It took more than 7 years (between 1991 and 1998) to achieve the target sample size of 982 women, whereas more than 15,000 off-protocol bone marrow transplantation procedures were administered during that time period. If only half of the women receiving off-protocol bone marrow transplantation had been enrolled in the trial, the target sample size would have been reached in less than 2 years. The difficulty was that when participants were informed that they faced a 50 percent chance of being randomized to the comparison group, they withheld consent to obtain bone marrow transplantation elsewhere, often just across town. The final result of the trial was that there was no survival benefit to this approach. A risk-based allocation design might have reached the same conclusion much sooner, saving many women from undergoing a very painful, expensive, and, ultimately, questionable surgical procedure. Other examples of desperately ill patients or their caregivers seeking experimental treatments and refusing to be randomized include patients with
OCR for page 52
Page 52 AIDS in the early days of trials of drugs for the treatment of human immunodeficiency virus infection and caregivers of premature infants with extra-corporeal membrane oxygenation. Other therapies, such as pulmonary artery (Swan-Ganz) catheter placement, estrogen treatment for Alzheimer's disease, or radical surgery for prostate cancer, have been nearly impossible to test in randomized trials because participants, convinced of their therapeutic benefits, did not want to receive the placebo or the standard therapy. These therapies have been cited in the news media because of the extreme difficulty in recruiting participants into randomized trials of the therapies (Altman, 1996; Brody, 1997; Kolata, 1995, 1997; Kolata and Eichenwald, 1999). A risk-based allocation design attempts to circumvent these problems by ensuring that all of the sickest patients will receive the experimental treatment. The design is sometimes called an “assured allocation design” (Finkelstein, Levin, and Robins, 1996a, b). It has also been called the “regression-discontinuity design,” although that name presupposes a specific statistical analysis that is not always appropriate. The design has three novel features. First, it requires a quantitative measure of risk, disease severity, or prognosis, which is observed at or before enrollment in the study, together with a prespecified threshold for receiving the experimental therapy. All participants above the threshold receive the experimental (new) treatment, whereas all participants below the threshold receive the standard (old) treatment. The second novel feature of the risk-based design is the goal of the trial: to estimate the difference in average outcome for high-risk individuals who received the new treatment compared with that for the same individuals if they had received the old treatment. Thus, in the bone marrow transplantation example, women eligible for the randomized trial had to have 10 or more nodes of involvement. In a risk-based allocation trial, all of these high-risk women would have been given bone marrow transplantation, whereas women with fewer affected nodes would have been recruited and given the standard therapy. The treatment effect to be estimated in the assured allocation design would be the survival difference for women with at least 10 nodes given bone marrow transplantation compared with that for the same group of women if they had received the standard therapy. The risk-based allocation creates a biased allocation, and the statistical analysis appropriate for estimation of the treatment effect is not a simple comparison of the mean outcomes for the two groups, as it would be in a randomized trial. One analytical method comes from the theory of general empirical Bayes estimation, originally introduced by Herbert Robbins in the
OCR for page 53
Page 53 1950s in a series of landmark papers (Lai and Siegmund, 1985; Robbins, 1956, 1977). Robbins applied this approach first to estimation problems, then to prediction problems, and later to risk-based allocation (Robbins, 1993; Robbins and Zhang, 1988, 1989, 1991). If one gives up randomization (because the trial would be impossible to carry out), one needs another principle to achieve a scientifically valid estimate of treatment effect. Therefore, the third requirement of risk-based design is a model that can be used to predict what outcomes the sicker patients would have had if they had been given the standard treatment. A prototypic example of the appropriate statistical analysis required is shown in Box 2-6. Thus, there is good rationale for using a risk-based allocation design to compare the outcomes for high-risk patients who receive the new treatment with the predicted outcome for the same patients if they had received the standard therapy. One requires a model for the standard treatment (but only the standard treatment) that relates the average or expected outcome to specific values of the baseline measure of risk used for the allocation. Only the functional form of the model, not specific values of the model parameters, is required. This is because the parameters used in the model will be estimated from the concurrent control data, and extrapolated to the high-risk patients. This is an advantage over historical controlled studies. One need not rely on historical estimates of means or proportions of the expected outcome, which are notoriously untrustworthy. All one needs to assume for the risk-based design is that the mathematical form of the model relating outcome to risk is correctly specified throughout the entire range of the risk measure. This is a strong assumption, but with sufficient experience and prior data on the standard treatment, the form of the model can be validated. In the same way that an engineer can build a bridge without being completely agnostic about the laws of gravity and the tensile strength of steel, so progress can be made without randomization if one has a model that predicts the outcomes of a standard treatment. In addition, the validity of the predictive model can always be checked against the concurrent control data in the risk-based trial. The usual problem of extrapolation beyond the range of data does not arise here for three reasons. First, one assumes that the mathematical form of the model relating outcome to risk is correctly specified throughout the entire range of the risk measure. If one does not know what lies beyond the range of data, then extrapolation is risky. Thus, in this situation one should assume a validated model for standard treatment that covers the whole range of the risk measure, including data for those high-risk patients that form part of the observed data. Estimation of the model parameters from a por-
OCR for page 54
Page 54 BOX 2-6 Example of General Empirical Bayes Estimation Suppose one collects data on the number of traffic accidents that each driver in a population of motorists had during a 1-year baseline period. Most drivers will have no accidents, some will have one, some will have two, and so on. If one focuses on the subgroup of drivers who had no accidents during the baseline period, one can then ask the following question: assuming that traffic conditions and driving habits remain stable, how many accidents in total would the same drivers with no accident in the baseline year be predicted to have in the next year? A model is needed to make a prediction. A reasonable statistical model is that the number of accidents that a single driver has in a 1-year period follows a Poisson distribution, the standard probability law governing the occurrence of rare events. Subject-to-subject variability requires one to assume that the mean value for a parameter according to a Poisson distribution (the number of accidents expected per year) varies from driver to driver: some drivers have very safe driving habits with a small expected number of accidents per year, whereas others have less safe driving habits. A key feature of a general empirical Bayes analysis is that no assumption about the distribution of the Poisson mean parameters in the population of drivers needs be made. In this case, the term “general empirical Bayes” does not mean empirical Bayes generally but, rather, refers to the kind of empirical Bayes method that does not make assumptions about the prior distribution (in contrast to the parametric variety used by Robbins ). Robbins proved that an unbiased and asymptotically optimal predictor of the number of accidents next year by the drivers who had no accidents in the baseline year is the number of drivers who had exactly one accident in the baseline year. The proof of this assertion is based only on the assumption of the form of the model for outcomes (Poisson distribution), without any parametric assumption about how the model parameter is distributed among participants in the population. What is amazing—and the reason that this example is presented—is that information about one group of people (the drivers with no accidents) can be consistently and asymptotically optimally predicted on the basis of information about an entirely different group of people (the drivers with one accident), which is characteristic of empirical Bayes methods. There is no question that the two groups are different: even though the groups of drivers with no accidents includes some unsafe drivers who had no accidents by good fortune, the drivers in that group are, nevertheless, safer drivers on average than the drivers in the group with one accident, even though the latter group includes some safe drivers who were unlucky. This illustrates that the complete homogeneity and comparability of two groups so avidly sought after in randomized comparisons is actually not necessary to make valid comparisons, given adequate model assumptions and an appropriate (not naïve) statistical analysis. Finally, one can observe the number of accidents next year among those with no accidents in the baseline year and compare that number with the predicted number using a 95 percent prediction interval based on the baseline data. An approximate 95 percent prediction interval is given by 1.96 times the square root of twice the number of drivers with either exactly one accident or exactly two accidents (Finkelstein and Levin, 1990). If the observed number is found to differ markedly from the predicted number, there are grounds to reject the starting assumption that driving conditions and habits remained the same. See the section Statistical Analyses in Appendix A for further discussion of risk-based allocation.
OCR for page 55
Page 55 tion of the data and then use of the model to predict responses for high-risk patients is not equivalent to extrapolation of the data into some unknown region of the sample data. Second, the model can be validated with the observed data, which increases confidence in the model over the unobserved data. Third, the effect of extrapolation is accurately reflected by the standard errors, but the effect is not some wild inflation into unknown territory. This third assumption is an important one, and identification of the appropriate model must be accomplished before a risk-based trial can be undertaken. Once the necessary model is developed, there are no other hidden assumptions. The reliability of the available data is important to this approach. A clinical example from Finkelstein, Levin, and Robbins (1996b) is given in Box 2-7. That example uses a simple linear model to relate how much the level of total serum cholesterol was reduced from the baseline to the end of follow-up on the basis of a preliminary measurement of the cholesterol level among a group of cholesteremic, sedentary men in the placebo arm of a well-known randomized trial of the cholesterol-lowering compound cholestyramine. If the trial had been designed as a risk-based allocation trial, the actually observed lowering of the cholesterol level among the highest-risk (the most cholesteremic) men given the active drug could have been compared on the basis of a simple linear model with the lowering predicted BOX 2-7 Potential Effectiveness of Replacing Randomized Allocation with Risk-Based (Assured) Allocation (in Which All Higher Risk Participants Receive the New Treatment) High levels of cholesterol (at least the low-density lipoprotein component) are generally regarded as a risk factor for heart disease. A primary prevention trial was conducted in which 337 participants were randomly assigned to treatment arms to evaluate the ability of cholestyramine to lower total plasma cholesterol levels. The group with high cholesterol levels (> 290 mg/dl) had an average reduction of 34.42 mg/dl with a treatment effect (the reduction in the cholestyramine-treated high cholesterol subgroup minus the reduction in the high-cholesterol placebo controls) of 29.40 ± 3.77 mg/dl (standard error) (Lipid Research Clinical Program, 1984). The results also suggest that the drug is less effective in absolute terms for participants with lower initial total plasma cholesterol levels (<290 mg/dl). By applying a risk-based allocation model to the same data, the treatment effect is estimated for participants at higher risk (>290 mg of total plasma cholesterol/dl) to be 30.76 ±8.02 mg/dl, which is close to the result of the RCT of 29.40 mg/dl. Thus, for the high-risk patients, the results from the trial with a risk-based allocation design are virtually identical to those of the trial with the conventional design (Finkelstein, Levin, and Robbins, 1996b).
OCR for page 56
Page 56 for the same men while they were receiving a placebo. The example illustrates that the risk-based approach would have arrived at the same estimate of treatment effect for those at higher risk as the RCT did. Some cautions must be observed when risk-based allocation is used. The population of participants entering a trial with a risk-based allocation design should be the same as that for which the model was validated so that the form of the assumed model is correct. Clinicians enrolling patients into the trial need to be comfortable with the allocation rule, because protocol violations raise difficulties just as they do in RCTs. Finally, the standard error of estimates will reflect the effect of extrapolation of the model predictions for the higher-risk patients on the basis of the data for the lower-risk patients. Because of this, a randomized design with balanced arms will have smaller standard errors than a risk-based design with the same number of patients. In the example of the study of cholestyramine in Box 2-7, the standard error was slightly more than doubled for the risk-based design than for the randomized design. What do these ideas have to do with small clinical trials? Consider the example of bone mineral density loss among astronauts. An obvious risk factor that correlates with bone mineral density loss is the duration of the mission in space: the longer the mission, the greater the bone mineral density loss. What would be required in a risk-based study design is the mathematical form of this relationship for some standard countermeasures (countermeasure is the term that the National Aeronautics and Space Administration uses for a preventive or therapeutic intervention that mitigates bone mineral density loss or other physiological adaptations to long-duration space travel). Astronauts who will be on extended future missions on the International Space Station will be at higher-risk than those who have shorter stays. If those on the longer missions (who are at higher risk) were to receive new experimental countermeasures, their bone mineral density losses could be compared on a case-by-case basis to a prediction of what their bone mineral density loss would have been by use of the standard countermeasures. Such comparisons of observed versus expected or predicted outcomes are familiar in other studies with small sample sizes, such as studies searching for associations of rare cancer with a variety of toxic exposures. Finally, any trial conducted in an unblinded manner has a potential bias. In some cases a trial with a risk-based allocation design need not be conducted in an unblinded manner; for example, patients may be assured of receiving an active experimental treatment together with a placebo standard treatment if they are at high risk or a placebo experimental treatment together with an active standard treatment if they are lower risk. The trial may
OCR for page 57
Page 57 be conducted in a blinded manner if the risk measure is not obvious to the patient. In many cases, however, the trial, such as a surgical intervention trial, would have to be unblinded. The issue is nothing new. Solid endpoints unaffected by investigator bias and careful protocols for permitted concomitant behavior are the best safeguard in unblinded trials. SUMMARY Scientific research has a long history of well-established, well-documented, and validated methods for the design, conduct, and analysis of clinical trials. A study design that is appropriate includes one with a sufficient sample size and statistical power and proper control of bias to allow a meaningful interpretation of the results. The committee strongly reaffirms that, whenever feasible, clinical trials should be designed and performed so that they have adequate statistical power. When the clinical context does not provide a sufficient number of research participants for a trial with adequate statistical power but the research question has great clinical significance, the committee understands that, by necessity for the advancement of human health, research will proceed. Bearing in mind the statistical power, precision, and validity limitations of studies with small sample sizes, the committee notes that there are innovative design and analysis approaches that can improve the quality of such trials. In small clinical trials, it is more likely that the sample population will share several unique characteristics, for example, disease, exposures, or environment. Thus, it might be more practical in some small clinical trials than in large clinical trials to involve the participants in the design of the trial. By doing so, the investigator can increase the likelihood of compliance, adherence to the regimen, and willingness to participate in monitoring and follow-up activities. Investigators should also keep in mind opportunities for community discussion and conversation during the conduct and planning of all trials. It is also important for investigators to consider confidentiality and privacy in disseminating the results of studies whose sample populations are easily identified. Investigatiors should also keep in mind opportunities for community discussion and consultation during the planning and conduct of all clinical trials. RECOMMENDATIONS Because of the constraints of trials with small sample sizes, for example, trials with participants with unique or rare diseases or health conditions, it is
OCR for page 58
Page 58 particularly important to define the research questions and select outcome measures that are going to make the best possible use of the available participants while minimizing the risks to those individuals. RECOMMENDATION: Define the research question. Before undertaking a small clinical trial it is particularly important that the research question be well defined and that outcomes and conditions to be evaluated be selected in a manner that will most likely help clinicians make therapeutic decisions. RECOMMENDATION: Tailor the design. Careful consideration of alternative statistical design and analysis methods should occur at all stages in the multistep process of planning a clinical trial. When designing a small clinical trial, it is particularly important that the statistical design and analysis methods be customized to address the clinical research question and study population. Clinical researchers have proposed alternative trial designs, some of which have been applied to small clinical trials. For a smaller trial, when the anticipated effect is not great, researchers may encounter a difficult tension between scientific purity or pragmatic necessity. One approach might be to focus on a simple, streamlined hypothesis (not multiple ones) and choose one means of statistical analysis that does not rely on any complicated models and that can be widely validated. An alternative approach is to choose a model-dependent analysis, effectively surrendering any pretense of model validation, knowing that there will not be enough information to validate the model, a risk that could compromise the scientific validity of the trial. The committee believes that the research base in this area requires further development. Alternative designs have been proposed in a variety of contexts; however, they have not been adequately examined in the context of small studies. RECOMMENDATION: More research on alternative designs is needed. Appropriate federal agencies should increase support for expanded theoretical and empirical research on the performances of alternative study designs and analysis methods that can be applied to small studies. Areas worthy of more study may include theory development, simulated and actual testing including comparison of existing and newly developed or modified alternative designs and methods of analysis, simulation models, study of limitations of trials with different sample sizes, and modification of a trial during its conduct.
OCR for page 59
Page 59 Because of the limitations of small clinical trials it is especially important that the results be reported with accompanying details about the sample size, sample characteristics, and study design. The details necessary to combine evidence from several related studies, for example, measurement methods, main outcomes, and predictors for individual participants, should be published. There are two reasons for this: first, it allows the clinician to appropriately interpret the data within the clinical context, and second, it paves the way for meta-analysis with other small clinical trials or other future analyses of the study, for example, as part of a sequential design or meta-analysis. In the clinical setting, the consequences might be greater if one misinterprets the results. In the research setting, insufficiently described design strategies and methods diminish the study's value for future analyses. RECOMMENDATION: Clarify methods in reporting of results of clinical trials. In reporting the results of a small clinical trial, with its inherent limitations, it is particularly important to carefully describe all sample characteristics and methods of data collection and analysis for synthesis of the data from the research.
Representative terms from entire chapter: