functionally interoperable with the clinical-trial registration database, ClinicalTrials.gov.
Researchers encounter barriers to the reporting of sex-specific biomedical research results well before the publication stage, said session moderator Jon Levine, director of the Wisconsin National Primate Research Center and editor-in-chief of Frontiers in Neuroendocrinology. Challenges emerge in designing experiments, applying for grants, and making the most of limited funding inasmuch as these activities build on the existing knowledge base, which is historically biased toward males.
The Politics of Sex Differences
Biases against studying females are embedded in the research culture, and there are numerous misconceptions, said Larry Cahill, professor of neurobiology and behavior at the University of California, Irvine. In neuroscience, for example, some think that if there is no behavioral difference between the sexes, there is no brain difference. It is known, however, that identical behaviors can be manifested through different neurobiologic mechanisms. Others assert that consideration of sex differences makes things more complicated. But analyzing data by sex can sometimes provide clarity.
Cahill offered an example of sex differences in brain function from his work on emotional memory. He discovered that the amygdala operates differently in men and the women when they watch the same emotional event; activity in the left-hemisphere amygdala is more predictive of memory of a given event in women, while activity in the right-hemisphere amygdala is more predictive of memory of the same event in men.
The greatest obstacle to moving forward, Cahill said, is the profound biases that exist against the consideration of sex differences. Such biases may be even greater in studies of the brain. Sex differences in the liver or kidneys are not particularly controversial, but sex differences in the brain can become a political issue. Cahill said that researchers need
to be bold and assert that male-only studies are not good enough anymore. How many false conclusions have been published as a result of failure to consider sex differences? he asked.
Male Bias in Animal Studies
One argument for the preferential use of males in animal studies, said Rae Silver, Kaplan Professor of Natural and Physical Sciences at Barnard College and Columbia University, is that females are more variable than males, partly because of cyclic reproductive hormones. There is evidence that some behaviors exhibit cycle-related variations, but in most instances there is little or no evidence that such variations make female models inappropriate.
But the arguments persist. One commentary cited by Silver described how a particular rat model of arthritis was more reproducible in male rats and that therefore far fewer males than females were needed to achieve statistically significant results. The researcher asserted, however, that the results were applicable to both sexes.
One argument that is true is that the cyclic nature of female sex hormones necessitates larger samples and more test groups in rodent work. Studying females requires more time, is more labor-intensive, and is more expensive than studying only males. Researchers must often justify the cost, as well as the increased use of animals, to their administration or institutional animal-care-and-use committee.
Silver questioned whether it would be possible to require the animal-research community to include both males and females when appropriate, as has been done for humans. Workshop participant Vivian Pinn, director of ORWH, responded that it takes great effort for NIH to monitor the mandated inclusion of women and minorities in clinical trials, and it could become overwhelming to monitor the sex of animals in studies in the same way. It would be more practical, and probably as effective, if researchers knew that information on the sex of animals was desired or required when submitting the results of studies for publication.
Sex Differences Across the Full Spectrum of Research
Denise Faustman, director of the Immunobiology Laboratory at Massachusetts General Hospital, noted that three large phase 3 clinical trials of type I diabetes products had recently failed; together, they were estimated to have cost over $3 billion to conduct. In two of those trials, she said, enrollment of males and females was fairly well balanced—
about 60–70% men. The preclinical data that informed the human trials, however, were obtained solely in female mice. She asked how these large, expensive trials might have been designed differently if pharmacokinetics or responsiveness or the stage of the disease had been studied in both male and female animals. The blame for failed clinical trials is shared equally by the clinical researchers who design and conduct the trials and the basic researchers who continue to publish data on only males or only females because it is easier. Sex differences must be considered and reported across the whole spectrum of research, Faustman said.
Subpopulations of Males and Females
A participant pointed out that males and females constitute broad subpopulations that can each be divided. For example, women in the follicular phase are different from women in a luteal phase; prepubertal women are different from postpubertal women; women taking hormone-replacement therapy are different from women who are not; and women taking estradiol and progesterone are different from women taking Premarin with hydroxyprogesterone acetate. Similarly, men taking androgens are different from men who are not. Those who understand or study reproduction or endocrinology are more aware of these issues, but researchers in other fields often are not. A challenge is how to make researchers more aware. Cahill concurred, noting that in his early work on emotional memory he simply divided subjects into men and women, but he later discovered that the division had led to false conclusions. He failed to find enhancing effects of stress hormones on memory in women, he explained, because he had not accounted for menstrual cycle or the use of hormonal contraception.
Analysis and Interpretation of Subgroups
Clinical-trial data reflect groups of participants, explained John B. Wong, chief of the Division of Clinical Decision Making at Tufts Medical Center, but each patient that a physician sees is a unique individual with unique risk factors, genetic profile, experiences, and medications. The question is which of the participants in a randomized controlled trial is the same as the patient about to be treated. That is the driving force for subgroup analysis.
Wong offered a cautionary tale about subgroup analysis. The International Study of Infarct Survival, a randomized controlled trial of thousands of patients, found an overall statistical benefit of aspirin over placebo in prevention of death (ISIS-2, 1988). Sleight (2000) conducted an analysis of 12 subgroups and identified two that had a nonsignificant adverse effect. Those two subgroups, Wong revealed, were participants whose astrologic signs were Gemini and Libra. That is amusing at first, but Sleight, a noted statistician, stressed in his publication that “when clinicians believe such subgroup analyses, there is real danger of harm to the individual patient” (Sleight, 2000, p. 25).
Frequentist Statistics and Null-Hypothesis Errors
The frequentist statistical perspective, sometimes called the null hypothesis, begins with the position that a drug and a placebo are equal. Given that assumption, any observed differences in results would be due to chance. Given the alternative hypothesis that the drug and the placebo are different, observed differences in results would be due to differences between the drug and the placebo, but the null hypothesis is easier to test.
Wong pointed out the problems of type I and type II errors, and the often greater concern about the former, and the problem of statistical power where an inadequate sample size increases the chance of a type II error. Wong further explained that two types of errors can occur in association with a hypothesis that there is no difference between drug and placebo (Table 1): either the drug is truly beneficial or not, and the study either suggests that the drug is beneficial or not. A type I error occurs when the study results show that the drug is beneficial but in fact it is not—a false positive. There is less than a 0.05 probability (α = 0.05) that this would be the case if it were assumed that the drug was equivalent to the placebo. A type II error occurs when the study results show that the drug is not beneficial but, in fact, it is—a false negative. There is usually a probability of 0.1–0.2 (ß = 0.2 or ß = 0.2) that this would be the case if it were assumed that the drug was equivalent to the placebo.
The consequence of these two kinds of errors in subgroup analysis is multiplicity. For a type II error, if a drug is truly beneficial (the unknown truth is that it works), the probability that the study will erroneously find the drug to be not beneficial is about 20% [1 – 80% = 20%]. Assuming that each subgroup is independent, and two subgroups are analyzed, the probability of erroneously finding the drug to be not beneficial in at least one subgroup increases to 36% [1 – (80%)(80%) = 36%]. With
TABLE 1 Errors of Hypothesis Testing
|Drug Beneficial||Drug Not Beneficial|
|1 – β = 0.80
|α = 0.05
Type I error
|β = 0.20
Type II error
|1 – α = 0.95|
SOURCE: Wong, 2011, Slide 6.
12 subgroups, there is a 93% chance of an erroneous finding that the drug is not effective in at least one subgroup [1 – (80%12) = 93%].
For a type I error, if the drug is truly not beneficial, the probability that the study will erroneously find it to be beneficial is 5% if there are no subgroups [1 – 95% = 5%], 10% if there are two independent subgroups [1 – (95%)(95%) = 10%], and 46% if there are 12 subgroups [1 – (95%12) = 46%].
Having described the general concerns subgroup analysis, Wong suggested Bayesian statistical inference as one possible approach to reporting of sex-based subgroups. Bayesian inference is a method of showing how knowledge or belief is altered by data (for further background, see Goodman, 1999). It provides a framework for combining prior belief or evidence with current evidence. The FDA guidance on using Bayesian methods for medical-device clinical trials, Wong said, describes it as “learning from evidence as it accumulates” (FDA, 2010, p. 5).
To illustrate the use of Bayesian inference, Wong asked: What is the probability that an asymptomatic woman 40–50 years old with a positive mammogram has breast cancer? Prior knowledge is that about 0.8% of asymptomatic 40- to 50-year-old women have breast cancer. In other words, of 1,000 asymptomatic women, based on prior knowledge of prevalence, eight (0.8%) would have breast cancer. Seven of those eight (90%) would have positive mammograms. However, 69 (7%) of the remaining 992 women who do not have breast cancer would have positive mammograms. The Bayes rule, or a Bayesian interpretation, Wong explained, would suggest that the probability of breast cancer in those with positive mammograms is 7 of the total positive mammograms (7 + 69), or 9%, because so many more women do not have breast cancer
than have breast cancer.4 Most physicians, Wong noted, guess that the likelihood is over 90%.
Another way to look at the data is with what Wong referred to as a likelihood ratio. If a patient has a positive mammogram, the likelihood that she has breast cancer is 90%, and the likelihood that she does not is 7%. Hence, the patient is 13 times as likely to have breast cancer as not if she has a positive mammogram (90% ÷ 7% = 13).
Wong also described the Bayes factor, which compares how well a hypothesis predicts the data (for further background, see Goodman, 1999). All information from a clinical trial is taken into account in the Bayes factor, Wong noted; the Bayes factor indicates the likelihood of an effect discussed above. In essence, it is the probability of the data given the null hypothesis vs the probability of the data given the alternative hypothesis. As opposed to the frequentist statistical perspective discussed above, there is a separation between the probability of error, which is the null hypothesis, and the weight of the evidence from a particular clinical trial, which is the Bayes factor. In other words, a Bayesian integration gains strength from prior information whereas a frequentist approach cannot.
A Bayesian approach formally integrates prior knowledge with data (“sequential learning”). However, it requires a subjective prior belief or evidence; conclusions depend on the prior evidence, and different investigators may use different prior evidence (which may actually help to determine how robust the conclusions are). A Bayesian approach can be used for hierarchical modeling, which combines results or “borrows strength” from different studies. For example, if the national prevalence of diabetes in the United Kingdom is 2% with a standard deviation of 0.5% and, in a local sample of 1,000 patients in a given city, 1.5% have diabetes, with the Bayesian framework the national and local data could be integrated to estimate that 1.7% of the patients in the city have diabetes (with a 95% credible interval of 1.2–2.4%). In contrast, the frequentist approach could not integrate the national data and would estimate that 1.5% of the patients in the city have diabetes (with a 95% confidence interval of 0.8–2.5%). It has also been suggested that a Bayesian approach can be used in the design and conduct clinical trials and would facilitate flexibility, including adaptive randomization and stopping criteria (Berry, 2005).
4Wong referred participants to Calculated Risks by Gerd Gigerenzer (2002) for further discussion.
Wong pointed out, however, how assumptions about prior evidence can affect interpretation of a new study and have large effects on the conclusions drawn. Berlin suggested the need for a “research czar” that could help to facilitate some level of consistency among similar studies, for example, in common terminology and definitions. Wong noted that the Patient-Centered Outcomes Research Institute has a methodology committee that is attempting to address some of the issues, such as methodological standards, that would help facilitate assessment of data among studies.
From an industry perspective, Berlin said, a barrier to sharing clinical-trial data is that participants sign an agreement that dictates whether and how their information can be shared. He said that sponsors should develop participant agreements to facilitate sharing.
Frank Davidoff, editor emeritus of Annals of Internal Medicine, suggested risk-stratification analysis as an alternative to Bayesian statistics for applying clinical-trial results to an individual patient, and he referred participants to the work of Kent and Hayward (2007a,b). When multiple risk factors are used to segregate a sizable study population into risk subgroups, the difference in rates of outcomes can be as great as a factor of 50, he said. For example, a drug that demonstrates an overall beneficial effect may have virtually no beneficial effect in some subgroups, probably because their baseline risk is small to begin with. At the other extreme, the intervention may have a large clinical effect in patients who have a high baseline risk. For many researchers, risk stratification is less difficult to grasp than Bayesian analysis, and Davidoff suggested that it is statistically robust. Risk-stratification analysis can be applied to existing trials to look for differences in intervention effects among different groups, including sex. There are methodologic challenges to risk stratification, he noted, including the need for an independent determination of the risk groups, and there is a potential for type I and type II errors.
A Role for Journals
Silver referred participants to the report of a 2010 IOM workshop, Sex Differences and Implications for Translational Neuroscience
Research, which focused on defining roles for industry, government, academe, and journals in the translation of sex differences in neuroscience from bench to bedside. One of the suggestions raised at that workshop was that journal publishers set standards “for the inclusion of sex-related subject information in all publications, including sex of origin of tissues, cell lines, etc.” and “establish guidelines to encourage authors to analyze data by sex and to report sex differences, or the lack thereof” (IOM, 2011, p. 77). It was noted, however, that it is not possible to study everything all the time, and one of the challenges raised at the 2010 workshop was to set priorities.
Silver cited the work of Beery and Zucker (2011), who analyzed the distribution of animal and human male and female subjects in published studies in journals in diverse biologic disciplines. The sex of subjects was not specified in a number of journals; in many cases in which sex was noted, there was a male bias. Silver noted that in nearly every discipline that was the case more often in nonhuman studies than in human studies. Silver quoted five of the recommendations of Beery and Zucker (2011, p. 570) that were based on their findings:
If male and female models are thought to differ in response to an intervention, then the study must be designed with adequate sample size to answer the question for each sex.
If prior research strongly indicates that there are no significant sex differences between male and female animals, then sex is not required in subject sex selection, but study of both males and females is both feasible and encouraged.
If information about the existence of sex differences is absent or equivocal, then both sexes should be studied in numbers sufficient to permit valid analysis.
Outreach training activities offering practical suggestions and additional sources of information should be made available by the NIH to help investigators design studies that fully incorporate female animals….
The review process for extramural funding should treat inclusion of females as a matter of scientific merit that affects funding.
Journal policies determine manuscript reporting requirements, Silver said, and if journal editors believe that it is important to know the sex of origin of a cell type that is being studied or the sex of animal or human participants, investigators will have to include that information.
Cahill suggested that for studies (of a non–sex-specific issue) in which only one sex has been used, journal editors should make the last two words of the article title “in males” or “in females.” In addition to providing immediate clarity to basic researchers as they refer to the literature, this truth-in-advertising policy would raise awareness and would be a powerful statement that sex matters. Davidoff noted that that is similar to what was done in the mid-1990s in publications of randomized controlled trials. Such publications were not always easily identifiable, partly because “randomized controlled trial” was not included in the title and partly because the articles were not indexed as trials in U.S. National Library of Medicine’s online database (MEDLINE®/PubMed®). Proper titling and indexing of papers allow researchers to study the frequency with which types of studies are published, and allow meta-analyses to be done more quickly, easily, and completely.
Sex-Based Comparisons vs Reporting of Participant Sex
Judith Lichtman, associate professor in the Department of Epidemiology and Public Health at Yale University School of Medicine, suggested that in considering standardization of journal policies for sex-specific reporting, it is important to remember that there are studies that are designed to assess sex-based differences, or of which such assessment is a natural extension, and studies in which sex-related data would be interesting to know but are not necessarily the focus. Studies designed to analyze by sex and studies that simply note the sex of participants as an observation present different methodologic issues. The extent to which sex is considered affects the focus of the work, the analyses, and often the length of the resulting paper. She suggested that requiring sex-based analysis takes study-design decisions out of the hands of the authors and peer reviewers and that comparisons drawn from studies that were not designed to assess sex differences may not be robust and could be misleading.
Sex-specific analysis presents methodologic and analytic challenges. For example, sample size is important. There must be enough data for adequate statistical power and useful comparisons. When the events being studied are very rare, there can be unintentional bias in enrollment or a disproportionate blend of women or men among study sites. There may also be differences in prevalence or risk factors between males and females, and differences in psychosocial factors may come into play in comparisons. Lichtman added that older datasets that do not have the desired distribution of men and women can still be of value
even though they may not have adequate power: relationships may be apparent, and they can help in generating hypotheses.
Lichtman described her quick survey of August 2011 issues of the Annals of Internal Medicine, JAMA, and the New England Journal of Medicine. Of 11 original contributions, four included some level of sex stratification of data, five that she thought probably should have included sex-specific analysis did not, and in the remaining two it was not clear whether stratification would have been appropriate (for example, an investigation of a nationwide outbreak of Salmonella infections associated with peanuts). She stressed that it is important to consider when sex-specific analysis makes sense and when it does not.
Other Subgroups: Race and Age
There is no question that sex is an important difference and one that has been underreported in the literature, Lichtman said, but differences are also associated with race and age, and perhaps reporting policies will need to be extended to those categories—although when sex, age, and race are considered, data presentation and interpretation can become complicated, and what the most useful comparisons are need to be considered.
Workshop participant Pinn pointed out that the law requires NIH to include women and minorities and their subpopulations in clinical research. Analysis by race can be challenging, and researchers are often confused about how to address subpopulations. Although ORWH focuses primarily on women, the NIH National Institute for Minority Health and Health Disparities (NIMHD) focuses on minorities and other health-disparity populations. Both ORWH and NIMHD report data by race and by sex.
Data on Sex-Specific Reporting
Pinn stressed that in looking at data on sex-specific reporting, it is important to know what studies the data are based on, for example, whether the data are only for clinical trials, or for clinical trials and observational studies, or whether the data are for studies funded by NIH or for all studies. She noted that NIH has been conducting analyses of clinical research and in looking at 12,000 protocols in FY 2010 found that 56% of the 23.3 million participants were women. When sex-specific studies of diseases that affect only women or only men were excluded from the analysis, 51.6% of the participants in NIH-funded extramural