This session of the workshop was designed to explore the ways in which standardization might impact research that incorporates animal models. Perspectives were provided by three speakers who discussed the challenges of developing, implementing, and disseminating standards along with the potential benefits and risks. Andrew Holmes described some of the controversy surrounding standardization of behavioral models and shared examples of interlaboratory standardization studies that have led to differing results. Timothy Bussey described the development of automated testing methods that would reduce interference introduced by experimenters in both animal and human studies. Lennart Mucke presented several examples of successful translation of findings in animal models to human and offered reasons why translation sometimes fails. He also provided his perspective on how optimization of experimental procedure through best practices for preclinical research might be an alternative to standardization of models.
As background for the discussion, session moderator Walter Koroshetz, deputy director of the National Institute of Neurological Disorders and Stroke (NINDS), referred to several recent reports that raise concerns about the reproducibility of published scientific data. For example, researchers at the pharmaceutical company Bayer reported that of 67 projects the company acquired based on “exciting published data,” two-thirds were abandoned in the target validation stage because Bayer scientists could not sufficiently replicate the published data (Prinz et al., 2011). Another report suggests that many of the published findings of positive effects in animal models of potential treatments for amyotrophic
lateral sclerosis are most likely “noise … as opposed to actual drug effect” (Scott et al., 2008).
Koroshetz offered his own perspective on some of the goals of standardization:
• Improve best laboratory practices to decrease the publication of spurious results.
• Facilitate the reproducibility of results.
• Facilitate the dissemination of valuable animal models into more laboratories.
• Improve comparability across studies using “identical” animal models (requires knowledge of laboratory-to-laboratory variability).
• Restore trust before disaster strikes.
He also suggested several potential risks to keep in mind:
• The increased burden posed by over-standardization could stifle innovation.
• Research might gravitate toward standardized models, thereby restricting development of better models or the testing of multiple models.
• There could be decreased generalizability due to convergence of studies on a limited number of standardized models.
Standardization is not an “all-or-none” question but rather a “when and how much” consideration, Koroshetz said, and he referred workshop participants to recent recommendations from NINDS for experimental design, minimizing bias, results reporting, and results interpretation.1
CHALLENGES TO STANDARDIZATION OF BEHAVIORAL MODELS
Andrew Holmes, chief of the Laboratory of Behavioral and Genomic Neuroscience at the National Institute on Alcohol Abuse and Alcoholism, discussed standardization of behavioral models in the context of preclinical models and assays of anxiety. Holmes described several tests that have been the basis for much of the preclinical research in anxiety
over the past 60 years, including the open field test, elevated plus-maze, and light/dark box. He explained that these approach/avoidance tests are based on the simple premise that small prey animals such as rats and mice have an innate aversion to exploring open, brightly lit areas where the risk of predation is presumably high, yet at the same time they have a natural drive to explore novel, potentially fruitful environments where they might find food, mates, or new territory.
The conceptual framework of these tests is straightforward, but each laboratory conducting anxiety testing uses what they believe to be the best apparatus and testing approach. The question then, Holmes said, was whether this variability affects the ability to reproduce findings across laboratories and across studies. To illustrate the complexities of this issue Holmes highlighted three studies. As background, he noted that it has been known for many decades that genetically inbred, isogenic strains of mice differ in various phenotypes, including measures related to anxiety. Using these inbred strains restricts the amount of variability in the population and presumably increases the ability to detect influences due to an environmental or a procedural difference.
The first study Holmes described compared the results of standard tests and assays for anxiety across four different laboratories involved in a consortium project (Mandillo et al., 2008). It was acknowledged that differences in equipment and apparatus could be a possible confound in standardization. Each laboratory was allowed to use the apparatuses already in place, and there were no attempts to equate variables such as housing or the vendor from which the mice were purchased. In one test, for example, using the percentage of time spent in the center of the open field as a measure of anxiety-like behavior, there were marked differences between mouse strains within a laboratory. Yet, although the magnitude of the differences varied among laboratories, trends were preserved. The authors concluded that despite differences in equipment, vendors, and housing across laboratories, the results were reproducible and robust. They also suggested possible confounds that might limit tighter replication, including experimenter experience, animal husbandry, apparatus differences, and clarity of the standard operating procedure used.
In the second case highlighted by Holmes, the investigators went to “extraordinary lengths to equate the test apparatus, protocols, and all possible features of animal husbandry” that they could control (Crabbe et al., 1999, p. 1670). Across a battery of different tests, the sites sought to
ensure that testing was done in the same order, on animals of the same age, at the same time of day, etc.
In one test, mice were assessed for time spent in the open arms of the elevated plus-maze. The authors found that at one site, for example, BALB/c mice spent more time in the open arms than C57BL/6 mice, indicative of a lower level of anxiety-like behavior in the BALB/c mice. A different site found the exact opposite effect, with C57BL/6 mice showing lower levels of anxiety on the same test. In contrast, in another test measuring voluntary alcohol consumption, the results for both strains at these two sites were remarkably consistent.
Crabbe and colleagues concluded that despite their efforts to equate the testing environments across laboratories, there were still significant effects of site for nearly all variables. They also noted the challenges of behavioral research standardization because there are many differing opinions on the “best” way to assay behavior. As a result, they went on to say that “it is not clear whether standardization of behavioral assays would markedly improve future replication of results across laboratories” (Crabbe et al., 1999 p. 1672). This statement was quite provocative and distressed many researchers, Holmes noted.
A third study Holmes described asked if standardization is not beneficial, would systematic nonstandardization paradoxically improve reproducibility? To test this, Richter et al. (2010) attempted to mimic different laboratory environments in their own setting. They ran four different experiments, systematically varying two potentially influential factors in each. For example, they compared C57BL/6 and BALB/c mice in the open field test under standardized conditions and under similar conditions, but with two factors at play: the size of the housing cage and the illumination level during the test (small cage, high light; large cage, high light; small cage, low light; and large cage, low light).
They found that under the standardized conditions, the magnitude of the difference between strains was variable across the experiments, whereas in the experiments where select parameters were systematically varied, they observed remarkable consistency in the strain differences. This was also quite a provocative result and led to a lot of debate in the field, Holmes said. The authors reasoned that standardized experiments can generate spurious results because over-standardization can artificially inflate the sensitivity of the procedure to the point where researchers are more likely to find false-positive or false-negative results. So while the results may be clean, the generalizability of that result to other situations may be limited. The authors suggested that varying some conditions
(e.g., age of animals, housing conditions) may improve reliability and generalizability of results. They also noted that this may apply more broadly, beyond behavioral studies.
These three studies all have some merit and add appreciably to the debate about standardization, Holmes said. As noted by Crabbe et al. (1999), standardization will be difficult, because there are many ideas about what “best” practice is. Moving forward, Mandillo et al. (2008) suggested that taking the experimenter out of the experiment would reduce one source of variability, that is, automated equipment may help to reduce subjectivity in scoring. Even if standardization will improve the reproducibility of behavioral tests, Holmes concluded, we also need to develop novel endpoints that might be less liable to these issues.
DEVELOPING TRANSLATABLE COGNITIVE ASSAYS
Timothy Bussey, professor in the department of experimental psychology at the University of Cambridge, focused his comments on behavioral cognitive assessment in animal models.
Bussey offered his opinion on an ideal cognitive testing method:
• Automated: Advantages of automation include high throughput or the ability to test large cohorts of animals simultaneously across multiple behavioral measures; minimal experimenter contact with animals during testing; labor saving; consistency and accuracy of task parameters and measures; data saved automatically; standardization.
• Non-aversive and low-stress: Stress and/or aversive stimuli can affect behavioral testing. Minimizing both when they are not specifically part of the study is important.
• Multidimensional: Standardization of all tasks carried out in the same apparatus, using the same stimuli and rewards while requiring the same responses. In this way, an animal can be tested on a battery of cognitive tasks and establish a cognitive profile of that animal.
• Translational: Make tasks as similar as possible to those used to test human populations.
One approach to increasing translation of results from animal models to clinical trials is to start by looking at how humans are tested. Increasingly, automated tests are used for human cognitive testing; for example,
the Cambridge Neuropsychological Test Automated Battery (CANTAB), uses a touchscreen. This approach, Bussey noted, benefits from all of the advantages of automation described above. Touchscreens offer tight contiguity between stimuli and responses, increasing learning and minimizing confounds compared to approaches where a person must divide attention between the computer screen and keyboard. The CANTAB battery also uses nonverbal stimuli that could conceivably be presented to an animal in a similar testing situation.
In fact, researchers are using cognitive methods that present computographic stimuli to animals. Bussey described a study on Huntington’s disease in which a mouse is presented with two pictures on a computer screen that can detect a touch by the animal’s nose. The task is called visual discrimination learning, which is simply the discrimination between two stimuli, one novel and one learned, on the computer screen. A specialized apparatus incorporates a computer monitor, touchscreen and food magazine that dispenses a pellet when the animal responds to the challenge correctly.
In another example, Bussey described a test of spatial and non-spatial learning and memory, visual reversal learning, and attention using the triple transgenic Alzheimer’s disease (3xTgAD) mouse model that showed attention impairment compared to wild type mice (Romberg et al., 2011).
Finally, Bussey shared data from paired associate learning tests used to distinguish among mouse strains with knockout mutations of scaffolding proteins associated with the N-methyl D-aspartate (NMDA) receptor (Nithianatharajah et al., 2013). In this test, the animal must learn that a particular shape belongs in a particular location. Although some knockout strains perform no differently from the wild type control animals, one strain (with a knockout mutation in postsynaptic density protein 93 or PSD-93) never achieved performance above chance levels.
The next step was to translate these mouse testing methods to humans. Most of the human subjects did eventually learn the task over the course of many trials, Bussey said. However, study participants who were known to have deletions of the disks large homolog 2 gene (DLG2) that codes for the PSD-93 scaffold protein, some of whom had schizophrenia, were generally unable to learn the task.
Although these are preliminary studies, they demonstrate the potential for relevant translation between the animal models and the human clinical studies by using an automated, standardized apparatus and methods.
STANDARDIZATION FROM THE PERSPECTIVE OF ALZHEIMER’S DISEASE MODELS
Experimental models are used to better understand nature, began Lennert Mucke, director of the Gladstone Institute of Neurological Disease at the University of California, San Francisco. Although mice are clearly not people, they face similar challenges (e.g., parenting, finding food, navigation). Experimental models need not simulate every aspect of a disease or disorder, but do need to have some critical features in common, he said.
Alzheimer’s disease is a very complex condition that Mucke described as a multifactorial proteinopathy including, but not limited to, different assembly states of amyloid beta peptides, mislocalization of tau and alpha-synuclein, localization of apolipoprotein E (ApoE) both inside and outside of cells, inflammatory changes, and vascular changes. Models of Alzheimer’s disease, including transgenic mouse models, have been very informative in dissecting this complexity, Mucke said. Much Alzheimer’s research has focused on the structural alterations found in the human disease (e.g., amyloid plaque formation), as well as network disruption, synaptic deficits, and network failure.
Extrapolation from Animal Models to Humans
Mucke described several tests of learning and memory that can be used to evaluate therapeutic manipulations in animal models, including the Morris water maze for spatial learning and memory, novel object recognition and passive avoidance learning. As an example, a study by Cissé et al. (2011) compared control mice and human amyloid precursor protein (hAPP) transgenic mice and assessed the effect of hippocampal injection of a lentiviral vector overexpressing ephrin-B2 (EphB2), a tyrosine kinase that is depleted in hAPP mice and in humans with Alzheimer’s disease. In all tests, untreated hAPP mice demonstrated learning and memory deficits, while hAPP mice treated with EphB2 performed similarly to controls. However, Mucke noted, these behavioral measures are sensitive to interference. For example, when the light/dark cycle in the animal housing facility fails and the lights stay on for extended periods of time, the animals become stressed and will not perform.
To study how navigational deficits relate to human dementia, Mucke and colleagues created a human maze in the hallways of their facility. Patients were tasked with route learning (forward and reverse), landmark
recognition, and photograph location and ordering. They found that approximately 70 percent of patients with early-stage Alzheimer’s disease and 50 percent of patients with mild cognitive impairment got lost on reverse routing, a navigation deficit that could not be predicted from mini-mental state exam scores of these groups (deIpolyi et al., 2007). Mucke observed smaller right posterior hippocampal and parietal volumes in patients who got lost. Interestingly, this same right-posterior hippocampal region has been shown to be expanded in London cab drivers who have been on the job for a long time, together suggesting a strong association of this region with human navigation.
In comparing mice and humans, Mucke made several observations in the mice that had not been described previously in the human condition. For example, in the dentate gyrus of the hAPP mice, there was decreased calbindin and overexpression of collagen VI compared to controls. There was also activation of group IVA cystolic phosopholipase A2 (IVA cPLA2) and increased met-enkephalin in the hippocampus, and decreases in specific sodium channel subunits in the parietal cortex. Mucke subsequently looked for and found these same molecular abnormalities in humans with Alzheimer’s disease, supporting the potential predictive value of these animal models (Verret et al., 2012).
One example of extrapolation of therapeutic findings from a mouse model to the human condition was the finding that immunization against amyloid beta clears amyloid plaques (Nicoll al., 2006; Schenk et al., 1999). As another example, Mucke described studies in mice that show how the antiepileptic drug, levetiracetam, when given chronically to hAPP mice, normalizes their long-term potentiation deficits in the hippocampus. There are also significant improvements, although not complete reversal, in performance in the Morris water maze. Studies in humans have shown that this drug also has beneficial effects in people with amnestic mild cognitive impairment.2
Increasing Success Through Best Practices
Having presented several examples of successful translation of findings in animal models to human Alzheimer’s disease, Mucke offered a list of possible reasons why translation may fail:
2Discussed further by Gallagher in Chapter 5.
• species differences;
• aging issues (neurodegenerative diseases develop over decades in humans, much longer than the lifespan of a mouse);
• human disease is more complex than what is modeled in animals;
• general problems with human and animal studies; and/or
• faulty hypothesis underlying the model.
Mucke highlighted some particular problems with the translation between animal and human studies of Alzheimer’s disease. First, the genetic heterogeneity of the patient population may obscure treatment effects. A better understanding of this heterogeneity might allow identification of subpopulations that respond to a particular treatment unlike the general population. Mucke added that animal models, in contrast to humans, typically have predictable genetic backgrounds and many other variables that can be controlled.
Another issue is that Alzheimer’s disease is a multifactorial condition and the success of a cause-specific treatment depends on the relative impact of the cause. In other words, if a disease has multiple contributing factors with different weights of impact on overall pathogenesis, it may matter which factor is blocked and it may be necessary to block more than one to see a significant impact. If a transgenic mouse model simulates only one of the causes, a treatment effect may be observed, but in the larger context of the human disease, there may only be partial benefit.
Mucke referred workshop participants to a recent review on best practices for preclinical animal studies in Alzheimer’s disease (Shineman et al., 2011). He offered his own list of what, in his experience, are important basic best practices for animal models:
• Blind-code all analyses.3
• Carefully match experimental and control groups (e.g., sex, age, other characteristics).
• Conduct rigorous statistical approaches.
• Reproduce experimental results in independent cohorts at different times.
• Use multiple outcome measures, including measures that are functionally relevant to humans.
• Regularly test animal models for quality control (e.g., genetic drift, loss of phenotype).
3Coding of data by someone other than the researchers so that analysis can be performed in an unbiased manner.
• Validate across models and in the human condition.
• To be useful, negative data require sensitive positive controls. False negatives are easy to obtain and they are as misleading as false positives.
• Do not ignore or suppress data that might contradict dogma.
With regard to standardization, Mucke stated that standardization makes sense for well-established basic principles. However, until truly perfect models and assays have been developed, optimization trumps standardization. Premature or overzealous standardization efforts can prevent progress, he concluded.
Following the presentations, participants expanded on the topics of reproducibility, statistics, and mouse strains as they relate to standardization and the use of animal models.
In response to the Bayer paper described by Koroshetz, many participants, from both academic and industrial laboratories reported difficulty reproducing results. Different assays, models, and compounds all affect reproducibility. Mucke noted that complex sets of data are often hard to reproduce, and success may require sending researchers to the other laboratory to learn protocols. Certain aspects of these protocols may not be trivial, and failure to reproduce another laboratory’s data may simply be that the methods are not adapted as carefully as they should be.
A participant cautioned against making generalizations about the inability to repeat work from academic laboratories. Mucke concurred and added that it would be helpful if there was a way to bring together the researchers who obtained discrepant findings and have them work through the discrepancies. Koroshetz noted that for spinal cord injury studies, when NINDS was not able to obtain the same results as the original investigators, they brought those investigators into the NINDS laboratories to help reproduce the results.
Mucke expressed concern that, while an academic researcher is usually committed to studying one area, such as behavior, for the length of
his or her career, industry researchers are often reassigned to new projects as company needs dictate. This turnover of scientists may impact the depth of expertise with animal models and/or methods used in a particular field of research. A participant with a pharmaceutical background countered that pharmaceutical researchers are equally dedicated scientists who work hard to maintain their expertise.
A participant pointed out that this workshop is focused on translation and exploring avenues to reduce clinical trial failures. Several participants continued to say that if it takes so much effort for two laboratories to reproduce the same finding, the likelihood of being able to then take that finding and reproduce it in a clinical trial would seem very low. While researchers may come together so that two laboratories using a model or test are doing everything the exact same way, with every variable controlled for, this might not be possible in the subsequent clinical trials. Perhaps the focus should be on profoundly reproducible findings and not those for which all experimental conditions need to be carefully nuanced.
A concern was raised that most basic science graduate students and postdoctoral fellows do not learn statistics. In addition, journal editors face challenges finding statisticians who are available and able to review basic cell biology in manuscripts. Mucke emphasized the value of involving a biostatistician in both the planning of a study and in the data analysis.
Animal Models with Uniform Versus Outbred Backgrounds
Holmes noted that in the early days of transgenic animal models, there were concerns that genetically isogenic backgrounds would be artificially homogeneous and therefore not relevant. These concerns were countered by claims that it would be difficult to see positive results against a background of high noise.
Across all behavioral domains, researchers continue to use inbred isogenic strains, but, increasingly, are using only one or two specific background strains, with the vast majority of studies done using
C57BL/6 mouse strains. This does raise questions about generalizability beyond the model.
Complex genetic strains have been derived from crossing a dozen or so different inbred strains, producing a population that is genetically as complex as a human population, Holmes said. These strains have a place in research for answering specific questions, such as the genetic basis of a particular phenotype. The choice of strain depends on the question being asked. When studying the mechanism of a disease or the effect of a drug, there is value in controlling the genetic background of the animals. It was also noted that the same strain of animal (e.g., C57BL/6) from different breeders can have both genetic and behavioral differences.