*David H. Kaye, M.A., J.D., is Distinguished Professor of Law and Weiss Family Scholar, The Pennsylvania State University, University Park, and Regents’ Professor Emeritus, Arizona State University Sandra Day O’Connor College of Law and School of Life Sciences, Tempe.*

*David A. Freedman, Ph.D., was Professor of Statistics, University of California, Berkeley.*

[Editor’s Note: Sadly, Professor Freedman passed away during the production of this manual.]

CONTENTS

A. Admissibility and Weight of Statistical Studies

B. Varieties and Limits of Statistical Expertise

C. Procedures That Enhance Statistical Testimony

1. Maintaining professional autonomy

3. Disclosing data and analytical methods before trial

II. How Have the Data Been Collected?

A. Is the Study Designed to Investigate Causation?

2. Randomized controlled experiments

4. Can the results be generalized?

B. Descriptive Surveys and Censuses

1. What method is used to select the units?

2. Of the units selected, which are measured?

1. Is the measurement process reliable?

2. Is the measurement process valid?

3. Are the measurements recorded correctly?

III. How Have the Data Been Presented?

A. Are Rates or Percentages Properly Interpreted?

1. Have appropriate benchmarks been provided?

2. Have the data collection procedures changed?

3. Are the categories appropriate?

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 211

Reference Guide on Statistics
d aVid h. Kaye and daVid a. freedman
David H. Kaye, M.A., J.D., is Distinguished Professor of Law and Weiss Family Scholar,
The Pennsylvania State University, University Park, and Regents’ Professor Emeritus, Arizona
State University Sandra Day O’Connor College of Law and School of Life Sciences, Tempe.
David A. Freedman, Ph.D., was Professor of Statistics, University of California, Berkeley.
[Editor’s Note: Sadly, Professor Freedman passed away during the production of this
manual.]
ConTenTs
I. Introduction, 213
A. Admissibility and Weight of Statistical Studies, 214
B. Varieties and Limits of Statistical Expertise, 214
C. Procedures That Enhance Statistical Testimony, 215
1. Maintaining professional autonomy, 215
2. Disclosing other analyses, 216
3. Disclosing data and analytical methods before trial, 216
II. How Have the Data Been Collected? 216
A. Is the Study Designed to Investigate Causation? 217
1. Types of studies, 217
2. Randomized controlled experiments, 220
3. Observational studies, 220
4. Can the results be generalized? 222
B. Descriptive Surveys and Censuses, 223
1. What method is used to select the units? 223
2. Of the units selected, which are measured? 226
C. Individual Measurements, 227
1. Is the measurement process reliable? 227
2. Is the measurement process valid? 228
3. Are the measurements recorded correctly? 229
D. What Is Random? 230
III. How Have the Data Been Presented? 230
A. Are Rates or Percentages Properly Interpreted? 230
1. Have appropriate benchmarks been provided? 230
2. Have the data collection procedures changed? 231
3. Are the categories appropriate? 231
4. How big is the base of a percentage? 233
5. What comparisons are made? 233
B. Is an Appropriate Measure of Association Used? 233
211

OCR for page 211

Reference Manual on Scientific Evidence
C. Does a Graph Portray Data Fairly? 236
1. How are trends displayed? 236
2. How are distributions displayed? 236
D. Is an Appropriate Measure Used for the Center of a Distribution? 238
E. Is an Appropriate Measure of Variability Used? 239
IV. What Inferences Can Be Drawn from the Data? 240
A. Estimation, 242
1. What estimator should be used? 242
2. What is the standard error? The confidence interval? 243
3. How big should the sample be? 246
4. What are the technical difficulties? 247
B. Significance Levels and Hypothesis Tests, 249
1. What is the p-value? 249
2. Is a difference statistically significant? 251
3. Tests or interval estimates? 252
4. Is the sample statistically significant? 253
C. Evaluating Hypothesis Tests, 253
1. What is the power of the test? 253
2. What about small samples? 254
3. One tail or two? 255
4. How many tests have been done? 256
5. What are the rival hypotheses? 257
D. Posterior Probabilities, 258
V. Correlation and Regression, 260
A. Scatter Diagrams, 260
B. Correlation Coefficients, 261
1. Is the association linear? 262
2. Do outliers influence the correlation coefficient? 262
3. Does a confounding variable influence the coefficient? 262
C. Regression Lines, 264
1. What are the slope and intercept? 265
2. What is the unit of analysis? 266
D. Statistical Models, 268
Appendix, 273
A. Frequentists and Bayesians, 273
B. The Spock Jury: Technical Details, 275
C. The Nixon Papers: Technical Details, 278
D. A Social Science Example of Regression: Gender Discrimination in
Salaries, 279
1. The regression model, 279
2. Standard errors, t-statistics, and statistical significance, 281
Glossary of Terms, 283
References on Statistics, 302
212

OCR for page 211

Reference Guide on Statistics
I. Introduction
Statistical assessments are prominent in many kinds of legal cases, including
antitrust, employment discrimination, toxic torts, and voting rights cases.1 This
reference guide describes the elements of statistical reasoning. We hope the expla-
nations will help judges and lawyers to understand statistical terminology, to see
the strengths and weaknesses of statistical arguments, and to apply relevant legal
doctrine. The guide is organized as follows:
• Section I provides an overview of the field, discusses the admissibility
of statistical studies, and offers some suggestions about procedures that
encourage the best use of statistical evidence.
• Section II addresses data collection and explains why the design of a study
is the most important determinant of its quality. This section compares
experiments with observational studies and surveys with censuses, indicat-
ing when the various kinds of study are likely to provide useful results.
• Section III discusses the art of summarizing data. This section considers the
mean, median, and standard deviation. These are basic descriptive statistics,
and most statistical analyses use them as building blocks. This section also
discusses patterns in data that are brought out by graphs, percentages, and
tables.
• Section IV describes the logic of statistical inference, emphasizing founda-
tions and disclosing limitations. This section covers estimation, standard
errors and confidence intervals, p-values, and hypothesis tests.
• Section V shows how associations can be described by scatter diagrams,
correlation coefficients, and regression lines. Regression is often used to
infer causation from association. This section explains the technique, indi-
cating the circumstances under which it and other statistical models are
likely to succeed—or fail.
• An appendix provides some technical details.
• The glossary defines statistical terms that may be encountered in litigation.
1. See generally Statistical Science in the Courtroom (Joseph L. Gastwirth ed., 2000); Statistics
and the Law (Morris H. DeGroot et al. eds., 1986); National Research Council, The Evolving Role
of Statistical Assessments as Evidence in the Courts (Stephen E. Fienberg ed., 1989) [hereinafter The
Evolving Role of Statistical Assessments as Evidence in the Courts]; Michael O. Finkelstein & Bruce
Levin, Statistics for Lawyers (2d ed. 2001); 1 & 2 Joseph L. Gastwirth, Statistical Reasoning in Law
and Public Policy (1988); Hans Zeisel & David Kaye, Prove It with Figures: Empirical Methods in
Law and Litigation (1997).
213

OCR for page 211

Reference Manual on Scientific Evidence
A. Admissibility and Weight of Statistical Studies
Statistical studies suitably designed to address a material issue generally will be
admissible under the Federal Rules of Evidence. The hearsay rule rarely is a
serious barrier to the presentation of statistical studies, because such studies may
be offered to explain the basis for an expert’s opinion or may be admissible under
the learned treatise exception to the hearsay rule.2 Because most statistical methods
relied on in court are described in textbooks or journal articles and are capable
of producing useful results when properly applied, these methods generally satisfy
important aspects of the “scientific knowledge” requirement in Daubert v. Merrell
Dow Pharmaceuticals, Inc.3 Of course, a particular study may use a method that is
entirely appropriate but that is so poorly executed that it should be inadmissible
under Federal Rules of Evidence 403 and 702.4 Or, the method may be inappro-
priate for the problem at hand and thus lack the “fit” spoken of in Daubert.5 Or
the study might rest on data of the type not reasonably relied on by statisticians or
substantive experts and hence run afoul of Federal Rule of Evidence 703. Often,
however, the battle over statistical evidence concerns weight or sufficiency rather
than admissibility.
B. Varieties and Limits of Statistical Expertise
For convenience, the field of statistics may be divided into three subfields: prob-
ability theory, theoretical statistics, and applied statistics. Probability theory is the
mathematical study of outcomes that are governed, at least in part, by chance.
Theoretical statistics is about the properties of statistical procedures, including
error rates; probability theory plays a key role in this endeavor. Applied statistics
draws on both of these fields to develop techniques for collecting or analyzing
particular types of data.
2. See generally 2 McCormick on Evidence §§ 321, 324.3 (Kenneth S. Broun ed., 6th ed. 2006).
Studies published by government agencies also may be admissible as public records. Id. § 296.
3. 509 U.S. 579, 589–90 (1993).
4. See Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999) (suggesting that the trial court
should “make certain that an expert, whether basing testimony upon professional studies or personal
experience, employs in the courtroom the same level of intellectual rigor that characterizes the practice
of an expert in the relevant field.”); Malletier v. Dooney & Bourke, Inc., 525 F. Supp. 2d 558, 562–63
(S.D.N.Y. 2007) (“While errors in a survey’s methodology usually go to the weight accorded to the
conclusions rather than its admissibility, . . . ‘there will be occasions when the proffered survey is so
flawed as to be completely unhelpful to the trier of fact.’”) (quoting AHP Subsidiary Holding Co. v.
Stuart Hale Co., 1 F.3d 611, 618 (7th Cir.1993)).
5. Daubert, 509 U.S. at 591; Anderson v. Westinghouse Savannah River Co., 406 F.3d 248 (4th
Cir. 2005) (motion to exclude statistical analysis that compared black and white employees without
adequately taking into account differences in their job titles or positions was properly granted under
Daubert); Malletier, 525 F. Supp. 2d at 569 (excluding a consumer survey for “a lack of fit between the
survey’s questions and the law of dilution” and errors in the execution of the survey).
214

OCR for page 211

Reference Guide on Statistics
Statistical expertise is not confined to those with degrees in statistics. Because
statistical reasoning underlies many kinds of empirical research, scholars in a
variety of fields—including biology, economics, epidemiology, political science,
and psychology—are exposed to statistical ideas, with an emphasis on the methods
most important to the discipline.
Experts who specialize in using statistical methods, and whose professional
careers demonstrate this orientation, are most likely to use appropriate procedures
and correctly interpret the results. By contrast, forensic scientists often lack basic
information about the studies underlying their testimony. State v. Garrison6 illus-
trates the problem. In this murder prosecution involving bite mark evidence, a
dentist was allowed to testify that “the probability factor of two sets of teeth being
identical in a case similar to this is, approximately, eight in one million,” even
though “he was unaware of the formula utilized to arrive at that figure other than
that it was ‘computerized.’”7
At the same time, the choice of which data to examine, or how best to model
a particular process, could require subject matter expertise that a statistician lacks.
As a result, cases involving statistical evidence frequently are (or should be) “two
expert” cases of interlocking testimony. A labor economist, for example, may
supply a definition of the relevant labor market from which an employer draws
its employees; the statistical expert may then compare the race of new hires to
the racial composition of the labor market. Naturally, the value of the statistical
analysis depends on the substantive knowledge that informs it.8
C. Procedures That Enhance Statistical Testimony
1. Maintaining professional autonomy
Ideally, experts who conduct research in the context of litigation should proceed
with the same objectivity that would be required in other contexts. Thus, experts
who testify (or who supply results used in testimony) should conduct the analysis
required to address in a professionally responsible fashion the issues posed by the
litigation.9 Questions about the freedom of inquiry accorded to testifying experts,
6. 585 P.2d 563 (Ariz. 1978).
7. Id. at 566, 568. For other examples, see David H. Kaye et al., The New Wigmore: A Treatise
on Evidence: Expert Evidence § 12.2 (2d ed. 2011).
8. In Vuyanich v. Republic National Bank, 505 F. Supp. 224, 319 (N.D. Tex. 1980), vacated, 723
F.2d 1195 (5th Cir. 1984), defendant’s statistical expert criticized the plaintiffs’ statistical model for an
implicit, but restrictive, assumption about male and female salaries. The district court trying the case
accepted the model because the plaintiffs’ expert had a “very strong guess” about the assumption, and
her expertise included labor economics as well as statistics. Id. It is doubtful, however, that economic
knowledge sheds much light on the assumption, and it would have been simple to perform a less
restrictive analysis.
9. See The Evolving Role of Statistical Assessments as Evidence in the Courts, supra note 1, at
164 (recommending that the expert be free to consult with colleagues who have not been retained
215

OCR for page 211

Reference Manual on Scientific Evidence
as well as the scope and depth of their investigations, may reveal some of the
limitations to the testimony.
2. Disclosing other analyses
Statisticians analyze data using a variety of methods. There is much to be said for
looking at the data in several ways. To permit a fair evaluation of the analysis that
is eventually settled on, however, the testifying expert can be asked to explain
how that approach was developed. According to some commentators, counsel
who know of analyses that do not support the client’s position should reveal them,
rather than presenting only favorable results.10
3. Disclosing data and analytical methods before trial
The collection of data often is expensive and subject to errors and omissions.
Moreover, careful exploration of the data can be time-consuming. To minimize
debates at trial over the accuracy of data and the choice of analytical techniques,
pretrial discovery procedures should be used, particularly with respect to the qual-
ity of the data and the method of analysis.11
II. How Have the Data Been Collected?
The interpretation of data often depends on understanding “study design”—the
plan for a statistical study and its implementation.12 Different designs are suited to
answering different questions. Also, flaws in the data can undermine any statistical
analysis, and data quality is often determined by study design.
In many cases, statistical studies are used to show causation. Do food additives
cause cancer? Does capital punishment deter crime? Would additional disclosures
by any party to the litigation and that the expert receive a letter of engagement providing for these
and other safeguards).
10. Id. at 167; cf. William W. Schwarzer, In Defense of “Automatic Disclosure in Discovery,” 27
Ga. L. Rev. 655, 658–59 (1993) (“[T]he lawyer owes a duty to the court to make disclosure of core
information.”). The National Research Council also recommends that “if a party gives statistical data
to different experts for competing analyses, that fact be disclosed to the testifying expert, if any.” The
Evolving Role of Statistical Assessments as Evidence in the Courts, supra note 1, at 167.
11. See The Special Comm. on Empirical Data in Legal Decision Making, Recommendations
on Pretrial Proceedings in Cases with Voluminous Data, reprinted in The Evolving Role of Statistical
Assessments as Evidence in the Courts, supra note 1, app. F; see also David H. Kaye, Improving Legal
Statistics, 24 Law & Soc’y Rev. 1255 (1990).
12. For introductory treatments of data collection, see, for example, David Freedman et al.,
Statistics (4th ed. 2007); Darrell Huff, How to Lie with Statistics (1993); David S. Moore & William
I. Notz, Statistics: Concepts and Controversies (6th ed. 2005); Hans Zeisel, Say It with Figures (6th
ed. 1985); Zeisel & Kaye, supra note 1.
216

OCR for page 211

Reference Guide on Statistics
in a securities prospectus cause investors to behave differently? The design of
studies to investigate causation is the first topic of this section.13
Sample data can be used to describe a population. The population is the
whole class of units that are of interest; the sample is the set of units chosen for
detailed study. Inferences from the part to the whole are justified when the sample
is representative. Sampling is the second topic of this section.
Finally, the accuracy of the data will be considered. Because making and
recording measurements is an error-prone activity, error rates should be assessed
and the likely impact of errors considered. Data quality is the third topic of this
section.
A. Is the Study Designed to Investigate Causation?
1. Types of studies
When causation is the issue, anecdotal evidence can be brought to bear. So can
observational studies or controlled experiments. Anecdotal reports may be of
value, but they are ordinarily more helpful in generating lines of inquiry than in
proving causation.14 Observational studies can establish that one factor is associ-
13. See also Michael D. Green et al., Reference Guide on Epidemiology, Section V, in this
manual; Joseph Rodricks, Reference Guide on Exposure Science, Section E, in this manual.
14. In medicine, evidence from clinical practice can be the starting point for discovery of
cause-and-effect relationships. For examples, see David A. Freedman, On Types of Scientific Enquiry, in
The Oxford Handbook of Political Methodology 300 (Janet M. Box-Steffensmeier et al. eds., 2008).
Anecdotal evidence is rarely definitive, and some courts have suggested that attempts to infer causa-
tion from anecdotal reports are inadmissible as unsound methodology under Daubert v. Merrell Dow
Pharmaceuticals, Inc., 509 U.S. 579 (1993). See, e.g., McClain v. Metabolife Int’l, Inc., 401 F.3d 1233,
1244 (11th Cir. 2005) (“simply because a person takes drugs and then suffers an injury does not show
causation. Drawing such a conclusion from temporal relationships leads to the blunder of the post hoc
ergo propter hoc fallacy.”); In re Baycol Prods. Litig., 532 F. Supp. 2d 1029, 1039–40 (D. Minn. 2007)
(excluding a meta-analysis based on reports to the Food and Drug Administration of adverse events);
Leblanc v. Chevron USA Inc., 513 F. Supp. 2d 641, 650 (E.D. La. 2007) (excluding plaintiffs’ experts’
opinions that benzene causes myelofibrosis because the causal hypothesis “that has been generated by
case reports . . . has not been confirmed by the vast majority of epidemiologic studies of workers being
exposed to benzene and more generally, petroleum products.”), vacated, 275 Fed. App’x. 319 (5th
Cir. 2008) (remanding for consideration of newer government report on health effects of benzene);
cf. Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309, 1321 (2011) (concluding that adverse event
reports combined with other information could be of concern to a reasonable investor and therefore
subject to a requirement of disclosure under SEC Rule 10b-5, but stating that “the mere existence of
reports of adverse events . . . says nothing in and of itself about whether the drug is causing the adverse
events”). Other courts are more open to “differential diagnoses” based primarily on timing. E.g., Best v.
Lowe’s Home Ctrs., Inc., 563 F.3d 171 (6th Cir. 2009) (reversing the exclusion of a physician’s opinion
that exposure to propenyl chloride caused a man to lose his sense of smell because of the timing in this
one case and the physician’s inability to attribute the change to anything else); Kaye et al., supra note
7, §§ 8.7.2 & 12.5.1. See also Matrixx Initiatives, supra, at 1322 (listing “a temporal relationship” in a
single patient as one indication of “a reliable causal link”).
217

OCR for page 211

Reference Manual on Scientific Evidence
ated with another, but work is needed to bridge the gap between association and
causation. Randomized controlled experiments are ideally suited for demonstrat-
ing causation.
Anecdotal evidence usually amounts to reports that events of one kind are
followed by events of another kind. Typically, the reports are not even sufficient
to show association, because there is no comparison group. For example, some
children who live near power lines develop leukemia. Does exposure to electrical
and magnetic fields cause this disease? The anecdotal evidence is not compelling
because leukemia also occurs among children without exposure.15 It is necessary
to compare disease rates among those who are exposed and those who are not.
If exposure causes the disease, the rate should be higher among the exposed and
lower among the unexposed. That would be association.
The next issue is crucial: Exposed and unexposed people may differ in ways
other than the exposure they have experienced. For example, children who live
near power lines could come from poorer families and be more at risk from other
environmental hazards. Such differences can create the appearance of a cause-and-
effect relationship. Other differences can mask a real relationship. Cause-and-effect
relationships often are quite subtle, and carefully designed studies are needed to
draw valid conclusions.
An epidemiological classic makes the point. At one time, it was thought that
lung cancer was caused by fumes from tarring the roads, because many lung cancer
patients lived near roads that recently had been tarred. This is anecdotal evidence.
But the argument is incomplete. For one thing, most people—whether exposed
to asphalt fumes or unexposed—did not develop lung cancer. A comparison of
rates was needed. The epidemiologists found that exposed persons and unexposed
persons suffered from lung cancer at similar rates: Tar was probably not the causal
agent. Exposure to cigarette smoke, however, turned out to be strongly associated
with lung cancer. This study, in combination with later ones, made a compelling
case that smoking cigarettes is the main cause of lung cancer.16
A good study design compares outcomes for subjects who are exposed to
some factor (the treatment group) with outcomes for other subjects who are
15. See National Research Council, Committee on the Possible Effects of Electromagnetic Fields
on Biologic Systems (1997); Zeisel & Kaye, supra note 1, at 66–67. There are problems in measur-
ing exposure to electromagnetic fields, and results are inconsistent from one study to another. For
such reasons, the epidemiological evidence for an effect on health is inconclusive. National Research
Council, supra; Zeisel & Kaye, supra; Edward W. Campion, Power Lines, Cancer, and Fear, 337 New
Eng. J. Med. 44 (1997) (editorial); Martha S. Linet et al., Residential Exposure to Magnetic Fields and Acute
Lymphoblastic Leukemia in Children, 337 New Eng. J. Med. 1 (1997); Gary Taubes, Magnetic Field-Cancer
Link: Will It Rest in Peace?, 277 Science 29 (1997) (quoting various epidemiologists).
16. Richard Doll & A. Bradford Hill, A Study of the Aetiology of Carcinoma of the Lung, 2 Brit.
Med. J. 1271 (1952). This was a matched case-control study. Cohort studies soon followed. See
Green et al., supra note 13. For a review of the evidence on causation, see 38 International Agency
for Research on Cancer (IARC), World Health Org., IARC Monographs on the Evaluation of the
Carcinogenic Risk of Chemicals to Humans: Tobacco Smoking (1986).
218

OCR for page 211

Reference Guide on Statistics
not exposed (the control group). Now there is another important distinction to
be made—that between controlled experiments and observational studies. In a
controlled experiment, the investigators decide which subjects will be exposed
and which subjects will go into the control group. In observational studies, by
contrast, the subjects themselves choose their exposures. Because of self-selection,
the treatment and control groups are likely to differ with respect to influential
factors other than the one of primary interest. (These other factors are called lurk-
ing variables or confounding variables.)17 With the health effects of power lines,
family background is a possible confounder; so is exposure to other hazards. Many
confounders have been proposed to explain the association between smoking and
lung cancer, but careful epidemiological studies have ruled them out, one after
the other.
Confounding remains a problem to reckon with, even for the best observa-
tional research. For example, women with herpes are more likely to develop cer-
vical cancer than other women. Some investigators concluded that herpes caused
cancer: In other words, they thought the association was causal. Later research
showed that the primary cause of cervical cancer was human papilloma virus
(HPV). Herpes was a marker of sexual activity. Women who had multiple sexual
partners were more likely to be exposed not only to herpes but also to HPV.
The association between herpes and cervical cancer was due to other variables.18
What are “variables?” In statistics, a variable is a characteristic of units in a
study. With a study of people, the unit of analysis is the person. Typical vari-
ables include income (dollars per year) and educational level (years of schooling
completed): These variables describe people. With a study of school districts, the
unit of analysis is the district. Typical variables include average family income of
district residents and average test scores of students in the district: These variables
describe school districts.
When investigating a cause-and-effect relationship, the variable that repre-
sents the effect is called the dependent variable, because it depends on the causes.
The variables that represent the causes are called independent variables. With a
study of smoking and lung cancer, the independent variable would be smoking
(e.g., number of cigarettes per day), and the dependent variable would mark the
presence or absence of lung cancer. Dependent variables also are called outcome
variables or response variables. Synonyms for independent variables are risk factors,
predictors, and explanatory variables.
17. For example, a confounding variable may be correlated with the independent variable and
act causally on the dependent variable. If the units being studied differ on the independent variable,
they are also likely to differ on the confounder. The confounder—not the independent variable—could
therefore be responsible for differences seen on the dependent variable.
18. For additional examples and further discussion, see Freedman et al., supra note 12, at 12–28,
150–52; David A. Freedman, From Association to Causation: Some Remarks on the History of Statistics, 14
Stat. Sci. 243 (1999). Some studies find that herpes is a “cofactor,” which increases risk among women
who are also exposed to HPV. Only certain strains of HPV are carcinogenic.
219

OCR for page 211

Reference Manual on Scientific Evidence
2. Randomized controlled experiments
In randomized controlled experiments, investigators assign subjects to treatment
or control groups at random. The groups are therefore likely to be comparable,
except for the treatment. This minimizes the role of confounding. Minor imbal-
ances will remain, due to the play of random chance; the likely effect on study
results can be assessed by statistical techniques.19 The bottom line is that causal
inferences based on well-executed randomized experiments are generally more
secure than inferences based on well-executed observational studies.
The following example should help bring the discussion together. Today, we
know that taking aspirin helps prevent heart attacks. But initially, there was some
controversy. People who take aspirin rarely have heart attacks. This is anecdotal
evidence for a protective effect, but it proves almost nothing. After all, few people
have frequent heart attacks, whether or not they take aspirin regularly. A good
study compares heart attack rates for two groups: people who take aspirin (the
treatment group) and people who do not (the controls). An observational study
would be easy to do, but in such a study the aspirin-takers are likely to be dif-
ferent from the controls. Indeed, they are likely to be sicker—that is why they
are taking aspirin. The study would be biased against finding a protective effect.
Randomized experiments are harder to do, but they provide better evidence. It
is the experiments that demonstrate a protective effect.20
In summary, data from a treatment group without a control group generally
reveal very little and can be misleading. Comparisons are essential. If subjects are
assigned to treatment and control groups at random, a difference in the outcomes
between the two groups can usually be accepted, within the limits of statistical
error (infra Section IV), as a good measure of the treatment effect. However, if
the groups are created in any other way, differences that existed before treatment
may contribute to differences in the outcomes or mask differences that otherwise
would become manifest. Observational studies succeed to the extent that the treat-
ment and control groups are comparable—apart from the treatment.
3. Observational studies
The bulk of the statistical studies seen in court are observational, not experi-
mental. Take the question of whether capital punishment deters murder. To
conduct a randomized controlled experiment, people would need to be assigned
randomly to a treatment group or a control group. People in the treatment
group would know they were subject to the death penalty for murder; the
19. Randomization of subjects to treatment or control groups puts statistical tests of significance
on a secure footing. Freedman et al., supra note 12, at 503–22, 545–63; see infra Section IV.
20. In other instances, experiments have banished strongly held beliefs. E.g., Scott M. Lippman
et al., Effect of Selenium and Vitamin E on Risk of Prostate Cancer and Other Cancers: The Selenium
and Vitamin E Cancer Prevention Trial (SELECT), 301 JAMA 39 (2009).
220

OCR for page 211

Reference Guide on Statistics
controls would know that they were exempt. Conducting such an experiment
is not possible.
Many studies of the deterrent effect of the death penalty have been conducted,
all observational, and some have attracted judicial attention. Researchers have cata-
logued differences in the incidence of murder in states with and without the death
penalty and have analyzed changes in homicide rates and execution rates over the
years. When reporting on such observational studies, investigators may speak of
“control groups” (e.g., the states without capital punishment) or claim they are “con-
trolling for” confounding variables by statistical methods.21 However, association is
not causation. The causal inferences that can be drawn from analysis of observational
data—no matter how complex the statistical technique—usually rest on a foundation
that is less secure than that provided by randomized controlled experiments.
That said, observational studies can be very useful. For example, there is strong
observational evidence that smoking causes lung cancer (supra Section II.A.1). Gen-
erally, observational studies provide good evidence in the following circumstances:
• The association is seen in studies with different designs, on different kinds of
subjects, and done by different research groups.22 That reduces the chance
that the association is due to a defect in one type of study, a peculiarity in
one group of subjects, or the idiosyncrasies of one research group.
• The association holds when effects of confounding variables are taken into
account by appropriate methods, for example, comparing smaller groups
that are relatively homogeneous with respect to the confounders.23
• There is a plausible explanation for the effect of the independent variable;
alternative explanations in terms of confounding should be less plausible
than the proposed causal link.24
21. A procedure often used to control for confounding in observational studies is regression
analysis. The underlying logic is described infra Section V.D and in Daniel L. Rubinfeld, Reference
Guide on Multiple Regression, Section II, in this manual. But see Richard A. Berk, Regression
Analysis: A Constructive Critique (2004); Rethinking Social Inquiry: Diverse Tools, Shared Standards
(Henry E. Brady & David Collier eds., 2004); David A. Freedman, Statistical Models: Theory and
Practice (2005); David A. Freedman, Oasis or Mirage, Chance, Spring 2008, at 59.
22. For example, case-control studies are designed one way and cohort studies another, with
many variations. See, e.g., Leon Gordis, Epidemiology (4th ed. 2008); supra note 16.
23. The idea is to control for the influence of a confounder by stratification—making compari-
sons separately within groups for which the confounding variable is nearly constant and therefore has
little influence over the variables of primary interest. For example, smokers are more likely to get lung
cancer than nonsmokers. Age, gender, social class, and region of residence are all confounders, but
controlling for such variables does not materially change the relationship between smoking and cancer
rates. Furthermore, many different studies—of different types and on different populations—confirm
the causal link. That is why most experts believe that smoking causes lung cancer and many other
diseases. For a review of the literature, see International Agency for Research on Cancer, supra note 16.
24. A. Bradford Hill, The Environment and Disease: Association or Causation?, 58 Proc. Royal
Soc’y Med. 295 (1965); Alfred S. Evans, Causation and Disease: A Chronological Journey 187 (1993).
Plausibility, however, is a function of time and circumstances.
221

OCR for page 211

Reference Manual on Scientific Evidence
than 1%, the result is highly significant. The p-value is also called the observed
significance level. See significance test; statistical hypothesis.
parameter. A numerical characteristic of a population or a model. See prob-
ability model.
percentile. To get the percentiles of a dataset, array the data from the smallest
value to the largest. Take the 90th percentile by way of example: 90% of the
values fall below the 90th percentile, and 10% are above. (To be very precise:
At least 90% of the data are at the 90th percentile or below; at least 10% of the
data are at the 90th percentile or above.) The 50th percentile is the median:
50% of the values fall below the median, and 50% are above. On the LSAT,
a score of 152 places a test taker at the 50th percentile; a score of 164 is at
the 90th percentile; a score of 172 is at the 99th percentile. Compare mean;
median; quartile.
placebo. See double-blind experiment.
point estimate. An estimate of the value of a quantity expressed as a single num-
ber. See estimator. Compare confidence interval; interval estimate.
Poisson distribution. A limiting case of the binomial distribution, when the
number of trials is large and the common probability is small. The parameter
of the approximating Poisson distribution is the number of trials times the
common probability, which is the expected number of events. When this
number is large, the Poisson distribution may be approximated by a normal
distribution.
population. Also, universe. All the units of interest to the researcher. Compare
sample; sampling frame.
population size. Also, size of population. Number of units in the population.
posterior probability. See Bayes’ rule.
power. The probability that a statistical test will reject the null hypothesis. To
compute power, one has to fix the size of the test and specify parameter values
outside the range given by the null hypothesis. A powerful test has a good
chance of detecting an effect when there is an effect to be detected. See beta;
significance test. Compare alpha; size; p-value.
practical significance. Substantive importance. Statistical significance does not
necessarily establish practical significance. With large samples, small differ-
ences can be statistically significant. See significance test.
practice effects. Changes in test scores that result from taking the same test
twice in succession, or taking two similar tests one after the other.
predicted value. See residual.
predictive validity. A skills test has predictive validity to the extent that test
scores are well correlated with later performance, or more generally with
outcomes that the test is intended to predict. See validity. Compare reliability.
292

OCR for page 211

Reference Guide on Statistics
predictor. See independent variable.
prior probability. See Bayes’ rule.
probability. Chance, on a scale from 0 to 1. Impossibility is represented by 0,
certainty by 1. Equivalently, chances may be quoted in percent; 100% cor-
responds to 1, 5% corresponds to .05, and so forth.
probability density. Describes the probability distribution of a random variable.
The chance that the random variable falls in an interval equals the area below
the density and above the interval. (However, not all random variables have
densities.) See probability distribution; random variable.
probability distribution. Gives probabilities for possible values or ranges of
values of a random variable. Often, the distribution is described in terms of a
density. See probability density.
probability histogram. See histogram.
probability model. Relates probabilities of outcomes to parameters; also, statis-
tical model. The latter connotes unknown parameters.
probability sample. A sample drawn from a sampling frame by some objective
chance mechanism; each unit has a known probability of being sampled. Such
samples minimize selection bias, but can be expensive to draw.
psychometrics. The study of psychological measurement and testing.
qualitative variable; quantitative variable. Describes qualitative features of
subjects in a study (e.g., marital status—never-married, married, widowed,
divorced, separated). A quantitative variable describes numerical features
of the subjects (e.g., height, weight, income). This is not a hard-and-fast
distinction, because qualitative features may be given numerical codes, as
with a dummy variable. Quantitative variables may be classified as discrete
or continuous. Concepts such as the mean and the standard deviation apply
only to quantitative variables. Compare continuous variable; discrete variable;
dummy variable. See variable.
quartile. The 25th or 75th percentile. See percentile. Compare median.
R-squared (R2). Measures how well a regression equation fits the data. R-squared
varies between 0 (no fit) and 1 (perfect fit). R-squared does not measure the
extent to which underlying assumptions are justified. See regression model.
Compare multiple correlation coefficient; standard error of regression.
random error. Sources of error that are random in their effect, like draws made
at random from a box. These are reflected in the error term of a statistical
model. Some authors refer to random error as chance error or sampling error.
See regression model.
random variable. A variable whose possible values occur according to some
probability mechanism. For example, if a pair of dice are thrown, the total
number of spots is a random variable. The chance of two spots is 1/36, the
293

OCR for page 211

Reference Manual on Scientific Evidence
chance of three spots is 2/36, and so forth; the most likely number is 7, with
chance 6/36.
The expected value of a random variable is the weighted average of
the possible values; the weights are the probabilities. In our example, the
expected value is
1 2 3 5 6
× 2+ × 3+ ×4 + ×6+ ×7
36 36 36 36 36
5 4 3 2 1
+ × 8 + × 9 + × 10 + × 11 + × 12
36 36 36 36 36
In many problems, the weighted average is computed with respect to the
density; then sums must be replaced by integrals. The expected value need
not be a possible value for the random variable.
Generally, a random variable will be somewhere around its expected value,
but will be off (in either direction) by something like a standard error (SE)
or so. If the random variable has a more or less normal distribution, there is
about a 68% chance for it to fall in the range expected value – SE to expected
value + SE. See normal curve; standard error.
randomization. See controlled experiment; randomized controlled experiment.
randomized controlled experiment. A controlled experiment in which sub-
jects are placed into the treatment and control groups at random—as if by a
lottery. See controlled experiment. Compare observational study.
range. The difference between the biggest and the smallest values in a batch of
numbers.
rate. In an epidemiological study, the number of events, divided by the size of
the population; often cross-classified by age and gender. For example, the
death rate from heart disease among American men ages 55–64 in 2004 was
about three per thousand. Among men ages 65–74, the rate was about seven
per thousand. Among women, the rate was about half that for men. Rates
adjust for differences in sizes of populations or subpopulations. Often, rates
are computed per unit of time, e.g., per thousand persons per year. Data
source: Statistical Abstract of the United States tbl. 115 (2008).
regression coefficient. The coefficient of a variable in a regression equation.
See regression model.
regression diagnostics. Procedures intended to check whether the assumptions
of a regression model are appropriate.
regression equation. See regression model.
regression line. The graph of a (simple) regression equation.
regression model. A regression model attempts to combine the values of certain
variables (the independent or explanatory variables) in order to get expected
values for another variable (the dependent variable). Sometimes, the phrase
294

OCR for page 211

Reference Guide on Statistics
“regression model” refers to a probability model for the data; if no qualifica-
tions are made, the model will generally be linear, and errors will be assumed
independent across observations, with common variance, The coefficients in
the linear combination are called regression coefficients; these are parameters.
At times, “regression model” refers to an equation (“the regression equation”)
estimated from data, typically by least squares.
For example, in a regression study of salary differences between men and
women in a firm, the analyst may include a dummy variable for gender,
as well as statistical controls such as education and experience to adjust for
productivity differences between men and women. The dummy variable
would be defined as 1 for the men and 0 for the women. Salary would be
the dependent variable; education, experience, and the dummy would be the
independent variables. See least squares; multiple regression; random error;
variance. Compare general linear model.
relative frequency. See frequency.
relative risk. A measure of association used in epidemiology. For example, if
10% of all people exposed to a chemical develop a disease, compared to 5%
of people who are not exposed, then the disease occurs twice as frequently
among the exposed people: The relative risk is 10%/5% = 2. A relative risk of
1 indicates no association. For more details, see Leon Gordis, Epidemiology
(4th ed. 2008). Compare odds ratio.
reliability. The extent to which a measurement process gives the same results on
repeated measurement of the same thing. Compare validity.
representative sample. Not a well-defined technical term. A sample judged to
fairly represent the population, or a sample drawn by a process likely to give
samples that fairly represent the population, for example, a large probability
sample.
resampling. See bootstrap.
residual. The difference between an actual and a predicted value. The predicted
value comes typically from a regression equation, and is better called the fit-
ted value, because there is no real prediction going on. See regression model;
independent variable.
response variable. See independent variable.
risk. Expected loss. “Expected” means on average, over the various datasets that
could be generated by the statistical model under examination. Usually, risk
cannot be computed exactly but has to be estimated, because the parameters
in the statistical model are unknown and must be estimated. See loss func-
tion; random variable.
risk factor. See independent variable.
robust. A statistic or procedure that does not change much when data or assump-
tions are modified slightly.
295

OCR for page 211

Reference Manual on Scientific Evidence
sample. A set of units collected for study. Compare population.
sample size. Also, size of sample. The number of units in a sample.
sample weights. See stratified random sample.
sampling distribution. The distribution of the values of a statistic, over all pos-
sible samples from a population. For example, suppose a random sample is
drawn. Some values of the sample mean are more likely; others are less likely.
The sampling distribution specifies the chance that the sample mean will fall
in one interval rather than another.
sampling error. A sample is part of a population. When a sample is used to
estimate a numerical characteristic of the population, the estimate is likely to
differ from the population value because the sample is not a perfect micro-
cosm of the whole. If the estimate is unbiased, the difference between the
estimate and the exact value is sampling error. More generally,
estimate = true value + bias + sampling error
Sampling error is also called chance error or random error. See standard error.
Compare bias; nonsampling error.
sampling frame. A list of units designed to represent the entire population as
completely as possible. The sample is drawn from the frame.
sampling interval. See systematic sample.
scatter diagram. Also, scatterplot; scattergram. A graph showing the relation-
ship between two variables in a study. Each dot represents one subject. One
variable is plotted along the horizontal axis, the other variable is plotted along
the vertical axis. A scatter diagram is homoscedastic when the spread is more
or less the same inside any vertical strip. If the spread changes from one strip
to another, the diagram is heteroscedastic.
selection bias. Systematic error due to nonrandom selection of subjects for
study.
sensitivity. In clinical medicine, the probability that a test for a disease will give
a positive result given that the patient has the disease. Sensitivity is analogous
to the power of a statistical test. Compare specificity.
sensitivity analysis. Analyzing data in different ways to see how results depend
on methods or assumptions.
sign test. A statistical test based on counting and the binomial distribution. For
example, a Finnish study of twins found 22 monozygotic twin pairs where
1 twin smoked, 1 did not, and at least 1 of the twins had died. That sets up
a race to death. In 17 cases, the smoker died first; in 5 cases, the nonsmoker
died first. The null hypothesis is that smoking does not affect time to death,
so the chances are 50-50 for the smoker to die first. On the null hypothesis,
the chance that the smoker will win the race 17 or more times out of 22 is
296

OCR for page 211

Reference Guide on Statistics
8/1000. That is the p-value. The p-value can be computed from the binomial
distribution. For additional detail, see Michael O. Finkelstein & Bruce Levin,
Statistics for Lawyers 339–41 (2d ed. 2001); David A. Freedman et al.,
Statistics 262–63 (4th ed. 2007).
significance level. See fixed significance level; p-value.
significance test. Also, statistical test; hypothesis test; test of significance. A signifi-
cance test involves formulating a statistical hypothesis and a test statistic, com-
puting a p-value, and comparing p to some preestablished value (a) to decide
if the test statistic is significant. The idea is to see whether the data conform
to the predictions of the null hypothesis. Generally, a large test statistic goes
with a small p-value; and small p-values would undermine the null hypothesis.
For example, suppose that a random sample of male and female employees
were given a skills test and the mean scores of the men and women were
different—in the sample. To judge whether the difference is due to sampling
error, a statistician might consider the implications of competing hypotheses
about the difference in the population. The null hypothesis would say that
on average, in the population, men and women have the same scores: The
difference observed in the data is then just due to sampling error. A one-sided
alternative hypothesis would be that on average, in the population, men score
higher than women. The one-sided test would reject the null hypothesis if
the sample men score substantially higher than the women—so much so that
the difference is hard to explain on the basis of sampling error.
In contrast, the null hypothesis could be tested against the two-sided
alternative that on average, in the population, men score differently than
women—higher or lower. The corresponding two-sided test would reject the
null hypothesis if the sample men score substantially higher or substantially
lower than the women.
The one-sided and two-sided tests would both be based on the same
data, and use the same t-statistic. However, if the men in the sample score
higher than the women, the one-sided test would give a p-value only half as
large as the two-sided test; that is, the one-sided test would appear to give
stronger evidence against the null hypothesis. (“One-sided” and “one-tailed”
are synonymous; so are “two-sided and “two-tailed.”) See p-value; statistical
hypothesis; t-statistic.
significant. See p-value; practical significance; significance test.
simple random sample. A random sample in which each unit in the sampling
frame has the same chance of being sampled. The investigators take a unit at
random (as if by lottery), set it aside, take another at random from what is
left, and so forth.
simple regression. A regression equation that includes only one independent
variable. Compare multiple regression.
size. A synonym for alpha (a).
297

OCR for page 211

Reference Manual on Scientific Evidence
skip factor. See systematic sample.
specificity. In clinical medicine, the probability that a test for a disease will give
a negative result given that the patient does not have the disease. Specificity
is analogous to 1 – a, where a is the significance level of a statistical test.
Compare sensitivity.
spurious correlation. When two variables are correlated, one is not necessarily
the cause of the other. The vocabulary and shoe size of children in elementary
school, for example, are correlated—but learning more words will not make
the feet grow. Such noncausal correlations are said to be spurious. (Originally,
the term seems to have been applied to the correlation between two rates with
the same denominator: Even if the numerators are unrelated, the common
denominator will create some association.) Compare confounding variable.
standard deviation (SD). Indicates how far a typical element deviates from the
average. For example, in round numbers, the average height of women age
18 and over in the United States is 5 feet 4 inches. However, few women
are exactly average; most will deviate from average, at least by a little. The
SD is sort of an average deviation from average. For the height distribution,
the SD is 3 inches. The height of a typical woman is around 5 feet 4 inches,
but is off that average value by something like 3 inches.
For distributions that follow the normal curve, about 68% of the elements
are in the range from 1 SD below the average to 1 SD above the average.
Thus, about 68% of women have heights in the range 5 feet 1 inch to 5 feet
7 inches. Deviations from the average that exceed 3 or 4 SDs are extremely
unusual. Many authors use standard deviation to also mean standard error.
See standard error.
standard error (SE). Indicates the likely size of the sampling error in an esti-
mate. Many authors use the term standard deviation instead of standard error.
Compare expected value; standard deviation.
standard error of regression. Indicates how actual values differ (in some aver-
age sense) from the fitted values in a regression model. See regression model;
residual. Compare R-squared.
standard normal. See normal distribution.
standardization. See standardized variable.
standardized variable. Transformed to have mean zero and variance one. This
involves two steps: (1) subtract the mean; (2) divide by the standard deviation.
statistic. A number that summarizes data. A statistic refers to a sample; a parameter
or a true value refers to a population or a probability model.
statistical controls. Procedures that try to filter out the effects of confounding
variables on non-experimental data, for example, by adjusting through statisti-
cal procedures such as multiple regression. Variables in a multiple regression
298

OCR for page 211

Reference Guide on Statistics
equation. See multiple regression; confounding variable; observational study.
Compare controlled experiment.
statistical dependence. See dependence.
statistical hypothesis. Generally, a statement about parameters in a probability
model for the data. The null hypothesis may assert that certain parameters have
specified values or fall in specified ranges; the alternative hypothesis would
specify other values or ranges. The null hypothesis is tested against the data with
a test statistic; the null hypothesis may be rejected if there is a statistically sig-
nificant difference between the data and the predictions of the null hypothesis.
Typically, the investigator seeks to demonstrate the alternative hypothesis;
the null hypothesis would explain the findings as a result of mere chance,
and the investigator uses a significance test to rule out that possibility. See
significance test.
statistical independence. See independence.
statistical model. See probability model.
statistical test. See significance test.
statistical significance. See p-value.
stratified random sample. A type of probability sample. The researcher divides
the population into relatively homogeneous groups called “strata,” and draws
a random sample separately from each stratum. Dividing the population into
strata is called “stratification.” Often the sampling fraction will vary from
stratum to stratum. Then sampling weights should be used to extrapolate
from the sample to the population. For example, if 1 unit in 10 is sampled
from stratum A while 1 unit in 100 is sampled from stratum B, then each unit
drawn from A counts as 10, and each unit drawn from B counts as 100. The
first kind of unit has weight 10; the second has weight 100. See Freedman et
al., Statistics 401 (4th ed. 2007).
stratification. See independent variable; stratified random sample.
study validity. See validity.
subjectivist. See Bayesian.
systematic error. See bias.
systematic sample. Also, list sample. The elements of the population are num-
bered consecutively as 1, 2, 3, . . . . The investigators choose a starting point
and a “sampling interval” or “skip factor” k. Then, every kth element is
selected into the sample. If the starting point is 1 and k = 10, for example, the
sample would consist of items 1, 11, 21, . . . . Sometimes the starting point
is chosen at random from 1 to k: this is a random-start systematic sample.
t-statistic. A test statistic, used to make the t-test. The t-statistic indicates how
far away an estimate is from its expected value, relative to the standard error.
The expected value is computed using the null hypothesis that is being tested.
299

OCR for page 211

Reference Manual on Scientific Evidence
Some authors refer to the t-statistic, others to the z-statistic, especially when
the sample is large. With a large sample, a t-statistic larger than 2 or 3 in abso-
lute value makes the null hypothesis rather implausible—the estimate is too
many standard errors away from its expected value. See statistical hypothesis;
significance test; t-test.
t-test. A statistical test based on the t-statistic. Large t-statistics are beyond the
usual range of sampling error. For example, if t is bigger than 2, or smaller
than –2, then the estimate is statistically significant at the 5% level; such values
of t are hard to explain on the basis of sampling error. The scale for t-statistics
is tied to areas under the normal curve. For example, a t-statistic of 1.5 is not
very striking, because 13% = 13/100 of the area under the normal curve is
outside the range from –1.5 to 1.5. On the other hand, t = 3 is remarkable:
Only 3/1000 of the area lies outside the range from –3 to 3. This discussion is
predicated on having a reasonably large sample; in that context, many authors
refer to the z-test rather than the t-test.
Consider testing the null hypothesis that the average of a population equals
a given value; the population is known to be normal. For small samples, the
t-statistic follows Student’s t-distribution (when the null hypothesis holds)
rather than the normal curve; larger values of t are required to achieve sig-
nificance. The relevant t-distribution depends on the number of degrees of
freedom, which in this context equals the sample size minus one. A t-test is
not appropriate for small samples drawn from a population that is not normal.
See p-value; significance test; statistical hypothesis.
test statistic. A statistic used to judge whether data conform to the null hypoth-
esis. The parameters of a probability model determine expected values for the
data; differences between expected values and observed values are measured
by a test statistic. Such test statistics include the chi-squared statistic (c2) and
the t-statistic. Generally, small values of the test statistic are consistent with
the null hypothesis; large values lead to rejection. See p-value; statistical
hypothesis; t-statistic.
time series. A series of data collected over time, for example, the Gross National
Product of the United States from 1945 to 2005.
treatment group. See controlled experiment.
two-sided hypothesis; two-tailed hypothesis. An alternative hypothesis
asserting that the values of a parameter are different from—either greater than
or less than—the value asserted in the null hypothesis. A two-sided alterna-
tive hypothesis suggests a two-sided (or two-tailed) test. See significance test;
statistical hypothesis. Compare one-sided hypothesis.
two-sided test; two-tailed test. See two-sided hypothesis.
Type I error. A statistical test makes a Type I error when (1) the null hypothesis
is true and (2) the test rejects the null hypothesis, i.e., there is a false posi-
300

OCR for page 211

Reference Guide on Statistics
tive. For example, a study of two groups may show some difference between
samples from each group, even when there is no difference in the population.
When a statistical test deems the difference to be significant in this situation,
it makes a Type I error. See significance test; statistical hypothesis. Compare
alpha; Type II error.
Type II error. A statistical test makes a Type II error when (1) the null hypoth-
esis is false and (2) the test fails to reject the null hypothesis, i.e., there is a
false negative. For example, there may not be a significant difference between
samples from two groups when, in fact, the groups are different. See signifi-
cance test; statistical hypothesis. Compare beta; Type I error.
unbiased estimator. An estimator that is correct on average, over the pos-
sible datasets. The estimates have no systematic tendency to be high or low.
Compare bias.
uniform distribution. For example, a whole number picked at random from 1
to 100 has the uniform distribution: All values are equally likely. Similarly, a
uniform distribution is obtained by picking a real number at random between
0.75 and 3.25: The chance of landing in an interval is proportional to the
length of the interval.
validity. Measurement validity is the extent to which an instrument measures
what it is supposed to, rather than something else. The validity of a standard-
ized test is often indicated by the correlation coefficient between the test
scores and some outcome measure (the criterion variable). See content valid-
ity; differential validity; predictive validity. Compare reliability.
Study validity is the extent to which results from a study can be relied
upon. Study validity has two aspects, internal and external. A study has high
internal validity when its conclusions hold under the particular circumstances
of the study. A study has high external validity when its results are gener-
alizable. For example, a well-executed randomized controlled double-blind
experiment performed on an unusual study population will have high internal
validity because the design is good; but its external validity will be debatable
because the study population is unusual.
Validity is used also in its ordinary sense: assumptions are valid when they
hold true for the situation at hand.
variable. A property of units in a study, which varies from one unit to another,
for example, in a study of households, household income; in a study of
people, employment status (employed, unemployed, not in labor force).
variance. The square of the standard deviation. Compare standard error; covariance.
weights. See stratified random sample.
within-observer variability. Differences that occur when an observer measures
the same thing twice, or measures two things that are virtually the same.
Compare between-observer variability.
301

OCR for page 211

Reference Manual on Scientific Evidence
z-statistic. See t-statistic.
z-test. See t-test.
References on Statistics
General Surveys
David Freedman et al., Statistics (4th ed. 2007).
Darrell Huff, How to Lie with Statistics (1993).
Gregory A. Kimble, How to Use (and Misuse) Statistics (1978).
David S. Moore & William I. Notz, Statistics: Concepts and Controversies (2005).
Michael Oakes, Statistical Inference: A Commentary for the Social and Behavioral
Sciences (1986).
Statistics: A Guide to the Unknown (Roxy Peck et al. eds., 4th ed. 2005).
Hans Zeisel, Say It with Figures (6th ed. 1985).
Reference Works for Lawyers and Judges
David C. Baldus & James W.L. Cole, Statistical Proof of Discrimination (1980
& Supp. 1987) (continued as Ramona L. Paetzold & Steven L. Willborn,
The Statistics of Discrimination: Using Statistical Evidence in Discrimination
Cases (1994) (updated annually).
David W. Barnes & John M. Conley, Statistical Evidence in Litigation: Methodol-
ogy, Procedure, and Practice (1986 & Supp. 1989).
James Brooks, A Lawyer’s Guide to Probability and Statistics (1990).
Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers (2d ed. 2001).
Modern Scientific Evidence: The Law and Science of Expert Testimony (David
L. Faigman et al. eds., Volumes 1 and 2, 2d ed. 2002) (updated annually).
David H. Kaye et al., The New Wigmore: A Treatise on Evidence: Expert Evi-
dence § 12 (2d ed. 2011) (updated annually).
National Research Council, The Evolving Role of Statistical Assessments as Evi-
dence in the Courts (Stephen E. Fienberg ed., 1989).
Statistical Methods in Discrimination Litigation (David H. Kaye & Mikel Aickin
eds., 1986).
Hans Zeisel & David Kaye, Prove It with Figures: Empirical Methods in Law and
Litigation (1997).
General Reference
Encyclopedia of Statistical Sciences (Samuel Kotz et al. eds., 2d ed. 2005).
302