the relationships among the variables, and it is likely that no model is absolutely correct in every detail. Nevertheless, an informative and biologically appropriate model extends the informativeness of data; a poor model may obscure true relationships between outcome and predictors; and both the efficiency and validity of inferences suffer if a model is seriously incorrect from either the biologic or statistical perspective.
Regression models, including linear regression, can be used to examine a specific proposed functional relation between a risk factor and an outcome. If a biologically inappropriate form of the relation is proposed, model findings may be misleading and incorrect.
In the past, linear models have been widely used to assess effects of environmental agents. Analysis of variance (ANOVA) and linear-regression models generally assume that the outcome varies linearly with functions of risk factors, that the individual observations are statistically independent, and that random differences from the model all have the same distribution, although models are available that relax each of these assumptions. For example, if the outcome seems to be approximately log-normally distributed, the investigator may assume that the natural logarithm of the outcome varies (approximately) linearly with continuous risk factors and that the errors of that model are (approximately) normally and independently distributed on a logarithmic scale. The outcome measures and risk factors are often assumed to be measured without error, although this assumption also can be relaxed. In any case, adherence to the assumptions underlying the chosen statistical model should be tested because violations can affect considerations of sample size as studies are designed and confidence bounds and hypothesis-testing as data are analyzed.
Thus, the linear-regression analyses that have often been used in studies of environmental agents typically carry strong underlying assumptions about the distribution of the data and the nature of the relationships being examined. There is a critical tradeoff here: the stronger the assumptions (if they are nearly correct), the more can be learned from a specific set of observations, but the greater the risk of a critical failure in one or more of the assumptions. Fortunately, most currently available statistical programs incorporate approaches to test compatibility with these assumptions, and some techniques allow analysis of data that violate one or more of these assumptions. Some of these methods are described below, with examples of their use, drawn largely from studies of the health effects of air pollution.
Generally, counts, or discrete data, are assumed to follow some version of the Poisson distribution. The Poisson distribution does approach