5
PCCL to CCL: Classification Algorithm

INTRODUCTION

The intrinsic difficulty of identifying potentially harmful agents for resource-intensive scrutiny such as the selection of a drinking water contaminant from a preliminary Drinking Water Contaminant Candidate List (PCCL) for inclusion onto a CCL raises the question of what kind of process or method is best suited to this judgment. As discussed in Chapter 2, the sorting of perhaps thousands of PCCL contaminants into two discrete sets, one (the CCL) that probably will undergo research or monitoring of some sort preparatory to an eventual regulatory decision and another much larger set that will not, is an exercise in classification.

The committee considered three broad types of strategies for accomplishing this task: expert judgment, rule-based systems, and prototype classifiers. It is important to note at the outset that there are no sharp boundaries between these three classification strategies since all are based on data to some extent and expert judgments factor into all to some degree. The main differences are the specific mix of data and expert judgment and the extent to which explicit or implicit judgments come together to produce predictable results. This chapter begins with a general overview of classification schemes. It is followed by an example application, framework, and related recommendations for how the U.S. Environmental Protection Agency (EPA) might develop its own prototype classification scheme for use in the creation of future CCLs.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration 5 PCCL to CCL: Classification Algorithm INTRODUCTION The intrinsic difficulty of identifying potentially harmful agents for resource-intensive scrutiny such as the selection of a drinking water contaminant from a preliminary Drinking Water Contaminant Candidate List (PCCL) for inclusion onto a CCL raises the question of what kind of process or method is best suited to this judgment. As discussed in Chapter 2, the sorting of perhaps thousands of PCCL contaminants into two discrete sets, one (the CCL) that probably will undergo research or monitoring of some sort preparatory to an eventual regulatory decision and another much larger set that will not, is an exercise in classification. The committee considered three broad types of strategies for accomplishing this task: expert judgment, rule-based systems, and prototype classifiers. It is important to note at the outset that there are no sharp boundaries between these three classification strategies since all are based on data to some extent and expert judgments factor into all to some degree. The main differences are the specific mix of data and expert judgment and the extent to which explicit or implicit judgments come together to produce predictable results. This chapter begins with a general overview of classification schemes. It is followed by an example application, framework, and related recommendations for how the U.S. Environmental Protection Agency (EPA) might develop its own prototype classification scheme for use in the creation of future CCLs.

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration OVERVIEW OF CLASSIFICATION SCHEMES Expert Judgments Many decisions (and their associated classifications) are made on the basis of the collective experience of experts. EPA has made extensive and regular use of expert committees (including this National Research Council [NRC] committee), advisory panels, peer review committees, and the like to help its staff (another assemblage of experts) make important policy decisions. Clearly the composition of any expert group is critical to the outcome, and in recent years there has been a concerted effort within EPA to include a wide spectrum of “stakeholder” opinion in the expertise solicited (e.g., the September 1999 workshop to discuss EPA’s draft CCL Research Plan; AWWARF, 2000). This is a recognition that where a particular expert stands on the issue is often influenced by where he or she sits in the real world (e.g., organizational affiliations). Of course, the outcome of any expert panel may be influenced equally, though to a largely unknown extent, by the absence of persons who could not participate in such meetings. Often the reasons for the absence of a particular perspective are random or accidental, as when schedules or timing do not permit participation at the last minute or result in a substitution for similar reasons. Yet even when matters external to the question at hand are set aside, the dynamics of expert committees often influences the outcome in crucial ways. Thus, when a particular subject comes up (e.g., at the beginning of a meeting versus the end of a long day), who advocates for or against a position (involving questions of articulateness, seniority, and status) or the juxtaposition or context of agenda items can lead to very different outcomes for reasons unconnected with the content of the issues at hand. The presence or absence of EPA experts and the relative force-fulness of their participation can frequently alter the direction of a discussion in important ways, for example, directing it toward or away from regulatory and policy concerns that may not be apparent or uppermost in the minds of non-agency experts. For the same reasons, discussions of committees comprised solely of EPA staff and consultants are likely to have a different character than those with significant or predominant participation from experts outside the agency. One strategy to neutralize these adventitious effects is to use a formal

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration Delphi procedure. The Delphi technique was introduced more than 30 years ago to limit interaction among participants and thereby optimize the quality of decisions (Dalkey, 1969; Linstone and Turoff, 1975). It is one of the formal group judgment methods used to obtain consensus and define disagreements on a specific question (Webler et al., 1991). The method assists in identifying important issues and needs (NRC, 1996, 1999c), integrating data and judgments (NRC, 1988, 1992), informing policy processes (NRC, 1996), structuring deliberations, modifying classification schemes, and ranking priorities (IOM, 1988; NRC, 1992). It may be used to obtain consensus on a time-limited issue or to facilitate longer-term building of common ground on issues. For example, the technique has been used by committees of the Institute of Medicine (IOM, 1992) and is being used at the U.S. Army’s Tooele Chemical Agent Disposal Facility to identify current issues to be included in a stakeholder survey (NRC, 1999c). Briefly, the Delphi process includes the solicitation and compilation of responses to a specific question. The answers are summarized (typically displayed with the range and mean of all responses) and provided to the participants, with each person invited to reconsider his or her position in light of the compilation. The answers received from this second request are again compiled. This process is repeated until group consensus is obtained (IOM, 1995; NRC, 1988). The technique is traditionally conducted with participants responding remotely and independently of each other. The Delphi method has been implemented with mailed or faxed surveys, e-mail instruments, face-to-face meetings or workshops, and iterations of closed computer networks and group dialogues. The means selected to implement the Delphi process involve trade-offs of cost, time, and commitment of the participants; the impact of influential individuals; confidentiality; duration and intensity of the process; response rate; and clarity of the responses (IOM, 1995). The Delphi technique may be implemented alone or as part of a multistep process such as Renn and colleague’s three-step procedure discussed in Chapter 2 (Renn et al., 1993), or it may be conducted for separate, dissimilar groups, commonly referred to as a double-Delphi system (NRC, 1992). For example, the same question could be asked of experts and nonexperts convened in two separate but simultaneous panels. The double-Delphi approach may be employed in workshops where the groups use separate computer networks in separate rooms and then share their consensus in reporting-out sessions with all groups present. Although there has been relatively little evaluation of the effective-

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration ness of the Delphi process, there are indications that it may not result in decisions much improved over those obtained through less structured processes (IOM, 1992). Its use for selecting candidates for the CCL, therefore, is questionable, but it remains a technique that potentially avoids some of the traditional pitfalls of expert opinion methods of classification. Rule-Based Methods Rule-based schemes take as inputs various features or parameters of an object and weigh and combine these features according to an algorithm that is decided upon in advance—usually as a result of some expert judgment. One characteristic of rule-based and expert systems is that their classification strategy is what is often called “Aristotelian,” that is, objects (in this case potential drinking water contaminants) are assigned attributes (e.g., toxicity or solubility), and a set of rules is used to determine which class they are in (Bowker and Star, 1999). They are, thus, expert opinion strategies, in which the opinions are embedded into a fairly rigid algorithm. Their rigidity can be considered both their strength (they are objective and consistent, and they allow for high throughput) and their weakness (they do not easily allow for additional nuanced judgment). The weights and modes of combination used in a rule-based scheme are presumably determined using some preexisting idea or objective of what the scheme intends to capture; however, this is often not done in any systematic or explicit fashion. The committee’s first report reviewed 10 rule-based ranking schemes whose objective was to use data about chemicals to establish a priority for regulation or attention (NRC, 1999a). As noted in that report, there are no formal schemes that might be considered for prioritizing microbial contaminants. However, EPA and the American Water Works Association Research Foundation have jointly sponsored a series of expert workshops beginning in 1996 to develop a decision process for prioritizing emerging waterborne pathogens. Although the final report of these workshops is nearing completion, it was not available for review at the time this report was written. At the request of EPA, the rule-based systems were evaluated for their potential to help select contaminants (including microorganisms) that are already on a CCL for future action. All of the systems reviewed had at least one shortcoming, but of special concern was the extent to which (often) arbitrary and nonexplicit expert

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration judgments were intrinsically embodied in what appeared to be objective ranking schemes. For example, various kinds of information (such as chemical persistence, solubility, and toxicity) had to be weighted and combined to produce a single number that allowed an ordinal ranking at the end of the process. The methods of combination (e.g., additive or multiplicative) and the weights given the various factors were all matters of judgment that were in place prior to the input of any data to the ranking system. In short, the committee concluded that a ranking process that attempts to sort contaminants in a specific order is not appropriate for the selection of drinking water contaminants already on a CCL for regulation, research, or monitoring activities (NRC, 1999a). In the absence of complete information, the output of the prioritization schemes was found to be so uncertain (though this uncertainty is generally not stated) that they are of limited use in making more than preliminary risk management decisions about drinking water contaminants. The committee concluded, however, that they may provide a (semi)quantitative means for preliminarily screening and sorting large numbers of contaminants. Even with this limited use, however, the aforementioned considerations can still become serious issues. Based on its collective experience, the committee emphasizes that there are some common and often troublesome problems that occur in developing new ranking systems and revising existing ranking systems to suit specific purposes. Many of these problems are similar to those noted earlier as related to the formation and deliberations of expert panels. For example, sometimes an expert group contains individuals who interpret the charge or understand the client’s intended purpose differently, because they bring different disciplines, experiences, and/or motives to the process. A perhaps surprisingly common problem is that many such groups think they have a consensus about the purpose of the system on which they are working, but in fact do not. Even when a goal is clearly stated and mutually understood, group members may envision different types of ranking processes and final products as appropriately meeting the goal. Some group members may believe a simple ranking process is adequate, whereas others believe the process should be somewhat to very complex. The final product may be conceived as quantitative, semiquantitative, or qualitative, with a whole range of outputs from a few to many and from categorical to specific. Some groups have had long debates about what process and form will best fit the purpose, and some have come to near deadlock over how much scoring or ranking is

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration enough or too much. Others have worked for months before they realized that they were trying to achieve different visions. Another situation in which groups that are trying to develop ranking schemes have encountered difficulties is where science and policy issues meet. When there is a lack of clarity about which decisions are within the group’s charge and which are better left to policy settings, groups have often gotten focused on narrow scientific issues, either not recognizing that there is a policy issue in the matter or inadvertently believing that there is a scientific “solution” to a policy problem. Groups have also gotten bogged down in specific scientific issues when it has been too difficult to address broader points of disagreement. A report sponsored by the Society of Environmental Toxicology and Chemistry is available that might help reduce such problems because it provides an expert panel consensus framework along with principles and guidelines to promote consistency in the development and use of chemical ranking and scoring systems (Swanson and Socha, 1997). Prototype Classification Methods An alternative classification strategy is sometimes referred to as “prototype classification” (Bowker and Star, 1999). This strategy recognizes that in ordinary practice and discourse one does not usually classify objects on the basis of a fixed algorithm, but instead uses criteria based on prior classification of examples or prototypes. The classic example is character recognition of an individual letter in handwriting, most commonly recognized by its similarity to an idealized example rather than by any fixed features such as height-to-width ratios. Prototype classifiers take advantage of the prototyping activity at which humans generally (and intuitively) excel. Prototype classification schemes usually take the form of neural networks, clustering algorithms, machine learning classifiers, and their hybrids. These methods start with a known classification of prototypes (a “training set”) that embody the kinds of outcomes one might wish to achieve. These prototypes are used to discern an algorithm that maps prototype features or attributes into classification outcomes. The prototype-based algorithm can then be used for classification of new objects. The neural network paradigm is discussed more fully later in this chapter.

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration Application of a prototype scheme for constructing the CCL would consist of a training set of chemicals, microorganisms, and other types of (potential) drinking water contaminants that would clearly belong on the CCL, such as currently regulated chemicals (if they were not already regulated), and those that clearly do not, such as food additives generally recognized as safe by the U.S. Food and Drug Administration. Each contaminant’s “features” or “attributes” must be extracted and characterized. In this case, these would include parameters such as solubility, various measures of toxicity (quantitative or categorical), and occurrence data, if any. Using this as a training set, the neural network constructs both the mode of combination and the weighting factors that seem best to differentiate between the two categories. It is important to note that the distinction between Aristotelian and prototype classification strategies is not sharp but is more a matter of emphasis. What in the ranking scheme was done a priori by experts is done instead on the basis of the data that differentiated prototype examples. This a posteriori determination of weights on the basis of features sets these methods apart from rule-based methods, but it should be clear that rule-based methods also depend on the data in a fundamental way. Conversely, prototype schemes emphasize past classification decisions in the construction of a classification algorithm but necessarily involve some expert judgment in determining which features should be used to characterize the prototypes. The committee notes that prototype strategies have found use in commerce and other sciences, although their use for constructing something like the CCL would be innovative. Keller et al. (1995) have compiled papers describing studies of neural network applications in hazardous waste disposal, environmental monitoring, and reliability analysis. The Federal Aviation Administration has used a neural network for analysis of radiation directed at travelers’ luggage at security check-points; numerous financial analysts have found neural networks to be useful in predicting stock performance and foreign exchange rates; and neural networks have been successful in the prediction of educational performance. These examples are well described by Garson (1998) and the references cited therein. Which Strategy to Use? To date, EPA has relied extensively on use of expert judgment and to

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration a lessor extent on rule-based prioritization schemes to identify and rank drinking water contaminants for regulatory and research activities. It is clear that contaminant-by-contaminant consideration by panels of experts as to whether something should be placed on a CCL is not possible if the entire universe of potential drinking water contaminants is to be considered, as recommended in the committee’s second report (NRC, 1999b) and again in this one. Some efficient screening method is required. Until now, this has usually meant the use of an a priori ranking or classifying system. However, none of the existing schemes for ranking chemicals that the committee previously reviewed (NRC, 1999a) seem able to fulfil this function readily, so one would have to be significantly modified or an entirely new one created. As discussed above, the difficulties are formidable, and there is no obvious reason why a new attempt will solve the problems of previous ones. Use of a prototype classification strategy and neural network technology would seem to be an innovative new approach. Furthermore, since the objective is one of identification, not regulation per se, the method does not have to withstand regulatory challenge. The PCCL contaminants that it helps identify for inclusion on a CCL can be evaluated in conventional ways by EPA staff and external experts for further justification of why they do or do not belong on the CCL. The recommended system merely identifies and offers likely candidates in a systematic way. The committee notes that a potentially serious drawback of using this technology is its perceived lack of “transparency” in that the neural network can easily appear as a black box, with little obvious indication of how it is “working.” The importance of transparency in decision-making is discussed extensively in Chapter 2. However, the committee reiterates that transparency is not necessarily synonymous with simplicity. A CCL decision-making process that uses complex classification modeling can be made relatively transparent by emphasizing that the classification is based on prototypes of past regulatory decisions and should, thus, be readily defensible. The difference in this regard from more conventional methods is perhaps misleading. The committee emphasizes that there is little that is “transparent” or easily reproducible about expert judgment, for example. The “black box” in this case is the human brain. Such expert judgments still must be justified for regulatory purposes, but this is no different than what will be required as output from a neural network approach. Ranking systems are only superficially transparent, in that the weights and modes of combination are explicit and open for all to see,

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration but how the weights are arrived at and the consequences of the modes of combination usually are not. As for the neural network approach, its outputs must also be justified for regulatory purposes. The committee recommends, therefore, that EPA give careful consideration to and actively experiment with developing a prototype classification approach using neural network or similar methods (in conjunction with expert judgment) for identifying appropriate PCCL candidates for inclusion on the CCL. Further, EPA should use several training sets to gauge the sensitivity of the adopted method. As discussed in Chapter 2, the committee cautions that disputes may occur (e.g., among decision-makers and between decision-makers and the public) regarding the issue of what minimum levels of sensitivity and specificity the approach must have in order to be deemed “adequate.” A robust result over different training sets would clearly lend support for this approach. Additional details of the recommended approach to the development of future CCLs are presented later in this chapter. ILLUSTRATIVE EXAMPLE OF A PROTOTYPE CLASSIFICATION SCHEME FOR CCL CONTAMINANTS The remainder of this chapter presents a demonstration of prototype schemes for classifying drinking water contaminants. As previously noted, prototype classification methods require a training data set. For illustrative purposes, the committee constructed a training data set based on contaminants that are presumed worthy of regulatory consideration and contaminants for which the committee can presume regulatory consideration is not necessary. The contaminants included in the training data set were then assigned values for the five health effect and occurrence attributes discussed in Chapter 4. For purposes of mathematical modeling, the report uses the following notation to represent the values of the five attributes for each contaminant: X1 is severity, X2 is potency, X3 is prevalence, X4 is magnitude, and X5 is persistence-mobility. Depending on the category of the contaminant, each contaminant is assigned a value of the binary classification variable, the target (T). The committee uses T=1 for contaminants that are “presumed worthy of regulatory consideration” and T=0 for contaminants “presumed not worthy of regulatory consideration.” The prototype classification approach does not require that the contaminant attributes be those specified by the committee. One could use

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration more attributes or fewer. They can be different attributes entirely. Redundant attributes can be integrated prior to processing. For example, in the absence of complete occurrence information one might use Boolean logic to combine prevalence and magnitude or persistence-mobility. Multidimensional attributes could be expanded. For example, the Henry’s law constant and aqueous solubility for chemicals could be used directly rather than combined in a single persistence-mobility score. Also, the attributes can be ordinal or categorical. The objective is to come up with contaminant features that comprehensively encompass the information that determines a regulatory action decision. The overall objective is to mathematically represent the mapping between the contaminant attributes (the values of Xis) and the target value. This is expressed mathematically as Y=f({Xi};θ), (5–1) where Y is the predicted value of the classification variable and θ collectively describes the parameters of the mapping function f(). The most commonly used metric to evaluate the function’s ability to capture the training data set is the mean squared error (mse): (5–2) where N is the total number of contaminants in the training data set and the subscript j denotes contaminant j. There is a great deal of choice in selection of the mapping function f(). In this report, the committee presents two alternatives. One is based on a linear function, and the other derives from a neural network architecture. The training data set is used to “calibrate” the mapping function (i.e., to estimate the optimal values of the parameters). The final step is to use an appropriate criterion to determine the optimal threshold value for Y that separates data into the two classes. One possible criterion is to set the threshold at that value of Y that minimizes the number of objects in the training data set that are misclassified. The resulting mapping can then be used to predict the classification for new contaminants to which values of Xi have been assigned. A predicted

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration value of Y greater than the threshold would indicate that the contaminant belongs in the T=1 category, and a predicted value of Y less than the threshold would indicate that the contaminant belongs in the T=0 category. The following sections of this chapter present the details of such an analysis. The committee emphasizes that the classification analysis presented here is not intended to lead to definitive conclusions about specific contaminants. The committee was limited in time and resources; hence the training data set used for this analysis was smaller than ideal and the attribute values are far from certain. Nonetheless, the committee believes that this analysis provides a valid demonstration of the methodology and could serve as a framework for a similar analysis that EPA might conduct to help classify PCCL contaminants for inclusion on the corresponding CCL. The Training Data Set The committee constructed a training data set consisting of 80 chemical and microbial contaminants, 63 of which were assigned to the “presumed worthy of regulatory action” category (T=1) and 17 of which were assigned to the “presumed not worthy of regulatory action” category (T=0). The contaminants included in the training data set are listed in Table 5–1. Those in the T=1 category were selected from among currently regulated drinking water contaminants that have enforceable maximum contaminant levels (MCLs; see Chapter 1 for further information). Those in the T=0 category are considered to be safe for human ingestion. Some of these were taken from the list of substances generally recognized as safe (GRAS) by the U.S. Food and Drug Administration. In discussing this illustrative example, the committee reiterates that it was constrained by time and data resources, so the list of contaminants in the T=0 category is relatively small and includes only inorganic and organic chemicals. If the EPA conducts a similar analysis, the committee strongly recommends that it increase the size and types of drinking water contaminants (e.g., radionuclides) included in the training data set to improve predictive capacity.

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration FIGURE 5–4 Classification error as a function of threshold value, in which classification error is defined as the number of misclassified contaminants linear classifier). FIGURE 5–5 Classification error as a function of threshold value, in which classification error is defined as the percentage of the number of contaminants within a category that are misclassified (linear classifier).

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration The ability of the classification scheme to separate the training data set is one way to estimate the classification error that is expected when used for prediction. With a threshold value of 0.55, the error in misclassifying T=1 contaminants is 8 percent (5 out of 63) and the error in misclassifying T=0 contaminants is 24 percent (4 out of 17). The larger error in misclassifying T=0 contaminants reflects the fact that this portion of the training data set was smaller, so there is less confidence that the contaminants used are representative of the population of contaminants in this category. These errors can be interpreted as follows: If this classification algorithm is used for prediction, 92 percent of the time a contaminant that should be on the CCL will be classified correctly. This represents the “sensitivity” of the classification. The other characteristic is the “specificity,” which is that 76 percent of the time a contaminant that should not be on the CCL will be classified correctly. That is, the probability of a false negative is 8 percent and the probability of a false positive is 24 percent. This estimation of errors is essentially a demonstration of what the committee discussed in Appendix A of its first report (NRC, 1999a). Classification Results Using a Neural Network Classifier The neural network used for this classification problem has two layers connected in a feed-forward configuration (Figure 5–6). The first is a hidden layer containing two nodes (neurons). A hyperbolic tangent sigmoidal function was chosen as the transfer function for each node in this layer. This transfer function is one of several commonly used nonlinear functions. The second is the output layer with a single node and a linear transfer function. All nodes have biases. Other possible network architectures could have been used for this problem (e.g., additional nodes in the hidden layer or different transfer functions). The committee chose this architecture because it is one of the simplest but has sufficient flexibility to capture nonlinear dependencies. The training algorithm used to calibrate the neural network was a conjugate gradient method that constitutes a family of efficient search algorithms. This search method converges much faster than the simpler gradient descent methods, but as is true with all nonlinear optimization problems, care must be taken to ensure robustness in the solution by running several training sessions with different initial conditions. Network

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration performance is measured according to the minimum of mse (Equation 5–2). Using the committee’s training data set, the resulting mean squared error was 0.018, which is considerably smaller than that achieved using the linear classifier. This clearly demonstrates the improvement in fitting capabilities of a neural network model over a linear model. The classification results using the neural network model are shown in Figure 5–7. The enhanced ability of the neural network classifier to separate the training data is clearly seen by comparing the histograms in Figures 5–3 and 5–7. The improved performance of the neural network is rather remarkable considering that the neural network architecture (Figure 5–6) is one of the simplest possible for this problem. Classification error analysis indicated that the optimum threshold value of Y for the neural network classifier is 0.55, which coincidentally is the same as the resulting optimal threshold for the linear classifier. The predicted classification error for the neural network classifier is 3 percent for false negatives (2 out of 63). Because there were no misclassifications of the T=0 contaminants, the false-positive rate is estimated to be near zero. FIGURE 5–6 Multilayer neural network used for contaminant classification. Weights are not shown, for simplicity.

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration FIGURE 5–7 Histogram of Yi values for the training data set using the neural network classifier. For the linear classifier it was possible to examine the values of the weights and their statistical significance to determine the relative importance of the different attributes in determining the classification outcome. Because of the greater complexity of the neural network model, there is less transparency in the functional mapping. In most computer programs, the weights can be examined, but it is very difficult to extrapolate these weights mentally to readily understandable rules. For example, in this neural network, there are 10 weights in the first layer, 2 weights in the second layer, and a total of 3 bias weights. Because these parameters relate to cross-products and nonlinear functions of the attributes, it is not immediately obvious what the values mean. One way to probe the importance of various attributes in a neural network is to leave out certain attributes in the training data set and examine the effect on the classification of the training data. Indeed, this practice is discussed in Chapter 2 as a means to enhance the transparency of the approach, and the committee recommends that EPA conduct such a sensitivity analysis on any classification model it develops and intends to use to assist in the creation of future CCLs.

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration DEMONSTRATED USE OF THE TRAINED CLASSIFIER Examination of Misclassified Contaminants Misclassification of contaminants in the training data set can be interpreted in three ways: Either (1) the training data (i.e., the attribute scores) do not capture the information that determines regulatory action for drinking water contaminants; (2) the model relating the input to the output does not adequately capture the process by which this information is used in regulatory decision-making; or (3) the target values are wrong, implying that some of the regulatory decisions made in the past are inconsistent with regulatory decisions for the bulk of drinking water contaminants. If one operates on the presumption that the training data are accurate, precise, and complete, then classification errors can be reduced by exploring modeling alternatives. That is precisely what was done in the analysis in this chapter. A large number of classification errors resulted from the linear model, so a neural network model was used that greatly reduced classification error. One can continue to try to eliminate all classification errors by using more and more elaborate neural networks (or other classification models such as support vector machines, cluster algorithms, radial basis functions). However, as mentioned previously, this endeavor may lead the analyst down the potentially dangerous path of overfitting the data. Thus, an analyst must question whether persistent errors are indicative of true classification errors. If one operates on the assumption that the attribute scores should combine in a linear fashion to determine classification output, then some insights can be gained by examining which of the contaminants in the training data set were misclassified using the linear classifier. These are listed in Table 5–3. Total coliforms and heterotrophic plate count (HPC) are misclassified because these contaminants are not inherently hazardous, but rather are indicators of the potential presence of hazardous microorganisms. For this reason, the committee believes it is appropriate for these two contaminants to fall below the threshold. The linear classifier also misclassified toluene, o-dichlorobenzene, and trans-1,2-dichloroethylene, which are currently regulated chemicals. These chemicals score relatively low on severity and potency because their health effects are smaller than those of other regulated drinking water contaminants that are of greater concern possibly because of their carcinogenic potential or their potential to cause health effects at low

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration TABLE 5–3 Contaminants in the Training Data Set That Were Misclassified Using the Linear Classifier Misclassified T=1 Contaminants Yi Misclassified T=0 Contaminants Yi o-Dichlorobenzene 0.47 Ethanol 0.57 trans-1,2-Dichloroethylene 0.54 Folic acid 0.72 Toluene 0.43 Olestra 0.63 HPC 0.28 Saccharin 0.70 Total coliforms 0.42     concentrations. Toluene is a chemical that is generally known to be rather prevalent, but the occurrence data available were insufficient to represent this fact. In general, the data used to score prevalence for all contaminants in the training data set were found to be lacking in several respects (see Chapter 4). These misclassifications may be indicative of a paucity of good information on which to base occurrence metrics. Of the T=0 contaminants that were misclassified in the linear classifier, the committee notes that ethanol, folic acid, and saccharin have potentially severe health impacts, and Olestra has significant potential for persistence and mobility in the environment. Like other substances in the T=0 category, however, these are substances that many people purposefully and regularly ingest. Rather than placing emphasis on the misclassifications obtained using the linear classifier, one can operate on the assumption that the underlying process by which the attribute scores determine classification output is complex and requires nonlinear functionalities to represent it mathematically. This makes the case for the use of neural networks rather than linear classifiers. The misclassified contaminants for the neural network classifier are listed in Table 5–4. The neural network classifier had a much smaller number of misclassified contaminants, indicating that it was able to capture the data mapping process mathematically. It is interesting to note that the misclassified contaminants in this case are not a subset of those misclassified in the linear classification. Here, ethylbenzene has been misclassified, whereas it had not been misclassified in the linear classification. This indicates the strong underlying nonlinearity that is being modeled by the neural network.

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration TABLE 5–4 Contaminants in the Training Data Set That Were Misclassified Using the Neural Network Classifier Misclassified T=1 Contaminants Yi Misclassified T=0 Contaminants Ethylbenzene 0.50 — HPC 0.08 — Validation Test Cases The contaminants in the training data set in the T=1 category did not include all those that have MCLs. Five such chemical and microbial contaminants were withheld as validation test cases to examine the predictive accuracy of the classification algorithm as required in the second phase of study (see Preface to this report). Details of how health effect and occurrence attributes were scored for these regulated contaminants are discussed earlier in this chapter. The predicted values of Yi for the five validation test cases are listed in Table 5–5. All contaminants are correctly classified as being in the T=1 category because their Yi values are well above the thresholds. These correct predictions, albeit few in number, provide additional supporting evidence of the validity of the classification algorithm. EPA should make every effort to increase the number of both types of validation test cases (especially for T=0 contaminants) to assess more thoroughly the predictive accuracy of any classification algorithm developed for use in the creation of future CCLs. Prediction for Interesting Test Cases The committee selected five potential drinking water contaminants for which data were available that may be of interest in the future (Table 5–6). Two of these, aluminum and methyl-t-butyl ether (MTBE), are currently included on the 1998 CCL (EPA, 1998b). The linear classifier predicts that all five should be considered for regulatory action. However, the neural network classifier predicts that silver, chloroform, aspirin, and MTBE should be placed on the CCL, but aluminum should not (based on data available at this time). The committee emphasizes that these results do not indicate a recommendation that any of these con-

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration TABLE 5–5 Classification Prediction Accuracy for Validation Test Cases Validation Test Cases Predicted Yi Using Linear Classifier (Threshold = 0.55) Predicted Yi Using Neural Network (Threshold = 0.55) Arsenic 1.2 1.00 Nitrate 1.1 1.00 Atrazine 0.79 1.00 Tetrachloroethylene 0.76 0.99 Giardia lamblia 1.1 1.00 TABLE 5–6 Classification Prediction for Interesting Test Cases Interesting Cases Predicted Yi Using Linear Classifier (Threshold=0.55) Predicted Yi Using Neural Network (Threshold=0.55) Aluminum 0.61 0.08 Silver 1.2 1.00 Chloroform 0.8 1.00 Aspirin 1.0 1.00 MTBE 0.98 1.00 taminants be placed on a future CCL or removed from the 1998 CCL. Rather, the results are intended to demonstrate how the EPA can develop and use a classification scheme to help select PCCL contaminants for inclusion on the CCL. The committee also believes that this demonstrates the sensitivity of the results to choices made in constructing the classification algorithm. SUMMARY: CONCLUSIONS AND RECOMMENDATIONS The purpose of this analysis was to demonstrate how existing contaminant data and past regulatory decisions could be used to develop a prototype classification algorithm to determine, in conjunction with expert judgment, whether a particular drinking water contaminant is of regulatory concern or not. More specifically, the committee has demon-

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration strated a prototype classification approach that must first be trained (calibrated) using a training data set containing prototype contaminants and can then be used in conjunction with expert judgment to predict whether a new (PCCL) contaminant should be placed on the CCL or not. The use of the majority of currently regulated drinking water contaminants in the training data set can be described simply as “making decisions that are consistent with and build upon what has been done in the past.” If this is deemed defensible, the prototype classification approach can be regarded as valid. It is important to realize that this approach is then constrained by the data (i.e., past regulatory decisions). What has been presented in this chapter is a framework and demonstration of how EPA might develop its own prototype classification scheme for use in the creation of future CCLs. In this regard, the committee makes the following recommendations: A linear model and a neural network are discussed and demonstrated for potential use in a prototype classification scheme. Although the neural network performed better than the linear model (with respect to minimizing the number of misclassified contaminants), at this time the committee cannot make a firm recommendation as to which model EPA should use because of the aforementioned uncertainties in the training data set. Thus, the committee recommends that EPA explore alternative model formulations and be cognizant of the dangers of overfitting and loss of generalization. To adopt and implement the recommended approach for the creation of future CCLs, EPA will have to employ or work with persons knowledgeable of prototype classification methods and devote appreciable time and resources to develop and maintain a comprehensive training data set. In this regard, the committee recommends strongly that EPA greatly increase the size of the training data set that is used illustratively in this chapter to improve predictive capacity. One way in which EPA can expand the training data set and classification algorithm is to allow for the expected case of missing data—that is, to purposefully include in the training data set contaminants for which values of some of the attributes are unknown and develop a scheme that allows prediction of contaminants for which some of the attributes are unknown. EPA will also have to accurately and consistently assign attribute scores for all contaminants under consideration. To do this, it will have to collect and organize available data and research for each PCCL contaminant and document the attribute scoring scheme used to help ensure

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration a transparent and defensible process, the importance of which is discussed in Chapter 2. As recommended in Chapter 3, the creation of a consolidated database that would provide a consistent mechanism for recording and retrieving information on the contaminants under consideration would be of benefit. EPA will also have to withhold contaminants from inclusion in the training data set to serve as validation test cases that can assess the predictive accuracy of any classification algorithm developed. While the committee was able to withhold five contaminants presumed worthy of regulatory consideration (T=1) for this purpose (see Table 5–5), it had insufficient numbers of contaminants presumed not worthy of regulatory consideration (T=0) to similarly withhold. All withheld validation contaminants were classified correctly as belonging in the T=1 category and such results provide (albeit limited) additional support for the validity of the classification algorithm approach. EPA should make every effort to increase the number of both types of validation test cases (especially for T=0 contaminants) to assess more thoroughly the predictive accuracy of any classification algorithm developed for use in the creation of future CCLs. If neural networks are used for prototype classification, the transparency in understanding which contaminant attributes determine the category of a contaminant will be less than that of a linear model or a more traditional rule-based scheme. However, if one acknowledges that the underlying process that maps attributes into categorical outcomes is very complex, then there is little hope that an accurate rule-based classification scheme can be constructed. The fact that the nonlinear neural network performed better than the linear classifier is a strong indicator that the underlying mapping process is complex, and it would be a difficult task for a panel of experts to accurately specify the rules and conditions of this mapping. Furthermore, the loss in transparency in using a neural network is not inherent, but rather derives from the difficulty in elucidating the mapping. The underlying mapping in a neural network classifier can be examined just as one would conduct experiments to probe a physical system in a laboratory. Through numerical experimentation, one can probe a neural network to determine the sensitivity of the output to various changes in input data. Although a sensitivity analysis was not conducted because of time constraints, the committee recommends that EPA should use several training data sets to gauge the sensitivity of the method as part of its analysis and documentation if a classification ap-

OCR for page 107
Classifying Drinking Water Contaminants for Regulatory Consideration proach is ultimately adopted and used to help create future CCLs. Finally, EPA should realize that the committee is recommending a prototype classification scheme to be used in conjunction with expert judgment for the future selection of PCCL contaminants for inclusion on a CCL. Thus, transparency is less crucial (though no less desired) at this juncture than when selecting contaminants from the CCL for regulatory activities as discussed in the committee’s first report (NRC, 1999a).