Data Collection and Use
Large-scale modeling is critically dependent on good data, available statistics, and, when necessary, carefully encoded expert opinions. As the PWS Study notes: “Without reliable and statistically valid data, safety shortcomings cannot be identified with clarity, and once safety programs are in place, they cannot be evaluated to determine if they are effective and whether resources committed to safety are being used wisely” (PWS Study 4.1). Unfortunately, the PWS Study does not inspire confidence that statistically valid data were available, acquired, and used in the calculations, although it is clear that the PWS Study team made substantial efforts to find complete data.
The necessary data included information on traffic patterns, the environment (weather, sea conditions, visibility, ice), and operational performance. Data on traffic patterns were used to develop the traffic simulation model, but in spite of diligent efforts, the environmental data were not as complete as the PWS Study team desired. Operational data in PWS were also inadequate and had to be supplemented with worldwide data. The NRC committee questions the appropriateness of some of these supplemental data.
GEOGRAPHIC AND TRAFFIC DATABASE
The PWS Study used a reasonable geographic representation of the PWS. Although, the maritime charts were old they probably did not influence the outcome substantially. An extensive traffic database was developed, based on (in the judgment of the PWS Study team) reliable records and on discussions with shippers and the U.S. Coast Guard. This database was the basis for the PWS Study’s traffic model. The same traffic patterns were used for the simulation and MARCS models. All models assumed that TAPS vessel operations were conducted according to the PWS maritime system rules, such as staying in designated transit channels and operating at designated vessel speeds. The models also make questionable assumptions regarding the distribution of vessel tracks through the channels
(e.g., that vessel tracks do not cross). Although the general traffic patterns of non-TAPS vessels (e.g., fishing vessels, tour boats, cruise ships) are reasonably well understood, their speeds and tracks are highly variable. Therefore, modeling these vessels is very difficult, and the consequences of the simplifying assumptions made about them are not known.
One weakness of the input traffic data is that it covers only one year, 1995, which was an atypical year because coordination was poor among the U.S. Coast Guard, the Alaska Department of Fish and Game, and the fishing industry at the time. As a result, a large number of fishing boats operated near tankers in 1995. This problem was rectified in 1996 through improved coordination among these organizations for setting dates for the fishing season (PWS Study, Exec. Sum., p. 5) and “the 1996 fishing season was free from incidents” (PWS Study 7.37).
The PWS Study notes that “accurate historical weather data for the Sound is [sic] not readily available" (PWS Study 2.3). Weather data from several buoys at three locations (not the most critical locations) was used. The buoys were out of service for part of 1995, the reference year. Data were also collected from SERVS vessels at three locations, but the vessels returned to port during the most severe weather. The data were not in a common format, were not all high quality, and were often incomplete, and they did not cover weather at the two most critical sites in the PWS, the Narrows and the Hinchinbrook Entrance. The PWS Study notes that “the measured weather data poorly represents site-specific conditions being used to make closure conditions…This may result in the system being closed when it should be open, and open when it should be closed” (PWS Study 8.7).
Some members of the NRC committee noted that even though the weather data used by the PWS Study team were poor, they may have been more complete than the data that are available for other areas of the world. In any case, they were reasonably certain that the effect of the weather was not significant enough to cast doubt on the results. Other members of the committee felt that the events concerned were rare, high consequence events for which outlier data were important considerations that should have been included. They also felt that the weather in PWS was important to the results because rapid changes in the weather in PWS can affect the transit of tankers and escorts.
The PWS Study cited a precedent for using visibility readings at the Valdez airport for both Port Valdez and PWS, although visibility readings in these two locations “are often quite different” (PWS Study 2.4). The PWS Study used airport visibility data unless better data were available because “accurate visibility records for the Sound are scarce, (although) some data does [sic] exist” (PWS Study 2.4).
Ice is an important consideration in the tanker lanes and “can be expected in the vicinity of the tanker lanes in the Valdez Arm and the Central Sound approximately 40 days a
year” (TD 5.3:1.76). However, the base year of 1995 does not reflect these conditions because in 1995 the “presence of significant ice in the traffic lanes was…a relatively rare occurrence” (PWS Study 7.35). The study states that "[b]oth the system simulation and the fault tree predict a significant increase in the risk of grounding and collision when ice is present” (PWS Study 7.34–7.35). In 1989, the EXXON Valdez deviated from the lane and changed course to avoid ice and ended up catastrophically on Bligh Reef (TD 5.3:1.76).
INCIDENTS/ACCIDENTS (OPERATIONAL) DATABASE
Historically, accurate data on marine incidents relating to failures and accidents have been lacking in risk assessments of the marine transport of oil. The need for improved safety data has been indicated in a variety of NRC studies (NRC 1990, 1991, 1994). One of the first tasks of the PWS risk assessment team was to develop a process for gathering data on marine accidents for the TAPS oil tankers. Data were collected on incidents involving groundings, collisions, allisions, steering and propulsion failures, electrical and mechanical failures, navigation equipment failures, structural failures, and fires and explosions.
If local data were available and deemed reliable, they were used. If local data were insufficient (e.g., on founderings and fires and explosions), worldwide or other regional (e.g., North Sea) data were used. However, the study does not discuss the effect of using non-PWS data on the results. The PWS Study states that all events in the database were verified by two independent data sources and that filling gaps in the event database usually required the reconciliation of archival data from several sources.
Modeling depends on appropriate data on the performance of ships and people. The PWS Study developed an apparently large database on PWS traffic but had difficulty finding operational performance data. Eventually, using confidential information from the shippers, the PWS Study team compiled a set of 50 databases, which were supplied to the committee in a letter dated February 18, 1977. For proprietary reasons, only 32 databases were listed in the PWS Study (PWS Study 4.9).
The PWS incidents/accidents (operational) database was constructed of a mix of public and private databases, including confidential company information. Data collection also involved questionnaires (discussed in some detail below), surveys, company audits, reviews of public record (such as the PWS Vessel Traffic Service data), interviews with local community organizations, and maritime accident data (both domestic and international).
The following organizations were involved in the data collection process:
U.S. Coast Guard
the U.S. Department of the Interior, Minerals Management Service
Prince William Sound RCAC
the Alaska Department of Environmental Conservation
the National Transportation Safety Board
the Republic of Liberia’s Transportation Research Board
the International Maritime Organization
local libraries and libraries at maritime academies
Alyeska Pipeline Service Company/SERVS
TAPS shipping companies (proprietary information)
The letter from the PWS Study team mentions 27 public and 23 private databases and lists 27 public sources and 8 reports from the Alyeska Pipeline Services Company. The other sources are summarized as “9 shipping company proprietary databases” and “4 databases from private Alaska citizens…private maritime organization databases of waterways events, incidents, accidents, etc.” (letter from Martha Grabowski, Feb. 18, 1997). Unfortunately, the key private databases, especially those of the shippers,1 were not available for review. All the data were synthesized into one database creating a valuable resource for examining alternatives in PWS. However, under the PWS Study team’s agreement with the shippers, this database was destroyed at the end of the study, and the proprietary components were returned to the providers. Consequently, the committee was unable to confirm the quality of the data or how it was collected, how well it was incorporated into the modeling, and whether significant anomalies were explored.
A notable weakness in the data were the poor responses to questionnaires by non-TAPS companies. Fifty-five questionnaires were sent,2 but responses were received from only eight tanker companies, two tug/barge operators, one ferry operator, and one passenger vessel operator. In other words, there were 43 nonrespondents of the 55 operators surveyed (PWS Study 4.21–4.22). Experience has shown that direct contact with experts and investigators elicits a better response.
Finally, a potentially more serious omission, which is common to most maritime safety studies, is that “no near-miss data were available in the system, thus, no near-miss analysis was conducted during the Risk Assessment” (PWS Study 4.11). The PWS Study notes that its “results point out the importance of considering risk reduction measures which interrupt the causal chain of errors before the occurrence of an accident…" (PWS Study 4.17). In other technical operations, such as nuclear reactors, near-misses are called precursors, and collecting data on precursors is important to identifying the potential causes of accidents, which can then be addressed.3
The PWS Study team used recognized procedures for its company audits, ship visits, and survey questionnaires. DNV performed the management audits at the headquarters of eight shippers, followed by visits to one or two ships of each company to “verify the
existence and degree of implementation of management systems and procedures intended to be in place on board the ship” (PWS Study 4.36).4 Each company was then rated on its performance. After examining the results from the questionnaires, the PWS Study states that “the degree of agreement between different groups in the maritime industry can only be described as remarkable” (PWS Study 4.30). However, no data were presented to support this statement.
FAILURE RATE REPORTING
The TAPS trade shippers all collect failure data, which were used as much as possible. However, companies use different processes and quality standards to collect, analyze, and report failures. The PWS Study team found that more failure events were reported by high performing shippers (i.e., safer shippers) than by low performing shippers. This apparent contradiction was assumed to reflect the higher standards of better performing companies for identifying and reporting system failures. The PWS Study team, therefore, “decided to base the calculation of failure rates on the companies with the best management scores, having established that those companies had the most stable and mature systems for failure reporting and analysis” (PWS Study 4.38). The NRC committee questions the wisdom of this approach.
The PWS Study team determined the quality of company management through management audits carried out at the head offices of several shippers. Management audits, however, do not always show the correlation between the existence of policies and procedures and the degree to which they are put into practice. A committed and dedicated management organization, motivated employees, and adequate funding are all necessary for successful performance. To verify that policies and procedures were actually in place aboard ship, auditors inspected a few ships of each participating company. The committee was not convinced that the time spent aboard ship by a single auditor was adequate, or that audits were performed under a suitable regime (e.g., no-notice audits). The information was used to calibrate certain relative results to establish absolute numerical results. The method was based on the unsupported assumption that the scores of the management audits and the failure data were inversely proportional.
Based on substantial work on the contributing role of human factors in accidents in other fields, such as aviation safety and nuclear reactor safety, the committee has some serious concerns about how human factors were treated in the PWS Study. The study restated the widespread belief that 80 percent of all failures are caused by human error. The conclusions and recommendations of the study, therefore, focus on reducing human errors as the best way of reducing risk. Furthermore, the simulation model is based on the probabilities of incidents and accidents calculated from the questionnaires. For all of these
reasons, the NRC review committee decided to examine the questionnaire approach in detail.
The PWS Study misuses expert human judgment in the gathering of data for the risk assessment and, therefore, falls short in its treatment of human factors. Both failings are attributable to the dearth of causal data for accidents. The lack of data, however, is no excuse for making invalid assumptions about the value of expert judgment for estimating the probabilities of incidents and accidents attributable to structural, mechanical, or human error.
Expert judgments have long been used to assess relative probabilities in studies of risk, but the usefulness of expert judgments depends on the experts’ ability to make judgments and the analysts’ ability to aggregate these opinions properly. In the PWS Study, experts were expected to make judgments about the likelihood that failures would occur in specific situations. However, the experts were not reliable judges of human factors as causes of specific failures.
In Questionnaire IV, for example, experts were asked to compare the likelihood that certain categories of human error would cause incidents. The categories were very broad: poor judgment, poor decision making, poor communications, lack of knowledge. Experts can be expected to give useful opinions about whether certain errors in judgment or communication will increase the likelihood of an accident, but the error categories in this questionnaire were extremely vague. Experts were not told the circumstances surrounding poor judgment or what the poor communication was about because it was assumed that the error itself would put the operator and others involved into the accident sequence of events. The experts were not told what the hypothetical crew knew or perceived to be true.
The vague error categories told the experts nothing about the particulars that started the accident sequence. Experts could easily have given different answers in good faith based on their interpretation of the questions. Considering that the experts were given 150 scenarios to judge, and that their answers required very little expert knowledge or experience, the NRC committee questions the seriousness given this task by the experts. Lack of stamina (as discussed in TD II, 2.7) is not the only human limitation to be considered in considering the value of expert judgments; their feelings about the relevance of the questionnaires to their particular realms of expertise must also have affected the quality of their responses. If they felt that many of the questions did not reflect on appreciation for their expertise, they may have treated the questionnaire lightly, even in relevant areas.
Expert opinions tell you what experts believe, not necessarily what is true. Data from expert opinions should be interpreted as being indicative of the prominent concerns of the experts. The interpretation of data always depends on the method of data collection. If one asks questions that are not be justified by clearly stated criteria, then one must question the value of answers. If even sparse data exist, they should be compared to the expert opinions. It would have been valuable if experts had been asked to state the criteria upon which they based their opinions and to check whether their opinions were consistent with the evidence. The criteria might then have been used to design better ways of gathering data in the future. (One reason human performance data in existing accident and incident databases is sparse is that human factors have been poorly coded.)
Effective elicitation has become an important element in risk assessments of complex
systems, such as nuclear power plants and high level waste repositories. The PWS Study team might have been more effective if they had consulted these risk assessments by agencies like the U.S. Nuclear Regulatory Commission and the U.S. Department of Energy, among others.
Assuming that the expert judgments were meaningful…the questionnaire data are clustered around ratings that indicate little or no difference between scenario pairs, indicating either that there is little or no difference in risk or that the experts felt the comparisons were irrelevant.
A few models and tools have been developed using experimental cognitive psychology data to identify and evaluate human factors associated with complex human-machine systems independent of the environment. The U.S. Army and the National Aeronautics and Space Administration have a joint program called Aircrew/Aircraft Integration, which is an applied research program to develop software tools and methods to improve the human engineering design process for advanced technology crew stations. The program’s major product, MIDAS (man-machine integration design and analysis system), provides system designers with a 3-D prototyping and task analysis environment that asks “what if” questions about crew performance to correct problems before the hardware is developed. MIDAS uses embedded models of human performance (e.g., vision, memory, and decision making, including functions for simulating remembering, forgetting, and interrupting activities) to make various analyses. The data used as input for MIDAS might have been useful in the PWS Study for determining the relative probabilities of human errors. However, these probabilities were not taken into account, and only the results of the human errors were analyzed.
The expert judgments could have been used to answer the following questions:
Could they see themselves behaving in a certain way?
Could they see personnel of a “weaker” company behaving that way?
Did they consider certain actions likely with existing safeguards in place (alarms, for example)?
Another approach to using expert opinion focuses on obtaining expert information rather than expert opinions (Kaplan, 1992). The literature suggests that other approaches are also possible (Spetzler and von Holstein, 1984).
Because of the lack of essential objective data, the PWS Study team found it necessary to elicit and analyze expert judgments to complete their models. A major element of the PWS Study was long questionnaires (up to two hours) given to 162 people considered to be experts in relevant areas, including pilots, tanker officers, and others. The questionnaires are described and discussed immediately below, with a general discussion following. In one set of questions, the respondents were asked to estimate (on a 17 point scale) which of two situations was more likely to cause an accident, based on “waterways attributes” and given an incident. Examples of situations used in the study include ice or no
ice in the lane, propulsion failure, an inbound tanker with two escort vessels, and a tug with a tow in the lane. (The questionnaires can be found in TD 2.7.)
Expert Survey I: Assessing the Likelihood of a Vessel Operational Incident due to Human Error on a TAPS Tanker
The respondents were asked to compare the likelihood that a human error would occur for two different vessels. The vessels were described by nine characteristics (such as the year the ship was built and the amount of officer training). There was only one difference between the vessels in each set of questions, and the respondent was asked to decide which of the two vessels was likely to experience a human error and to decide how likely the error was. The causes of human error were: diminished ability; hazardous shipboard environment; lack of knowledge, skills, experience, or training; poor management practices; and faulty perceptions or understanding. The respondents were asked to address 101 comparisons.
Expert Survey II: Assessing the Likelihood of a Mechanical Reliability Incident on a TAPS Tanker
Using seven vessel attributes (four of them the same as in Survey I) and the same rating scale, respondents were asked which vessel would be likely to experience a particular type of reliability incident and how likely that would be. The types of incidents were propulsion failures, steering failures, operational systems failures (loss of all radar, global positioning system, radio, etc.), and structural failures. Respondents were asked to address 160 comparisons.
Expert Survey III: Assessing the Likelihood of an Accident Caused by a Mechanical Reliability Incident on a TAPS Tanker
Respondents were told that one of three types of reliability failures had occurred: propulsion, steering, or all critical operating systems. The scenario differed in only one of the following characteristics: location, the proximity of traffic (within 2 miles, 2 to 10 miles, nothing within 10 miles), the type of traffic (inbound or outbound tanker and less than or greater than 150,000 DWT), the number of escort vessels (0, 1, 2 or more), wind speed, wind direction, visibility, and ice conditions. The same scale was used as in the other surveys. Three types of accidents were listed: collisions, groundings, and founderings. The respondents were asked to address 275 comparisons.
Expert Survey IIIB: Assessing the Likelihood of a Collision between a TAPS Tanker and a Nearby Vessel
The other vessel was within 10 miles and had either a human error or a reliability failure. The respondents were asked to estimate in which situation a collision would be
more likely to occur, and how likely, on the 17 point scale. A collision was defined as a TAPS tanker being struck by a nearby vessel that was either under way or drifting. The attributes were the same as in Survey III. The respondents were asked to consider 54 comparisons.
Expert Surveys IVA and IVB: Assessing the Likelihood of an Accident Caused by a Human Error on a TAPS Tanker
The respondents were told that one of four types of human error had occurred and asked in which case an accident would be more likely to occur and, using the same scale as before, how likely that would be. The accidents were, as before, collisions, groundings, and founderings. The human errors were poor decision making, poor judgment, lack of knowledge, and poor communications (within the bridge team or between the vessel and vessel traffic system). In Survey IVA the respondents were asked to consider 160 comparisons involving poor decision making and poor judgment; in IVB, 164 comparisons involving lack of knowledge and poor communications.
Another set of questions used paired comparisons to develop the relative probabilities of incidents. Respondents were asked which of the two scenarios was more likely to happen, given two different vessel types, including differences in manning characteristics. These questions were used to develop relative probabilities for vessel reliability failures and relative probabilities for human error (PWS Study 4.34).
Expert Survey V: Assessing the Likelihood of a Mechanical Reliability Incident or Vessel Operational Incident on TAPS Tankers in PWS
Two vessels were described in terms of 11 characteristics (e.g., flag, management, vessel size). Eight scenarios and five types of human errors were given. The respondents were asked to check the combination of tanker plus error that was most likely to occur in PWS. For each scenario, there were six comparisons (only three for one scenario). The second part of this survey asked for similar comparisons for failures of propulsion, steering, operational systems, or structure. Five scenarios were given, four with six comparisons and one with four.
Expert Survey VI: Assessing the Likelihood of an Accident Caused by a Mechanical Reliability Incident or Vessel Reliability Incident in Various Situations in PWS
The respondents were asked to check which scenario was most likely. Scenarios were presented with the conditions used in Expert Survey III Part A asked about operational failures, and part B asked about reliability failures. The respondents were asked which was more likely to occur, given the type of incident that had already occurred. Part A had 16 scenarios, with six comparisons each; part B had 11 scenarios, also with six comparisons each.
Finally, an open-ended questionnaire was given, in which respondents were asked general questions, such as how comfortable they were about making the comparisons and what errors they believed were most likely to lead to oil outflow. Unfortunately, the answers to the first set of questions in this questionnaire were not given in the report, and the answers to the other questions were tabulated. No actual responses were provided. The PWS Study team informed the NRC committee that these questions were used to check for anything missing from the list of risk reduction measures developed by the study team in conjunction with the PWS steering committee (personal communication from J.Harrald, March 5, 1997).
GENERAL DISCUSSION OF THE QUESTIONNAIRES
The answers to the questionnaires were used to develop conditional probabilities, and these were translated into the absolute probabilities required for the model. The PWS Study team used two techniques. First, the results of Surveys IV and V were used to determine the probability of an accident after an incident (PWS Study 4.34). Second, the management audits were compared with lost time injury rates (LTIRs), which were thought to be well reported for all ships. Failure rates were determined on the basis of correlations (e.g., better management audit scores correlated with lower LTIRs) (PWS Study 3.13, 3.59, and briefing on Jan. 6, 1997).
USE OF EXPERT JUDGMENTS
The use of expert judgments raises a number of questions that are not unique to the PWS Study. The PWS Study team appears to have tried to minimize the difficulties by making adjustments for experts’ fatigue, level of understanding, individual scale bias, and variabilities in responses to the questionnaires. The NRC committee still has fundamental concerns, however, such as community bias or viewpoint, as well as consistency. The application of sophisticated statistical techniques to expert responses tends to mask these problems. In the case of the PWS Study, further complications were created as a result of a subjective “worst case” approach. Mixing worst-case scenarios with probabilities makes interpreting the results extremely difficult.
The questionnaires and the analytic models were not always consistent. For example, some questions were generic, even though the simulation analysis models were for specific TAPS vessels. The study team assumed that because the experts spanned a wide area of expertise the responses of groups of experts (e.g., pilots, tanker officers, etc.) would yield consistent information. However, the combination of shared experiences, membership in a tightly bound community, and the desire of individuals to give “correct” responses may have resulted in a collective skewing of results. Individual experiences of events may have left some experts with vivid memories and lessons learned, which may have contributed to a variability in responses beyond the specified range of the questionnaires and may not have been adequately resolved. One method of assessing and calibrat-
ing variability is to ask questions to which the answers are known as benchmarks and to compare the responses. According to the PWS Study team, this was not done because of a lack of time and a lack of data from which to formulate benchmark questions.
The PWS Study does not document the criteria used for selecting experts and judging their qualifications. Because experts responded to the questionnaires anonymously, the NRC committee could not make a judgment about the qualifications of the experts. The PWS Study gives the impression that all of the experts were more or less equally qualified. A fundamental difficulty in using experts is always the question of their qualifications.
Using expert judgments involves comparing pairs that are statistically treated to yield a ranking structure. Using the value scheme of the questionnaires, the PWS Study team developed a weighted ranking structure (relative values of frequencies of incidents and accidents under different conditions). This is an excellent technique, but it does not address the problem of calibrating the ranking to an absolute scale so that the information can be used to calculate usable “quasi-real” frequencies.
Incident calibrations in the PWS Study were based on one failure type, namely, propulsion failures. The PWS Study team argued that the propulsion failure database was extensive and reliable and that the quality of other failure data were not nearly as high. Although this might have been the most practical approach, it creates potential problems in calibrating absolute values because of uncertainties in the propulsion failure database and because of potential errors caused by the nonlinearities associated with using a single calibration “point” (i.e., propulsion data). The relative frequencies of accidents were calibrated against the MARCS model using two accident scenarios (collisions in the central PWS and in the Gulf of Alaska). Similar questions and uncertainties about the MARCS analysis could be raised. No verification of these processes was reported.
Dependency on Data
All of the methods described above are highly dependent on appropriately selected databases that accurately represent the local situation. The models, however, reflect limitations in the data. Like many other marine areas, the PWS lacks data suitable for implementing these methods. The PWS Study team used some creative and imaginative procedures to develop the requisite data and relationships by using expert judgments, worldwide data and data from other areas (e.g., the North Sea), making assumptions about the similarity of operations in the PWS and elsewhere, and making assumptions about how behavior in one aspect of operations (e.g., company management quality) and/or one parameter (e.g., loss of crew time) correlates with another area (e.g., operations safety).
Although worldwide data were used selectively, much of those data are influenced by location or environmental conditions. For example, it was generally assumed that certain mechanical failures were independent of location. In fact, however, mechanical failures often depend on factors like duty cycles or maintenance procedures, which, in turn, depend on the particular service in which the vessel is employed. The PWS Study briefly discusses uncertainties associated with using different data reporting systems, the limited participa-
tion of the PWS oil transportation system community, the lack of an accessible, independent, reliable data source, and the distrust among members of the PWS community.
The sparse database and the relatively large differences between real experience in PWS and the data used for the study make for less than credible results. Worldwide data, which were used to fill gaps, even though they were selected to approximate PWS operations, were, nevertheless, not representative of PWS operations. Some data, such as propulsion failure rates, were derived from shipping company databases. But every company collects and reports data differently, which could compromise the accuracy and precision of the analysis. Weather data were often incomplete because the number and locations of collecting stations did not cover weather at the two most critical sites in the PWS, the Narrows and the Hinchinbrook Entrance. Expert judgments were also used to fill gaps and augment weather data. Although attempts were made to minimize errors from expert judgments, they are inherently subject to distortion and bias.