Letters to the Committee
Notes toward a Theory of Accident Precursors and Catastrophic System Failure
ROBERT A. FROSCH
John F. Kennedy School of Government
For want of a nail the shoe is lost,
for want of a shoe the horse is lost,
for want of a horse the rider is lost,
for want of a rider the battle is lost,
for want of a battle the kingdom is lost,
all for the loss of a horseshoe nail.
Benjamin Franklin, Poor Richard’s Almanac (based on a rural saying collected by English poet, George Herbert, and published in 1640)
So, naturalists observe, a flea
Hath smaller fleas that on him prey;
And these have smaller still to bite ‘em;
And so proceed ad infinitum.
Jonathan Swift. A Rhapsody (1733)
Given enough layers of management, catastrophe need not be left to chance.
Norman Augustine, Augustine’s Laws (as recollected by this author)
In these notes, I make some observations about accidents and system failures drawn from parts of the literature that are commonly consulted for this subject. Although I do not formulate a complete and connected theory, my intent is to point out the directions from which a theory may be developed. I place this possible theory in the context of complexity and the statistical mechanics of physical phase change.
Machines and organizations are designed to be fractal. Machines are made of parts, which are assembled into components, which are assembled into subassemblies, which are assembled into subsystems, and so on, until finally they are assembled into a machine. Hierarchical organizations are specifically designed to be fractal, as any organization diagram will show. Neither a hierarchical organization nor a machine is, of course, a regular, mathematically precise fractal; they may be described as “heterogeneous fractals.” Nevertheless, we expect the distribution of the number of parts of a machine vs. the masses (or volumes) of those parts to follow an inverse power law, where the power describes the dimension of the fractal (Mandelbrot, 1982). For machines, I would expect the dimension to be between two and three. For organizations, I would expect the dimension to be between one and two.
Because they are fractal, both machines and organizations are approximately (heterogeneously) scale free, that is, they look the same at any scale. (Scale free is used in the sense that: f(kx)/f(x) is not a function of x. Most functions are not scale free. Power laws are scale-free functions since: (kx)a/xa = ka .) In the case of the machine, a shrinking engineer will be surrounded by machinery at any scale; in the case of an organization, all levels of local organization are similar. (One boss has n assistant bosses, each of whom has n assistant-assistant bosses, and so on to [(assistant)m] bosses.)
Many natural (and human) systems appear to develop to a self-organized critical (SOC) state (e.g., Barabosi, 2003), in which they have a scale-free fractal structure and are on the edge of catastrophe (Bak, 1996; Buchanan, 2001). Such systems appear to undergo disasters of all scales, from the miniscule to the completely destructive. The distribution of the structure of these systems is fractal, and the distribution of the (size vs. number of occurrences of a given size) catastrophe follows an inverse power law in the vicinity of catastrophe. Examples include: sandpiles (more correctly, rice piles), earthquakes, pulsar glitches, turbidite layers (fossil undersea avalanches), solar flares, sounds from the volcano Stromboli, fossil genera life spans, traffic jams, variations in cotton futures, people killed in deadly conflicts, and research paper citations. This is also the case for the ranking of words in the English language by frequency and the ranking of cities by size (Zipf, 1949).
For reasons of economy and efficiency, engineered systems (which I will loosely refer to as machines) and organizations (including those in which the design and operations of the machines are embedded) are designed to be as close to catastrophe as the designer dares. In the case of machines, the “distance” from envisioned catastrophes, called the factor of safety, varies depending upon the stresses predicted to be placed on the machine during its operating life. Organizations (as operating machines) are designed to be as lean (and mean and cheap) as seems consistent with performing their functions in the face of their operating environments. In this sense, these fractal systems may be described as design
organized critical (DOC). I argue that the physics that applies to phase changes in natural SOC systems may also be applied to DOC systems.
I now introduce percolation theory, which embodies the use of the renormalization group (mean field theory) and has been used as a theoretical framework for natural phase change (Grimmett, 1999; Stauffer and Aharony, 1994). I assert that percolation theory provides a suitable “spherical cow” or “toy model” of disaster in machines and organizations. There are a number of possible percolation models, such as lattices of any dimension. I will use percolation on a Bethe lattice (Cayley Tree), although percolation on other lattices gives similar results. A Bethe lattice is a multifurcation diagram. (The simplest nontrivial case, in which multi = 2, consists of a tree of repeated bifurcations at the end of each branch.) The asymptotically, infinitely large case can be solved exactly. It has been shown (both by approximation and computation) that in less than infinite cases the phenomena, particularly around the critical value (see below), approximate the proven phenomena for the infinite case.
In our model, a link between two nodes may be conducting or nonconducting. If conducting, we regard it as a failure of that link. Strings of connected link failures are interpreted as accidents of various sizes, and a string of link failures to the central (or origin) node is interpreted as a complete catastrophe. We examine the probability of catastrophe and the distribution of lesser failures as p (the probability of failure of any link) increases from zero (Grimmett, 1999; Stouffer and Aharony, 1994).
We first would like to know the percolation threshold: the value of p = pc, for which there is at least one infinite path through the infinite lattice. This may be shown to be pc = 1/(z-1) where z is multi, the number of links at each node.
Next, we would like to know P, the probability that the origin (or any other arbitrarily selected site) belongs to the infinite (or catastrophic) cluster. Stauffer and Aharony (1994) prove the example for z = 3:
Further, it may be shown that ns(p), the average number of clusters (per site) containing s sites goes asymptotically to: ns(pc) ~ s-τ. More generally, the mean cluster size (S) near the critical threshold (p = pc) goes as S ~ / p - pc / -γ , where g is a constant. This distribution of cluster sizes describes cluster distributions near phase change in many physical systems, including the Ising model of magnetization, and clusters of water molecules near the phase change from steam to liquid water.
Catastrophic system failures are what they seem to be, phase changes, for example, from organized shuttle to rubble (a “liquid”). I note also that the percolation model leads to the suggestion that the Heinrich diagram, or occurrence pyramid is likely to be correct (Corcoran [p. 79] and Hart [p. 147] in this volume).
It is also interesting to note that Reason’s “Swiss cheese model” of accidents is a percolation model, although he does not call it that or formally develop its statistical implications (Reason, 1997). Any occurrence may be an accident precursor.
What can one do? Clearly, as Hendershot has suggested in another paper in these proceedings, (p. 103 in this volume) we may redesign the system to be simpler and to function without subsystems or components that are likely to fail in a way that leads to accidents. If this is not feasible, the system must be strengthened, that is, moved farther away from its breaking point. For machines, this can be done by strengthening the elements that appear most likely to fail, and whose failure is likely to lead to disaster, by introducing redundancy, more (or larger [which is equal to more]) mechanical strength elements, redundant sensors, controls and actuators, etc. The trick is to find the elements most likely to fail, singly or in combination, so that only they are strengthened, and no “unnecessary” redundancies are introduced. The critical elements are found by engineering intuition, engineering analysis, and/or some kind of probabilistic risk assessment analysis.
In organizations, one can add people or redundant organizational elements intended to increase strength against mistakes of various kinds. These may include safety organizations, inspectorates, or auditors. However, other organizational means may also provide the necessary redundancy. In an organization with a reasonable atmosphere of trust among its members and echelons, juniors formally or informally bring their problems and troubles to their peers and their seniors, who may have other and/or broader means of attacking them. Seniors are attuned and attentive to rumors and concerns of both peers and juniors. These are strengthening elements that bridge over portions of the organizational tree and strengthen the organization. They may be likened to bringing in reinforcements. In this theoretical framework, these horizontal and vertical means of communication are strengthening elements that move the organizational structure away from pc without adding suborganizations or people. In the scientific and engineering communities, peer review plays this communication role. The prime purpose of peer review is not to provide confirmation of excellence but to find errors and omissions that might be damaging or catastrophic.
In their work on high-reliability organizations, LaPorte, Roberts, and Rochlin (cited in Reason, 1997) describe the field reorganization of hierarchies into small teams whose members communicate directly with each other, particularly when warning of danger. For example, usually highly hierarchical Navy crews, when working together as flight deck teams on an aircraft carrier during flight operations, become a flat, highly communicating group, in which authority comes from knowledge and the perception of problems rather than from organizational position.
In summary, the statistical properties of designed machines and organizations are similar to those of natural SOC systems, and we should expect the same theoretical framework that applies to them, and to statistically similar physical
phase changes, to apply to machines and organizations. Therefore, we can expect to predict the general statistical properties of accident precursors and catastrophic system failure in human-made systems from well known theoretical structures. The results also suggest why commonly used means to strengthen systems work to move the system state away from the critical pc.
Further development and application of this theory will require applying it specifically to machine and organizational system accidents and testing this framework against real system data.
Bak, P. 1996. How Nature Works. New York: Springer-Verlag.
Barabosi, A.-L. 2003. Linked. New York: Penguin Books.
Buchanan, M. 2001. Ubiquity. New York: Three Rivers Press.
Grimmett, G. 1999. Percolation. New York: Springer-Verlag.
La Porte, T.R., K.H. Roberts, and G.I. Rochlin. 1987. The self-designing of high reliability organization: aircraft carrier flight operations at sea. Naval War College Review 40(4): 76–90.
Mandelbrot, B. 1982. Fractal Geometry of Nature. New York: W.H. Freeman and Co.
Reason, J. 1997. Managing the Risks of Organizational Accidents. Aldershot, U.K.: Ashgate Publishers.
Stauffer, D., and A. Aharony. 1994. Introduction to Percolation Theory. London and New York: Routledge.
Zipf, G.K. 1949. Human Behavior and the Principle of Least Effort. Cambridge, Mass: Addison-Wesley.
Corporate Cultures as Precursors to Accidents
Department of Sociology
Eastern Michigan University
I strongly believe that precursor situations vary with the corporate culture. In my previous work on cultures and the kinds of accidents they encourage, I have argued that accidents typically have a dominating feature: (1) violations; (2) neglect; (3) overload; or (4) design flaws (Westrum, forthcoming). Although these configurations are admittedly impressionistic, I believe every accident has a dominant character (although combinations certainly exist) (Turner and Pidgeon, 1997). Let me elaborate on these dominant characters a bit more.
Violations are actions that take place in blatant disregard of rules. Of course, rules are often tinkered with to a certain extent, but violations are different. At the 2001 Australian Aviation Psychologists’ Symposium Bob Helmreich quoted the remark “checklists are for the lame and the weak.” It is a good representation of this attitude. Tony Kern (1999) explores this subject in his book Darker Shades of Blue.
Neglect involves the dominance of a known but unfixed problem in an accident configuration. Reason’s classic “latent pathogen” probably belongs in this category (Reason, 1990). In this scenario, fixes for problems are ignored, deferred, dismissed, or incomplete. A “dress rehearsal” incident may even take place before an actual accident occurs.
Overload occurs when there is a mismatch between tasks and the resources required to address them. Overload may occur as the result of an organization taking on too large a task. Even though everyone works hard, there is too much work for too few people. Mergers, expansions, downsizing, and reshuffling can all generate overloads. An overload can also occur spontaneously when a team or an individual decides to accept a task that requires more time or skill than they have.
Design flaws occur when conscious decisions lead to unsuspected consequences. Unlike neglect, when a conscious decision is made to ignore a problem, here the problem is unseen, so no one plans for it. Design flaws are insidious, and eliminating all design flaws is very difficult. A design flaw can occur through a failure of imagination or through poor execution of a design.
Any dominant factor can shape an accident for an organization, but I believe their frequency varies systematically with the corporate culture. In my previous work, I have proposed that organizational cultures can be ranged along a spectrum of information flow from pathological to bureaucratic to generative (Westrum, 1994; Turner and Pidgeon, 1997). Organizations with a pathological culture, for instance, have an atmosphere of fear and intimidation, which often reflects intense conflicts or power struggles. The bureaucratic culture, by contrast, is oriented toward following rules and protecting the organization’s “turf,” or domain of responsibility. Generative organizations are oriented toward high performance and have the most effective information flow.
To me, it follows naturally from the nature of information flow in these cultures that each has particular vulnerabilities (i.e., accident precursors). For instance, a pathological environment encourages overt and covert violations of safety policies. Rogues or “cowboys” pretty much do what they want no matter what the rules are. By contrast, accidents caused by safety violations are rare in a generative environment. Furthermore, generative accidents typically do not show results associated with neglect. Overload is a possibility, but what often catches up generative organizations are design flaws, problems that have been created by conscious decisions whose consequences are not recognized until they have played out in reality. Bureaucratic organizations (most frequently represented in the “systems accident” population) typically fail because they have neglected potential problems or have taken on tasks for which they do not have the resources to do well.
I believe that these tendencies have profound consequences for dealing with accident precursors. The Reason model provides a good general approach to the problem of latent pathogens, but I believe we can do better. One implication of these special vulnerabilities is that even the nature of latent pathogens may differ from one kind of culture to another. By recognizing how cultures differ, we may have a better idea of where to look for problems.
The challenge of a pathological environment is that the culture does not promote safety. In this environment, safety personnel mostly put out fires and make local fixes. The underlying problems are unlikely to be fixed, however. In fact, the pathological organizational environment encourages the creation of new pathogens and rogue behavior. The best techniques in the world can never be enough in an environment where managers practice or encourage unsafe behavior.
In bureaucratic environments, the challenge is a lack of conscious awareness. In the neglect scenario, problems are present, and may even be recognized, but the will to address them is absent. Bureaucratic organizations need to develop
a consciousness of common cause, of mutual effort, and of taking prompt action to eliminate latent pathogens. In an overload situation, which has a great potential for failure, the organization needs outside help to cut tasks down to a size it can handle. Groupthink is an ever-present danger in neglect or overload situations, because it can mask problems that need to be faced.
Generative organizations may seem to be accident-proof, but they are not. Generative organizations do not do stupid things, but design flaws are insidious. In the cases of violations, neglect, and overload, the environment provides clear indications to an outside analyst that something is wrong. You can measure the sickness or inefficiency of the culture by tests, observations, and analysis. For instance, there are “symptoms” of groupthink. By contrast, design flaws can be present even when a culture shows no overt symptoms of pathology. Design flaws often come from a failure of what I have called “requisite imagination,” an inability to imagine what might go wrong. Even generative cultures suffer from design flaws. In a recent paper, Tony Adamski and I have suggested how requisite imagination can be increased (Adamski and Westrum, 2003). Yet, I believe no system is really capable of predicting all negative consequences. Hence, requisite imagination is more of an art than a science.
We can now look at the differences in confronting latent pathogens in the three different cultural situations. In a generative environment, pointing out (or even discovering) a latent pathogen is usually sufficient to get it fixed. Things are very different in a bureaucratic environment, where inertia or organizational commitments stand in the way of fixing the latent pathogen. When an organization has an overload problem, fixing the problem can be very difficult. In a pathological environment, pointing out a latent pathogen is personally dangerous and may result in the spotter, rather than the pathogen, getting “fixed.” I believe that knowing the specific types of failure and their typical generating conditions can help organizations eliminate latent pathogens. If pathogenic situations vary with the environment, then maybe our approach to cleaning them up ought to vary, too.
These observations are impressionistic, but they can be a starting point for further inquiry into links between organizational cultures and the dynamics of latent pathogens. Latent pathogens are constantly generated and constantly removed. Accident reports contain voluminous information about the production of latent pathogens, but we do not know enough about the processes for removing them. The characteristics of healthy environments might be a good topic for a future workshop.
Adamski A., and R. Westrum. 2003. Requisite Imagination: The Fine Art of Anticipating What Might Go Wrong. Pp. 187–220 in Handbook of Cognitive Task Design, E. Hollnagel, ed. Mahwah, N.J.: Lawrence Erlbaum Associates.
Kern, T. 1999. Darker Shades of Blue: The Rogue Pilot. New York: McGraw-Hill.
Reason, J. 1990. Human Error. Cambridge, U.K.: Cambridge University Press.
Turner, B.A., and N.F. Pidgeon. 1997. Man-Made Disasters, 2nd ed. Oxford, U.K.: Butterworth Heineman.
Westrum, R. 1994. Cultures with Requisite Imagination. Pp. 401–416 in Verification and Validation of Complex Systems: Human Factors, J. Wise et al., eds. Berlin: Springer Verlag.
Westrum, R. Forthcoming. Forms of Bureaucratic Failure. Presented at the Australian Aviation Psychologists Symposium, November 2001, Sydney, Australia.