Three Mile Island

March 27, 2004 - Reading time: 23 minutes

Note 1 May 2020: I've made several minor edits to this while re-posting from the original blog.  I don't think I'm changing anything critical; just readability. 

Another note: the story of TMI here borrows heavily from Inviting Disaster: Lessons from the Edge of Technology, by James Chiles.

Final note: the end of this post is really disappointing.  I actually have a lot to say on the subject of "operator error," and - by extension - "root cause analysis."  As I recall, I had a lot to say about those things when I wrote this post, too; I just didn't actually say them.  If I do start blogging again, it'll be about this.

March 28 marks the 25th anniversary of the accident at Three Mile Island! In honor of the occasion, I'd like to go on for a bit about it.

If we learn anything from history, we learn the most from historic failures.  History provides us with many spectacular failures, and it is imperative that we learn from them.

The nuclear power industry provides us with some very spectacular failures indeed, and I'd like to ramble for a bit on a very important one: the accident at Three Mile Island.

The near-catastrophe at Three Mile Island (hereafter TMI) started in Unit 2 of the plant on March 28, 1979 and the resulting drama gripped the nation for weeks; as pregnant women were fleeing the area, the President Jimmy Carter toured the plant as two tiny pumps, designed for other tasks, worked to keep the core of the plant from melting (one of them eventually failed).

Unit 2 at TMI had a lot of problems at the end of 1978 when it was set to be started.  Nuclear plants are complex, though, so startup problems are not so unusual.  Adding to the normal complexity, however, was the fact that the maintenance crews were overworked at the time of the accident - crew sizes had been reduced in an effort to save money.  The unit was plagued with various problems, and was shut down several times (after the accident, in 1982, the utility owner Metropolitan Edison sued the reactor's builder, Babcock and Wilcox for building a faulty reactor; B&W fired back with a lawsuit charging that the Edison employees were not competent to run the reactor).

Most commercial nuclear reactors have two cooling systems.  The primary system contains water at high pressure and temperature that circulates through the core where the reaction takes place.  This water goes to the steam generator, where it flows around tubes which circulate water in the secondary cooling system.  The resulting heat transfer keeps the core from overheating, and the heat from the secondary system generates steam that runs the turbine.  The accident started in the secondary system.

Water in the primary system is highly radioactive, but water in the secondary system is not.  However, the water in the secondary system must be very pure, because its steam drives turbine blades in a system that is built to extremely tight tolerances.  Contaminants in the water must be removed by the condensate polisher system, to avoid gunking up the system and lowering its efficiency, or (worse) causing premature wear and failure of the turbine blades.  The condensate polisher design used at TMI was a somewhat brittle design, and this particular one had failed three times in the few months the reactor had been in operation.  On March 28, 1979, after operating for about 11 hours, the turbine "tripped" (stopped) at 4:00am.  The plant operators did not know why at the time, but the turbine tripped because a small amount of water (maybe a cupful) leaked through a seal into the instrument air system of the plant.  This system drives several of the plant's instruments.  When water leaked into the system, it interrupted the air pressure on two valves at two feedwater pumps.  Normally, if air pressure was lost at these two pumps it would mean that something was wrong; in this case nothing was wrong that should make the pumps stop, but they did anyway: in this case, the behavior was by design; the driving signal was the problem.

But without the pumps, cold water was no longer flowing into the steam generator to cool the reactor, so the turbine shut down automatically.  The system that did this is known as an Automatic Safety Device (ASD).

Stopping the turbine in a nuclear reactor doesn't make it safe, though.  The core was still hot, and it needed to be cooled down.  No problem, there was another ASD for this purpose -- emergency feedwater pumps started; they pulled water from an emergency storage tank and ran it through the secondary cooling system.

Or at least they were supposed to.  But the pipes from the emergency feedwater pumps were blocked -- a valve in each pipe had been left closed after maintenance two days earlier.  So the operator verified that the pumps came on as they should, but he did not know that they were not pumping water.

There were two indicators on TMI's control panel that showed that the valves were closed (though one was obscured by a repair tag hanging from the switch above it).  But at this point, there was no reason to suspect a problem with the valves.  Eight minutes later, when the operators were otherwise baffled by the operation of the plant, they discovered the light, but at that point it was too late.

But we're not there yet.  Now, just as the emergency feedwater pumps have started, things should be fine, but since there was actually no coolant circulating in the secondary coolant system, the steam generator boiled dry.  Because of this, no heat was being removed from the reactor core, so the reactor "scrammed."  (When the reactor scrams, graphite control rods drop into the core to absorb neutrons and stop the reaction.  In early experiments with nuclear power, so the lore says, the procedure was to "drop the rods and scram" -- hence the name).

But that's not enough to avert catastrophe.  The decaying radioactive materials still produce heat (in this case, enough to generate electricity for over 15,000 homes for a useful amount of time); this "decay heat" builds up a large amount of pressure in the 40 foot tall stainless steel vessel that houses the reactor.  Normally, there are thousands of gallons of water to draw off this heat, and it would cool down within a few days.  But - of course - in this case, the cooling system was not working.

Thankfully, there are more ASDs to handle this problem.  The first is called the Pilot Operated Relief Valve (PORV), which relieves pressure in the core by channeling water from the core through a large vessel called a pressurizer, and then out the top of it into a drain pipe (the "hot leg"), and from there down into a sump.  This water would be radioactive and very hot.  By design, the PORV should only be open long enough to relieve excess pressure.  The liquid water coolant is only a liquid because it's under pressure; if the valve stays open for too long, the pressure in the core drops enough that the remaining water can boil and turn into steam.  The steam bubbles (called "steam voids") that would result in the core and primary cooling pipes would restrict the flow of coolant, and allow some spots (in particular, some spots around the uranium rods) to overheat.

The PORV at TMI was manufactured by Dresser Industries; in the aftermath of the accident, Dresser ran TV ads claiming that Jane Fonda, who was then starring in the movie China Syndrome, was more dangerous than nuclear plants.  The valve, per its specification, had a Mean Time Between Failure (MTBF) of fifty usages; this seems pretty low but in this case was seen as OK because with any amount of luck, it almost never be used.  Contrary to this, the President's Commission on the TMI accident turned up at least eleven instances of the same PORV failing in other nuclear plants (much to the surprise of the Nuclear Regulatory Commission, and B&W, who only knew of four), including two earlier failures at TMI's Unit 2.  It also failed this time, when the emergency feedwater pumps' block valves were closed, the condensate pumps were out of order, and one important indicator light was obscured: after opening, the PORV failed to reseat properly when it was supposed to close.

This meant that the reactor core, which was heating up rapidly, had a big hole in the top - the stuck valve.  The coolant remaining in the core was under very high pressure, and was shooting out of the stuck valve into the hot leg pipe, which went down to its drain tank.  When all was said and done, 32,000 gallons (about 1/3 the capacity of the core) went out of the core through that pipe.  This was bad news.

But engineers are always thinking.  Since it was already known that the valve was a bit touchy (in all fairness, it's hard to make something reliable under such extreme circumstances), an indicator had been recently added to warn operators if it did not reseat.  Sadly, that indicator was also broken - it indicated that all was well.  This is the worst sort of failure: if there had been no status indicator, someone might have actually taken a walk and checked to see if the valve had reseated, and some problems might have been avoided.  It's just one of those cases where it might have been better to have no indicator at all -- after all, if you can't believe your warning lights, what good are they?

The indicator said that the valve had shut, so the operators waited for the reactor pressure to rise again (it had dropped sharply, as one would expect, when the valve opened).  The valve stayed stuck, partially open, for another 2 hours and 20 minutes until a new shift supervisor, taking a fresh look at the problems, discovered it.

But we're not there yet -- incredibly, with this cavalcade of broken features and minor screwups we're still just thirteen seconds into the accident.  Just for review; in these thirteen seconds, the relevant problems were:

  1. A false signal caused the condensate pumps to turn off
  2. Two valves for emergency cooling were closed instead of opened
  3. The indicators that would have alerted operators to #2 was obscured
  4. The PORV opened properly, then failed to reseat
  5. The indicator that would indicate the PORV's failure also failed

... only #3 in this list could possibly lead to blaming the operators.  It should also be pointed out that no single failure there suggests a link to other failures; in fact they are all on different parts of the system.  Additionally, none of the problems by themselves would be a big deal - it is what is sometimes called a "swiss cheese catastrophe," where holes in all of the layers of protection just happen to line up so that a catastrophic failure occurs.

Adding to the complexity of the situation, it was later found out that the radioactive water from the hot leg was not even flowing into the tank designed to hold it; it had been misrouted to another tank, which ended up overflowing onto the floor in an auxiliary building.

But let's rewind back to our point 13 seconds into the event.  The PORV was open, and would be for another two hours and twenty minutes; coolant from the reactor core was squirting out of it, which meant that reactor pressure dropped.  This is dangerous.  If the pressure goes down, the superheated water (over 2,000°F) will turn into steam, which does not work to cool the reactor; steam bubbles also block liquid flow in the coolant pipes.

But (thank goodness!) yet another ASD kicked in.  One of two reactor coolant pumps restarted automatically, and the other was manually started by the operators (this was quick action: remember we're still around 13 seconds in to the event).  For a few minutes, this appeared to do the trick; pressure appeared to be stabilizing in the core.

But it actually wasn't stabilizing; the indicators only said it was.  Why?  The operators were not aware that the steam generators were not getting water.  When they boiled dry, the reactor coolant started heating up again because the secondary cooling system was not removing heat from the primary one, which removes heat from the core.  And since the core was losing water, pressure in the coolant system dropped sharply.

Two minutes into the accident, yet another ASD (whew!) came on.  This one - a hail-Mary pass - is called High Pressure Injection (HPI), and it forces water into the core at an extremely high rate.  This is a key moment in the accident, and the operators' actions here earned them the unreasonable distinction of being the "cause" of the accident.  They let the HPI go full blast for about two minutes, and then cut it way back.  When they cut it back, it stopped replacing the water that was boiling out through the (still open) PORV, so the core was steadily being uncovered.  This is the worst possible case in a nuclear plant, because if the core is uncovered it will melt through the vessel and may release radiation into the open.

Let's get a bit of background on HPI and its relation to the problem.  HPI involves the injection of a lot of cold water at a very high pressure into the very hot reactor core in order to cool it.  The water goes in at about 1,000 gallons per minute (this would fill an average-sized suburban swimming pool in about 20 minutes).  It is viewed by many as risky, because the thermal shock could cause cracks in the core vessel.  It could also cause problems if the core vessel fills with water.  The dangers here weren't well understood, and there was a lot of disagreement at the time over whether or not it was a good idea to throttle back the HPI.  As an aside - two years later, the NRC issued a report disclosing that thirteen reactors, some of them only three years old, showed degrees of core vessel brittleness because the radioactive bombardment of the vessel was greater than predicted; in these cases, the injection of cold water into a brittle vessel would be very likely to crack it.  Fortunately, however, the TMI reactor had only been in operation at full power for about 40 days, so HPI did not pose this particular risk.

The more widely disputed problem with HPI involves the pressurizer.  Remember that the pressurizer is connected directly to the core vessel via the PORV.  The pressurizer, under normal conditions, contains about 800 cubic feet of water, resting under about 700 cubic feet of steam.  This ratio is controlled by the use of heaters in the tank.  The idea is that the steam acts as a shock absorber: liquid water is incompressible, but steam is not, so if there is a substantial pressure surge in the core, the cushion provided by the steam would prevent coolant pipes from bursting.  A burst coolant pipe is one source of a Loss Of Coolant Accident (LOCA), which could cause a core meltdown, so this is seen as a critical system.  Under HPI, the incoming water could increase pressure in the pressurizer by flooding it with water too quickly for heaters to manage the steam/water split, so it would fill with water and no longer act as a shock absorber.  The general consensus at the time was that allowing the pressurizer to be filled with water ("going solid" - solid water with no steam) was a Bad Thing and should be avoided.  In fact, the TMI operating manual said, with unusual clarity, that the pressurizer "must not be filled with coolant to solid conditions (400 inches) at any time except as required for system hydrostatic tests." (emphasis mine)

So: the operators of the plant were trained to avoid going solid in the pressurizer by both the manufacturer (B&W) and the owner (Metropolitan Edison).  There was never any instruction that suggested that going solid in the pressurizer might be OK -- this would be like telling your kids "sometimes it's OK to put a plastic bag over your head."  In fact, such an instruction (that going solid in the pressurizer during HPI could be acceptable) was considered and rejected by B&W after an earlier accident at a different plant.

But at this point, about two minutes into the incident, going solid in the pressurizer would have been a good risk to take, because - unbeknownst to the operators - the reactor core was about to be uncovered.

When HPI was activated, the operators were looking primarily at two dials, which were close to each other on the huge indicator panel.  One indicated that the pressure in the reactor was falling, but the other indicated that the pressure in the pressurizer was dangerously high and rising.  This was weird, because the two are directly connected by a large pipe, and the two dials always moved together.  After all: the pressurizer is there to control the pressure in the coolant system.  That's what it's for.  The pressures should always be the same, and to see the dials moving differently meant, in a reasonable world, that one of them was probably wrong.  Given TMI's short and troubled upbringing; a gauge being faulty would not be too surprising.

But which one was wrong?  If the reactor dial was correct, there must be a huge problem, because plenty of water was entering the reactor vessel through the reactor cooling pumps (which were still running), and more obviously, HPI - which was still running.  Even if there was a small pipe break somewhere, the reactor cooling pumps would easily keep the core covered.  On the other hand, since the emergency feedwater pumps were on, the operators thought that the secondary cooling system should be cooling the core, so the core pressure should really be falling.  If that was true, then the HPI signal had been wrong.  So perhaps the reactor pressure dial was wrong.

The pressurizer pressure dial, though, was a serious cause for concern.  High pressure in the pressurizer eliminated a safety margin, and all instruction that the operators had said that the pressurizer should never be flooded.  The pressurizer was the first line of defense between the operators and a LOCA.  It was easy to see the connection between HPI and rising pressure in the pressurizer - HPI was flooding the core and sending water up to flood the pressurizer.  Given the information at hand, this seemed pretty obvious.

So the operators cut the HPI way back ("throttled back on the makeup valves").  Pressure in the pressurizer started coming back down, relieving the danger of going solid.  This was Good.

What the operators didn't know was that the emergency feedwater pumps didn't have any water to pump (remember the closed valves?), and also the PORV was stuck open: they already had a significant LOCA, but not from a pipe break.  The rise in pressure in the pressurizer was probably due to steam voids which were rapidly forming because the core was close to becoming uncovered.  The operators thought they were avoiding a LOCA by throttling back HPI, but in fact they were already in one, and throttling back HPI only made it worse.  With the PORV stuck open, the danger of going solid in the pressurizer was reduced because the open valve would provide some relief.  But nobody in the control room knew it was open.

The Kemeny Commission thought that the operators should have known all of this - instead, the report says, they were "oblivious" to the danger; the two readings "should have clearly alerted" them to the LOCA; "the major cause of the accident was due to inappropriate actions by those who were operating the plant."

It's easy to make those sorts of judgments after the fact.  It's easy for us, decades later, to pontificate about what the operators should have known, and how they could have known it.  Back in the 1970s, though, these events transpired as we understand them and the damage was done.  Anyway, about 4 or 5 minutes into the event, another more pressing problem arose.

The reactor coolant pumps that had turned on started thumping and shaking.  They could be heard and felt from the control room, which was pretty far away.  Should the pumps be turned off?  A hasty conference was called, and they were turned off.  In retrospect, the noise was a sign of further dangers ahead: the pumps were cavitating, because they were not getting enough coolant flowing through them to function correctly.

At this point, a few minutes into the event, there were three audible alarms sounding in the control room.  Many of the 1,600 warning indicators (lights and rectangular displays with code numbers and letters on them) were on or blinking.  The operators wanted to turn off the main alarm Klaxon, but they couldn't because turning it off would also cancel some of the warning lights, and they needed those to be correct.  So it was a little hard to concentrate: a Klaxon going, hundreds of warning lights flashing, a dot matrix printer printing line after line of status information.  Also remember that this was the 1970s: there were no smartphones, and there was no internet - the control room had one phone line to the outside world, and that was the only way to ask for help (if you could hear it over the Klaxon).  And computers?  There was a status monitoring computer, but it was not in the control room.  It was housed in a server room: pretty powerful for its day, and capable of recording information about hundreds of alarm inputs as fast as they came in, but its output went to a dot matrix printer in the control room that could only print fifteen lines of information per minute.  The printer fell more than two hours behind at one point in the event.

In the meantime, radiation alarms were starting to happen in various parts of the reactor facility, and the control room was slowly filling with experts.  By the end of the day, adding to the din, there were almost 40 people in the control room.

But rewind again.  Two hours and twenty minutes after the start of the incident, a new shift came on.  It was at this point that the new shift supervisor decided to check the PORV, and the operators discovered that the valve was stuck.  They closed a block valve to shut off the flow to the PORV.  An operator testifying before the Kemeny Commission hearings said that it was more of an act of desperation than understanding to shut off the block valve: after all, you don't casually block off a safety system.

That act of desperation was, in retrospect, well done.  But another problem was brewing.

The fuel rods, 36,816 in this reactor, contain enriched uranium in little pills, all stacked within a thin liner of zirconium.  Water circulates through the 12 foot stacks of rods and cools the liner ("cladding") so it won't melt.  If they get too hot, though, the cladding can react with the water: a zirconium-water reaction.  This consumes Oxygen, thus releasing Hydrogen.  The Hydrogen bubbles form pockets of Hydrogen gas, and these pockets coalesced to form the famous Hydrogen bubble that threatened the integrity of the plant for the next few days.  Any spark could have ignited the Hydrogen and brought the entire plant down to a glowing, fiery pile of Chernobyl-esque rubble.  There were several warning signs that a Hydrogen bubble was being produced, and that a smaller one had already exploded (the explosion caused a pressure spike that reached half the design limit of the building).  With more and more Hydrogen being produced, the gas might have found ways to be vented from the core (whose condition was unknown) into the containment building, where a spark from (for example) starting a pump could have ignited it; if that happened near heavy equipment, the shrapnel could have broken through the core or otherwise injured people.  Three years after the incident, investigators found that the huge crane required to lift off the top of the reactor vessel had been damaged by missiles from the small explosion of the first Hydrogen bubble; two engineers protested that the crane was not safe enough to use and were fired.

At this point, we've demonstrated the sitcom version of relativity by breezing through a hot but complex topic in just a couple of pages.  There's a lot to say about TMI.  In particular, there's a lot to say about what it tells us about government procurement, the commercial nuclear power industry in the US, the safety of complex systems, human behavior, and ... really lots of stuff.  TMI was, if nothing else, a goldmine of research opportunity, which was mostly squandered by greedy, shortsighted, intellectually lazy people.

That said, there is still much to learn about the interaction of failures in complex systems here.  There is also much to learn about the dangers of implementing such a complex system whose failure could cause huge problems.

My hope is that we also learn that blaming the operators, while often a good PR move, solves no problems.  More on this maybe later.  Pleasant dreams, all.