On August 14, 2003, a cascading failure of the power grid plunged more than 50 million people into darkness in the northeast US and Canada. It was the most significant power outage ever in North America, with an economic impact north of ten billion dollars. Calamities like this don’t happen in a bubble, and there were many human factors, political aspects, and organizational issues that contributed to the blackout. But, this is an engineering channel, and a bilateral task force of energy experts from the US and Canada produced this in-depth 240-page report on all of the technical causes of the event that I’ll try to summarize here. Even though this is kind of an older story, and many of the tough lessons have already been learned, it’s still a nice case study to explore a few of the more complicated and nuanced aspects of operating the electric grid, essentially one of the world’s largest machines.
Shooter’s Bible ... Best Price: $12.00 Buy New $19.95 (as of 08:13 UTC - Details) Nearly every aspect of modern society depends on a reliable supply of electricity, and maintaining this reliability is an enormous technical challenge. I have a whole series of videos on the basics of the power grid if you want to keep learning after this, but I’ll summarize a few things here. And just a note before we get too much further, when I say “the grid” in this video, I’m really talking about the Eastern Interconnection that serves the eastern two-thirds of the continental US plus most of eastern Canada.
There are two big considerations to keep in mind concerning the management of the power grid. One: supply and demand must be kept in balance in real-time. Storage of bulk electricity is nearly non-existent, so generation has to be ramped up or down to follow the changes in electricity demands. Two: In general, you can’t control the flow of electric current on the grid. It flows freely along all available paths, depending on relatively simple physical laws. When a power provider agrees to send electricity to a power buyer, it simply increases the amount of generation while the buyer decreases their own production or increases their usage. This changes the flow of power along all the transmission lines that connect the two. Each change in generation and demand has effects on the entire system, some of which can be unanticipated.
Finally, we should summarize how the grid is managed. Each individual grid is an interconnected network of power generators, transmission operators, retail energy providers, and consumers. All these separate entities need guidance and control to keep things running smoothly. Things have changed somewhat since 2003, but at the time, the North American Electric Reliability Council (or NERC) oversaw ten regional reliability councils who operated the grid to keep generation and demands in balance, monitored flows over transmission lines to keep them from overloading, prepared for emergencies, and made long-term plans to ensure that bulk power infrastructure would keep up with growth and changes across North America. In addition to the regional councils, there were smaller reliability coordinators who performed the day-to-day grid management and oversaw each control area within their boundaries.
August 14th was a warm summer day that started out fairly ordinarily in the northeastern US. However, even before any major outages began, conditions on the electric grid, especially in northern Ohio and eastern Michigan were slowly degrading. Temperatures weren’t unusual, but they were high, leading to an increase in electrical demands from air conditioning. In addition, several generators in the area weren’t available due to forced outages. Again, not unusual. The Midwest Independent System Operator (or MISO), the area’s reliability coordinator, took all this into account in their forecasts and determined that the system was in the green and could be operated safely. But, three relatively innocuous events set the stage for what would follow that afternoon.
The first was a series of transmission line outages outside of MISO’s area. Reliability coordinators receive lots of real-time data about the voltages, frequencies, and phase angles at key locations on the grid. There’s a lot that raw data can tell you, but there’s also a lot of things it can’t. Measurements have errors, uncertainties, and aren’t always perfectly synchronized with each other. So, grid managers often use a tool called a state estimator to process all the real-time measurements from instruments across the grid and convert them into the likely state of the electrical network at a single point in time, with all the voltages, current flows, and phase angles at each connection point. That state estimation is then used to feed displays and make important decisions about the grid.
But, on August 14th, MISO’s state estimator was having some problems. More specifically, it couldn’t converge on a solution. The state estimator was saying, “Sorry. All the data that you’re feeding me just isn’t making sense. I can’t find a state that matches all the inputs.” And the reason it was saying this is that twice that day, a transmission line outside MISO’s area had tripped offline, and the state estimator didn’t have an automatic link to that information. Instead it had to be entered manually, and it took a bunch of phone calls and troubleshooting to realize this in both cases. So, starting around noon, MISO’s state estimator was effectively offline.
Here’s why that matters: The state estimator feeds into another tool called a Real-Time Contingency Analysis or RTCA that takes the estimated state and does a variety of “what ifs.” What would happen if this generator tripped? What would happen if this transmission line went offline? What would happen if the load increased over here? Contingency analysis is critical because you have to stay ahead of the game when operating the grid. NERC guidelines require that each control area manage its network to avoid cascading outages. That means you have to be okay, even during the most severe single contingency, for example, the loss of a single transmission line or generator unit. Things on the grid are always changing, and you don’t always know what the most severe contingency would be. So, the main way to ensure that you’re operating within the guidelines at any point in time is to run simulations of those contingencies to make sure the grid would survive. And MISO’s RTCA tool, which was usually run after every major change in grid conditions (sometimes several times per day), was offline on August 14th up until around 2 minutes before the start of the cascade. That means they couldn’t see their vulnerability to outages, and they couldn’t issue warnings to their control area operators, including FirstEnergy, the operator of a control area in northern Ohio including Toledo, Akron, and Cleveland.
That afternoon, FirstEnergy was struggling to maintain adequate voltage within their area. All those air conditioners use induction motors that spin a magnetic field using coils of wire inside. Inductive loads do a funny thing to the power on the grid. Some of the electricity used to create the magnetic field isn’t actually consumed, but just stored momentarily and then returned to the grid each time the current switches direction (that’s 120 times per second in the US). This causes the current to lag behind the voltage, reducing its ability to perform work. It also reduces the efficiency of all the conductors and equipment powering the grid because more electricity has to be supplied than is actually being used. This concept is kind of deep in the weeds of electrical engineering, but we normally simplify things by dividing bulk power into two parts: real power (measured in Watts) and reactive power (measured in var). On hot summer days, grid operators need more reactive power to balance the increased inductive loads on the system caused by millions of air conditioners running simultaneously.
Base Nation: How U.S. ... Best Price: $2.92 Buy New $16.95 (as of 07:10 UTC - Details) Real power can travel long distances on transmission lines, but it’s not economical to import reactive power from far away because transmission lines have their own inductance that consumes the reactive power as it travels along them. With only a few running generators within the Cleveland area, FirstEnergy was importing a lot of real power from other areas to the south, but voltages were still getting low on their part of the grid because there wasn’t enough reactive power to go around. Capacitor banks are often used to help bring current and voltage back into sync, providing reactive power. However, at least four of FirstEnergy’s capacitor banks were out of service on the 14th. Another option is to over-excite the generators at nearby power plants so that they create more reactive power, and that’s just what FirstEnergy did.
At the Eastlake coal-fired plant on Lake Erie, operators pushed the number 5 unit to its limit, trying to get as much reactive power as they could. Unfortunately, they pushed it a little too hard. At around 1:30 in the afternoon, its internal protection circuit tripped and the unit was kicked offline – the second key event preceding the blackout. Without this critical generator, the Cleveland area would have to import even more power from the rest of the grid, putting strain on transmission lines and giving operators less flexibility to keep voltage within reasonable levels.
Finally, at around 2:15, FirstEnergy’s control room started experiencing a series of computer failures. The first thing to go was the alarm system designed to notify operators when equipment had problems. This probably doesn’t need to be said, but alarms are important in grid operations. People in the control room don’t just sit and watch the voltage and current levels as they move up and down over the course of a day. Their entire workflow is based on alarms that show up as on-screen or printed notifications so they can respond. All the data was coming in, but the system designed to get an operator’s attention was stuck in an infinite loop. The FirstEnergy operators were essentially driving on a long country highway with their fuel gauge stuck on “full,” not realizing they were nearly out of gas. With MISO’s state estimator out of service, Eastlake 5 offline, and FirstEnergy’s control room computers failing, the grid in northern Ohio was operating on the bleeding edge of the reliability standards, leaving it vulnerable to further contingencies. And the afternoon was just getting started.