Think that forgetting a synchronized isn't so bad?
About eight weeks after the blackout, the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitoring. The bug had a window of opportunity measured in milliseconds.
"There was a couple of processes that were in contention for a common data structure, and through a software coding error in one of the application processes, they were both able to get write access to a data structure at the same time," says Unum. "And that corruption lead to the alarm event application getting into an infinite loop and spinning."
listed in the final report as one of the direct causes of a blackout that eventually cut off electricity to 50 million people in eight states and Canada.
The company did everything it could, says Unum. "We text exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug," says Unum. "I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software."
And...is this an excuse for MS?
"If we see a system that's behaving abnormally well, we should probably be suspicious, rather than assuming that it's behaving abnormally well."