The pointer "exploded" causing the company to lose hundreds of millions of funds

On January 15, 1990, AT&T's New Jersey operations center detected a widespread system failure, causing numerous red warnings to appear on network displays.

Despite attempts to troubleshoot, the network outage persisted for nine hours, resulting in a call connection failure rate of 50%.

AT&T lost more than $60 million as a result, and more than 60,000 Americans were completely unable to reach their phones .

In addition, 500 flights were delayed, affecting 85,000 people.

By rights, AT&T's long-distance network was a model of efficiency, using advanced electronic switching and signaling systems to handle most of the nation's calls. The system typically routes calls within seconds.

However, on this day, the entire network failed, starting with a switch in New York. This is due to a software bug in a recent update that affected 114 switches in the network. When New York's switches reset and signaled, the error set off a domino effect that caused widespread network outages.

Interestingly, this software has not been tested. Since the code changes were minor, testing was bypassed as per management's request .

problem lies in

The cause was traced to a coding error in a software update implemented by the network switch.

The error occurred in a C language program and involved a misplaced interrupt statement within a nested conditional statement, resulting in data overwriting and system reset.

pseudocode

1  while (ring receive buffer not empty 
          and side buffer not empty):

2    Initialize pointer to first message in side buffer
     or ring receive buffer

3    get copy of buffer

4    switch (message):

5       case (incoming_message):

6             if (sending switch is out of service):

7                 if (ring write buffer is empty):

8                     send "in service" to status map

9                 else:

10                    break // The error was here!

                  END IF

11           process incoming message, set up pointers to
             optional parameters

12           break
       END SWITCH


13   do optional parameter work

problem analysis

  • If the ring write buffer is not empty, the `if` statement on line 7 is skipped and replaced by the break statement on line 10.
  • However, in order for the program to run properly, line 11 should have been executed.
  • When the interrupt statement is executed, instead of processing the incoming information and setting pointers for optional parameters, the data (the pointers that should have been retained) are overwritten
  • The error correction software recognizes that the data has been overwritten and activates the shutdown switch to reset. The problem was compounded by the fact that all switches in the network used this flawed software, causing a chain reaction of resets that ultimately paralyzed the entire network system.

Despite rigorous testing and a flexible network design, a single line of code brought down major communications lines in half the country.

repair

It took engineers nine hours to fully restore AT&T's systems. They mostly do this by rolling back the switch to a previous working version of the code.

In fact, it took software engineers two weeks of rigorous code reading, testing, and copying to actually figure out what was wrong.

in conclusion

Unfortunately for AT&T, this wasn't their biggest system meltdown of the '90s. Later in the decade, they encountered more problems.

Today’s companies have better processes, but even so, holes can slip through. Google wrote an excellent retrospective on 20 years of website reliability engineering, which reflected on YouTube's first global outage in 2016.

For companies, the scale of failures is huge, and each failure teaches us a lesson. However, for most companies, failures come down to human error and process gaps.

Original text: https://engineercodex.substack.com/p/how-one-line-of-code-caused-a-60
Reprinted from: https://www.jdon.com/69737.htm

Guess you like

Origin www.oschina.net/news/266368/one-line-of-code-caused-a-60