More than 10,000 U.S. flights delayed due to system glitch

On the morning of January 11, EST, a key system of the Federal Aviation Administration (FAA) failed, causing flights across the United States to be grounded. Although the FAA subsequently updated the news through social media that the grounding order for flights across the United States had been cancelled, the failure of this critical system made the US aviation authorities begin to re-examine the issue of maintaining the resilience of the US air traffic system.

cc9c0f660c2fefb80b5fc579bc4c9a9d.jpeg

A video board shows flight delays and cancellations at Ronald Reagan Washington National Airport in Arlington, Va., Jan. 11, 2023.

Patrick Semansky/AP

The FAA said it was continuing a comprehensive assessment to determine the root cause of the failure, but an initial investigation found that the cause was a "corrupted database file" and there was no evidence of a cyber attack, according to Bloomberg.

Through wiki/2023_FAA_system_outage and other related reports, you can roughly outline the relevant circumstances of the failure.

1. The faulty system is NOTAM (Notice to Air Missions). NOTAM is the system used by the Federal Aviation Administration to report aviation abnormalities, including airspace blockade, bad weather, and runway closure. For longer international flights, the NOTAM can be as long as 200 pages and include information such as runway closures, general bird risk warnings or low-level building obstructions. The US media quoted industry insiders as saying that there has never been a nationwide failure of the NOTAM system in the past.

As an IT personnel outside the aviation industry, Brother G judges that although NOTAM is not the command system of the flight itself, it is also a flight critical link system as an aviation abnormal event notification system. The risk is very high if the corresponding event is not received. For example, flying cross-border into blocked airspace, or stopping on an incorrect runway may cause unknown risks.

2. When U.S. air traffic control officials realized that there was a problem with the Air Task Notification System (NOTAM) on the evening of Tuesday, January 10, Eastern Time, they came up with a plan to restart the system on Wednesday morning to minimize the problem. System interference with flights within the United States. Ultimately, however, the plan resulted in blackouts and massive flight delays, before the FAA grounded flights across the United States.

At 20:30 on January 10, the pilots of some flights could no longer obtain this information through the Internet. The FAA urgently opened a backup system for telephone answering, allowing evening flights to fly normally. However, with the advent of a new day, more and more flights, the backup system can no longer support, the FAA immediately issued an order at around 7:30 am Eastern Time on January 11, prohibiting all flights in the United States from taking off before 9:00, and has already taken off aircraft are not affected. In fact, the failure gradually recovered after 8:30.

bb400e515b4c0df52a2955c37eacd3db.jpeg

Passengers wait for flights to resume at O'Hare International Airport on Jan. 11, 2023, after the Federal Aviation Administration (FAA) ordered airlines to suspend all U.S. domestic flights due to a system outage.

Jim Vondruska/Reuters

3. The introduction of a glitch, the NOTAM system stopped processing updates at 3:28pm after an engineer mistakenly replaced one file with another during routine scheduled system maintenance.

The ground grounding and an FAA system glitch affecting thousands of flights across the U.S. on Wednesday morning appeared to be the result of errors during routine scheduled system maintenance, according to a senior official with knowledge of the internal review.

An engineer "substituted one file for another" without realizing he had made a mistake, the official said. As the system began to malfunction and eventually fail, FAA staff frantically tried to figure out what was wrong. The engineer who made the mistake didn't realize what was happening. "This was an inadvertent mistake that cost the country millions of dollars."

Earlier on Wednesday, the FAA said normal operations were "gradually resuming" after the FAA ordered a nationwide suspension of all domestic flight departures until 9 a.m. Wednesday after a computer glitch caused delays and cancellations across the country. .

4. Southwest, which canceled thousands of flights after Christmas following a systemwide meltdown, was hit hard, with more than 400 canceled flights. About 10% of Southwest's Wednesday flights had been canceled and about half delayed as of 6 pm ET. Impact: ( Southwest Airlines: 10% of flights canceled, half delayed), the whole incident, millions of dollars.

5. The article "A Warning to Us from NOTAM Faults" mentioned that the FAA's IT operation and maintenance is so chaotic, which is indeed incredible, but this is not the first time that NOTAM has such a big problem. The failure in 2008 more serious. At that time, the database used to record abnormal events in NOTAM was still the Oracle database running on the Sun sparc server (I don’t know if the overdue NOTAM refers to this system, the possibility is still very high).

That time was also due to the chaos of operation and maintenance, which led to a large number of flight anomalies from the afternoon of May 22, 2008 to the entire daytime of May 23, 2008. The scene at that time was also very simple. According to NOTAM's maintenance operating procedures, some hard drives with a long service life must be replaced on schedule. Originally, it was a very simple operation, insert the new disk, pull out the old disk, and the rebuild work is automatically completed. However, there was a problem at that time, and the operating performance of the system after the replacement was very poor, so a series of temporary operations had problems. When they crashed the main system and were about to switch to the backup system, they found that the fault had been propagated to the backup system. The system is broken, and the backup system has also been destroyed. So they could only export the data that could be read, rebuild a new database, and then import the data. It took almost a day to get the system done. When the system resumes operation, there is still the problem of data inconsistency.

It can be seen that although there are more than ten years apart, the development process of these two failures is exactly the same. I really doubt that the same group of people did these two incidents.

6. Some officials compared the current outage to the crisis that paralyzed Southwest Airlines during the holidays: Outdated software in critical IT networks was overdue and could not be replaced. If one thing fails, the system can be brought down.

It is not difficult to see that the U.S. aviation industry is also facing an "unbearable burden" in terms of outdated IT systems, corresponding technical architecture, operation and maintenance capabilities, and emergency response.

1. Expected business impact that online changes can cause.

If it is a business change, change the grayscale (seed users, internal users, and even scale up to external users). If it is a non-business change, such as background file backup, data backup, facility switching, etc., it should be considered that the impact on online business is insensitive or within a controllable range.

2. The availability and capacity of the backup system. From the description "At 20:30 on January 10, the pilots of some flights could no longer obtain this information through the Internet. The FAA urgently opened the backup system for telephone answering, so that night flights can fly normally." It seems that NOTAM does not have much The regional multi-computer room capability can only be temporarily supported by telephone answering mode. This risk is very high, if a critical system hangs up and cannot be switched at any time.

3. Attitude to failure, an engineer "replaced one file with another" without realizing that he had made a mistake. As the system began to malfunction and eventually fail, FAA staff frantically tried to figure out what was wrong. The engineer who made the mistake didn't realize what was happening. "It was an inadvertent mistake that cost the country millions of dollars." While that was rhetoric, it still reflected the bureaucracy's perfunctory response to technical failures. According to the list revolution, most of the problems are the fault of innocence or ignorance, but they cannot be used as excuses.

Why does a file replacement error cause a global system inaccessible failure? This should have been part of the isolation consideration? The second is to replace the file. How to ensure that the result after replacement is as expected and how to verify it? Judging from the current disclosure, there must be no chaos engineering and failure drills involved.

For key systems of high-risk businesses, it is necessary to ensure high reliability and high availability, do a good job in service isolation design and monitoring, and do a good job in change management, including upgrading the operation and maintenance system.

Readers have any suggestions and discussions, welcome to reply.

— END —

Past recommendations:

technical trivia 

Based on distributed design, architecture, and system thinking, it also discusses bits and pieces related to R&D, not limited to code, quality system, and R&D management.

Guess you like

Origin blog.csdn.net/u013527895/article/details/129483741