Delta and other air carriers show how not to do disaster recovery
The August 8 systemwide outage suffered by Delta Airlines – attributed to a power failure at the company’s primary data center – is merely the most recent of a string of technology-related problems significantly affecting operations of major air carriers. In just the past 13 months, United (July 2015) and Southwest (July 2016) lost the use of their computer systems due to problems blamed on faulty network routers and Delta last week joined JetBlue (January 2016) in experiencing data center power failures. The predominant response from the IT industry is both surprise and disappointment that mission-critical airline operations systems do not seem to have reliable or effective continuity of operations or failover capabilities in place, whether in the form of backup power generation in data centers or redundant hardware or software systems. All of the recent outages highlight single points of failure for the airlines, which not only show poor design but also seem completely unnecessary given modern computing resources.
Conventional disaster recovery and business continuity planning begins by assessing the criticality of the business processes that information technology systems, networks, and infrastructure support. Alternate processing facilities (like secondary data centers) are categorized as “cold”, “warm”, or “hot” sites according to how rapidly the alternate facility can take over for or establish at least some level of business operations when the primary facility has an outage; hot failover is the most immediate, often entailing fully redundant or mirrored systems in two or more data centers that can work together or individually to keep systems available. In addition to alternate processing capabilities, most modern data centers – whether owned and operated by companies themselves (like Delta) or by outsourcing providers like Dell, HP, or Verizon (the last of which is used by JetBlue) – have redundant network and power connections as well as battery and generator-based backup power to try to avoid precisely the type of failure it seems affected Delta. Various news reports of the Delta outage have noted that many of Delta’s key operational systems, including the ones that failed on August 8, run out of a single Atlanta data center, dubbed the Technology Command Center, and have speculated that Delta chose not to implement an alternate processing site. Based on what happened on August 8, it seems fair to say that either Delta does not in fact have a secondary facility or, if it does, any automated failover procedures that are designed to shift operations to a secondary facility did not work as intended. Whether due to poor planning, misplaced financial priorities, or lack of disaster recovery testing, the events of the day provide clear evidence that Delta’s systems are neither reliable nor resilient in the face of unanticipated problems in the Atlanta facility. There seems to be some disagreement as to whether a power outage or an equipment malfunction was actually the cause of the outage, but neither of those issues should have brought Delta’s systems down if the company had implemented the sort of IT redundancy that is common among major commercial enterprises. Even when redundancy has been built in, the importance of testing cannot be overstated; without regular disaster recovery testing companies may operate under a false sense of security, until they actually encounter a problem and find that their failover mechanisms don’t work. This is apparently the case for the Southwest Airlines outage, which was blamed on a network router that began functioning improperly but did not actually go offline, with the result that existing backup systems were not activated to take over for the malfunctioning router.
The apparent fragility of air carrier IT systems has raised concerns within the federal government, as seen this week in a letter from Senators Edward Markey and Richard Blumethal, both members of the Senate Commerce, Science and Transportation Committee, to Delta CEO Ed Bastian (the letter was also sent to executives at a dozen other airlines) asking for information about the state and general resilience of the airlines’ IT systems, their potential susceptibility to failure due to power or technology issues or to cyber-attack, and the affect on traveling members of the public when outages occur. Commercial air carriers’ IT systems are not explicitly considered part of the nation’s critical infrastructure (although aviation is part of the Transportation Systems Sector defined as critical infrastructure by the Department of Homeland Security) but Sens. Markey and Blumenthal emphasize the responsibility that Delta and other carriers have to ensure the reliability and resilience of their IT systems, especially in light of the large-scale consolidation of U.S. airlines. Many industry observers point to airline mergers, and in particular the need for merged carriers to integrate disparate IT systems, many of which rely on “legacy” technologies that may not have been designed for or easily adapted to high-availability deployments. It seems quite likely that the diversity of systems and technology characterizing many carriers post-merger makes their systems more vulnerable and makes business continuity planning more complicated than it would be with a more homogeneous IT environment, but there is nothing in the recent airline outages to suggest that merger-related IT integration had anything to do with the problems that brought flights to a standstill.