It’s good to have a contingency plan for Web site traffic surges

The record snowfall in the Washington, D.C. area since last weekend has been notable for the widespread closings it has caused, and came with an unanticipated side effect for the federal government: the unavailability of its official operating status page on the Web. The Office of Personnel Management (OPM) provides an Operating Status page on its agency website, to which many federal employees turn to see if the government will be open (or, in an non-weather-related example, to check if the president closes the government early on Christmas Eve or other holiday). The volume of visitors to the Web site spiked to such a degree on Monday evening (according to a story in the Washington Post, Web traffic during the afternoon and evening hours on February 8 was approximately 4000 percent of the average daily volume) that the site was rendered unavailable; in response OPM configured its Web server to redirect traffic to a copy of the operating status notice posted on servers at OMB’s data.gov site instead. This serves both as an example of quick thinking and suggests some pretty good contingency planning, although it’s unclear if the need for an alternate Web hosting site was anticipated in advance or not.

As a mini-case study in contingency planning (or incident response, since this was an organic denial-of-service), OPM’s actions demonstrate one approach among multiple alternatives. The agency chose to stand-up a backup site using existing data center capacity made available to it, so this was a sort of warm-site failover. Another approach would have been to mirror the primary site to an alternate and configure front-end routers or load balancers to automatically re-route traffic to the alternate site whenever volume exceeded a given threshold; the threshold would properly be tied to the existing Web server capacity, so no estimate of traffic spike levels would be necessary. A third option would be to scale the capacity of the existing Web server environment to be able to accommodate spikes in traffic. This option requires the ability to make good estimates of maximum traffic levels, or else at some point availability would still suffer. Still another option would be replicate key Web pages to an content distribution network provider, such as Akamai, so that user requests for popular content wouldn’t hit the OPM server at all. The content replication approach has been used successfully in the government in the past — for instance, when the Centers for Disease Control and Prevention (CDC) experienced an unprecedented surge in volume to its Web site due to concerns over the anthrax attacks in the fall of 2001, the agency quickly contracted with Akamai to replicate most of its public Web content (which at the time was all static HTML), while it re-engineered its infrastructure to accommodate higher demand.

In many cases, it’s simply not cost-effective to build infrastructure to accommodate exceptional loads, but it’s foolish for any large organization to assume that traffic will never exceed its capacity, so having a contingency plan is an important element of any business continuity plan. Choosing the appropriate options often depends on whether the rise in traffic volume is a one-time (or very infrequent) event (as in OPM’s case), or whether the spike corresponds to an ongoing increased demand (as it did for the CDC).