A Bottom-Up Approach to Weathering the Worst
The events of April, 27th, 2011 left an indelible mark on most of us. While the city of Tuscaloosa suffered terribly along with so many individuals, families, and businesses, the UA campus stood stunned but generally unharmed. Such a near-miss was a heavy line underscore to the importance and urgency of work already well underway in OIT at the time. That work is generally referred to as Continuity of Operations or Disaster Recovery planning and preparation.
For OIT, Continuity of Operations refers to our ability to keep as much of the technology as close to full operational status as possible in the face of whatever events may occur. Disaster Recovery (DR) refers to returning to full and normal operational status for those systems as soon as possible after a disaster event occurs. These two concepts are different but very closely related. We have been considering them both over the last two years and have made significant progress in addressing them in meaningful ways. To be brief, we often refer to the whole effort as DR.
Thinking the Worst
We started thinking about what should be done to be better prepared by considering the kinds of scenarios we might encounter, identifying the most likely disaster scenarios and their relative impact on IT systems and services. While there are many scenarios that could damage a part of the campus network, those that would have the biggest impact would be events that involved the primary data center. That facility houses the servers, storage, and primary network equipment that are essential to delivering IT and telecommunications services and applications across the whole campus. While our general network projects are continuing to improve network resiliency across the campus through redundant connections, there has not been a general redundancy model for data center itself, making it the highest risk for disaster impact.
What Matters Most
We began planning by surveying the services and applications that OIT provides and supports. Our approach has been to do this in a bottom-up fashion, looking at the most basic services first, those on which all other services and applications depend. Starting at the foundation, a host of specialized equipment supports various services required for the network to operate. We have included all of these foundational services as well as primary communications and messaging services (Internet, email, and VPN) in the first phase of our Continuity of Operations project.
We have also identified mission critical applications and determined the physical equipment (servers, storage, etc.) that would be needed to support them. This equipment has also been included in our initial phase of preparedness work.
A Home Away From Home
The right solution to mitigating the kind of risks we have identified is to create an alternative base of IT operations somewhere outside the greater Tuscaloosa area – a DR site. We wanted it to be far enough away to reduce the likelihood of the same event affecting both Tuscaloosa and the DR site but close enough that personnel could get to the site within a few hours if necessary. Since our primary and backup Internet connection circuits currently run through a secured, environmentally hardened, DR colocation facility in Atlanta, it was most practical and economical to select that facility for our DR site.
Recently, we completed the implementation of the foundation equipment at the Atlanta DR site. This includes three fully loaded equipment racks containing routers, switches, firewalls, load balancers, spam filters, servers, and storage. The equipment has been selected and configured to support the full range of foundation services and to ultimately host mission critical applications capable of being hosted off-site.
Making it Work
Having the equipment in place is one thing. Ensuring that it will actually work and provide the intended protections is an entirely different matter. We are beginning now to set up and document the intricate network routing configurations required to enable a fail-over process of network and security functions. Some of the services run in real-time redundant mode. Others will require the execution of an automated or semi-automated procedure to transition them in an event. We will be developing testing protocols for each layer of services.
Once the network transition is assured, we will begin deploying instances of the applications to the DR site and devising the procedures for operational fail-over and fail-back for them. This work is expected to continue throughout 2012.
In the meantime, we are realizing benefits from the having the DR site in place. It is already serving as the off-site backup location for our primary storage backup processes replacing a much less efficient tape backup model. Our email services are also now protected by redundant services in the DR site providing improved operational reliability for that platform. Later we plan to use the equipment at the DR site for application development and testing and possibly as a reporting platform to ease the load on local systems.