How the University of Illinois System Conducted a Massive Failover Test to Reduce Risk Exposure

Case Study

Authors:: A. J. O'Connell
Published:: Thursday, August 10, 2023
Columns:: Case Studies Cybersecurity and Privacy

min read

After discovering that it had outdated disaster recovery plans and enormous risk exposure in 2018, the University of Illinois system embarked on a five-year plan culminating in a massive failover test.

Case Study — *Credit: Muslianshah Masrie / Shutterstock.com © 2023*

Institutional Profile

The University of Illinois System (U of I System) comprises three large public research universities in Urbana-Champaign, Chicago, and Springfield; it also includes the University of Illinois Hospital & Health Sciences System (UI Health), numerous regional and satellite campuses, and the Discovery Partners Institute, a workforce development institute. Together they enroll more than 94,000 students and employ more than 36,000 staff and faculty. The U of I System Office provides a range of operational services, including Administrative Information Technology Services.

The Challenge/Opportunity

Early on a Saturday morning in February 2023, a team at the U of I System AITS office quietly shut down its main data center, cutting access to 340 servers running 900 software applications used for payroll, procurement, financial aid, fundraising, research, student information, communication, collaboration, and many other services. The team had just commenced a weeklong live simulation of a massive disaster, the first step of which was, essentially, to turn everything off. The fifty people conducting the test then went to work on the second stage: seeing if they could get all those applications and data working in the backup and recovery environment without a serious delay.

A test like this wouldn't have been possible five years earlier, when the U of I System realized it had outdated disaster recovery plans and enormous risk exposure. When an institution hosts its own servers, it typically has an emergency plan in the event of a disaster striking its primary data center. For example, if there's a power outage or structural damage to a data center, the system is "failed over" by engineers switching it to a backup site to limit interruptions to the applications. Later, when the damaged data center is restored, the system is "failed back."

But in 2018, AITS discovered the risk of imminent disaster at its primary data center in a structure built in 1947 and purchased in the 1950s by the U of I System to house its first computers. Aging HVAC systems in the building had been a known issue, so the U of I System had always taken particular care of the pipes that moved glycol through the cooling system. The pipes were coated with a chemical to slow decay, and a team of engineers inspected them regularly.

However, the 2018 inspection revealed that the deterioration had accelerated dramatically, eroding the metal from the inside. The pipes were razor-thin in places, and some were falling apart, shedding chunks of metal externally. If a pipe burst, the facility would overheat, bringing down the enterprise systems and likely damaging the servers.

Nyle Bolliger, senior assistant vice president of IT services and operations for AITS, called repairing the pipes a nightmare scenario. "The cooling system was in the basement, and the cooling air handlers were on the top of the building," he said. "The pipes wound through the walls and ceilings of four stories in a very old building, and there were no diagrams of where all the pipes went."

Bolliger gathered some of the pieces of metal that had fallen off the pipes (see figure 1) and went to the system's executive leadership team to request funds to fix the plumbing. "I remember holding these pieces, and the CFO was saying, 'Why don't we just fail over our systems to the backup data center in Urbana?' I had to tell them we couldn't."

pieces of rusted pipe next to a quarter — **Figure 1. Decaying Cooling Pipes**

The university system's disaster recovery plan was written in the early 2000s, when the system's digital footprint was much smaller. With its current failover plan, weeks would be needed to recover at a facility on the Urbana-Champaign campus, which at that time was able to handle only data backups to tape. "Our 1,000-page disaster recovery plan wasn't adequate," Bolliger said. "A disaster at that building would have made us infamous. The CFO kept shaking his head and saying, 'How can this be? We're a multi-billion-dollar organization.'"

Budget cuts over the years meant that the IT disaster recovery plan hadn't been top priority, said Bolliger. Updates to IT infrastructure were put on the back burner, retiring disaster recovery coordinators had not been replaced, and capital projects had been delayed. The CFO declared this situation to be unacceptable; the U of I System would make a substantial investment in its IT infrastructure and disaster recovery plan to end this risk exposure. That began a years-long transformation, eventually leading to the massive failover test in February.

The Process

The first step was an enormous challenge: replacing the coolant pipes in the Chicago facility during a winter without interrupting services. "It was the equivalent of replacing the radiator in your car while the engine's running, without overheating," said Bolliger.

To be ready to respond if power failed during this nine-month project, Bolliger had engineers onsite 24 hours a day, as well as multiple emergency backup systems, including generators and chillers on flatbed trucks. Even with these precautions in place, there were some bumps in the road. Cooling went out at 2 a.m. one day, but when the backups were engaged, they failed as well. "We had to use the backup to the backup," said Bolliger. "The data center hit 108 degrees in 15 minutes. We were that close to catastrophe."

Meanwhile, the U of I System began the process of rebuilding its disaster recovery protocols and an IT infrastructure capable of safely failing over the main servers to the backup data center in Urbana. This required a capital investment of $4.5 million in hardware and infrastructure, and Huron Consulting Group was brought in to help redevelop the U of I System's IT disaster procedures, plans, and policies.

This involved more than a rewrite of the disaster plan. A new methodology for handling digital risk was needed. The IT organization was restructured so that disaster recovery personnel were higher in the administrative hierarchy. The new plan covers multiple disaster scenarios, and IT organization runs annual tabletop exercises to simulate specific threats, such as ransomware attacks.

While the disaster recovery protocols were updated, the U of I System's backup data center in Urbana was outfitted with new infrastructure able to replicate the systems and data in the main data center in Chicago. The goal was to be able to start the failover process for all systems with the flip of a switch, to have them running after only a few hours of interruption, to have them be able to run for an extended period of time without breaking, and then to return them—fail back—to the main facility in Chicago.

Achieving that goal would require a failover test larger than what anyone had heard of being conducted at a higher education institution. A team from within the U of I System office was assembled, including system administrators, storage engineers, application administrators, analysts, data center staff, and disaster recovery specialists, with additional support from IT staff from the individual universities. For several weeks prior to the failover test, this team ran smaller tests, as well as rehearsals of the full failover. For example, the team failed over the systems into an isolated recovery environment, testing applications and learning the nuances of switching applications from one system to another.

The start of the ultimate test of the new systems was scheduled for February 11, the Saturday of Super Bowl weekend, because the team thought use of the institutional systems would be lighter that weekend. The Facilities and Services team set up temporary backup generators and had electricians on-site in case they were needed. They flipped the switch at 6 a.m., and any of the institution's 100,000 students and employees who were working that morning started seeing "down for maintenance" messages. The race was on to get the backup systems online within their target of six hours.

"As the morning wore on and we reached the point where we were starting up services, we were all nervous," Bolliger said. "Even when the systems started coming up and were working, thoughts started going through my mind like, 'Are these really our systems in DR [disaster recovery] and not somehow our old production systems?' The system admins assured me our normal production data center was offline and we really were running in DR."

By noon, all systems were running on the backup servers, and they continued to run without incident for the test period of 18 hours. The failover had been a success. The next stage, starting Sunday morning, was to fail it all back to the Chicago facility.

That's when the test finally stretched the limits of their planning. "We discovered we had much more activity in our systems overnight and the data replication going back was taking much longer than we anticipated," said Bolliger. Super Bowl weekend wasn't so slow after all.

As the day dragged on, it became clear that the restore stage wasn't going to be completed in the goal of 18 hours. The team decided to use the primary center to run the systems restored so far and to continue using the backup environment to run the remaining systems that hadn't yet synced. This amounted to an impromptu test of an "extended mixed mode," which AITS had built into the architecture. The mixed mode would need to run for another week and a half, which was the next open window to fail back the remaining systems.

"It worked flawlessly," Bolliger said. "This challenge ended up adding far more value to our exercise than we ever anticipated. We now know we can effectively move portions of our workloads between data centers as needed instead of having to do all our systems in a wholesale cutover. We also ran a substantial number of systems for almost two full weeks from our DR data center without experiencing performance issues under normal busy production loads."

In 2018, a disaster would have left most of the U of I System's technology infrastructure offline for weeks. "We went from weeks to recover and rebuild to six hours," said Bolliger. The failover test had been a success.

Outcomes and Lessons Learned

While failover tests are a common practice in business settings, Bolliger has never heard of a test this large having been attempted by a college or university system. "Banks and other businesses have to do this because they need to have continuity of business," he said. "In higher education, we all want to get there and have this resiliency. When we talk with our peers across the Big Ten, they all worry about system failures too. It's just tough to invest and actually do it."

In fact, many colleges and universities are migrating their environments and applications to the cloud, something AITS is working toward as well. But they determined that systems simply being in the cloud does not provide adequate disaster recovery capabilities. They also decided that the migration to the cloud is a long journey and that the U of I System needs to have its own disaster recovery capabilities that complement the cloud strategy.

Bolliger and his team learned several other lessons from the project. First, IT disaster recovery must be an organizational goal within the higher education institution. "Formalize the effort to build a disaster program and infrastructure," he said. "This effort was made a strategic objective for the organization, which created both visibility across the organization and resource availability." The project was managed just like any other critical business project, using project managers, leadership updates, resource plans, and regular meetings.

Another lesson learned was that disaster recovery needs adequate ongoing capital and operational funding to keep up with the U of I System's growing dependence on technology and its changing risk exposure. "The program needs care and feeding and must be exercised regularly," said Bolliger. "We conduct annual tabletop exercises, review plans, educate and train staff, and conduct failover tests. We also invest in improvements in capabilities and methodologies and continue to have projects dedicated to those efforts."

AITS plans to conduct failovers annually to ensure readiness. And now that the plumbing for the cooling system in the Chicago data center is fixed, the attention in 2024 will be focused on the aging electrical system.

Where to Learn More

Case Study Contributor: Nyle Bolliger, Senior Assistant Vice President of IT Services and Operations for Administrative Information Technology Services (AITS), University of Illinois System

This case study is one in a series being produced by EDUCAUSE Communities and Research. A. J. O'Connell is a writer with McGuire Editorial & Consulting specializing in education technology.

ParentTopics:: Business Continuity Planning Disaster Recovery Planning Risk Management Vulnerability Assessment and Management