April showers bring May flowers, but they also bring lively thunderstorms to central Kentucky. Thunderstorms can bring power outages, and power outages can bring a keen desire to revisit past choices in hardware maintenance contracts. Let me explain.
On a late April evening in 2015, a storm caused an extended power outage that drained the backup batteries serving a rack of network switch equipment on our small university campus. Once power resumed and everything came back up, one blade in one switch no longer functioned properly. Diagnosis of the error messages led to a help article indicating one of those special cases where if you have this particular model of switch with this certain type of memory and it has been running for over this long, then an issue may occur at the next power cycle of the hardware, no matter the cause. A "next business day" (NBD) maintenance contract covered the switch. We diagnosed the problem and called it in at 1:00 a.m.; the big question then was, "At one o'clock in the morning, when is Next Business Day?"
For this particular vendor and contract, NBD turned out to be "tomorrow," meaning the next calendar day, not the business day about to begin. This definition meant that the replacement part would ship sometime during the following calendar day and arrive sometime the day after that, meaning at least 55 hours until the problem would be resolved, and likely more if the part did not arrive first thing in the morning.
Naturally, more questions arose. For starters, why does "next business day" mean only that parts will ship, not that they will arrive NBD? And hey, it's 1:00 a.m. — why does NBD have to mean the next calendar day when we've got a whole business day ahead of us?
Why can't the parts ship today, meaning any time during the business day about to begin? And why isn't same-day shipping the standard? Without it, doesn't NBD automatically turn into 2×NBD?
Most importantly, when did we decide that this was OK?
When thinking about disaster recovery, we often envision worst-case scenarios. "What would we do if our entire network was down or destroyed?" or "How would we recover if all of our hardware was suddenly gone?" are common questions as we try to plan for those horrific events. But disasters come in many shapes and sizes, including a small hardware failure after a power outage, so it is important to think about how to mitigate these "mini" disasters, too.
Budget pressures are another common issue that IT shops face when making decisions about disaster recovery. When you just need to find a few thousand dollars in the budget to tackle a much needed upgrade of some type, suddenly an NBD service contract looks more appealing than its 24×7×4 counterpart (24 hours a day, seven days a week, with a four-hour response time). If you combine thinking about only large disasters with routine budget squeezing, then you might find yourself rationalizing, "Well, if something big does happen, we'll be down for more than a day, anyway." With that mindset, you would surely be OK with an NBD contract — which might bite you when a small disaster strikes.
Reevaluating IT Crisis-Response Approaches
While we waited for the replacement part to arrive, we were able to provide limited service. The affected blade served fiber optic connections, but a remaining blade offered copper connections, so we used some spare media converters to link them up. Meanwhile, we issued regular communications to the campus to keep faculty and students in the loop.
After resolving this incident, we reviewed all maintenance contracts on critical systems. (Regularly reviewing contract terms is an important component of an annual disaster recovery review for all organizations.) In general, we do not have hot backup or live redundancy systems. Our budget capabilities simply won't afford this level of coverage, so we rely on maintenance contracts to cover our systems. We have looked at cloud options for replication and redundancy, but have found the cost similar to having the capabilities on site.
With the vendor in this case, NBD and 24×7×4 contracts look very different: 24×7×4 means having a replacement part in your hands inside of four hours, no matter when you call. In contrast, NBD means the part will ship out the next business day (this might as well be next calendar day) by next-day shipping, so our 1:00 a.m. scenario meant that at least two business days would go by before parts arrived. At last quote, the 24×7×4 option cost about 60 percent more than NBD, equaling about a few thousand budget dollars, which could have saved us more than 48 hours of downtime. Would that gain have been worth it? Would that pricier 24×7×4 option have covered its cost over several years?
Eventually, we looked at the gray market of network gear, which has many vendors. Simple math told us that if we took the difference between the NBD and 24×7×4 contracts, that amount over two years could buy us a used version of this network switch that we could keep on a shelf. We expect this device to operate for at least another two years, so we decided to stick with the NBD contract and purchase a used gray market device as a backup. We feel good about this decision because it gives us hardware on site to replace any failed parts quickly, while still allowing us to have the original device repaired under contract.
Different vendors have different levels of service available. For another device that we reviewed, the vendor offered a "cold" backup copy of its network appliance at a reasonable price. But, without paying for licensing to make this spare device "hot" so that we would have both in operation, we'd have to call in during normal business hours to swap the license from our primary appliance to the cold backup. We went for this option, since it got the replacement hardware on site ahead of time, and we would be OK with waiting until NBD to have this appliance active again.
Another vendor offered no cold backup hardware on their device and only one level of replacement service that didn't require sending the original device back first. The options they offered were to pay for a second device plus licensing to have redundant, load-balanced gear on site; or to pay for what they called "expedited" hardware replacement that equates to 24×7×8 service. We opted for the expedited service in this case simply because we could not fit the redundant gear plus licensing costs into our budget.
Weigh Your Options
The lesson learned from this springtime outage is to review your contracts and weigh your options, while considering that every disaster will not be a large one. Next business day might work just fine for some devices, especially when combined with purchasing backup hardware, while 24×7×4 service could be a must for your most critical systems. Your institutional context will determine how you manage IT risk across campus.
- EDUCAUSE: Business Continuity Planning
- EDUCAUSE: Disaster Recovery Planning
- How much downtime can you tolerate: Determining your recovery time objective (RTO)
- When does RTO begin?
- Is it time to consider disaster recovery as a service?
- Business continuity and disaster recovery planning: The basics
- Webinar: Business continuity techniques
- How effective is your disaster recovery plan?
- EDUCAUSE: Managing university business continuity
- Disaster recovery budgeting and RTOs
- Business Continuity Techniques for Today’s Businesses
Jason Whitaker serves as the vice president for information technology at Transylvania University in Lexington, Kentucky, after 12 years at IBM as an IT specialist. Whitaker has driven several technology initiatives at Transylvania including server and desktop virtualization and migration to Google Apps for Education. He has steered enhancements to Transylvania's Ellucian ERP system to incorporate an admissions CRM and analytics software. In addition to his BA in computer science from Transylvania, Jason holds an MS in computer science from the University of Kentucky.
© 2016 Jason Whitaker. The text of this EDUCAUSE Review blog is licensed under Creative Commons BY 4.0 International.