Tuesday, October 18, 2011

Avoiding catastrophic outages by automating change management

There are plenty of examples of catastrophe these days, but in the IT world, catastrophic losses typically are the result of anticipated technology failures combined with unanticipated human failures.

The best laid plans often go awry due to the human nature to take shortcuts. Repetitive processes and time-consuming steps get skipped. It's a human problem, and when the result is a catastrophic outage, nobody can hide.

Our sales team recently relayed a pair of horror stories they heard where major data facilities lost their main and backup power. Power loss is an anticipated issue in data centers. What made the events catastrophic was the human failure over time to adhere to change management processes.

When power was restored, recovering key pieces of network infrastructure was hampered by the lack of the most recent configuration files. Without the ability to return to the previous operating state, MTTR rose steadily as administrators tried to recover devices.

The power of automation
Clearly these stories show the importance of having clear NCCM procedures that are followed to the letter. But it also shows the need for automation to take that human nature to shortcut out of the process.

Quantifying how big an issue this is, Gartner says "80% of unplanned downtime is caused by people and process issues, including poor change management practices."

Uplogix localized management can help you address these issues by automating the config management processes -- every time and consistently. For change management, Uplogix delivers the following:

  • Facilitating changes | Administrators log into the Uplogix platform to access only those devices that they have privileges on. Changes can be staged locally on Uplogix (especially useful for low-bandwidth/short change window situations) and scheduled in advance. Previous configurations and OS files are also saved locally on the Uplogix platform.
  • Verifying changes | When a change is pushed from Uplogix to a device it is validated. If the change fails, the Uplogix SurgicalRollback process backs out the exact changes, returning the device to its previous state.
  • Auditing changes | All changes are recorded by Uplogix, 24x7, capturing both commands and device responses.
So how good are your plans to avoid catastrophe? It's likely that the biggest risk isn't power or weather, but just day-to-day activities that turn your best laid plans awry.