Wednesday, August 14, 2013

Perfect storms of impossible events

When planning for a perfect storm,
automation is a critical component
of engineering for resiliency.
The recent failure of two network switches in a data center in Utah rippled across the network bringing down four major US web hosting firms and impacting millions of their customers. The breadth of the impact demonstrates some of the risks of consolidation in the hosting industry. Similar troubles have been seen in cloud computing with power issues at an Amazon data center causing headaches for some high profile customers.

From a more internal networking perspective, as businesses move more critical functions to the cloud, they rely more and more on their network infrastructures to connect to those functions. To a large degree (obviously not completely) reliability in the cloud will be there. The more likely risk will be issues with the local network. We're not just talking the black & white issues of whether the network is up or down. Cloud applications will require the network to operate at specified levels or suddenly the savings and promise of the cloud will be drowned out by complaints of slow moving applications killing productivity and costing business. For planning and operations of these networks, administrators of branch offices can learn some lessons from their cloud providers.

With data center issues, it's often a "perfect storms of impossible events" that lead to unprecedented downtime. Jesse Robbins of Amazon has been a frequent speaker on the topic and says the solution is try out contingency plans by breaking stuff on purpose and seeing what the response is to unexpected trouble to see how people respond. Then do it again and again.

Of course, people are important factors in responding to issues, but they tend to be the cause behind issues as well. People get busy, distracted... any number of things that can extend a problem or provide a window for a small problem to grow into something bigger. Robbins says that automation is a critical component of engineering for resiliency.

Uplogix provides solutions for the increasingly-important network infrastructure component often overlooked in the cloud discussion. A Local Manager can automatically detect common WAN problems, including outages or flapping circuits, and provide an instant diagnosis with the supporting trending or configuration data to speed recovery, document outages, or facilitate carrier resolution.

Many common faults can be solved without human intervention at all. A robust automation framework makes it possible for end-users to modify prepackaged -- or define sequential and conditional -- recovery procedures that align with their run book. For example, problems with a device could have the following automated steps taken with an evaluation for a successful recovery in between each action: Clear Service Module... Cycle Interface... Show Tech... Reboot... Cycle Power.

Of course, if the automated recovery can't fix the problem, it's time to escalate it to the human experts. Uplogix will provide them with a secure out-of-band connection through the device's console port for remote troubleshooting which is what a tech what they would do if they were onsite.

With automation, a small squall of a network issue should dissipate before building into a major storm. Read more about Uplogix automation.