Tuesday, May 14, 2013

MTBF and MTTR are important, but so is MTTI

MTBF (mean time between faults) and MTTR (mean time to recovery) are important measurements that usually factor into the creation of SLAs (service level agreements). Another important measurement within IT groups when it comes to their management tools should be MTTI, or the mean time to innocence. When there is a problem, this is the idea that it's important to know if the problem is your fault, or lies elsewhere. It's easier to enforce SLAs when you can cut through finger pointing early in the event.

Wikipedia tells us that Mean time between failures (MTBF) is the predicted (average) time between failures of a system during operation. Mean time to repair (MTTR) represents the average time required to repair a failed component. A tool like Local Management from Uplogix can lower MTBF by automating routine network management tasks, which removes opportunities for human error. Detailed monitoring combined with rules and alerts can notify administrators of a potential problem, letting them intervene and potentially avoid a failure.

MTTR is reduced with Uplogix in multiple ways. First, the direct connection to managed devices over the console port means more frequent and more detailed monitoring. Uplogix will know there is a problem and specifically what the problem is very quickly. Our default polling interval is 30 seconds. And since we are not monitoring devices over the network, we'll still be able to talk to all the devices in the rack and report back over an out-of-band on exactly what the situation is. If Uplogix can't fix an issue automatically, it will already have tried your initial run book steps, so you won't have to start at #1.

This factors into mean time to innocence. In the traditional SNMP model when there are network issues, polling goes down. Is it a carrier problem? A last-mile issue? Something in your branch office infrastructure? The downtime clock is running and each stakeholder starts troubleshooting from page one of the run book. Or worse, finger pointing begins. Tick tick tick.

MTTI doesn't necessarily protect you from downtime, but it can focus the recovery efforts on where the problem lies, directly reducing MTTR. This is also helpful for enforcing SLAs. Of course, the goal is not to have to collect on missed SLAs. As Andy Gotlieb said in a recent article:
If the carrier violates the terms of the SLA, its biggest penalty is that it will owe you a portion of your monthly bill back. The more "generous" SLAs will say that if the outage lasts for too long a period of time, they'll refund your entire month's bill. The problem, of course, is that you don't want a free month's service – you want to avoid the very high cost of downtime to your enterprise. But no carrier will give you an SLA where they commit to compensate you for what that lost connectivity time is worth to you and your firm.
So as you worry about MTBF and MTTR, consider the impact that Uplogix Local Management can have on these metrics as well as giving you a way to obtain mean time to innocence, or MTTI. It's always nice to be able to show when it's not your fault, the ever-popular CYA metric.