Thursday, August 21, 2014

High resolution network device monitoring

As the summer season winds down, let's pull an analogy from a summer tradition: the great American road trip. Driving across the vast Western US, miles can go by without seeing another car, let alone an exit. You enjoy the scenery, only infrequently needing to consult a map because hours can pass before anything changes with your route. Then, later on in the trip you pull into a large city, trying to find your hotel for the night. If you consulted your map at the same rate as when you were on the open road, you'd miss the destination for sure. Instead, you are actively on the lookout—constantly checking the map, street signs and looking for landmarks to ensure you are ready to react when it's time.

This is kind of like monitoring in network management. Centralized monitoring tools poll remote devices over the network, generating traffic and a load on the managed devices. To lower this impact, sampling frequencies are decreased, sacrificing how quickly you'll know there is a problem when it occurs.

With network-independent, local connections to devices over the console port, Uplogix takes a default sampling interval down to every 30 seconds from a standard 15 minutes, or more. The high-resolution monitoring conducted by an Uplogix LM means that problems can be detected and recovered before SLAs kick in and the customer calls. This combination of monitoring frequency and depth with reliable automation of most level-1 runbook steps is like having an administrator with a crash cart plugged into network devices 24/7.

The clear benefit is increased uptime and decreased time for problem resolution—whether the issue is solved automatically, or the initial troubleshooting steps are taken automatically within minutes of the problem. In this situation, technicians at the NOC start working on the problem not from step one, but deeper into the runbook with the both the confidence that earlier steps did not resolve the issue, and knowing exactly where the problem lies because Uplogix has updated their dashboard and ticketing systems automatically.

In a multi-vendor network, issues often kick off a chain of finger pointing as everyone tries to isolate the problem and find out who is to blame. With local monitoring of devices, Uplogix can tell exactly where the issue is (with the carrier, in the network stack or downstream), ending the finger pointing and reducing what has been called the Mean-Time-to-Innocence—that period of time nobody likes when everyone is hoping someone else is to blame.

