Tuesday, October 18, 2011

Inside looking out: Local info for centralized network management

The device-level data used by traditional centralized management tools includes device statistics collected via ICMP, SNMP get requests and SNMP traps based on predetermined thresholds and rules. These tools gather and receive this important information from devices over the network. If the network isn’t available, the data isn’t either, and centralized network management cannot do its job.

The problem is that disruptive incidents are tough for IT departments to prevent because there are so many variables to control in complex and highly distributed network environments. Just some of the many possibilities include the following:
  • Service provider’s WAN link connecting a remote office is goes down
  • Technician’s simple error on a remote switch drops that device offline later that night
  • Kernel panic on a device in a remote location causes the device to stop responding
Each of these events occurs in even the best-managed networks, and on the most reliable platforms.
The resulting disruptions often limit visibility into the remote locations impacted. This restricts IT’s ability to troubleshoot, track configurations, enforce security policies, and provide timely fault assistance.
Adding Local Management to Centralized Tools
With uninterrupted access to device-level data, centralized tools can extend capabilities beyond the moment of outage. As examples, here is the impact on the above scenarios:
  • WAN link disruption could be quickly confirmed as the service provider’s fault in seconds or minutes instead of hours or days; service provider triangulation actions could be  executed and reported and the remote office infrastructure would be accessible for hands-on management
  • Technician error on a remote switch could be rolled back immediately  as an un-committed configuration change
  • An unresponsive device could be automatically power cycled in the first few minutes of outage.
Guaranteed and consistent data improves monitoring, makes troubleshooting easier, allows configurations to be tracked and security policies to be maintained, and accelerates fault resolution.

Uplogix makes your existing tools work better
Uplogix ensures that device-level data keeps flowing even if the primary network is temporarily unavailable. Uplogix Local Managers (LMs) collect device statistics and SYSLOG messages out-of-band, via a device’s console port. This information is stored locally on each LM and delivered at regular intervals to the Uplogix Control Center in the NOC and forwarded to centralized tools where it can be used to replace or augment statistics that might be missing or incomplete. Uplogix generates traps based on standard rules and policies. Traps generated by Uplogix Local Managers are forwarded to centralized tools even when the network isn’t available.

Uplogix collects and stores information commonly used by centralized tools:
  • Device statistics | These are the raw statistics about device state, such as error frames, CPU-status, or carrier transitions. Traditionally these statistics are requested via SNMP ping across the network at regular intervals (i.e., every 15 minutes), and this information is delivered on a best effort basis by the device.

    With Uplogix, these same statistics are collected directly from devices at much more frequent intervals (i.e., every 30 seconds), but without impacting network or system performance. Uplogix stores and forwards this data to centralized tools on a guaranteed delivery basis.  So not only does local management provide more granular device statistics at tighter intervals, it does so with less network overhead and with guaranteed delivery, even during disruptions.
  • SYSLOG messages | These are the unsolicited event records reported by devices. They generally tell that something has happened, where it’s happening, and what it is related to, such as “port 16 is having problems with duplex mismatch,” or “port 4 dropped a packet  -- out of memory.”

    Uplogix gathers SYSLOG messages over console (the most reliable method) and time and date stamps according to UDP. They are used along with device stats to generate traps.
  • Traps | These are alarms based on pre-determined rules and thresholds using device statistics and SYSLOG data, such as “if more than three malloc errors occur in five minutes” or “utilization > 50%” send an alarm. Centralized tools receive traps via SNMP on a best effort basis, which means a device can generate and send a trap, but doesn’t know if it arrives. Likewise, traditional management tools don’t know that a device sends a trap unless it is actually received.

    With Uplogix, traps are generated by the LM rules engine, which uses parameters and thresholds modeled on Cisco TAC best practices. Uplogix LMs store and forward traps on a guaranteed delivery basis. So when a trap is generated, centralized tools will receive it even if the network is not available.
For more information, read the Local Management Technical White Paper.