Thursday, March 24, 2011

Zenoss: Stairway to Events

Even though it is doable (and incredibly easy to replicate, especially when using zenossYAMLTool), I make it a practice not to create multiple alerting rules for each individual event. So, instead of creating numerous alerting rules for each severity for each event, I try to lump them together as much as possible. For instance, in an operations group, this is probably the normal escalation procedure:
  1. send an email to the entry level support
  2. send an email to higher level support
  3. send an SMS to higher level support
One way to handle this would be to create three separate alerting rules that will send out emails according to the counts that the event is duplicated. Another way would be to escalate the severity of the alert as well so that the above actions are also associated with 3 global alerting rules with the following severities:
  1. Warning
  2. Error
  3. Critical
Anything lower in severity just appears in the Event Console but does not sent out any alerts.

However, suppose I wanted the escalation to occur over certain periods of time instead of duplication counts (which may or may not occur in 5 minute intervals).Then I would need to utilize the python API within a Zenoss transform. During the course of researching how to do this, I came upon many similar posts with varying degrees of success. I am posting mine now in the hopes that it will benefit others from all the trial and error that it took to clean this up and make it work for me:

import time
em = dmd.Events.getEventManager()
mydedupid = '|'.join([ evt.device, evt.component, evt.eventClass, evt.eventKey, '2' ])
try:
    ed = em.getEventDetail(dedupid=mydedupid)
    first = int(time.mktime(time.strptime(ed.firstTime, '%Y/%m/%d %H:%M:%S.000')))
except:
    first = 0
for sev in range(5,1,-1):
    mydedupid = '|'.join([ evt.device, evt.component, evt.eventClass, evt.eventKey, str(sev) ])
    try:
        ed = em.getEventDetail(dedupid=mydedupid)
        mycount = ed.count
        last = int(time.mktime(time.strptime(ed.lastTime, '%Y/%m/%d %H:%M:%S.000')))
    except:
        mycount = 0
        last = 1
    if mycount > 0: break
diff = last - first
if first == 0: evt.severity = 2
elif diff > 3600: evt.severity = 5
elif diff > 1800: evt.severity = 4
elif diff >  900: evt.severity = 3

The original event should be set to Info (severity=2) and will escalate to Warning after 15 minutes, Error after 30 minutes, and Critical after an hour. Where you place this transform in the Event Class tree depends on how the events should be affected by this suppression/escalation logic.

Other tricks that I implement in combination with this are:
  • Combine loadbalanced devices into a single event --- using the DeviceClass shortname:
    evt.device = evt.DeviceClass.split('/')[-1]
  • For SNMP Traps, set the Event Key (and thus the deduplication id) to the second tab-delimited field to combine certain traps into the same event:
    evt.eventKey = evt.fullLogLine.split('\\t')[1]
Once they escalate, they will trigger the appropriate alerting rule and send out the proper notification.

No comments:

Post a Comment