ikenticus: tech: March 2011

Friday, March 25, 2011

Zenoss: Automated LDAP Authentication via twill

Call me a nitpicker, but after getting my Zenoss core packages installed via an automated configuration management tool, like cfengine or puppet, I dislike having to manually click through the Zope UI in order to attached an external user validation module, like the LDAP Authenticator. Luckily, since I was using Zenoss Enterprise, the enterprise ZenPacks for Zenoss includes both the LDAP Authentication plugin as well as the Synthetic Web Transaction plugin. By stepping through the twill-sh steps, I was able to create a twill sequence that would allow me to add LDAP Authentication with the same script that I used to install Zenoss and the subsequent ZenPacks.

If you execute the twill commands before you go through the Getting Started UI, then you can also take advantage of the default zenoss username and password:

# Set all your custom variables here:
setlocal ldapserver1 <primary_ldap_server_hostname>
setlocal ldapserver2 <secondary_ldap_server_hostname>
setlocal ldapserver3 <tertiary_ldap_server_hostname>
setlocal ldapouuser <ldap_base_ou_users>
setlocal ldapougroup <ldap_base_ou_groups>
setlocal ldapgrouptype <group_object_class>
setlocal ldapzenmanager <ldap_group_zenmanager>

go localhost:8080
fv 1 __ac_name admin
fv 1 __ac_password zenoss
submit

go /zport/acl_users/manage_addProduct/LDAPMultiPlugins/addLDAPMultiPlugin
fv 1 id LDAP
fv 1 title "OpenLDAP Login"
fv 1 LDAP_server $ldapserver1
fv 1 users_base $ldapouuser
fv 1 groups_base $ldapougroup
fv 1 roles ZenUser
submit

go /zport/acl_users/LDAP/manage_activateInterfacesForm
fv 1 interfaces:list IAuthenticationPlugin
fv 1 interfaces:list ICredentialsResetPlugin
fv 1 interfaces:list IPropertiesPlugin
fv 1 interfaces:list IGroupsPlugin
fv 1 interfaces:list IRolesPlugin
fv 1 interfaces:list IUserEnumerationPlugin
fv 1 interfaces:list IGroupEnumerationPlugin
fv 1 interfaces:list IRoleEnumerationPlugin
submit

go /zport/acl_users/LDAP/acl_users/manage_main
fv 1 obj_classes $ldapgrouptype
submit
fv 3 host $ldapserver2
submit
fv 3 host $ldapserver3
submit

go /zport/acl_users/LDAP/acl_users/manage_grouprecords
fv 3 group_name $ldapzenmanager
fv 3 role_name Manager
submit

A simple bash snippet that I use in my installation script that will run this ldap_zenoss.tw from whatever version of Zenoss you are using:

twshell=$(find $ZENHOME/ZenPacks -iname twill-sh)
PYTHONPATH=${twshell%/bin/twill-sh}/lib $twshell $ZENHOME/bin/ldap_zenoss.tw

And voila, automated installation without GUI representation...

Thursday, March 24, 2011

Zenoss: Stairway to Events

Even though it is doable (and incredibly easy to replicate, especially when using zenossYAMLTool), I make it a practice not to create multiple alerting rules for each individual event. So, instead of creating numerous alerting rules for each severity for each event, I try to lump them together as much as possible. For instance, in an operations group, this is probably the normal escalation procedure:

send an email to the entry level support
send an email to higher level support
send an SMS to higher level support

One way to handle this would be to create three separate alerting rules that will send out emails according to the counts that the event is duplicated. Another way would be to escalate the severity of the alert as well so that the above actions are also associated with 3 global alerting rules with the following severities:

Warning
Error
Critical

Anything lower in severity just appears in the Event Console but does not sent out any alerts.

However, suppose I wanted the escalation to occur over certain periods of time instead of duplication counts (which may or may not occur in 5 minute intervals).Then I would need to utilize the python API within a Zenoss transform. During the course of researching how to do this, I came upon many similar posts with varying degrees of success. I am posting mine now in the hopes that it will benefit others from all the trial and error that it took to clean this up and make it work for me:

import time
em = dmd.Events.getEventManager()
mydedupid = '|'.join([ evt.device, evt.component, evt.eventClass, evt.eventKey, '2' ])
try:
    ed = em.getEventDetail(dedupid=mydedupid)
    first = int(time.mktime(time.strptime(ed.firstTime, '%Y/%m/%d %H:%M:%S.000')))
except:
    first = 0
for sev in range(5,1,-1):
    mydedupid = '|'.join([ evt.device, evt.component, evt.eventClass, evt.eventKey, str(sev) ])
    try:
        ed = em.getEventDetail(dedupid=mydedupid)
        mycount = ed.count
        last = int(time.mktime(time.strptime(ed.lastTime, '%Y/%m/%d %H:%M:%S.000')))
    except:
        mycount = 0
        last = 1
    if mycount > 0: break
diff = last - first
if first == 0: evt.severity = 2
elif diff > 3600: evt.severity = 5
elif diff > 1800: evt.severity = 4
elif diff >  900: evt.severity = 3

The original event should be set to Info (severity=2) and will escalate to Warning after 15 minutes, Error after 30 minutes, and Critical after an hour. Where you place this transform in the Event Class tree depends on how the events should be affected by this suppression/escalation logic.

Other tricks that I implement in combination with this are:

Combine loadbalanced devices into a single event --- using the DeviceClass shortname:
```
evt.device = evt.DeviceClass.split('/')[-1]
```
For SNMP Traps, set the Event Key (and thus the deduplication id) to the second tab-delimited field to combine certain traps into the same event:
```
evt.eventKey = evt.fullLogLine.split('\\t')[1]
```

Once they escalate, they will trigger the appropriate alerting rule and send out the proper notification.

Wednesday, March 23, 2011

Zenoss: Maintenance Windows

I have noticed that when I set a maintenance window for a device, it stops any related alerts for that device, but it does not stop the events from appearing with varying severity levels. In order to suppress the events to the same severity as the Maintenance Window notification, I use the following transform:

# Suppress any events during maintenance window
if evt.severity > 2:
    for mw in device.getMaintenanceWindows():
        if mw.started is not None:
            evt.summary = 'Maintenance: %s' % evt.summary
            evt.severity = 2

As you can see, the transform uses the same objects classes that the local python API uses, so you should be able to retrieve just about anything from the system. Stay tuned next time when I outline the time-based suppression/escalation code that I use for certain events.

Friday, March 4, 2011

Zenoss: The SNMP Chinese Wall, Separating Integers from Strings

In Zenoss, one of the issues that one finds when one devotes oneself to a pure SNMP implementation, avoiding SSH altogether, is that the Zenoss SNMP DataSource only really handles numerical outcomes. That means that it can handle the INTEGER value from nsExtendResult as well as convert the purely numerical STRING value from nsExtendOutLine, but cannot deal with the stdout from an entire nsExtendOutputFull OID.

To remedy this, just simply create a DataSource that uses one of the INTEGER OIDs, then a Threshold that triggers a specific Event Class, and finally a Transform within that Event Class that will replace the INTEGER OID with a STRING OID and snmpget the entire stdout from the SNMP Extend script. Assuming that you have read my SNMP Extend post, let us continue using the same OID for remote_command as well as the same exact template as before. All we need now is a transform for the /Perf/Snmp Event Class that will look something like this:

alert = re.search('threshold of (\w+)_failure_output', evt.message)
if alert:
    ds = evt.eventKey.split('|')[0]
    for t in device.getRRDTemplates():
        for s in t.getRRDDataSources():
            if ds == '%s_%s' % (s.id, s.id):
                 # evt.summary = snmpget of (nsExtendResult OID
                 # rewritten as nsExtendOutputFull OID)

As I was having trouble figuring out how to utilize the existing Zenoss python libraries (and not have to install one of the various proprietary pysnmp/snmppy modules) to create a pure python snmpget, I originally created a python-based subprocess call to the system's netsnmp tools:

from subprocess import *
outoid = s.oid.replace('8072.1.3.2.3.1.4','8072.1.3.2.3.1.2')
cmd = 'snmpget -Ov -%s -c%s %s %s' % (
    device.zSnmpVer, device.zSnmpCommunity,
    device.manageIp, outoid)
proc= Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
evt.summary = proc.stdout.readlines()[0].replace('STRING: ','')
break

Eventually, after some research and speaking with Zenoss support, this was the shortest pure python implementation I was able to figure out:

from twisted.internet import reactor
from pynetsnmp import twistedsnmp

def snmpget(proxy, oids):
    data = proxy.get(oids)
    data.addCallback(snmpvalue)

def snmpvalue(result):
    global snmpdata
    snmpdata = result
    reactor.stop()

where the evt.summary would be parsed like this:

proxy = twistedsnmp.AgentProxy(
    community=device.zSnmpCommunity,
    snmpVersion=device.zSnmpVer,
    ip=device.manageIp)
proxy.open()
oids = [s.oid.replace('8072.1.3.2.3.1.4','8072.1.3.2.3.1.2')]
reactor.callWhenRunning(snmpget, proxy, oids)
reactor.run()
proxy.close()
evt.summary = snmpdata[snmpdata.keys()[0]]
break

However, this produces the following zenperfsnmp output:

yyyy-mm-dd HH:MM:SS,123 ERROR zen.zenperfsnmp: [Failure instance: Traceback (failure with no frames): : Connection was closed cleanly.
]
Traceback (most recent call last):
  File "/opt/zenoss/Products/ZenHub/PBDaemon.py", line 382, in pushEvents
    driver.next()
  File "/opt/zenoss/Products/ZenUtils/Driver.py", line 64, in result
    raise ex
PBConnectionLost: [Failure instance: Traceback (failure with no frames): : Connection was closed cleanly.
]

which results in killing zenhub. When posted to the Zenoss Support portal, the Zenoss "engineers agree that the deferred is a bad thing to use inside the event transform." So, even though I find the subprocess method distasteful, it is the optimal (or only?) transform that can be performed. If something knows otherwise, please feel free to comment.

So, either dump the transform into the GUI panel or inject it via the add_transforms feature in my zenossYAMLTool and you should be set. For the complete YAML, download my Result2Output.yaml and modify it accordingly --- it contains additional python code for the transform to handle SNMPv3 using zConfigurationProperties as well as some simple escalation parsing logic.

Thursday, March 3, 2011

Zenoss: Dr SNMP Extend or How I Learned to Stop SSHing and Love the OID

So there are all these scripts that you run on your remote hosts for monitoring. Since Zenoss has a built-in nagios parser, you basically run all of these scripts via SSH. What sucks is that establishing X number of SSH sessions for Y number of devices in Zenoss builds up a significant number of TCP connections and other wonderfully painful bottlenecks on your Zenoss system. Now, because all of these scripts already exist on the remote host, you can just as easily run it as root with a simple SNMP Extend line in your snmpd.conf. The line you would need to add would look something like this:

extend remote_command /customdir/customscript custom args -s 123

Do not forget to restart snmpd for the changes to take place. The next step would be to create a Zenoss DataSource for this --- bearing in mind that Zenoss works better with OID numbers than MIB names --- the straightforward approach would simply be to walk the SNMP tree and convert to OID:

snmpwalk -v2c -cpublic hostname 'NET-SNMP-EXTEND-MIB::nsExtendResult."remote_command"' -On

Two things to note in that command:

If you want to see the MIB name, just simply remove the -On
If you noticed that I used snmpwalk instead of snmpget, then you will understand that I do so because sometimes I like to walk the entire Extension tree using nsExtendObjects and snmpget would just barf on that

Rather than painstakingly reproduce all that using screenshots, I will illustrate the template creation steps using zenossYAMLTool syntax, which is how I normally make changes to my Zenoss system (to avoid tons of images here as well as tediously clicking through the GUI). Assuming that I will not be graphing in this template, the YAML needed to create the template looks like this:

- action: add_template
  description: Result Threshold retrieving Output Summary
  targetPythonClass: Products.ZenModel.Device
  templateName: Result2Output
  templatePath: /Server/Linux/TestCase
  GraphDefs: []

Now that we have the OID, adding a DataSource is fairly easy:

DataSources:
  - dsName: remote_command
    cycletime: 300
    enabled: true
    eventClass: /Cmd/Fail
    oid: 1.3.6.1.4.1.8072.1.3.2.3.1.4.x.114.101.109.111.116.101.95.99.111.109.109.97.110.100
    parser: Auto
    severity: 3
    sourcetype: BasicDataSource.SNMP
    DataPoints:
    - dpName: remote_command
      isrow: true
      rrdtype: GAUGE

You will find the red-highlighted number will increment as you add additional Extensions into the OID table. I had originally had a script that would generate the OID value for a specified extend_command, but since I cannot predict which position it would appear in the OID table, we will have to rely on snmpget/snmpwalk -On.

In order for it to trigger an alert, create a MinMax Threshold that will trigger an event (to keep it simply, we will be assuming 0=success and 1=failure here):

Thresholds:
  - thresholdName: remote_failure_output
    enabled: true
    escalateCount: 0
    eventClass: /Perf/Snmp
    maxval: '0'
    minval: ''
    severity: 3
    dsnames:
    - remote_command_remote_command

For the complete YAML file, please read my next blog on how to use transforms to extract the nsExtendOutputFull after the nsExtendResult triggers and event.

And that is it for now. Of course, for the die-hard nagios plugin fanatic, you may be apt to point out that using SSH enables you to pass DataPoints via the stdout as such:

STATUS: Some useful output message here|data1=100;;; data2=10;20;30

This would be added to Zenoss with a single DataSource that contains multiple DataPoints. For my zenossYAMLTool syntax, the YAML would look something like this:

DataSources:
  - dsName: remote_command
    cycletime: 300
    enabled: true
    eventClass: /Cmd/Fail
    parser: Auto
    severity: 3
    sourcetype: BasicDataSource.COMMAND
    usessh: true
    DataPoints:
    - {dpName: data1, isrow: true, rrdtype: GAUGE}
    - {dpName: data2, isrow: true, rrdtype: GAUGE}

To recreate the same effect using SNMP Extend, you would need to add multiple DataSources, each with a single DataPoint, as the OIDs are mapped directly to each DataSource. The script should output the message and the data values on separate lines:

STATUS: Some useful output message here
100
10

And you would make use of the multiple nsExtendOutLine."remote_command".# OIDs to gather your DataSources:

DataSources:
  - dsName: remote_command
    cycletime: 300
    enabled: true
    eventClass: /Cmd/Fail
    oid: 1.3.6.1.4.1.8072.1.3.2.4.1.2.12.114.101.109.111.116.101.95.99.111.109.109.97.110.100.1
    parser: Auto
    severity: 3
    sourcetype: BasicDataSource.SNMP
    DataPoints:
    - dpName: remote_command
      isrow: true
      rrdtype: GAUGE
  - dsName: remote_data1
    cycletime: 300
    enabled: true
    eventClass: /Cmd/Fail
    oid: 1.3.6.1.4.1.8072.1.3.2.4.1.2.12.114.101.109.111.116.101.95.99.111.109.109.97.110.100.2
    parser: Auto
    severity: 3
    sourcetype: BasicDataSource.SNMP
    DataPoints:
    - dpName: remote_data1
      isrow: true
      rrdtype: GAUGE
  - dsName: remote_data2
    cycletime: 300
    enabled: true
    eventClass: /Cmd/Fail
    oid: 1.3.6.1.4.1.8072.1.3.2.4.1.2.12.114.101.109.111.116.101.95.99.111.109.109.97.110.100.3
    parser: Auto
    severity: 3
    sourcetype: BasicDataSource.SNMP
    DataPoints:
    - dpName: remote_data2
      isrow: true
      rrdtype: GAUGE

Yes, the YAML appears longer but it is not more complex, merely an exercise in copy-and-paste repetition. The multiple DataSources can easily be written as a loop in a simply generator script. So you have to decide which way you want to handle it --- though, I suppose, these examples make it seem like just a choice between configuration vs performance.

Wednesday, March 2, 2011

Zenoss: The Straw That Broke...The YAMLs Back

I remember when my buddy over at Linux Dynasty first introduced me to Zenoss 1.x a few years back. At the time, I knew nothing about python or any of the related open source projects related to python. I only knew that I was not a fan of Zope only because it had caused me a lot of grief at a previous job. After looking at it for a few minutes, I had decided that the reporting aspect was severely lacking, so I spent a week hacking up some python and TAL code in order to create what I had called zenoss-organized-graphs, a few dynamic reporting template that allowed you to select what to graph from standard combo dropdowns. My employer decided not to use Zenoss, and when Zenoss launched 2.0, the whole backend (including the reporting engine) had changed completely, and my one-week-hack project died right then and there. Fast forward to last year when, right before I started my new position, the Operations VP had already decided and purchased Zenoss Enterprise 2.5.x. Thus began my reignited love/hate relationship with Zenoss.

Originally, because of the nature of the hosting provider, we had to run Zenoss using purely SSH/Nagios commands and scripts. The performance and overhead of all those SSH connections, despite our attempts to keep all the scripts short and zippy, was a headache. One day, while trying to update the default behavior of a the Solaris Device SSH templates, I discovered that deleting ZenPacks is BAD. So bad that the only way to recover from the situation it caused was a ZODB Restore. Mind you, this was not the first issue I had regarding ZenPacks, but it was the straw that broke the camel's back. I made a solemn vow to myself that I would avoid ZenPacks as much as possible.

Let me clarify that last statement just a bit. I believe that ZenPacks have a useful purpose, as I do install the Core and Enterprise ZenPacks. That purpose is to add tabs/menu views, like the Zeus ZenPack, or to add global functionality, like LDAP-Authentication or the Holts-Winter Prediction. But due to what I consider its volatile nature (of removal/undo), I simply do not think it is the best way to perform simple updates to templates, reports, devices, etc. Again Linux Dynasty, a much more devout follower of Zenoss than I, had a solution --- probably based loosely on our earlier carefree Zenoss 1.x python discussions --- and it came in the form of the Zenoss_Template_Manager. I am certain that it is a great tool, and if you have been using it and you love it, then read no further. However, even though I am a big fan of getopts and passing parameters to a script, the prospect of having to pass multiple DataSources, Thresholds and GraphDefs did not readily appeal to me. I needed a way to manipulate the fairly complex templates I had built using configuration files, which also serve (for me, at least) as a better backup technique than the ZenPack or the ZenBackup. Zenoss itself uses XML to import changes into the ZODB but, while I believe in the power of XML, there had to be a better way for humans to process the data quickly and easily enough to affect rapid changes for template duplication, etc. And thus began my conversion to the YAML religion.

Originally, I wrote my zenossYAMLTool to be for adding and removing devices and templates. Throughout the course of the year, I would extend it to cover alerts, devices, event commands, event mappings, os processes, reports, templates, transforms, users/groups and maintenance windows. As I needed more functionality to handle minor changes, reproductions and migrations, I added features as I needed them. I had spoken about this tool with a few people here and there who I had conversations with when the topic of Zenoss came up and it occurred to me that perhaps, by sharing my tool, I could get some feedback and identify any bugs I overlooked.

So here it is: zenossYAMLTool.py

Zenoss has all the python dependencies that are required to use this tool, except for the python YAML plugin. So install that and, if you want to make it convenient, place the script into $ZENHOME/bin and run it as the zenoss user. In order to understand the YAML variables that are used, you should simply use the appropriate export feature on an existing object on your Zenoss instance. Using my latest addition (-w) as an example, here is a snippet of what a YAML configuration for this tool looks like:

- action: add_window
  windowName: Overnight Maintenance
  windowPath: /Server/Linux/
  duration: 120
  repeat: Daily
  skip: 1
  start: 1299139200.0
  enabled: true

I actually use this tool to generate grouped Multi-Graph Reports from the existing DeviceClasses --- perhaps one day I will revitalize the zenoss-organized-reports into the 3.x release, as I am not so sure that I am 100% satisfied that the current Reporting engine managed to cover all the features I had with my zenoss-organized-graphs back then.

This Is Tech

In order to avoid munging all my thoughts together into a single blog, I have decided to create a tech blog where I will simply post up technobabble when I feel like sharing tidbits of technical issues that have caused my grief and how I resolved them.

This will mostly be stuff I come across moving forward. I will dredge up stuff from the past should I re-encounter it or if I find myself reminiscing about past jobs and past lives.