You are on page 1of 4

iostat monitoring for zenoss and such

So I wanted to graph linux storage metrics in a local zenoss environment so I googled it, as you do. Theres a bit out there like the nagios check_iostat script, even some perl scripts that keep their own averages of iostat metrics. Considering iostat is part of the sysstat package along with our old friend sar it seems only right to use this tool to provide relevant and meaningful statistics for just causes. Another complication is the range of platforms and storage/drivers. In my first instance its an emulex dual port HBA with mpathd, the ability to regex the devices you want to capture for your filesystem is essential. Im going to combine the io counter like metrics and average the service time metrics. Now as for zenoss, theres snmp or command w/ ssh for data sources that are relevant to the task. snmp is great but you can never expect it to be more than a 5 minute poll rate and in general inconsistent, commando style on the other hand has a configurable execution time in zenoss. My problem is the SSH part, it messes up my wtmp making it difficult to keep an eye on whos been logged in messing stuff up. My solution to everything is a net-snmp pass-through script to transport the data and a local script munging snmpwalk into a nagios style output for zenoss to record, giving me my 1 minute polling. Firstly you need to install sysstat your way and edit /etc/cron.d/sysstat. My examples are RedHat/CentOS style. We want sar to run every minute, and record disk stats. If you are adding the -d you wont see the data immediately unless you delete the /var/log/sa/saXX file for today, or wait for tomorrow.
*/1 * * * * root /usr/lib64/sa/sa1 -d 1 1

So now you already have all the stats youll ever need, get a tool like kSar and you can impress anyone. Next is to present this data via snmp. I already have this open between all my hosts and the network management host. net-snmpd has a pass option in the config that will allow you to present your own data . As I learned along the way snmp will either ask for an OID, or the next OID, and the script must support this type of query.
#!/bin/bash NO_ARGS=0 E_OPTERROR=85 if [ $# -eq "$NO_ARGS" ] then echo "Usage: `basename $0` options (-md)" echo "Where option is -m minutes for average (ie. zenoss poll rate, 1 or 5 would make sense)"

fi

echo " echo " exit $E_OPTERROR

-d device (ie. dev253)" -o base oid (ie. .1.3.6.1.3.1"

while getopts "m:d:o:n:g:" o do case $o in m ) MINUTES=$OPTARG;; d ) DEVICE=$OPTARG;; o ) BASEOID=$OPTARG;; n ) NREQ=$OPTARG ; TYPE="NEXT";; g ) GREQ=$OPTARG ; TYPE="GET";; esac done case $TYPE in NEXT ) REQ=$NREQ;; GET ) REQ=$GREQ;; esac MINUTES=`expr $MINUTES + 1` /usr/bin/sar -d -s `date -d "${MINUTES} minutes ago" +%H:%M:%S` -e `date +%H:%M:%S` | egrep -e "Average.*${DEVICE}" | \ /bin/awk '{ tps += $3 ; readss += $4 ; writess += $5 ; reqsize += $6 ; queuelength += $7 ; wait +=$8 ; svct += $9 ; util += $10 ; count++ }; \ END \ { printf "%d %d %d %d %d %d %d %d\n", tps, readss, writess, reqsize/count, queuelength/count, wait/count, svct/count, util/count }' >/tmp/sarsnmp.$$ read TPS READS WRITES REQSZ QUEUE AWAIT ASVC UTIL >/tmp/sarsnmp.$$ OUTPUTBASEOID="" OUTPUTBASEOID0=`printf OUTPUTBASEOID1=`printf OUTPUTBASEOID2=`printf OUTPUTBASEOID3=`printf OUTPUTBASEOID4=`printf OUTPUTBASEOID5=`printf OUTPUTBASEOID6=`printf OUTPUTBASEOID7=`printf $BASEOID".0\ninteger\n"%s"\n" $BASEOID".1\ninteger\n"%s"\n" $BASEOID".2\ninteger\n"%s"\n" $BASEOID".3\ninteger\n"%s"\n" $BASEOID".4\ninteger\n"%s"\n" $BASEOID".5\ninteger\n"%s"\n" $BASEOID".6\ninteger\n"%s"\n" $BASEOID".7\ninteger\n"%s"\n" $TPS` $READS` $WRITES` $REQSZ` $QUEUE` $AWAIT` $ASVC` $UTIL`

case $REQ in ${BASEOID} ) if [ $TYPE = "NEXT" ]; then echo "$OUTPUTBASEOID" ; fi ;; ${BASEOID}.0 ) if [ $TYPE = "NEXT" ]; else echo "$OUTPUTBASEOID0"; fi ;; ${BASEOID}.1 ) if [ $TYPE = "NEXT" ]; else echo "$OUTPUTBASEOID1"; fi ;; ${BASEOID}.2 ) if [ $TYPE = "NEXT" ]; else echo "$OUTPUTBASEOID2"; fi ;; ${BASEOID}.3 ) if [ $TYPE = "NEXT" ]; else echo "$OUTPUTBASEOID3"; fi ;;

echo "$OUTPUTBASEOID0"; else then echo "$OUTPUTBASEOID1"; then echo "$OUTPUTBASEOID2"; then echo "$OUTPUTBASEOID3"; then echo "$OUTPUTBASEOID4";

${BASEOID}.4 ) if [ $TYPE = "NEXT" else echo "$OUTPUTBASEOID4"; fi ;; ${BASEOID}.5 ) if [ $TYPE = "NEXT" else echo "$OUTPUTBASEOID5"; fi ;; ${BASEOID}.6 ) if [ $TYPE = "NEXT" else echo "$OUTPUTBASEOID6"; fi ;; ${BASEOID}.7 ) if [ $TYPE = "NEXT" ]; echo "$OUTPUTBASEOID7"; fi ;; esac rm -f /tmp/sarsnmp.$$

]; then echo "$OUTPUTBASEOID5"; ]; then echo "$OUTPUTBASEOID6"; ]; then echo "$OUTPUTBASEOID7"; then echo "$OUTPUTBASEOID"; else

This script takes a number of internal options, but two come from snmpd -g (GET) and -n (GET NEXT), followed by the OID. This is called by net-snmpd and most importantly configured in /etc/snmp/snmpd.conf as below;
pass .1.3.6.1.4.1.2021.255.1 /usr/local/bin/sar_iostat_snmp.sh -m 1 -d dev253 -o .1.3.6.1.4.1.2021.255.1

Now I made the OID up and unless you know otherwise using that is the safest bet. The -m 1 flag is because Im after a 1 minute average -d is te device, run sar -d and you see the device. ls -la the /dev/sdX or /dev/mapper/mpath device and the device numbers will be obvious. Great.. Lets test it on another host.
$ snmpwalk -v2c -c public datawarehouse .1.3.6.1.4.1.2021.255.1 UCD-SNMP-MIB::ucdavis.255.1.0 = INTEGER: 67 UCD-SNMP-MIB::ucdavis.255.1.1 = INTEGER: 16709 UCD-SNMP-MIB::ucdavis.255.1.2 = INTEGER: 26 UCD-SNMP-MIB::ucdavis.255.1.3 = INTEGER: 247 UCD-SNMP-MIB::ucdavis.255.1.4 = INTEGER: 0 UCD-SNMP-MIB::ucdavis.255.1.5 = INTEGER: 5 UCD-SNMP-MIB::ucdavis.255.1.6 = INTEGER: 5 UCD-SNMP-MIB::ucdavis.255.1.7 = INTEGER: 16

Its all working, but we need to use a command data source to get the granularity and consistancy. I make a script and put it in $ZENOSS/libexec/
#!/bin/bash NO_ARGS=0 E_OPTERROR=85 if [ $# -eq "$NO_ARGS" ] then echo "Usage: `basename $0` options (-vcho)" exit $E_OPTERROR fi while getopts "v:c:h:o:" o do case $o in

v ) c ) h ) o ) esac done

SNMPVER=$OPTARG;; COMMUNITY=$OPTARG;; HOST=$OPTARG;; OID=$OPTARG;;

snmpwalk -v $SNMPVER -c $COMMUNITY $HOST $OID | cut -d"=" -f2 | cut -d":" -f 2 | awk '{ printf "%s ", $0 } END { print "" }' >/tmp/chksnmpetc.$$ read TPS READS WRITES REQSZ QUEUE AWAIT ASVC UTIL >/tmp/chksnmpetc.$$ printf "OK | tps=%s reads_sec=%s writes_sec=%s requests_size=%s queue_size=%s average_wait=%s average_service_time=%s utilisation=%s\n" $TPS $READS $WRITES $REQSZ $QUEUE $AWAIT $ASVC $UTIL rm -f /tmp/chksnmpetc.$$

Awesome, now were ready. Make a template, in that make a command data source. Make the cycle time every 60 seconds and fill in the other required fields. The command will be;
/usr/local/zenoss/libexec/check_customiostat.sh 1.3.6.1.4.1.2021.255.1 -h ${here/manageIp} -v 2c -c public -o .

If you test this against your host you should expect;


Executing command /usr/local/zenoss/libexec/check_customiostat.sh -v 2c -c public -o . 1.3.6.1.4.1.2021.255.1 -h 172.1.3.4 against datawarehouse OK | tps=61 reads_sec=15633 writes_sec=14 requests_size=253 queue_size=0 average_wait=7 average_service_time=6 utilisation=20 DONE in 0 seconds

Now I havent mucked with the OK status. I will build thresholds against the metrics rather. Add gauge data points for the variables passed and you can quickly get a graph like this to see your service or wait times blow out.

You might also like