bash / postfix health check / dev tcp

bash / postfix health check / dev tcp

  • Written by
    Walter Doekes
  • Published on

Using /dev/tcp in bash for a health check? Here's an example.

I had a script that used netcat to connect to a Postfix email daemon to check its health status. To avoid pipelining errors I had it sleep between each write. The core looked somewhat like this:

messages=$(for x in \
        'EHLO localhost' \
        'MAIL FROM:<healthz@localhost>' \
        'RCPT TO:<postmaster@example.com>' \
        RSET \
        QUIT
    do sleep 0.1; echo "$x"; done |
        nc -v $HOST 25 2>&1 |
        tr -d '\r')

This works, but the sleeps make it slower than necessary, and more brittle. If the daemon is temporarily slow, we can trigger a Postfix SMTP command pipelining error anyway.

Ideally, we want to read the responses, and act on them immediately instead.

Here's a script that uses bash instead of POSIX sh because bash has /dev/tcp support, which makes doing networking I/O easier.

Starting bash might be slightly costlier than starting a smaller POSIX sh like dash. But we avoid calling netcat and some other tools, so we win out not only in speed but also in resource usage.

#!/bin/bash
# Using bash (not POSIX sh) for /dev/tcp I/O.
: ${BASH_VERSION:?Use bash, not POSIX-sh}
set -u

DEBUG=0
READ_TIMEOUT=10
CHAR_CR=$'\r'

#HOST=mail.example.com
#HOST=1.2.3.4
HOST=127.0.0.1
PROXY_PROTOCOL=

if [ "$HOST" = 127.0.0.1 ]; then
    # This should succeed in mere milliseconds. But sometimes postfix
    # decides to take just over a second for the RCPT TO check.
    READ_TIMEOUT=3
    PROXY_PROTOCOL='PROXY TCP4 127.0.0.1 127.0.0.1 12345 25'  # (or empty)
fi

getresp() {
    local line status
    while :; do
        read -t $READ_TIMEOUT -r line
        [ $DEBUG -ne 0 ] && printf '%s\n' "<<< $line" >&2
        printf '%s\n' "${line%$CHAR_CR}"
        test -z "$line" && exit 65
        status=${line:0:3}
        if [ $status -lt 200 -o $status -ge 300 ]; then
            exit 66
        elif [ "${line:3:1}" = ' ' ]; then
            break   # "250 blah"
        elif [ "${line:3:1}" = '-' ]; then
            true    # "250-blah" (continue)
        else
            exit 67
        fi
    done
}

if ! exec 3<>/dev/tcp/$HOST/25; then  # open fd
    # Takes a looooot of time. 2m10 in the test case. You will want to wrap
    # this script in a timeout(1) call.
    # $0: connect: Connection timed out
    # $0: line 40: /dev/tcp/1.2.3.4/25: Connection timed out
    exit 1
fi

messages=$(for x in \
        "$PROXY_PROTOCOL" \
        'EHLO localhost' \
        'MAIL FROM:<healthz@localhost>' \
        'RCPT TO:<postmaster@example.com>' \
        RSET \
        QUIT
    do \
        [ -n "$x" -a $DEBUG -ne 0 ] && printf '>>> %s\n' "$x" >&2
        [ -n "$x" ] && printf '%s\r\n' "$x" >&3
        getresp <&3 || exit $?
    done)
ret=$?

exec 3>&- # close fd

ok=$(echo "$messages" | grep -xF '250 2.0.0 Ok')
fail=$(echo "$messages" | sed -e1d | grep -v ^2 | grep '')
if [ $ret -ne 0 -o -z "$ok" ]; then
    echo "Missing OK line (ret $ret). Got:" $messages
    false
elif [ -n "$fail" ]; then
    echo "$fail"
    false
else
    true
fi

One can invoke this from something like a Haproxy check script — this is also the part where you add a timeout call.

#!/bin/sh
CHECK_SCRIPT=/usr/local/bin/postfix-is-healthy

if test -f /srv/in_maintenance; then
    echo 'drain stopped 0%'
elif error=$(timeout -k3 -s1 2 $CHECK_SCRIPT 2>&1); then
    echo 'ready up 100%'
else
    echo 'fail #' $error
fi

And that script could be invoked from something like xinetd:

service postfix-haproxy
{
        flags           = REUSE
        socket_type     = stream
        type            = unlisted
        port            = 1025
        wait            = no
        user            = nobody
        server          = /usr/local/bin/postfix-haproxy-agent-check
        log_on_failure  += USERID
        disable         = no
        only_from       = 0.0.0.0/0 ::1
        per_source      = UNLIMITED
}

Back to overview Newer post: mplayer / screen saver / wayland Older post: gpgv / can't allocate lock for