Mon Dec 13 18:11:16 2010                        Michael Jennings (mej)

Initial check-in.
----------------------------------------------------------------------
Mon Dec 13 19:12:54 2010                        Michael Jennings (mej)

Work in progress:  Initial skeleton for node health check.
----------------------------------------------------------------------
Tue Dec 14 11:15:23 2010                        Michael Jennings (mej)

Completed driver script for health check.  Now to write checks and
test it.
----------------------------------------------------------------------
Wed Dec 15 17:01:43 2010                        Michael Jennings (mej)

Add packaging goop.
----------------------------------------------------------------------
Wed Dec 15 19:10:33 2010                        Michael Jennings (mej)

Added some checks and a sample config.  In the process of debugging.
----------------------------------------------------------------------
Wed Dec 15 20:51:38 2010                        Michael Jennings (mej)

Debugging errors in the FS check.
----------------------------------------------------------------------
Wed Dec 15 21:19:35 2010                        Michael Jennings (mej)

Thanks to Greg, I fixed the handling of subshell (pipeline)
vs. non-subshell while loops.  Everything appears to be working now.
----------------------------------------------------------------------
Thu Dec 16 20:51:57 2010                        Michael Jennings (mej)

Allow for more readable config files by stripping surrounding
whitespace from target and check values.

Convert comment check to native bash to avoid spawning grep and a
subshell.

Fix bug with regexp targets in config.  Forgot to *actually* strip off
the slashes...
----------------------------------------------------------------------
Thu Dec 16 21:24:58 2010                        Michael Jennings (mej)

Fix output redirection and debugging.
----------------------------------------------------------------------
Thu Mar 31 15:50:46 2011                        Michael Jennings (mej)

Fix spec to package all installed files and directories.

Wrap I/O in "eval" to make sure $LOGFILE redirection symbols are used
properly.

Fix typo in $SILENT check.

Fix config file parsing to be compatible with bash >= 3.2.
----------------------------------------------------------------------
Thu Mar 31 17:46:42 2011                        Michael Jennings (mej)

Don't log timestamp; we're trying to avoid subprocesses.

On error, syslog the reason.
----------------------------------------------------------------------
Thu Apr 21 18:46:17 2011                        Michael Jennings (mej)

This is a work in progress.  I'm testing a bunch of stuff, so some or
all of it may end up not working.  We'll see.

 - Added routine to gather /etc/passwd data into arrays
 - Added userid-to-UID mapping function
 - Consolidated process checks to use a single spawning of "ps"
 - Added routine to gather list from TORQUE of users who currently
   have jobs running on the node
 - Added check for unauthorized processes running on the node
 - Added timeout in background to kill nhc if it hangs to avoid
   hanging pbs_mom
 - Eliminated several unnecessary forks
----------------------------------------------------------------------
Fri Apr 22 14:03:32 2011                        Michael Jennings (mej)

Still needs some debugging, but I've successfully eliminated all but 1
subprocess (the "ps" command).  Quite good given all the script does
so far.

Also added the beginnings of a test script for making sure the
individual functions work as advertised.
----------------------------------------------------------------------
Mon Apr 25 13:00:10 2011                        Michael Jennings (mej)

Final fixups for UID check.  Everything appears to be working well
now.
----------------------------------------------------------------------
Wed Apr 27 12:19:11 2011                        Michael Jennings (mej)

Added check to verify user processes descend from pbs_mom.

Added flexible regexp/glob match check.

Renamed utility functions to nhc_* so that only user-usable checks
start with check_*.

Added syslog function to save syslog messages until script
termination.
----------------------------------------------------------------------
Mon May  2 18:51:27 2011                        Michael Jennings (mej)

Added checks for CPU socket/core/thread counts and total/free
RAM/swap/memory.
----------------------------------------------------------------------
Tue May  3 19:07:02 2011                        Michael Jennings (mej)

Missed a file.
----------------------------------------------------------------------
Wed May  4 17:49:59 2011                        Michael Jennings (mej)

Minor cleanups to check_ps_kswapd().
----------------------------------------------------------------------
Fri May  6 13:12:24 2011                               Yong Qin (yqin)

Added check for Infiniband.
----------------------------------------------------------------------
Fri May  6 14:41:23 2011                               Yong Qin (yqin)

A minor bug fix.
----------------------------------------------------------------------
Tue May 10 01:26:14 2011                        Michael Jennings (mej)

Bump version.
----------------------------------------------------------------------
Tue May 10 13:01:16 2011                               Yong Qin (yqin)

Added checks for Myrinet and Ethernet. Minor bug fix.
----------------------------------------------------------------------
Thu May 12 08:12:48 2011                        Michael Jennings (mej)

Try alternate mechanism for IB port checks.
----------------------------------------------------------------------
Wed May 18 15:57:49 2011                        Michael Jennings (mej)

Fix parsing bug.  Due to bash not properly "escaping" expanded
variables inside ${VAR#...} constructs, config file lines must not
contain more than one occurance of "||" any more.

Direct output to /dev/null, then redirect if $LOGFILE is set.
----------------------------------------------------------------------
Tue May 24 15:33:25 2011                        Michael Jennings (mej)

Support older single-core, non-HT CPUs in /proc/cpuinfo.
----------------------------------------------------------------------
Thu Sep  1 16:16:38 2011                        Michael Jennings (mej)

Fixed status reporting and added NHC label to offline message.
----------------------------------------------------------------------
Mon Sep 19 17:13:11 2011                        Michael Jennings (mej)

Bump version to 1.1.

This release adds the ability to detect previously-set notes for nodes
and not overwrite them.

It will also clear notes and online nodes if all checks pass for a
node that had previously had check errors.  It will only do this for
nodes whose notes begin with "NHC" to avoid bringing nodes online
which were manually offlined.  Nodes marked offline which have no note
are not distinguished from down nodes and may be brought online if the
error(s) clear.
----------------------------------------------------------------------
Thu Oct 13 16:04:20 2011                        Michael Jennings (mej)

Output onlining/offlining of nodes to log (with timestamp).

Log failure of health check to logfile as well as syslog.
----------------------------------------------------------------------
Wed Jan 25 15:40:58 2012                        Michael Jennings (mej)

Convert to autoconf/automake for build.
----------------------------------------------------------------------
Tue Feb  7 11:11:12 2012                        Michael Jennings (mej)

Various fixes and release of 1.1.4.
----------------------------------------------------------------------
Tue Mar 13 11:22:45 2012                        Michael Jennings (mej)

Bump version.  More consistency/cleanups.
----------------------------------------------------------------------
Fri May  4 11:31:47 2012                        Michael Jennings (mej)

Remove debugging stuff for UIDs > 100.  I don't really use it anyway,
and some people may want to run as other users.

Convert node online/offline scripts to use variables and $PATH to
identify where the "pbsnodes" command is and what arguments it should
take.

Add an "eval" to the execution of the check so that shell variables
can be used or altered in config files.
----------------------------------------------------------------------
Wed May  9 12:12:54 2012                        Michael Jennings (mej)

Always use [[ ]] instead of [ ] (primarily for consistency).

Add customization of resource manager daemon match expression and
greater control over pbsnodes commands in online/offline helpers.
----------------------------------------------------------------------
Wed May  9 12:22:20 2012                        Michael Jennings (mej)

Fix a couple conditional expressions from the last commit.
----------------------------------------------------------------------
Tue May 15 09:05:23 2012                        Michael Jennings (mej)

Make sure nodes with no job files still work.
----------------------------------------------------------------------
Wed May 16 13:50:20 2012                        Michael Jennings (mej)

Fix bug pointed out by Ole Holm Nielsen <ole.h.nielsen@fysik.dtu.dk>
which caused the new "eval" of config file lines to barf on the
regular expression with parentheses in the sample config.  Going
forward, users will need to take care to escape shell metacharacters
appropriately in config files.
----------------------------------------------------------------------
Fri Jun 22 15:26:40 2012                        Michael Jennings (mej)

Add stubs for unit test and benchmarking scripts.

Convert main NHC driver script to use functions so that it can be
loaded without needing to be executed and to facilitate testing of
some of its functionality.
----------------------------------------------------------------------
Fri Jun 29 17:52:30 2012                        Michael Jennings (mej)

I smell unit tests!
----------------------------------------------------------------------
Mon Jul  2 16:33:54 2012                        Michael Jennings (mej)

Move unit test framework to separate file.  Now called "SHUT."

Override output functions to suppress normal NHC I/O and exception
handling.

Major refactoring of test framework to allow named tests and progress
output.

Added lots more unit tests for main nhc script.
----------------------------------------------------------------------
Mon Jul  2 17:17:38 2012                        Michael Jennings (mej)

Initial test files for each check script.
----------------------------------------------------------------------
Thu Aug 16 18:12:36 2012                        Michael Jennings (mej)

More work on unit tests:
 - Report number of tests skipped, if any.
 - Add tests for "common.nhc" module.
 - Add tests for "ww_fs.nhc" module.
 - Fix typos in external match checks.
----------------------------------------------------------------------
Fri Aug 24 15:29:17 2012                        Michael Jennings (mej)

Finished hardware unit tests.
----------------------------------------------------------------------
Mon Aug 27 14:52:27 2012                        Michael Jennings (mej)

Unit tests are finally done!  Should be 100% coverage on end-user
checks too, though I don't know of any "gcov" equivalents for
bash....  ;-)

TODO:  More checks!
----------------------------------------------------------------------
Mon Aug 27 16:39:47 2012                        Michael Jennings (mej)

Build fixes, alternative skip syntax, and unit test changes to allow
"make test" in the spec file.  Tested on RHEL4, 5, and 6 and in chroot
jails and VNFS images.
----------------------------------------------------------------------
Tue Sep  4 13:01:10 2012                        Michael Jennings (mej)

Initial support for the nVidia HealthMon tool for checking the status
of nVidia CUDA GPU devices.  More information can be found with the
Tesla Deployment Kit version 3 (currently in RC status).
----------------------------------------------------------------------
Tue Sep  4 17:50:00 2012                        Michael Jennings (mej)

Add check for blacklisted processes.
----------------------------------------------------------------------
Wed Sep  5 16:39:59 2012                        Michael Jennings (mej)

New checks for filesystem size/used/free limits based on "df" output.
Refactored check_fs_mount() to only read /proc/mounts once and
populate central array set (just like all the other modules).
Refactored unit tests accordingly.
----------------------------------------------------------------------
Thu Sep  6 09:57:41 2012                        Michael Jennings (mej)

Added support for detached mode.  Runs all checks in the background,
saves state to filesystem and checks it on the next run.
----------------------------------------------------------------------
Fri Sep  7 14:14:41 2012                        Michael Jennings (mej)

Added unit tests for new disk space checks.  Tweaked detached mode to
detach sooner.  Fixed some faulty logic.
----------------------------------------------------------------------
Fri Sep  7 14:52:49 2012                        Michael Jennings (mej)

A couple minor bugfixes/cleanups.  This is now officially 1.2 beta.
----------------------------------------------------------------------
Wed Oct  3 16:09:05 2012                        Michael Jennings (mej)

Finalized 1.2 release.
----------------------------------------------------------------------
Thu Oct 25 18:29:13 2012                        Michael Jennings (mej)

Add support for NHC log rotation.
----------------------------------------------------------------------
Fri Oct 26 17:17:51 2012                        Michael Jennings (mej)

Add support and unit tests for an "authorized users" whitelist.
----------------------------------------------------------------------
Mon Oct 29 17:58:30 2012                        Michael Jennings (mej)

By default, don't touch nodes that are offline but have no note.  Not
every site uses notes as religiously as we do, nor wants to!
----------------------------------------------------------------------
Tue Oct 30 14:17:40 2012                        Michael Jennings (mej)

Fix job file location fallback handling, and look up userids for
processes where only UID is given as this may indicate a userid >8
characters rather than an unknown user.
----------------------------------------------------------------------
Tue Nov  6 13:28:13 2012                        Michael Jennings (mej)

Add nhc.cron script contributed by Ole Holm Nielsen
<Ole.H.Nielsen@fysik.dtu.dk> to help minimize excessive messages from
NHC when executed via cron.
----------------------------------------------------------------------
Wed Nov  7 14:23:44 2012                        Michael Jennings (mej)

Finalized 1.2.1 release.
----------------------------------------------------------------------
Wed Nov  7 15:24:26 2012                        Michael Jennings (mej)

Found a bug.  Re-releasing 1.2.1.
----------------------------------------------------------------------
Tue Nov 27 16:56:25 2012                        Michael Jennings (mej)

Despite being specified by POSIX, apparently bash's built-in "kill"
command doesn't support signaling process groups.  The watchdog timer
has been rewritten to just kill the nhc script itself.  Unit tests for
the watchdog timer were also added.
----------------------------------------------------------------------
Thu Nov 29 17:57:23 2012                        Michael Jennings (mej)

New check:  check_hw_mcelog

This check will run "mcelog --client" by default and fail if any
output is received.  If the mcelog daemon is not running, this will be
noted in the log file and syslog, but the check will pass.
----------------------------------------------------------------------
Mon Dec 17 11:03:13 2012                        Michael Jennings (mej)

Reset IFS in die() handler and add quotes to traps.  This should
prevent newlines being added to failure messages when certain
subcommands cause timeouts.
----------------------------------------------------------------------
Wed Jan 16 12:14:23 2013                        Michael Jennings (mej)

Patch from John Hanks <john.hanks@usu.edu> for basic pdsh-style node
range support.  Node ranges are now permitted and must be surrounded
by braces (e.g., "{n00[00-99].cluster}").  Multiple ranges may be
specified by separating them with commas (e.g., "{node[0-5],node8}"),
but commas may NOT be used inside the brackets ("{node[0-5,8]}").

This feature should be considered experimental at this point.  Please
report any mismatches.
----------------------------------------------------------------------
Wed Jan 16 14:24:02 2013                        Michael Jennings (mej)

I've added fallback support for LDAP, NIS, etc. via "getent" based on
a suggestion and proposed patch from John Hanks <john.hanks@usu.edu>.

For users using any solution for passwd resolution other than local
/etc/passwd, there are now 2 possible alternatives.

One, you can override the use of /etc/passwd as the source of passwd
data.  You can reference any file on the filesystem that's locally
accessible by setting PASSWD_DATA_SRC to the filename you want NHC to
use.  This could be used to read from a cache file generated by, e.g.,
"ypcat passwd" or a similar command.  This should also work with
process substitution, so you can specify something like
PASSWD_DATA_SRC='<(ypcat passwd)' instead of using a file.  (Note,
though, that this will block.  Insert associated caveats here.)

Two, if reading from PASSWD_DATA_SRC fails, NHC will use "getent" on
an as-needed basis to populate its internal data structures.  Note
that this will mean 1 execution of getent PER missing userid or UID.
Once a particular passwd entry has been retrieved, the information
will be cached and used throughout that NHC run.  Subsequent
executions of NHC will have to execute getent again for each missing
entry.

I do not have such a system, so these changes are largely untested.
Feedback is humbly requested.  :-)
----------------------------------------------------------------------
Tue Jan 22 13:32:47 2013                        Michael Jennings (mej)

Bare parenthesized regular expressions are incompatible with bash 3 in
RHEL4 and RHEL5.  Quoted regular expressions are incompatible with
bash 4 in RHEL6.  Why?  Because someone decided in bash 4 to allow
parts of regular expressions to be quoted and matched as strings
instead.  Why?  Beats me.

Way to go, bash developers.  That's the kind of incompatibility I'd
expect from Python. :-/

Thankfully, storing the regular expressions in variables works just
fine, so we'll do that.
----------------------------------------------------------------------
Tue Jan 22 16:56:03 2013                        Michael Jennings (mej)

1.2.2 has been released.
----------------------------------------------------------------------
Tue Jan 22 16:58:44 2013                        Michael Jennings (mej)

Setting up the tree for 1.2.3 development.
----------------------------------------------------------------------
Mon Feb  4 11:20:34 2013                        Michael Jennings (mej)

Fix from Ole Holm Nielsen <Ole.H.Nielsen@fysik.dtu.dk> for his
nhc.cron script to place transient files in /var/lib/nhc rather than
/tmp to avoid potential symlink issues.  Based on suggestions on the
hpc-monitoring Google Group from Stuart Barkley <google@4gh.net> and
Jesse Becker <hawson@gmail.com>.
----------------------------------------------------------------------
Mon Feb 11 14:22:29 2013                        Michael Jennings (mej)

Added variable $DETACHED_MODE_FAIL_NODATA which can be set to 1 to
cause detached mode to return failure by default, instead of success,
when no results file is present from a previous run.

Also made sure that results files which are older than /proc,
indicating a reboot since the last run, are considered stale and
removed.
----------------------------------------------------------------------
Mon Mar 11 14:47:19 2013                        Michael Jennings (mej)

Add DMI data gatherer and corresponding checks.  Definitely not
speedy, but there's an awful lot of very valuable information that can
be gleaned from it.  And, as always, once you take the initial hit,
you can write as many DMI checks as you like with minimal added
overhead.

Still need to write the unit tests for the new checks.
----------------------------------------------------------------------
Mon Mar 11 17:43:40 2013                        Michael Jennings (mej)

Added unit tests and auto-fu for DMI stuff.
----------------------------------------------------------------------
Tue Mar 12 16:39:17 2013                        Michael Jennings (mej)

Merged and tweaked patch from Aleksey Senin <aleksey@senin.name> to
support Infiniband device names in check_hw_ib.
----------------------------------------------------------------------
Tue Mar 12 16:39:19 2013                        Michael Jennings (mej)

Add .gitignore file.
----------------------------------------------------------------------
Wed Mar 13 15:29:34 2013                        Michael Jennings (mej)

Auto-detect resource manager on startup for later use.
----------------------------------------------------------------------
Wed Mar 13 15:29:37 2013                        Michael Jennings (mej)

Move nhc_fs_[un]parse_size() to common for use in other checks.
----------------------------------------------------------------------
Wed Mar 13 15:29:39 2013                        Michael Jennings (mej)

Convert existing checks to support multiple resource managers via
$NHC_RM.
----------------------------------------------------------------------
Thu Mar 14 21:08:53 2013                        Michael Jennings (mej)

Merged in contributions from Dustin Rice <dustin@alaska.edu> to add
SLURM support to the node online/offline scripts.  Also added SGE
support (sort of, since it's not actually necessary).
----------------------------------------------------------------------
Thu Mar 14 21:14:11 2013                        Michael Jennings (mej)

Fix "make test" by assuming PBS for testing purposes.
----------------------------------------------------------------------
Fri Mar 15 16:38:28 2013                        Michael Jennings (mej)

Bump to version 1.3.
----------------------------------------------------------------------
Mon Mar 18 15:53:15 2013                        Michael Jennings (mej)

Simulate svnversion with git to fix RPM release numbers.
----------------------------------------------------------------------
Wed Mar 20 11:58:08 2013                        Michael Jennings (mej)

Fix issues with SLURM support found during testing.  Since $(...) does
not treat quotes as metacharacters, we'll need to hard-code the sinfo
arguments for obtaining the node status listing.  If someone has a
better way, I'm all ears (eyes?)!

As far as I can tell, SLURM support is now fully functional.
----------------------------------------------------------------------
Wed Mar 20 12:03:37 2013                        Michael Jennings (mej)

Oops; missed a few spots.
----------------------------------------------------------------------
Wed Mar 20 13:37:30 2013                        Michael Jennings (mej)

Preliminary work to integrate Grid Engine support into NHC.  This
currently still requires an external input loop, but I am planning to
do that in nhc as well.
----------------------------------------------------------------------
Wed Mar 20 14:41:55 2013                        Michael Jennings (mej)

Finish merging the Grid Engine wrapper into NHC.  The nhc script is
now directly callable as an SGE/UGE/*GE "load sensor."
----------------------------------------------------------------------
Thu Mar 21 11:48:40 2013                        Michael Jennings (mej)

Use "DRAIN" state instead of "DOWN" as the latter will terminate
running jobs on the node.  We don't want that.  :-)
----------------------------------------------------------------------
Thu Mar 21 11:48:42 2013                        Michael Jennings (mej)

More tweaks to handling of SLURM node states.
----------------------------------------------------------------------
Thu Mar 21 14:42:56 2013                        Michael Jennings (mej)

Another node state I didn't know about.
----------------------------------------------------------------------
Fri Mar 22 12:35:26 2013                        Michael Jennings (mej)

Add online/offline support for IBM Platform LSF.

NOTE:  This is based entirely on documentation and has not been tested
at all.  If you are an LSF user and are willing to help test, please
contact me!  I haven't found a way to have LSF run NHC on the nodes
yet, so for now it would need to run out of cron or similar.
----------------------------------------------------------------------
Mon Apr 01 16:27:08 2013                        Michael Jennings (mej)

Add support for both command line options and arbitrary environment
variable setting on the command line.  A few limited options are
available; see "nhc -h" for details.  For example, you can turn on
debugging and set an alternate config file location using:

    # nhc -d -c /etc/nhc/alternate.conf

All other configuration settings can be manipulated on the command
line using an env-like VAR=value syntax.  So for example, if you
wanted to disable marking online/offline of nodes and set the maximum
system UID to 499, you can now do:

    # nhc MARK_OFFLINE=0 MAX_SYS_UID=499

Note that these parameters WILL be overridden by the config file if
they're set there!
----------------------------------------------------------------------
Mon Apr 01 18:09:03 2013                        Michael Jennings (mej)

Added 3 new checks:  check_fs_inodes(), check_fs_ifree(), and
check_fs_iused().  The perform the same tasks as their
check_fs_{size,used,free}() counterparts except using inode count
instead of byte count.
----------------------------------------------------------------------
Mon Apr 01 18:09:06 2013                        Michael Jennings (mej)

Add support for byte suffixes to all check_hw_{physmem,swap,mem}{,_free}() tests.
----------------------------------------------------------------------
Tue Apr 02 17:09:16 2013                        Michael Jennings (mej)

Added new check:  check_file_contents() will scan through a file
looking for matches to one or more patterns (regular expressions or
globs).  The check will succeed iff all patterns are successfully
matched against individual lines in the file.

Some real-world usage examples:
    check_file_contents /etc/passwd '/^root:x:0:0:[^:]*:/root:/bin/[a-z]*sh$/'
    check_file_contents /etc/passwd 'adminusr:*' 'slurm:*' 'sshd:*'
    check_file_contents /var/spool/torque/mom_priv/config '$pbsserver master'
    check_file_contents /etc/hosts '10.0.0.10*master'
    check_file_contents /proc/cgroups 'cpuset*1'
----------------------------------------------------------------------
Wed Apr 03 15:10:12 2013                        Michael Jennings (mej)

More minor issues found during testing.
----------------------------------------------------------------------
Wed Apr 03 15:10:14 2013                        Michael Jennings (mej)

I'll take "Things Missed by Unit Tests for $100, Alex."
----------------------------------------------------------------------
Wed Apr 03 15:10:17 2013                        Michael Jennings (mej)

Note to self:  Write more unit tests based on potential stupid
mistakes an admin could make rather than just real-world usage.
----------------------------------------------------------------------
Fri Apr 05 18:26:52 2013                        Michael Jennings (mej)

Release version 1.3.
----------------------------------------------------------------------
Wed Jun 19 11:57:55 2013                        Michael Jennings (mej)

SLURM doesn't prohibit nodes with subdomains, so don't forceably
eliminate them.
----------------------------------------------------------------------
Fri Sep 27 14:10:07 2013                        Michael Jennings (mej)

Use "bsdtime" option instead of "time" to make sure we get a time
value we can easily parse (MMM:SS instead of DD-HH:MM:SS).  Thanks to
John Hanks <john.hanks@usu.edu> for pointing out this issue!
----------------------------------------------------------------------
Mon Oct 07 16:14:08 2013                        Michael Jennings (mej)

If logging to the logfile fails for some reason, syslog an error
message and redirect to /dev/null.  Thanks to Ole Holm Nielsen
<Ole.H.Nielsen@fysik.dtu.dk> for catching this issue!
----------------------------------------------------------------------
Wed Nov 06 13:55:39 2013                        Michael Jennings (mej)

Increment test count on failed test too.
----------------------------------------------------------------------
Wed Nov 06 13:55:41 2013                        Michael Jennings (mej)

Add new check:  check_ps_service [options] <service>

This check takes a service-oriented posture.  It's similar to
check_ps_daemon but allows for glob- and regexp-based matching along
with optionally restarting the service if it's not running.  The user
can also specify arbitrary commands to be run if the appropriate
service daemon isn't (or is) found.
----------------------------------------------------------------------
Tue Jan 14 07:41:12 2014                        Michael Jennings (mej)

Add new "test-debug" target for generating verbose debugging output
when running the unit test suite.

Work around bash regexp implementations which do not support the \b
"word boundary" binding operator.
----------------------------------------------------------------------
Tue Jan 14 08:15:39 2014                        Michael Jennings (mej)

Add support for timestamping in log/debug output based on bash
$SECONDS variable.  This adds a single fork() in order to get the
current UNIX time_t value via date(1).  Off by default unless
debugging.
----------------------------------------------------------------------
Sun Feb 09 02:24:11 2014                        Michael Jennings (mej)

Properly handle missing script files.
----------------------------------------------------------------------
Sun Feb 09 02:24:18 2014                        Michael Jennings (mej)

Preliminary setup for allowing checks to be done in a non-fatal manner
for general monitoring purposes.
----------------------------------------------------------------------
Sun Feb 09 02:24:26 2014                        Michael Jennings (mej)

Add option (-a) and config variable (NHC_CHECK_ALL) to make individual
checks non-fatal.  This will cause NHC to continue running all checks,
even if one or more of them fail, until it finishes.  It then reports
how many checks failed and returns that number as its exit status.

This is intended for more general monitoring use (e.g., from cron).
----------------------------------------------------------------------
Sun Feb 09 02:24:34 2014                        Michael Jennings (mej)

Don't assume device files will always be there.
----------------------------------------------------------------------
Mon Feb 10 16:57:54 2014                        Michael Jennings (mej)

Added two new options for check_ps_service (-s and -k) to stop/kill
services which *are* running.  Similar to check_ps_blacklist but
allows for blacklisted services to be actively terminated by NHC.
----------------------------------------------------------------------
Mon Feb 10 16:57:57 2014                        Michael Jennings (mej)

Support negated user (owner) matches in check_ps_service just like in
check_ps_blacklist.
----------------------------------------------------------------------
Mon Feb 10 16:58:01 2014                        Michael Jennings (mej)

This was going to be 1.3.1, but the changes are extensive enough that
it will need to be a 1.4 release.  Bump version accordingly.
----------------------------------------------------------------------
Mon Feb 10 17:01:53 2014                        Michael Jennings (mej)

Add another svnversion fallback to spec file and uncomment.
----------------------------------------------------------------------
Wed Feb 12 12:29:58 2014                        Michael Jennings (mej)

Remove the reference to $BASH_SUBSHELL.  I now realize what it does,
and it's not at all what I was intending.
----------------------------------------------------------------------
Wed Feb 12 13:29:09 2014                        Michael Jennings (mej)

Tweak how $OFFLINE_NODE and $ONLINE_NODE are invoked so that they can
be customized with additional capabilities instead of just specifying
a single command.  This is a potential alternative to customizing the
scripts themselves.
----------------------------------------------------------------------
Wed Feb 12 13:29:11 2014                        Michael Jennings (mej)

Mark pretty much everything, including nhc itself, as a config file so
that customizations don't get overwritten.  Conceivably any or all of
these elements could potentially acquire site-local modifications.

The caveat, of course, being that updates would then have to be done
by hand....
----------------------------------------------------------------------
Wed Feb 12 15:37:45 2014                        Michael Jennings (mej)

Better way of securing permissions.
----------------------------------------------------------------------
Wed Feb 19 15:26:18 2014                        Michael Jennings (mej)

Typo.
----------------------------------------------------------------------
Wed Feb 19 17:27:17 2014                        Michael Jennings (mej)

This should fix the issue spotted by Anthony DelSorbo
<adelsorb@csc.com> and Ken Nielson <knielson@adaptivecomputing.com>.

When NHC exits, it tries to terminate the watchdog timer process (a
bash process with a sleep process as a child).  Because of a "bug"
(lack of feature, but that feature is specified by POSIX!) in bash,
its internal implementation of the "kill" builtin is incapable of
sending a signal to an entire process group (see also:  NHC's SVN
r1201 commit).  So when we sent the signal to the watchdog (i.e.,
bash) process, it died, but its child (the sleep process) didn't!

This commit rewrites the watchdog timer (again) to try and make sure
that the sleep goes away when the bash goes away.

NOTE:  This was only an issue if the output of NHC was being piped
(i.e., read()) somewhere.  Unfortunately, that includes pbs_mom....
;-)  You can compare/verify this by running "nhc" by itself on the
command line vs. running "nhc 2>&1 | less"
----------------------------------------------------------------------
Wed Feb 19 17:31:24 2014                        Michael Jennings (mej)

Add EXIT for paranoia.
----------------------------------------------------------------------
Wed Mar 05 16:15:56 2014                        Michael Jennings (mej)

Add support for negating file content matches in
check_file_contents().  Prefixing the match expression with an
exclamation mark (!) will cause the check to fail if any line in the
file matches the expression.  Any combination of positive and negative
match expressions may be used in the same check.
----------------------------------------------------------------------
Fri Mar 07 12:09:06 2014                        Michael Jennings (mej)

Patch from Eliot Eshelman <eliot.eshelman@6by9.net> to ensure that NHC
reads the correct exit code from nvidia-healthmon.
----------------------------------------------------------------------
Fri Mar 07 22:31:57 2014                        Michael Jennings (mej)

Further input from Eliot Eshelman <eliot.eshelman@6by9.net> led me to
reorder and rework the subcommand execution for the nVidia healthmon
check to ensure that command-line options were handled in the most
portable way possible.
----------------------------------------------------------------------
Fri Mar 07 22:56:51 2014                        Michael Jennings (mej)

What on earth was that??
----------------------------------------------------------------------
Fri Mar 14 10:10:54 2014                        Michael Jennings (mej)

Clarification of comment verbiage.
----------------------------------------------------------------------
Fri Mar 14 10:10:59 2014                        Michael Jennings (mej)

Added 4 new checks, all with similar syntax, for looking at process
resource consumption.  Each check looks for processes using more than
a specified amount of the resource and can take various actions when
they are found.  The checks and their respective resources are:

check_ps_cpu     - Percentage of CPU utilization
check_ps_mem     - Amount of total system memory (absolute size)
check_ps_physmem - Amount of physical RAM (absolute or percentage)
check_ps_time    - Total CPU time

Syntax is, e.g.:  check_ps_cpu [flags] <threshold>
Flags accepted:
   -0          Non-fatal; report on matches, but don't terminate
   -a          Find and alert on all matches; don't die after the 1st
   -e action   Execute a command if a match is found
   -f          Full match; match against the entire command line
   -k          Kill matching processes if found
   -l          Log processes found to the NHC log
   -m match    Specifies a command (or command line) to match
   -r value    Renice matching processes by the specified factor
   -s          Log processes found to the syslog
   -u [!]user  Match only processes owned (or not owned) by user

Thresholds are specified as percentages (percent sign is optional for
check_ps_cpu), sizes (in kB or with appropriate suffix), or time (in
seconds or XXXmYYs).

Examples:
  check_ps_cpu -r 19 -u '!root' 99
  check_ps_mem -k -u mej -m '/leakyprog/' 24g
  check_ps_physmem -l -s 90%
  check_ps_time -l 720m
----------------------------------------------------------------------
Fri Mar 14 16:28:50 2014                        Michael Jennings (mej)

Slight efficiency improvement by reading directly from the file
instead of creating a subprocess.
----------------------------------------------------------------------
Fri Mar 14 16:28:52 2014                        Michael Jennings (mej)

Fix minor cosmetic bug when timestamps are turned on -- the completion
line in the log file gave the timestamp instead of the elapsed time.
----------------------------------------------------------------------
Fri Mar 14 17:18:12 2014                        Michael Jennings (mej)

Add check_loadavg() for looking at the 1-, 5-, and 15-minute load
averages on a system.  Any or all may be capped.  Syntax is:
  check_loadavg <limit_1m> <limit_5m> <limit_15m>

Blank limits are ignored.

This check was originally written in front of a live studio audience
at MoabCon 2013!  (See the video at:  http://go.lbl.gov/nhc-2013-mc)
----------------------------------------------------------------------
Mon Mar 17 15:27:10 2014                        Michael Jennings (mej)

Added 2 new checks for looking at the results of bash's built-in
"test" command as well as file stat() values.

check_file_test() provides an interface to certain options of the
"test" command which examine file attributes without needing to shell
out to run the /bin/stat command.  For example, you can check to see
if a file is readable, or writable, or if it even exists at all.

check_file_stat() goes further by actually running the /bin/stat
command and allowing you to test its results against expected values.
You can verify the owner or group of a file, or check to see if the
last-modified-time for a file is newer or older than you think it
should be.

These checks both support a *ton* of options, so documenting them all
here would make for a humongous changelog entry, but here are a few
examples:

To make sure /tmp is writable:
check_file_test -w /tmp

To make sure the passwd file isn't empty or missing:
check_file_test -s /etc/passwd

To make sure /dev/null is a character special device:
check_file_test -c /dev/null

To do a full integrity check on /dev/null:
check_file_stat -m 0666 -u 0 -g 0 -t 1 -T 3 /dev/null

To make sure /var/log/messages has recent activity:
check_file_stat -n 7200 /var/log/messages

To verify access to a user's ~/.ssh/ tree:
check_file_stat -m 0700 -U someuser /home/someuser

Full documentation for these checks will be on the web once I have a
chance to write them all up!
----------------------------------------------------------------------
Tue Mar 18 16:18:20 2014                        Michael Jennings (mej)

Fix path to stat command.
----------------------------------------------------------------------
Tue Mar 18 16:18:23 2014                        Michael Jennings (mej)

Add optional "fudge" factor to mem/swap size checks.  This allows the
actual size to be within some percentage or specific number of kB of
the specified minimum/maximum and still pass the check.  If not
specified, obviously, no fudge factor is used.

Example:  To verify RAM size is 32GB +/- 10%:

check_hw_physmem 32g 32g 10%
  -OR-
check_hw_physmem 32g 32g 3200m
----------------------------------------------------------------------
Wed Mar 19 09:20:07 2014                        Michael Jennings (mej)

Add unit tests for fudge factor code.
----------------------------------------------------------------------
Wed Mar 19 13:54:01 2014                        Michael Jennings (mej)

Create /var/run/nhc and put run-time files (like results) in there
instead of directly in /var/run.

Make $RESULTFILE depend on $NAME.  Each named instance should have
independent results.

Don't have $CONFDIR or $HELPERDIR depend on $NAME; that requires a
duplicate of /etc/nhc and /usr/libexec/nhc per named instance.  I
think the overwhelming majority of users will want checks and helper
scripts to be universal and only have the configuration file(s) differ
(at most).  If anyone wants it the old way, it can still be overridden
via sysconfig or command line.
----------------------------------------------------------------------
Wed Mar 19 15:46:12 2014                        Michael Jennings (mej)

Add new script file ww_cmd.nhc for checks based on arbitrary subcommands.
----------------------------------------------------------------------
Thu Mar 20 17:14:02 2014                        Michael Jennings (mej)

Initial implementations of command-based checks.  These may get further refinement.
----------------------------------------------------------------------
Fri Mar 21 12:30:45 2014                        Michael Jennings (mej)

Patch from Eliot Eshelman <eliot.eshelman@6by9.net> (slightly
modified) for SLURM support in check_ps_unauth_users().
----------------------------------------------------------------------
Fri Mar 21 13:42:39 2014                        Michael Jennings (mej)

Command output matching is (preliminarily) working now.

Fixed some missing check name labels in error messages.
----------------------------------------------------------------------
Fri Mar 21 14:46:45 2014                        Michael Jennings (mej)

Redo the command output check a better way.
----------------------------------------------------------------------
Mon Mar 24 12:22:14 2014                        Michael Jennings (mej)

Additional unit tests for command checks.
----------------------------------------------------------------------
Mon Mar 24 12:22:17 2014                        Michael Jennings (mej)

Add stubs for Moab/TORQUE checks.
----------------------------------------------------------------------
Wed Mar 26 15:11:07 2014                        Michael Jennings (mej)

Added a flag to check_ps_service to just start (instead of restart)
the service.  Useful for cases like sshd where "restart" when the
daemon isn't running will kill user login sessions!
----------------------------------------------------------------------
Wed Mar 26 15:11:10 2014                        Michael Jennings (mej)

Fix typos in check_ps_service unit tests.
----------------------------------------------------------------------
Wed Mar 26 15:11:12 2014                        Michael Jennings (mej)

Add TORQUE/Moab-specific checks.  These are still preliminary, and
unit tests for them are still pending, but they're already at least
somewhat useful.

 check_moab_sched -t <timeout> -a <alert> -v <version> -m <match>

Checks the output of "mdiag -S -v" against the specified version,
alert, and/or arbitrary match expression(s).  If a matching alert is
found, if the versions don't match, or if any of the match expressions
(possibly negated) trigger, the check fails.  All parameters are
optional.  Multiple occurrences of -m are supported.

 check_moab_rm -t <timeout> -m <match>

Checks the output of "mdiag -R -v" against any specified match
expression(s).  It also looks for any RMs that are not in the "Active"
state.  If any RM is inactive, or if any of the match expressions
(possibly negated) trigger, the check fails.  All parameters are
optional.  Multiple occurrences of -m are supported.

 check_moab_torque -t <timeout> -m <match>

Checks the output of "mdiag -R -v" against any specified match
expression(s).  It also looks for the "scheduling" parameter to be
turned on.  If "scheduling" is false, or if any of the match
expressions (possibly negated) trigger, the check fails.  All
parameters are optional.  Multiple occurrences of -m are supported.
----------------------------------------------------------------------
Thu Mar 27 16:10:28 2014                        Michael Jennings (mej)

Skip a few tests on RHEL4 due to apparent bash bug.
----------------------------------------------------------------------
Thu Mar 27 17:01:46 2014                        Michael Jennings (mej)

Explicitly test $DEBUG for ==1, not !=0
----------------------------------------------------------------------
Sun Mar 30 16:19:36 2014                        Michael Jennings (mej)

Allow nhc.cron to pass options to nhc.

Fix missing flag in comment.
----------------------------------------------------------------------
Sun Mar 30 18:03:49 2014                        Michael Jennings (mej)

Rewrite check_file_test to match check_file_stat calling conventions.
----------------------------------------------------------------------
Wed Apr 09 09:39:03 2014                        Michael Jennings (mej)

check_loadavg() -> check_ps_loadavg()
----------------------------------------------------------------------
Fri Apr 18 14:59:17 2014                                      macabral

Merge branch 'NO_OFFLOAD'
----------------------------------------------------------------------
Fri May 02 13:31:01 2014                        Michael Jennings (mej)

Improved default config file for NHC.
----------------------------------------------------------------------
Tue Jun 17 11:59:26 2014                        Michael Jennings (mej)

Fix name of SLURM job parent daemon noted by Eliot Eshelman <eliot.eshelman@6by9.net>.
----------------------------------------------------------------------
Wed Jun 18 17:06:03 2014                        Michael Jennings (mej)

Rewrote check_fs_mount() to newer, more flexible syntax.  Old syntax
is still supported, but the check now takes options.  New features
include support for multiple filesystems in a single check_fs_mount
invocation, ability to attempt mount of missing filesystems, ability
to negate match strings for source/type/options, support for multiple
match strings of each type, support for actions if found (or not), and
support for making the check non-fatal.  check_fs_mount_ro() and
check_fs_mount_rw() have also been updated to work with either syntax
(based on whether or not the first argument is a mountpoint).

Added logging of messages for non-fatal checks and logging of shell
actions for check_ps_service() and nhc_ps_check_res().
----------------------------------------------------------------------
Thu Jun 19 12:34:32 2014                        Michael Jennings (mej)

Adjust handling of updating SLURM node states.
----------------------------------------------------------------------
Thu Jun 19 12:40:03 2014                        Michael Jennings (mej)

Fix typo.
----------------------------------------------------------------------
Fri Jun 20 17:39:53 2014                        Michael Jennings (mej)

Use mount options if provided.  Add unit tests for new syntax.
----------------------------------------------------------------------
Wed Jun 25 16:46:25 2014                        Michael Jennings (mej)

Revert logfile redirection; bash doesn't support it that way.
----------------------------------------------------------------------
Wed Jun 25 16:46:28 2014                        Michael Jennings (mej)

Refactor I/O.  Now all check and subcommand executions' output is
redirected according to the value of $LOGFILE; the original
stdout/stderr are restored immediately before exit.  The ERROR message
output, if any, is forceably routed to stdout to help ensure that
ERROR is the first thing the RM daemon sees.  (This is especially
important for TORQUE/PBS which only look at the first word of the
first line of output and ignore everything else.)

Along with this, the default execution is now quieter.  Status
information is only printed in VERBOSE mode (-v).
----------------------------------------------------------------------
Wed Jun 25 16:46:30 2014                        Michael Jennings (mej)

Add new -l option for specifying log file on command line.  If value
of LOGFILE looks like a path instead of output redirection, assume
it's just a filename and convert to an append-redirect statement.  So
LOGFILE="/path/to/log" becomes LOGFILE=">>/path/to/log 2>&1" instead.
----------------------------------------------------------------------
Thu Jun 26 13:55:47 2014                        Michael Jennings (mej)

Properly handle multiple double-pipes on the check line.  This allows
things like:  * || check_name && something_else || :
----------------------------------------------------------------------
Tue Jul 08 15:01:39 2014                        Michael Jennings (mej)

Update sample config with latest syntax and fix a couple typos.
Correct yet another issue in the "overloading $SECONDS" vein.
----------------------------------------------------------------------
Tue Jul 08 15:01:42 2014                        Michael Jennings (mej)

Since the shell doesn't distinguish between an unset variable (like
"undef" in Perl) and an empty/null variable (like the empty string in
Perl), we must use "-" for -l/LOGFILE to represent "no redirection."
----------------------------------------------------------------------
