We run System Reporter 2.6 on a windows VM that connects to a physical host running oracle.
From time to time, the oracle system will be taken offline, resulting in the "Sampler" service for 3PAR System Reporter quiting until someone notices its not working anymore and then manually restarts the service after the remote database is back online.
I tried to use the built in Windows service recovery tab to auto restart the service when it fails.... however, the keyword there is "fail". When the 3PAR Sampler service stops due to database connection issues, it self terminates cleanly... not triggering a service FAILure, hence WinServe leaves it alone. In order for WinServe to restart the service, it needs to return an error code not equal to 0. Sure enough, when the DB goes offline and the Sampler service quits, it returns a 0. *sigh*
As a workaround I wrote the following quick script, and run it with the task scheduler every 30 minutes on the System Reporter server. This way, System Reporter will automaticaly restart and reconnect to the database as needed. Psservice is a free download from Microsoft, part of the PSTOOLS kit. Change the e:\3PAR_watchdog.log to be whatever location you like.
copy and paste this into a text file names 3PAR_watchdog.bat
Code: Select all
psservice query "3PAR System Reporter sampler" | find "RUNNING"
if errorlevel 1 (net start "3PAR System Reporter sampler" & echo %date% %time% >>e:\3PAR_watchdog.log)
Be advised, there is one gotcha... the psservice command throws up a gui asking you to agree to the license agreement the first time you run it, per windows user. So you cant run this as local system. You will need at minimum a local service account with permissions to start/stop query services. Log in interactively as the newly created service account user you will be using to run the script via the scheduler, run the psservice and click "ok" to the EULA. After that, the script will run properly in the background.
I've suggested to 3PAR via my SE to change the sampler process to use return codes other than zero for errors... but hind sight being 20/20 I think it would be primo if the sampler service locally cached metrics that were unable to post to the DB until the DB comes back online an hour, a day or more later... adding in a option to alert the admin via email if anything is wrong is a bonus too.