Automate Nagios Monitoring with Ansible

nagiosansibleInfrastructure monitoring is important and there’s lots of tools for it.  I’m fond of Nagios – it’s very mature and customizable but it’s also a giant pain in the ass to setup.  Here’s how to save time deploying and managing Nagios with Ansible.


What Does it Do?

  • Automated deployment of Nagios server on CentOS or RHEL
  • Generates hosts and service checks based on Ansible inventory
  • Generates configuration/services based on easy-to-manage jinja2 templates
  • Deploys comprehensive, configurable checks for all hosts/services
  • Sets up comprehensive checks for the Nagios server itself
  • Wraps Nagios interface with Apache/SSL
  • Sets up firewall rules for Nagios/HTTP
  • Currently monitors eight types of external resources:
    • generic Linux server (check ping, ssh, uptime, load, users, processes, disk space)
    • webservers (check ping, http, ssh, uptime, load, users, processes, disk space)
    • elasticsearch (same as Linux server plus elasticsearch)
    • ELK server (same as Linux server plus elasticsearch and Kibana)
    • network switches (check ssh, ping)
    • out-of-band interfaces (check ssh, ping, http)
    • Dell iDRAC server health monitoring (using SNMP, see example below)
      • (all configurable) CPU, DISK, RAID, TEMP, FANS, MEM, POWER
    • SuperMicro health monitoring (using IPMI).
      • Split out by SuperMicro server type and checks

Getting Started
First, clone the git repo locally.  You’ll need Ansible installed prior to this.

git clone https://github.com/sadsfae/ansible-nagios

Setup your Hosts Inventory
Next you’ll want to edit a few variables in your hosts (inventory) file.  Most importantly you’ll want to change host-01 to whatever your nagios server should be.

cd ansible-nagios
sed -i 's/host-01/yournagioshost/' hosts

Add Resources to be Monitored (optional)
Add servers/resources you want monitored to your inventory file, they’ll be automatically added to monitoring in the Nagios configuration.  Servers/resources simply need to be reachable via SSH, have proper SSH keys setup for the root user and Python installed.

[nagios]
host-01

[webservers]
webserver01

[switches]
switch01 ansible_host=192.168.0.101
switch02 ansible_host=192.168.0.102

[oobservers]
webserver01-idrac ansible_host=192.168.0.105

[servers]
server01

[elasticsearch]

[elkservers]

[idrac]
database01-idrac ansible_host=192.168.0.106

[supermicro-6048r]

Any hosts you add to under the [servers] or [webservers] groups will inherit a full battery of common service checks.

Note:  For switches, idrac and out-of-band interfaces you must use the ansible_host=ip.address variable as well as illustrated above, this is because most appliances will not have Python installed (so Ansible can gather facts from them).  In Nagios it will still use the alias name listed.

Change Nagios Password and Options (optional)
Part of the setup is generating the Nagios admin user, you’ll want to change this value from the default.  Configurable variables are located in install/group_vars/all.yml.  You can change this at any time if you like however.

sed -i 's/changeme/yourpasswordhere/' install/group_vars/all.yml

You might want to change the notification settings if you wish to receive external email alerts, and the contacts.cfg.j2 file is templated for further modification.

admin_name_01: nagiosadmin
admin_email_01: nagios@localhost

Lastly it creates a read-only guest user for nagios, you can turn this off by setting the following to false or changing the settings as you see fit:

nagios_create_guest_user: true
nagios_ro_username: guest
nagios_ro_password: guest

Run the Playbook
Now you’re ready to deploy Nagios.  Run the Ansible playbook.

ansible-playbook -i hosts install/nagios.yml

Access your Nagios Instance
Now you should be able to access your Nagios interface via https://yourserver/nagios.  Below is an example I’ve setup locally on a few home servers.

nagios-example-2

If you haven’t setup any external hosts to monitor in your Ansible inventory you’ll just see the localhost where Nagios is running and slew of standard checks.

Known Issues
SELinux occasionally breaks checks as policy files are updated.  If you get the following error during your playbook run:

avc:  denied  { create } for  pid=8800 comm="nagios" name="nagios.qh

You need to do the following (or disable SELinux entirely):

cat /var/log/audit/audit.log | audit2allow -M mynagios
semodule -i mynagios.pp

Now restart Nagios:

systemctl restart nagios
systemctl restart httpd

I don’t want to make any assumptions about your environment for you so that’s why I am not going to be forcing setenforce 0 or the like.

Deployment Video
I’ve put together a simple video showing the deployment, it takes about 2minutes to deploy a full Nagios stack and set of remote server/resource checks.

Dell iDRAC Server Health Example
I’ve recently added support for monitoring all server health values via SNMP against the iDRAC interface, this is built upon someone else’s already-existing idrac check.  The big benefit of this is you don’t need to stress the actual server at all or install NRPE in cases where you need to optimize performance but still want to have a battery of health checks in case something breaks.

These checks are all configurable in install/group_vars/all.yml.  Here’s what it looks in the Nagios dashboard:

idrac-check

Here is what it looks like from the service details view (quite exhaustive).

nagios-idrac

Further Automation and Extending
Remember that all the configurations are generated via jinja2 templates, if you want to expand this to include different server types and checks you can simply branch out your hostgrouptype.conf.j2 files as I have done in a minimal fashion and add the corresponding entries to your Ansible inventory.

We will be utilizing this along with Ansible dynamic inventory and Foreman to auto-generate our list of monitored services to do various tasks, one of which will be to generate and maintain an up-to-date monitoring system with Nagios automatically as hosts change or get added.

Questions or suggestions are welcome, feel free to add a comment below, or file an issue on Github.

About Will Foster

hobo devop/sysadmin, all-around nice guy.
This entry was posted in open source, sysadmin and tagged , , , , , , , , , , . Bookmark the permalink.

14 Responses to Automate Nagios Monitoring with Ansible

  1. Wayne Rousey says:

    Very helpful – cheers!

    Like

  2. driveby says:

    Eager to give it a spin.
    What component I would add is livestatus for running integration tests against it.

    Like

    • Will Foster says:

      Hey @driveby, that would be useful. Right now I’m just focusing on expanding the templated checks (I’ll be adding IPMI / idrac environmental checks for servers soon). I also test everything extensively in VMs or against bare-metal (in case of the IPMI checks) before I push anything but there’s no replacement for proper integration testing for sure.

      Like

  3. GL says:

    Impressive. One thing I’m wondering, about how long did it take to setup this configuration? I’m working towards similar ends, but this is beyond. Very nice.

    Like

    • Will Foster says:

      Hi GL, the entire setup takes about 2-3 minutes with a few dozen hosts. Edit your inventory file and off it goes, note that you shouldn’t put the same host into more than one group (e.g. servers, webservers etc) due to how it’s templated. If you see any check failures or issue with the Nagios service you may need to set setenforce 0 and change SELinux to permissive in /etc/selinux/config until you can apply the proper policy or label(s) needed.

      Like

  4. GL says:

    ps – i like how the entire package is self contained… cool

    Like

  5. jsingamsetty says:

    Hi i could see below error

    AnsibleUndefinedVariable: ‘dict object’ has no attribute ‘ansible_default_ipv4′”}

    TASK [nagios-client : Setup NRPE client configuration] ***************************************************************************
    fatal: [SErver02]: FAILED! => {“changed”: false, “failed”: true, “msg”: “AnsibleUndefinedVariable: ‘dict object’ has no attribute ‘ansible_default_ipv4′”}

    Like

  6. AK says:

    Hi,
    Nice work done, please let me know how to add other centos versions and other types of servers to be monitored?

    Like

    • Will Foster says:

      Hi,
      Nice work done, please let me know how to add other centos versions and other types of servers to be monitored?

      Hi AK,

      Adding other server types is easy, you just need to add them underneath the “hosts” file for the right category, for example if you have three linux servers and three linux web servers you’d split them out underneath the proper category (note you cannot have the same server in more than one category).


      [servers]
      server01
      server02
      server03

      [webservers]
      server04
      server05
      server06

      If you wanted to extend the types of server checks you’ll just need to write jinja2 templates for them here and ensure they are generated in the main Nagios Ansible tasks here. You can use one of the existing ones to model a new one off of that.

      Like

  7. Dheerendra says:

    Excellent work! Any suggestions on, how can i make it work for CentOS6.6. I was thinking of editing main.yml file.
    Thanks in Advance!

    Like

Have a Squat, Leave a Reply ..

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s