Automate Nagios Monitoring with Ansible

nagiosansibleInfrastructure monitoring is important and there’s lots of tools for it.  I’m fond of Nagios – it’s very mature and customizable but it’s also a giant pain in the ass to setup.  Here’s how to save time deploying and managing Nagios with Ansible.


What Does it Do?

  • Automated deployment of Nagios server on CentOS7 or RHEL7
  • Automated deployment of Nagios client on CentOS6/7/8, RHEL6/7/8, Fedora or FreeBSD
  • Generates hosts and service checks based on Ansible inventory
  • Generates configuration/services based on easy-to-manage jinja2 templates
  • Deploys comprehensive, configurable checks for all hosts/services
  • Sets up comprehensive checks for the Nagios server itself
  • Wraps Nagios interface with Apache/SSL
  • Sets up firewall rules for Nagios/HTTP
  • Currently monitors 10 types of resources:
    • Generic Linux server (check ping, ssh, uptime, load, users, processes, disk space, swap, zombie procs, mdadm raid)
    • Webservers (same as Linux server plus TCP/80 for webservers)
    • Elasticsearch (same as Linux server plus elasticsearch)
    • FreeNAS Appliances (check ping, ssh, volume, disk and alert status)
    • ELK server (same as Linux server plus elasticsearch and Kibana)
    • Jenkins CI (same as linux server plus TCP/8080 and optional reverse proxy with authentication)
    • DNS servers (same as Linux server plus DNS service checks)
    • Network switches (check ssh, ping)
    • Out-of-band interfaces (check ssh, ping, http)
    • Dell iDRAC server health monitoring (using SNMP, see example below)
      • (all configurable: CPU, DISK, RAID, TEMP, FANS, MEM, POWER)
    • SuperMicro health monitoring (using IPMI).
      • Split out by SuperMicro server type and checks

Getting Started
First, clone the git repo locally.  You’ll need Ansible installed prior to this.

git clone https://github.com/sadsfae/ansible-nagios

Setup your Hosts Inventory
Next you’ll want to edit a few variables in your hosts (inventory) file.  Most importantly you’ll want to change host-01 to whatever your nagios server should be.

cd ansible-nagios
sed -i 's/host-01/yournagioshost/' hosts

Add Resources to be Monitored (optional)
Add servers/resources you want monitored to your inventory file, they’ll be automatically added to monitoring in the Nagios configuration.  Servers/resources simply need to be reachable via SSH, have proper SSH keys setup for the root user and Python installed.

[nagios]
host-01

[webservers]
webserver01

[switches]
switch01 ansible_host=192.168.0.101
switch02 ansible_host=192.168.0.102

[oobservers]
webserver01-idrac ansible_host=192.168.0.105

[servers]
server01

[servers_with_mdadm_raid]

[dns_with_mdadm_raid]

[dns_with_mdadm_raid]

[freenas]

[elasticsearch]

[elkservers]

[jenkins]

[idrac]
database01-idrac ansible_host=192.168.0.106

[supermicro-6048r]

Any hosts you add to under the [servers], [dns], [jenkins], or [webservers] groups will inherit a full battery of common service checks.

Each inventory hostgroup is a unique category, you cannot have a host listed under more than one so pick the one that fits each of your hosts best.

Note:  For switches, idrac and out-of-band interfaces you must use the ansible_host=ip.address variable as well as illustrated above, this is because most appliances will not have Python installed (so Ansible can gather facts from them).  In Nagios it will still use the alias name listed.

Change Nagios Password and Options (optional)
Part of the setup is generating the Nagios admin user, you’ll want to change this value from the default.  Configurable variables are located in install/group_vars/all.yml.  You can change this at any time if you like however.

sed -i 's/changeme/yourpasswordhere/' install/group_vars/all.yml

You might want to change the notification settings if you wish to receive external email alerts, and the contacts.cfg.j2 file is templated for further modification.

admin_name_01: nagiosadmin
admin_email_01: nagios@localhost

Lastly it creates a read-only guest user for nagios, you can turn this off by setting the following to false or changing the settings as you see fit:

nagios_create_guest_user: true
nagios_ro_username: guest
nagios_ro_password: guest

If you don’t want to automatically create iptables or firewalld rules for either nagios server or the NRPE nagios client modify these:

manage_firewall: true
manage_firewall_client: true

Run the Playbook
Now you’re ready to deploy Nagios.  Run the Ansible playbook.

ansible-playbook -i hosts install/nagios.yml

Access your Nagios Instance
Now you should be able to access your Nagios interface via https://yourserver/nagios.  Below is an example I’ve setup locally on a few home servers.

nagios-example-2

If you haven’t setup any external hosts to monitor in your Ansible inventory you’ll just see the localhost where Nagios is running and slew of standard checks.

Known Issues
SELinux occasionally breaks checks as policy files are updated.  If you get the following error during your playbook run:

avc:  denied  { create } for  pid=8800 comm="nagios" name="nagios.qh

You need to do the following (or disable SELinux entirely):

cat /var/log/audit/audit.log | audit2allow -M mynagios
semodule -i mynagios.pp

Now restart Nagios:

systemctl restart nagios
systemctl restart httpd

I don’t want to make any assumptions about your environment for you so that’s why I am not going to be forcing setenforce 0 or the like.

Deployment Video
I’ve put together a simple video showing the deployment, it takes about 2minutes to deploy a full Nagios stack and set of remote server/resource checks.

Dell iDRAC Server Health Example
I’ve recently added support for monitoring all server health values via SNMP against the iDRAC interface, this is built upon someone else’s already-existing idrac check.  The big benefit of this is you don’t need to stress the actual server at all or install NRPE in cases where you need to optimize performance but still want to have a battery of health checks in case something breaks.

These checks are all configurable in install/group_vars/all.yml.  Here’s what it looks in the Nagios dashboard:

idrac-check

Here is what it looks like from the service details view (quite exhaustive).

nagios-idrac

Further Automation and Extending
Remember that all the configurations are generated via jinja2 templates, if you want to expand this to include different server types and checks you can simply branch out your hostgrouptype.conf.j2 files as I have done in a minimal fashion and add the corresponding entries to your Ansible inventory.

We will be utilizing this along with Ansible dynamic inventory and Foreman to auto-generate our list of monitored services to do various tasks, one of which will be to generate and maintain an up-to-date monitoring system with Nagios automatically as hosts change or get added.

Questions or suggestions are welcome, feel free to add a comment below, or file an issue on Github.

About Will Foster

hobo devop/sysadmin/SRE
This entry was posted in open source, sysadmin and tagged , , , , , , , , , , . Bookmark the permalink.

35 Responses to Automate Nagios Monitoring with Ansible

  1. Wayne Rousey says:

    Very helpful – cheers!

    Like

  2. driveby says:

    Eager to give it a spin.
    What component I would add is livestatus for running integration tests against it.

    Like

    • Will Foster says:

      Hey @driveby, that would be useful. Right now I’m just focusing on expanding the templated checks (I’ll be adding IPMI / idrac environmental checks for servers soon). I also test everything extensively in VMs or against bare-metal (in case of the IPMI checks) before I push anything but there’s no replacement for proper integration testing for sure.

      Like

  3. GL says:

    Impressive. One thing I’m wondering, about how long did it take to setup this configuration? I’m working towards similar ends, but this is beyond. Very nice.

    Like

    • Will Foster says:

      Hi GL, the entire setup takes about 2-3 minutes with a few dozen hosts. Edit your inventory file and off it goes, note that you shouldn’t put the same host into more than one group (e.g. servers, webservers etc) due to how it’s templated. If you see any check failures or issue with the Nagios service you may need to set setenforce 0 and change SELinux to permissive in /etc/selinux/config until you can apply the proper policy or label(s) needed.

      Like

  4. GL says:

    ps – i like how the entire package is self contained… cool

    Like

  5. jsingamsetty says:

    Hi i could see below error

    AnsibleUndefinedVariable: ‘dict object’ has no attribute ‘ansible_default_ipv4′”}

    TASK [nagios-client : Setup NRPE client configuration] ***************************************************************************
    fatal: [SErver02]: FAILED! => {“changed”: false, “failed”: true, “msg”: “AnsibleUndefinedVariable: ‘dict object’ has no attribute ‘ansible_default_ipv4′”}

    Like

    • Will Foster says:

      Can you paste your ‘hosts’ file? Opening a Github issue is probably the best place to help with this.

      Like

    • Jcob says:

      Did you ever find a solution for this?

      Like

      • Will Foster says:

        Likely that this error happens when you try to add a host that doesn’t support Python (so no fact collection is possible) without using the ansible_host variable, e.g. switches, routers, out-of-band devices, etc.

        For those they have their own inventory group / format.

        Per the README example:

        [switches]
        switch01 ansible_host=192.168.0.100
        switch02 ansible_host=192.168.0.102

        Like

  6. AK says:

    Hi,
    Nice work done, please let me know how to add other centos versions and other types of servers to be monitored?

    Like

    • Will Foster says:

      Hi,
      Nice work done, please let me know how to add other centos versions and other types of servers to be monitored?

      Hi AK,

      Adding other server types is easy, you just need to add them underneath the “hosts” file for the right category, for example if you have three linux servers and three linux web servers you’d split them out underneath the proper category (note you cannot have the same server in more than one category).


      [servers]
      server01
      server02
      server03

      [webservers]
      server04
      server05
      server06

      If you wanted to extend the types of server checks you’ll just need to write jinja2 templates for them here and ensure they are generated in the main Nagios Ansible tasks here. You can use one of the existing ones to model a new one off of that.

      Like

  7. Dheerendra says:

    Excellent work! Any suggestions on, how can i make it work for CentOS6.6. I was thinking of editing main.yml file.
    Thanks in Advance!

    Like

  8. Narpet says:

    We use nagiosdev, nagiosqa and nagiosprod useraccounts instead of nagios i. Also the group names would be nagiosdev, nagiosqa and nagiosprod. How to do this.

    Like

    • Will Foster says:

      Hi Narpet, this is a one-line change here:

      https://github.com/sadsfae/ansible-nagios/blob/master/install/group_vars/all.yml#L17 in the setting nagios_username:

      Just change nagiosadmin to whatever you want the primary username to be per environment.

      After that just run the playbook once you’ve added each server underneath the proper Ansible inventory group that corresponds to what you want monitored (only put a server in one group).

      For groups, are you referring to Nagios groups?

      This playbook isn’t setup in a way to put all servers into the same groups, nor does Nagios by design want you to do this.
      Typically you want to put servers you are monitoring underneath the appropriate Ansible inventory group that corresponds with the server, device or service type you are monitoring.

      On the Nagios side this lets you organize sets of similar servers doing the same function into the same Nagios Host Group based on their function, but it’s also why you probably want a separate Nagios instance per environment. This has other added benefits like separate alerting, contact and notification behavior that you probably want to have between different environments (e.g. you probably don’t want full alerting on in development but you probably do want it in prod).

      In your case with different environments it might make more sense to have separate Nagios instances per environment: nagiosdev, nagiosqa and nagiosprod.

      You could collapse all of this if you really wanted to however but you’d need to edit and manage the hostgroup_name variable in the templates section of the playbook and do some restructuring of how that maps to the Ansible inventory groups:

      https://github.com/sadsfae/ansible-nagios/tree/master/install/roles/nagios/templates

      As things are set up this is designed to have Ansible inventory groups map to Nagios Host groups, adding each server underneath only one inventory group that describes it best. You can see that some of the generic health checks are inherited and only differentiate based on the primary service (e.g. jenkins, DNS, etc).

      I hope this helps.

      Like

  9. Narpet says:

    Thanks. This is great. I am starting to test this.

    Like

  10. Narpet says:

    How do i disable installing nagios core. Also i cannot access any repo outside the company. I have to install the linux-nrpe-agent.tar.gz package. How do i automate this method. Please advise.

    Like

    • Will Foster says:

      I think you’re better off pointing your system to some local RPM repository ocation if you cannot access public mirrors.

      You’d want to comment this out and install lay down a repository config that points to some local location where the EPEL / Nagios RPMs are located.

      https://github.com/sadsfae/ansible-nagios/blob/master/install/roles/nagios/tasks/main.yml#L18:21

      I’d run through the playbook on a host you can use to have access externally to get a list of all the packages you’d need for that repository.
      Then you’d want to create an RPM repository out of that location.

      Lastly, you can use the Ansible repository module:

      http://docs.ansible.com/ansible/latest/yum_repository_module.html

      So far as nagios-core, I guess you mean nagios and nagios-common? If you don’t want to not install certain packages you can edit the items array here, however i would advise against this because everything you need is required for proper nagios operation.

      https://github.com/sadsfae/ansible-nagios/blob/master/install/roles/nagios/tasks/main.yml#L52

      Your best bet is to just ensure all needed packages are located somewhere locally inside your security requirements and install that way, managing a local RPM repository and copying the packages there is probably the easiest way.

      Like

      • Bivabari says:

        Hello ,I have already exists nagios core sever .
        Only wanted to deploy nrpe client and plugins ,could you pls suggest me?

        Like

      • Will Foster says:

        HI Bivabari, the playbook I maintain has both server + client packages tightly integrated, you can decouple these and try running only the nagios-client playbook but you may be better off just using Ansible to install the packages and drop any configs in a standalone way.

        e.g. to install them on a set of hosts, make an inventory file called hosts with a section called [nagios_clients]:

        ansible nagios_clients -m yum -a “name=nrpe state=latest”

        Now if you wanted to drop the configs (assuming the same config can be deployed everywhere) have a copy of it locally called nrpe.cfg

        ansible nagios_clients -m copy -a ‘src=nrpe.cfg dest=/etc/nagios/nrpe.cfg’

        Now set nrpe to start and to start on boot

        ansible nagios_clients -m service -a “name=nrpe state=started”

        ansible nagios_clients -m service -a “name=nrpe state=enabled”

        Like

  11. dontknow says:

    What if you just want to add a bunch of hosts to a already up and running nagios?

    Like

    • Will Foster says:

      What if you just want to add a bunch of hosts to a already up and running nagios?

      This is really for a new installation because it makes a lot of changes to your Nagios configs, templates them and the ServiceGroup and HostGroup changes may not jive with what you’re currently using.

      Like

  12. Bivabari says:

    Hello ,
    Thanks for the update .
    ansible nagios_clients -m yum -a “name=nrpe state=latest”————In this need little clarity.
    nagios_clients——>is this a yml file .can you pls help what could be format inside nagios_clients

    And also ,Can I use your package in this ,by removing nagios core part only keeping nagios_client code.Its failing in my case.

    [root@goldfinch install]# cat nagios.yml

    #
    # Playbook to install nagios server, clients and
    # generate service checks based on Ansible inventory
    #

    # we need to collect facts from all hosts we reference
    # https://github.com/ansible/ansible/issues/9260
    # we skip switches/oobservers because they normally don’t
    # have python installed.

    – hosts: all
    remote_user: “{{ ansible_system_user }}”
    tasks: []

    # role for nagios clients via NRPE
    – hosts: all
    remote_user: “{{ ansible_system_user }}”
    roles:
    – { role: nagios-client }
    – { role: firewall_client, when: manage_firewall_client }

    Like

    • Will Foster says:

      nagios_clients refers to an inventory file containing hosts you want to run ansible against:

      https://docs.ansible.com/ansible/latest/user_guide/intro_inventory.html

      This is a YAML file containing the destination hosts underneath a header like this, with the filename being “hosts”

      [nagios_clients]
      host01
      host02
      host03

      You can try this, but I would probably run this standalone using the ad-hoc module commands in my previous reply since all you’re doing is installing it, dropping a configuration file and starting the service. This would assume you have a config file already setup.

      ansible nagios_clients -m yum -a “name=nrpe state=latest”
      ansible nagios_clients -m copy -a ‘src=nrpe.cfg dest=/etc/nagios/nrpe.cfg’
      ansible nagios_clients -m service -a “name=nrpe state=started”
      ansible nagios_clients -m service -a “name=nrpe state=enabled”

      I don’t think I can help you much deviating from the scope of the playbook as the two are tightly integrated together, with the NRPE checks being generated/managed on the Nagios server side with the rest of the Ansible playbooks, however if you can get things to work this way that’s great.

      Like

  13. varun rayala says:

    I already have nagios server and now i need to write an ansible playbook to schedule downtime and
    silence and unsilence checks on other hosts
    i tried below config but not sure where i need to define nagiso server,url,user and password details
    # set 30 minutes of apache downtime
    – nagios:
    action: downtime
    minutes: 30
    service: httpd
    host: ‘{{ inventory_hostname }}’

    Like

  14. LL says:

    Hi, thanks for this. I’ve noticed httpd and Nagios being installed on ALL servers too. Is it just my setup ?

    Like

    • Will Foster says:

      Hi LL, you should only have one server and everything else is a client. (unless you want separate Nagios servers per sub-domain/environment etc).

      e.g.

      [nagios]
      my-nagios-server

      Everything else would be a client of some sort below this in the example inventory file, corresponding to the checks or category they fit into – and only have one unique category/client/inventory group per host (don’t list it in more than one location, this is why there is some overlaps in checks like the “server” role being inherited plus other functionality).

      HTTPD is required because Nagios does not have it’s own web server it’ just Perl CGI.

      Hope this helps, if you have any more questions please let me know.

      Like

  15. txfunnkkvy says:

    Muchas gracias. ?Como puedo iniciar sesion?

    Like

  16. LL says:

    Thanks, any chance of adding PnP4Nagios ?

    Like

Have a Squat, Leave a Reply ..

This site uses Akismet to reduce spam. Learn how your comment data is processed.