Infrastructure monitoring is important and there’s lots of tools for it. I’m fond of Nagios – it’s very mature and customizable but it’s also a giant pain in the ass to setup. Here’s how to save time deploying and managing Nagios with Ansible.
What Does it Do?
- Automated deployment of Nagios server on CentOS7 or RHEL7
- Automated deployment of Nagios client on CentOS6/7/8, RHEL6/7/8, Fedora or FreeBSD
- Generates hosts and service checks based on Ansible inventory
- Generates configuration/services based on easy-to-manage jinja2 templates
- Deploys comprehensive, configurable checks for all hosts/services
- Sets up comprehensive checks for the Nagios server itself
- Wraps Nagios interface with Apache/SSL
- Sets up firewall rules for Nagios/HTTP
- Currently monitors 10 types of resources:
- Generic Linux server (check ping, ssh, uptime, load, users, processes, disk space, swap, zombie procs, mdadm raid)
- Webservers (same as Linux server plus TCP/80 for webservers)
- Elasticsearch (same as Linux server plus elasticsearch)
- FreeNAS Appliances (check ping, ssh, volume, disk and alert status)
- ELK server (same as Linux server plus elasticsearch and Kibana)
- Jenkins CI (same as linux server plus TCP/8080 and optional reverse proxy with authentication)
- DNS servers (same as Linux server plus DNS service checks)
- Network switches (check ssh, ping)
- Out-of-band interfaces (check ssh, ping, http)
- Dell iDRAC server health monitoring (using SNMP, see example below)
- (all configurable: CPU, DISK, RAID, TEMP, FANS, MEM, POWER)
- SuperMicro health monitoring (using IPMI).
- Split out by SuperMicro server type and checks
- Generic Linux server (check ping, ssh, uptime, load, users, processes, disk space, swap, zombie procs, mdadm raid)
Getting Started
First, clone the git repo locally. You’ll need Ansible installed prior to this.
git clone https://github.com/sadsfae/ansible-nagios
Setup your Hosts Inventory
Next you’ll want to edit a few variables in your hosts (inventory) file. Most importantly you’ll want to change host-01 to whatever your nagios server should be.
cd ansible-nagios sed -i 's/host-01/yournagioshost/' hosts
Add Resources to be Monitored (optional)
Add servers/resources you want monitored to your inventory file, they’ll be automatically added to monitoring in the Nagios configuration. Servers/resources simply need to be reachable via SSH, have proper SSH keys setup for the root user and Python installed.
[nagios] host-01 [webservers] webserver01 [switches] switch01 ansible_host=192.168.0.101 switch02 ansible_host=192.168.0.102 [oobservers] webserver01-idrac ansible_host=192.168.0.105 [servers] server01 [servers_with_mdadm_raid] [dns_with_mdadm_raid] [dns_with_mdadm_raid] [freenas] [elasticsearch] [elkservers] [jenkins] [idrac] database01-idrac ansible_host=192.168.0.106 [supermicro-6048r]
Any hosts you add to under the [servers], [dns], [jenkins], or [webservers] groups will inherit a full battery of common service checks.
Each inventory hostgroup is a unique category, you cannot have a host listed under more than one so pick the one that fits each of your hosts best.
Note: For switches, idrac and out-of-band interfaces you must use the ansible_host=ip.address variable as well as illustrated above, this is because most appliances will not have Python installed (so Ansible can gather facts from them). In Nagios it will still use the alias name listed.
Change Nagios Password and Options (optional)
Part of the setup is generating the Nagios admin user, you’ll want to change this value from the default. Configurable variables are located in install/group_vars/all.yml. You can change this at any time if you like however.
sed -i 's/changeme/yourpasswordhere/' install/group_vars/all.yml
You might want to change the notification settings if you wish to receive external email alerts, and the contacts.cfg.j2 file is templated for further modification.
admin_name_01: nagiosadmin admin_email_01: nagios@localhost
Lastly it creates a read-only guest user for nagios, you can turn this off by setting the following to false or changing the settings as you see fit:
nagios_create_guest_user: true nagios_ro_username: guest nagios_ro_password: guest
If you don’t want to automatically create iptables or firewalld rules for either nagios server or the NRPE nagios client modify these:
manage_firewall: true manage_firewall_client: true
Run the Playbook
Now you’re ready to deploy Nagios. Run the Ansible playbook.
ansible-playbook -i hosts install/nagios.yml
Access your Nagios Instance
Now you should be able to access your Nagios interface via https://yourserver/nagios. Below is an example I’ve setup locally on a few home servers.
If you haven’t setup any external hosts to monitor in your Ansible inventory you’ll just see the localhost where Nagios is running and slew of standard checks.
Known Issues
SELinux occasionally breaks checks as policy files are updated. If you get the following error during your playbook run:
avc: denied { create } for pid=8800 comm="nagios" name="nagios.qh
You need to do the following (or disable SELinux entirely):
cat /var/log/audit/audit.log | audit2allow -M mynagios semodule -i mynagios.pp
Now restart Nagios:
systemctl restart nagios systemctl restart httpd
I don’t want to make any assumptions about your environment for you so that’s why I am not going to be forcing setenforce 0 or the like.
Deployment Video
I’ve put together a simple video showing the deployment, it takes about 2minutes to deploy a full Nagios stack and set of remote server/resource checks.
Dell iDRAC Server Health Example
I’ve recently added support for monitoring all server health values via SNMP against the iDRAC interface, this is built upon someone else’s already-existing idrac check. The big benefit of this is you don’t need to stress the actual server at all or install NRPE in cases where you need to optimize performance but still want to have a battery of health checks in case something breaks.
These checks are all configurable in install/group_vars/all.yml. Here’s what it looks in the Nagios dashboard:
Here is what it looks like from the service details view (quite exhaustive).
Further Automation and Extending
Remember that all the configurations are generated via jinja2 templates, if you want to expand this to include different server types and checks you can simply branch out your hostgrouptype.conf.j2 files as I have done in a minimal fashion and add the corresponding entries to your Ansible inventory.
We will be utilizing this along with Ansible dynamic inventory and Foreman to auto-generate our list of monitored services to do various tasks, one of which will be to generate and maintain an up-to-date monitoring system with Nagios automatically as hosts change or get added.
Questions or suggestions are welcome, feel free to add a comment below, or file an issue on Github.
Very helpful – cheers!
LikeLike
Eager to give it a spin.
What component I would add is livestatus for running integration tests against it.
LikeLike
Hey @driveby, that would be useful. Right now I’m just focusing on expanding the templated checks (I’ll be adding IPMI / idrac environmental checks for servers soon). I also test everything extensively in VMs or against bare-metal (in case of the IPMI checks) before I push anything but there’s no replacement for proper integration testing for sure.
LikeLike
Impressive. One thing I’m wondering, about how long did it take to setup this configuration? I’m working towards similar ends, but this is beyond. Very nice.
LikeLike
Hi GL, the entire setup takes about 2-3 minutes with a few dozen hosts. Edit your inventory file and off it goes, note that you shouldn’t put the same host into more than one group (e.g. servers, webservers etc) due to how it’s templated. If you see any check failures or issue with the Nagios service you may need to set
setenforce 0
and change SELinux to permissive in/etc/selinux/config
until you can apply the proper policy or label(s) needed.LikeLike
ah, i meant time to write the ansible / nagios configuration :) weeks? months?
LikeLike
I think it took a day or two, once I figured out how to iterate configuration loops in Jinja2 with Ansible facts it came together shortly afterwards.
LikeLike
ps – i like how the entire package is self contained… cool
LikeLike
Hi i could see below error
AnsibleUndefinedVariable: ‘dict object’ has no attribute ‘ansible_default_ipv4′”}
TASK [nagios-client : Setup NRPE client configuration] ***************************************************************************
fatal: [SErver02]: FAILED! => {“changed”: false, “failed”: true, “msg”: “AnsibleUndefinedVariable: ‘dict object’ has no attribute ‘ansible_default_ipv4′”}
LikeLike
Can you paste your ‘hosts’ file? Opening a Github issue is probably the best place to help with this.
LikeLike
Did you ever find a solution for this?
LikeLike
Likely that this error happens when you try to add a host that doesn’t support Python (so no fact collection is possible) without using the ansible_host variable, e.g. switches, routers, out-of-band devices, etc.
For those they have their own inventory group / format.
Per the README example:
[switches]
switch01 ansible_host=192.168.0.100
switch02 ansible_host=192.168.0.102
LikeLike
Hi,
Nice work done, please let me know how to add other centos versions and other types of servers to be monitored?
LikeLike
Hi AK,
Adding other server types is easy, you just need to add them underneath the “hosts” file for the right category, for example if you have three linux servers and three linux web servers you’d split them out underneath the proper category (note you cannot have the same server in more than one category).
[servers]
server01
server02
server03
[webservers]
server04
server05
server06
If you wanted to extend the types of server checks you’ll just need to write jinja2 templates for them here and ensure they are generated in the main Nagios Ansible tasks here. You can use one of the existing ones to model a new one off of that.
LikeLike
Excellent work! Any suggestions on, how can i make it work for CentOS6.6. I was thinking of editing main.yml file.
Thanks in Advance!
LikeLike
Hi Dheerendra,
I haven’t tested this on RHEL/CentOS6 and don’t really plan to support it but you’ll at least need to edit two areas where we check for this:
Edit main.yml and remove the OS/distribution check here
Edit main.yml and modify the EPEL RPM repository for CentOS6/RHEL6
You may also want to substitute systemctl commands with service commands.
Let me know how it goes.
LikeLike
We use nagiosdev, nagiosqa and nagiosprod useraccounts instead of nagios i. Also the group names would be nagiosdev, nagiosqa and nagiosprod. How to do this.
LikeLike
Hi Narpet, this is a one-line change here:
https://github.com/sadsfae/ansible-nagios/blob/master/install/group_vars/all.yml#L17 in the setting nagios_username:
Just change nagiosadmin to whatever you want the primary username to be per environment.
After that just run the playbook once you’ve added each server underneath the proper Ansible inventory group that corresponds to what you want monitored (only put a server in one group).
For groups, are you referring to Nagios groups?
This playbook isn’t setup in a way to put all servers into the same groups, nor does Nagios by design want you to do this.
Typically you want to put servers you are monitoring underneath the appropriate Ansible inventory group that corresponds with the server, device or service type you are monitoring.
On the Nagios side this lets you organize sets of similar servers doing the same function into the same Nagios Host Group based on their function, but it’s also why you probably want a separate Nagios instance per environment. This has other added benefits like separate alerting, contact and notification behavior that you probably want to have between different environments (e.g. you probably don’t want full alerting on in development but you probably do want it in prod).
In your case with different environments it might make more sense to have separate Nagios instances per environment: nagiosdev, nagiosqa and nagiosprod.
You could collapse all of this if you really wanted to however but you’d need to edit and manage the hostgroup_name variable in the templates section of the playbook and do some restructuring of how that maps to the Ansible inventory groups:
https://github.com/sadsfae/ansible-nagios/tree/master/install/roles/nagios/templates
As things are set up this is designed to have Ansible inventory groups map to Nagios Host groups, adding each server underneath only one inventory group that describes it best. You can see that some of the generic health checks are inherited and only differentiate based on the primary service (e.g. jenkins, DNS, etc).
I hope this helps.
LikeLike
Thanks. This is great. I am starting to test this.
LikeLike
How do i disable installing nagios core. Also i cannot access any repo outside the company. I have to install the linux-nrpe-agent.tar.gz package. How do i automate this method. Please advise.
LikeLike
I think you’re better off pointing your system to some local RPM repository ocation if you cannot access public mirrors.
You’d want to comment this out and install lay down a repository config that points to some local location where the EPEL / Nagios RPMs are located.
https://github.com/sadsfae/ansible-nagios/blob/master/install/roles/nagios/tasks/main.yml#L18:21
I’d run through the playbook on a host you can use to have access externally to get a list of all the packages you’d need for that repository.
Then you’d want to create an RPM repository out of that location.
Lastly, you can use the Ansible repository module:
http://docs.ansible.com/ansible/latest/yum_repository_module.html
So far as nagios-core, I guess you mean nagios and nagios-common? If you don’t want to not install certain packages you can edit the items array here, however i would advise against this because everything you need is required for proper nagios operation.
https://github.com/sadsfae/ansible-nagios/blob/master/install/roles/nagios/tasks/main.yml#L52
Your best bet is to just ensure all needed packages are located somewhere locally inside your security requirements and install that way, managing a local RPM repository and copying the packages there is probably the easiest way.
LikeLike
Hello ,I have already exists nagios core sever .
Only wanted to deploy nrpe client and plugins ,could you pls suggest me?
LikeLike
HI Bivabari, the playbook I maintain has both server + client packages tightly integrated, you can decouple these and try running only the nagios-client playbook but you may be better off just using Ansible to install the packages and drop any configs in a standalone way.
e.g. to install them on a set of hosts, make an inventory file called hosts with a section called [nagios_clients]:
ansible nagios_clients -m yum -a “name=nrpe state=latest”
Now if you wanted to drop the configs (assuming the same config can be deployed everywhere) have a copy of it locally called nrpe.cfg
ansible nagios_clients -m copy -a ‘src=nrpe.cfg dest=/etc/nagios/nrpe.cfg’
Now set nrpe to start and to start on boot
ansible nagios_clients -m service -a “name=nrpe state=started”
ansible nagios_clients -m service -a “name=nrpe state=enabled”
LikeLike
What if you just want to add a bunch of hosts to a already up and running nagios?
LikeLike
This is really for a new installation because it makes a lot of changes to your Nagios configs, templates them and the ServiceGroup and HostGroup changes may not jive with what you’re currently using.
LikeLike
Hello ,
Thanks for the update .
ansible nagios_clients -m yum -a “name=nrpe state=latest”————In this need little clarity.
nagios_clients——>is this a yml file .can you pls help what could be format inside nagios_clients
And also ,Can I use your package in this ,by removing nagios core part only keeping nagios_client code.Its failing in my case.
[root@goldfinch install]# cat nagios.yml
—
#
# Playbook to install nagios server, clients and
# generate service checks based on Ansible inventory
#
# we need to collect facts from all hosts we reference
# https://github.com/ansible/ansible/issues/9260
# we skip switches/oobservers because they normally don’t
# have python installed.
– hosts: all
remote_user: “{{ ansible_system_user }}”
tasks: []
# role for nagios clients via NRPE
– hosts: all
remote_user: “{{ ansible_system_user }}”
roles:
– { role: nagios-client }
– { role: firewall_client, when: manage_firewall_client }
LikeLike
nagios_clients refers to an inventory file containing hosts you want to run ansible against:
https://docs.ansible.com/ansible/latest/user_guide/intro_inventory.html
This is a YAML file containing the destination hosts underneath a header like this, with the filename being “hosts”
[nagios_clients]
host01
host02
host03
You can try this, but I would probably run this standalone using the ad-hoc module commands in my previous reply since all you’re doing is installing it, dropping a configuration file and starting the service. This would assume you have a config file already setup.
ansible nagios_clients -m yum -a “name=nrpe state=latest”
ansible nagios_clients -m copy -a ‘src=nrpe.cfg dest=/etc/nagios/nrpe.cfg’
ansible nagios_clients -m service -a “name=nrpe state=started”
ansible nagios_clients -m service -a “name=nrpe state=enabled”
I don’t think I can help you much deviating from the scope of the playbook as the two are tightly integrated together, with the NRPE checks being generated/managed on the Nagios server side with the rest of the Ansible playbooks, however if you can get things to work this way that’s great.
LikeLike
I already have nagios server and now i need to write an ansible playbook to schedule downtime and
silence and unsilence checks on other hosts
i tried below config but not sure where i need to define nagiso server,url,user and password details
# set 30 minutes of apache downtime
– nagios:
action: downtime
minutes: 30
service: httpd
host: ‘{{ inventory_hostname }}’
LikeLike
My hunch is you’d want to look into the uri module for Ansible, it will let you perform POST requests – you’d want to model the API/URL for Nagios per service and probably pass the service and host(s) you want to silence or unsilence as a variable(s) when running the playbook.
https://docs.ansible.com/ansible/latest/modules/uri_module.html
This is something you’d write yourself and isn’t in the scope of the playbook here.
You might also want to look at interacting with the API directly, there’s a library or two floating around for this but I haven’t tried it.
https://pypi.org/project/nagios-api/
LikeLike
Hi, thanks for this. I’ve noticed httpd and Nagios being installed on ALL servers too. Is it just my setup ?
LikeLike
Hi LL, you should only have one server and everything else is a client. (unless you want separate Nagios servers per sub-domain/environment etc).
e.g.
[nagios]
my-nagios-server
Everything else would be a client of some sort below this in the example inventory file, corresponding to the checks or category they fit into – and only have one unique category/client/inventory group per host (don’t list it in more than one location, this is why there is some overlaps in checks like the “server” role being inherited plus other functionality).
HTTPD is required because Nagios does not have it’s own web server it’ just Perl CGI.
Hope this helps, if you have any more questions please let me know.
LikeLike
Muchas gracias. ?Como puedo iniciar sesion?
LikeLike