Foreman is an awesome provisioning, utility and life-cycle management solution for Linux and UNIX-like server infrastructure. It unifies a lot of things like DHCP, DNS, PXE, and automated provisioning into one tool. Unfortunately we’ve hit some issues with scaling the Foreman-Proxy component past 15,000 VM’s using Foreman-Proxy with ISC DHCPD.
Below is a quick rather ugly hack workaround in Python for now I’ve developed, and how we should be fixing it permanently.
The Problem
When the Foreman API makes a provisioning action it calls Foreman -Proxy to scan the /var/lib/dhcpd/dhcpd.leases file. When this file is very large Foreman-proxy can timeout from the request causing Foreman to fail with the provisioning request. In general having a very large dhcpd.leases file may just generally slow down UI and CLI responsiveness to Foreman. Perhaps Foreman-Proxy shouldn’t be performing sequential reads of the file and it should instead be kept in a database? I don’t know, but I’ll bring it up with the friendly developers (who are very responsive, supportive and available upstream in #theforeman-dev on irc.freenode.net)
Hack/Workaround (for us.. for now)
In our case, the 15,000+ VM’s that are aggressively spun up are all provisioned against a large 10.1.x /18 network while our permanent entries are all fully qualified domain names. This made culling the file easy by simply removing all 10.1 lease entries on a given interval.
I’ve developed a quick rather ugly hack workaround in Python for now, until I can address the issue upstream. The following tool culls your ‘/var/lib/dhcpd/dhcpd.leases’ file in-place and bounces services, making Foreman-Proxy (and therefore Foreman) responsive again when you are provisioning several thousand VM’s in an aggressive, automated fashion.
#!/usr/bin/env python # prune the VM-created entries from DHCP leases file # all of our permant dhcp lease will be FQDN, so we'll prune all # entries that start with a 10.1 ip scheme. # purpose: in our R&D environments the amount of temporary VM DHCP # reservations cripples Foreman Proxy over time. import fileinput import shutil import datetime import subprocess # first, stop dhcpd temporarily from subprocess import call call(["service", "dhcpd", "stop"]) # backup existing dhcpd.leases file shutil.copy2('/var/lib/dhcpd/dhcpd.leases', '/var/lib/dhcpd/dhcpd.leases-' \ + datetime.datetime.now().strftime('%Y%m%d%H%M')) # in-place edit dhcpd.leases ignore = False for line in fileinput.input('/var/lib/dhcpd/dhcpd.leases', inplace=True): if not ignore: if line.startswith('lease 10.1.'): ignore = True else: print line, if ignore and line.isspace(): ignore = False # start dhcpd back up again from subprocess import call call(["service", "dhcpd", "start"]) # bounce foreman-proxy for good measure from subprocess import call call(["service", "foreman-proxy", "restart"])
Proper Fix / Solution
The ideal way to fix this is to simply scale out that 10.1.0/18 subnet onto it’s own DHCPD server, perhaps having requests from the 10.1.0/18 network use the ‘next-server’ entry but this works in a really ugly fashion for now.
As mentioned before, Foreman should ideally be able to scale out to accommodate this load on the sub-components like ISC BIND, DHCPD, etc. I know there are folks like CERN who have had success with load-balancing Foreman and I’d love to know more about that.