Workaround for Foreman-Proxy Scale Issue: DHCPD and lots of VM’s

foremanForeman is an awesome provisioning, utility and life-cycle management solution for Linux and UNIX-like server infrastructure.  It unifies a lot of things like DHCP, DNS, PXE, and automated provisioning into one tool.  Unfortunately we’ve hit some issues with scaling the Foreman-Proxy component past 15,000 VM’s using Foreman-Proxy with ISC DHCPD.
Below is a quick rather ugly hack workaround in Python for now I’ve developed, and how we should be fixing it permanently.

The Problem
When the Foreman API makes a provisioning action it calls Foreman -Proxy to scan the /var/lib/dhcpd/dhcpd.leases file.  When this file is very large Foreman-proxy can timeout from the request causing Foreman to fail with the provisioning request.  In general having a very large dhcpd.leases file may just generally slow down UI and CLI responsiveness to Foreman.  Perhaps Foreman-Proxy shouldn’t be performing sequential reads of the file and it should instead be kept in a database?  I don’t know, but I’ll bring it up with the friendly developers (who are very responsive, supportive and available upstream in #theforeman-dev on irc.freenode.net)

Hack/Workaround (for us.. for now)
In our case, the 15,000+ VM’s that are aggressively spun up are all provisioned against a large 10.1.x /18 network while our permanent entries are all fully qualified domain names.  This made culling the file easy by simply removing all 10.1 lease entries on a given interval.

I’ve developed a quick rather ugly hack workaround in Python for now, until I can address the issue upstream.  The following tool culls your ‘/var/lib/dhcpd/dhcpd.leases’ file in-place and bounces services, making Foreman-Proxy (and therefore Foreman) responsive again when you are provisioning several thousand VM’s in an aggressive, automated fashion.

#!/usr/bin/env python
# prune the VM-created entries from DHCP leases file
# all of our permant dhcp lease will be FQDN, so we'll prune all
# entries that start with a 10.1 ip scheme.
# purpose: in our R&D environments the amount of temporary VM DHCP 
# reservations cripples Foreman Proxy over time.

import fileinput
import shutil
import datetime
import subprocess

# first, stop dhcpd temporarily
from subprocess import call
call(["service", "dhcpd", "stop"])

# backup existing dhcpd.leases file
shutil.copy2('/var/lib/dhcpd/dhcpd.leases', '/var/lib/dhcpd/dhcpd.leases-' \
        + datetime.datetime.now().strftime('%Y%m%d%H%M'))

# in-place edit dhcpd.leases
ignore = False
for line in fileinput.input('/var/lib/dhcpd/dhcpd.leases', inplace=True):
    if not ignore:
        if line.startswith('lease 10.1.'):
            ignore = True
        else:
            print line,
    if ignore and line.isspace():
        ignore = False

# start dhcpd back up again
from subprocess import call
call(["service", "dhcpd", "start"])

# bounce foreman-proxy for good measure
from subprocess import call
call(["service", "foreman-proxy", "restart"])

Proper Fix / Solution
The ideal way to fix this is to simply scale out that 10.1.0/18 subnet onto it’s own DHCPD server, perhaps having requests from the 10.1.0/18 network use the ‘next-server’ entry but this works in a really ugly fashion for now.

As mentioned before, Foreman should ideally be able to scale out to accommodate this load on the sub-components like ISC BIND, DHCPD, etc.  I know there are folks like CERN who have had success with load-balancing Foreman and I’d love to know more about that.

About Will Foster

hobo devop/sysadmin, all-around nice guy.
This entry was posted in open source, sysadmin and tagged , , , , , , . Bookmark the permalink.

Have a Squat, Leave a Reply ..

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s