How bad can a greedy regex really get?

By Saša Teković Sysbee Blog cpu usage, cumin, debug, error, load, log, overload, profiler, puppet, puppet module, puppet server, regex, timeouts 0 Comments

Recently, a client approached us with a curious Puppet Server problem that manifested in excessive CPU usage on the Puppet Server. One of the surprising things was that the issue appeared out of the blue, and was causing timeouts during Puppet runs on 650 nodes. None of the recent changes to the Puppet code coincided with this issue.

Let the troubleshooting begin!

We began the troubleshooting process by reviewing performance metrics, and we have indeed confirmed that previously predictable CPU usage increased all of a sudden.

We then confirmed that the Puppet Server processes were consuming way too much CPU time. Since the server was under heavy load, the Puppet Server log files were full of timeout errors such as:

ERROR [p.s.r.request-handler-core] Error 503 on SERVER at /production/node/node.example.local: Attempt to borrow a JRuby instance from the pool timed out; Puppet Server is temporarily overloaded. If you get this error repeatedly, your server might be misconfigured or trying to serve too many agent nodes. Check Puppet Server settings: jruby-puppet.max-active-instances

These log entries were not that helpful, because they pointed out the obvious. Right, it was time to restart the Puppet Server with the log level set to DEBUG. We were hoping that verbose log entries will give us an idea of what was going on. We also increased the number of JRuby instance workers, even though we were confident that the current configuration was adequate for the 650 nodes executing Puppet runs every 30 minutes.

So, we restarted the Puppet Server and continued to observe the situation. We noticed that initially, the CPU usage was at the usual levels, but after a certain number of Puppet runs the CPU usage would begin to rise. In fact, we noticed that at some point, a worker process would start to consume 100% of CPU time. Then, after a few minutes, another worker process would follow, then another, until finally, all worker processes would consume 100% of CPU time. Debug log entries, unfortunately, didn’t reveal anything helpful.

We then started to suspect that perhaps Java heap size was not big enough and that, due to memory pressure, the garbage collector would cause increased CPU usage. We decided to increase the heap size. Thankfully, we had enough spare memory on the server.

We restarted the Puppet Server once again, observed the situation, and again saw the same problem. OK, so the garbage collector was not the culprit.

Slowly getting there…

One of the theories we had was that maybe one of the nodes or a whole node group was somehow causing the Puppet Server to go crazy. To confirm this, we needed to observe the situation on the Puppet Server without the noise of all 650 Puppet nodes fighting for attention.

We used the Cumin to stop Puppet agents on all nodes, and then trigger multiple Puppet runs in batches of 30 nodes. Interestingly, Puppet Server was behaving normally the whole time. Just as when we began to doubt that we’ll be able to reproduce the problem this way, we reached the batch of nodes that triggered the excessive CPU usage. Finally, we were getting somewhere!

We’ve compared the catalog compilation times of all nodes in that batch and found that catalog compilation for certain nodes took 60-120 seconds.

INFO  [puppet-server] Puppet Compiled catalog for node123.example.local in environment production in 0.81 seconds
INFO  [puppet-server] Puppet Compiled catalog for node456.example.local in environment production in 0.97 seconds
INFO  [puppet-server] Puppet Compiled catalog for xyz-node24.example.local in environment production in 122.96 seconds

On average, the catalog compilation should take no more than 1.2 seconds. Something was seriously wrong there.

We’ve collected the hostnames of all problematic nodes and immediately seen that they all have one thing in common. They all had the same “xyz-” prefix in their hostnames (e.g. xyz-nodename.example.local). We knew that the hostname couldn’t possibly cause this problem, so it must have been a problem with a Puppet module or a class that’s applied specifically on those nodes.
We’ve enabled the Puppet Server profiler, triggered Puppet run on one of the problematic nodes, and observed the logs. The following log entry caught our eye:

DEBUG [c.p.p.LoggingPuppetProfiler] [compiler evaluate_resource Class[Example::class]] (112440 ms) Evaluated resource Class[Example::class]

112 seconds for evaluating a single class (Example::class)? That definitely isn’t right. We checked the Git history for that (in-house developed) Puppet module and noticed that one of the recent changes introduced a condition that specifically matches nodes with the “xyz-” prefix.

if ($facts['clientcert'] =~ /^xyz-(.+)*\.example\.local/) {
  file {'/path/to/directory':
    ensure => directory,
    owner  => 'root',
    ...
  }
}

What immediately caught our attention was the regex, which should not have had the asterisk character after the (.+) expression. This was, obviously, an unintentional mistake, which was copied to two more places within the Puppet code, and ultimately caused the Puppet server to break the sweat trying to process this greedy regex. As soon as this regex was corrected, the problem instantly disappeared. Yay!

Hey, but what about…

But, one thing left us bothering. Why didn’t this code change overload the Puppet Server as soon as it was deployed to production?

Well, it turned out that the Puppet runs on the problematic hosts were happening with intervals big enough that they didn’t overlap. Average CPU usage did rise slightly, but it wasn’t until the overall Puppet run splay was disrupted, some two weeks later, that the problem brought the Puppet Server to its knees!

Share your regex horror story with us. Leave a comment or mention us on social media!