Tackling cache invalidation

Tackling cache invalidation

Tackling cache invalidation

It’s a well-known fact that there are two difficult problems in computer science:

  1. naming things
  2. cache invalidation
  3. off by one error

In this post, we’re going to talk about the first one, cache invalidation.

Not a long time ago, we got a support request from one of our client’s developers with the following issue:

We deployed a new version of the app and are getting error 500, last time you changed something in the server configuration, and it did the trick.

A bit of a backstory

We’re talking about your standard LAMP stack here.

First of all, let’s take a look at the deployment process. What the devs used was symlink based deployment. The idea here is to have the document root of the application to be a symlink to the current version of the app. For example let the document root be /home/username/public_html/public which is symlink to /home/username/appversion1/

To change the version to /home/username/appversion2/ you need to change the symlink to point to the new version of the code. This is great, as the change is pretty much instantaneous, there is no downtime, and you don’t need to bother your local sysadmin for restarting the webserver or stuff like that.

Except when it’s not.

We had a similar issue before on the development server and figured that OPcache was the cause of the problem. If you’re not familiar with it, OPcache is a PHP module that stores precompiled bytecode of PHP scripts, thus eliminating the need to load and parse scripts on each request.

We figured that the older cached version of the code was interfering with some newer files and that caused the application to return a 500 error. We lowered the opcache.revalidate_freq parameter, which sets the cache validation interval, restarted PHP-FPM for the change to take effect and got rid of the 500 error.

We made the same change to the production server and called it a day.

Back to the present day

So what is clear now is that lowering the opcache.revalidate_freq parameter wasn’t the right solution, and it was probably the restart that did the trick.

We have two options now, restart PHP-FPM every time a new version of the application is deployed, which would flush all the possible caches or figure out what’s causing the problem. While the first one would solve the problem, it just doesn’t “feel” right. To make an analogy, it would be like turning off the engine of your car only to mute the radio. 

So we started digging a bit deeper. The debugging was going to take some time, so the devs reverted the site to a working version.

First of all, one of the things that may cause a 500 error is a misconfigured .htaccess file.

However, this wasn’t the case here. The file didn’t change between app versions, and there was nothing in it that might have caused the issue.

The next logical step was to take a look at the OPcache config file again.

Digging through it, we found the “opcache.use_cwd” parameter. When enabled, it makes OPcache append the current working directory to the cache key, which makes sense. If we cached /home/username/appversion1/index.php as index.php and we’re supposed to be using /home/username/appversion2/index.php, this might trigger the error.

So PHP-FPM restarted, asked the devs to switch versions and hope for the best.

500 Internal Server Error

##$%!&!!”*?!”#

So apparently this isn’t it.

The next thing was to set the opcache.revalidate_freq to 0. What this means is to validate the file for every request. We were sure that this was going to do the trick.

Well, it didn’t.

Looking through the config file, we didn’t find anything else that might cause problems. At this point we weren’t even sure anymore if OPcache was the cause of the issue, so we figured we’ll disable it and see what was going to happen. As you might have guessed, it would be a really dumb blog post if I told you that worked.

We were getting more and more convinced that there is a bug in the code as it obviously wasn’t an OPcache issue.

What about the code?

So, the next step would be to go through the application logs; however, for some unknown reason, there was nothing relevant in the logs, just older entries. We suspect the reason was that logging was disabled on the application level. However, as a policy, we don’t do any changes to clients code.

With no logs to tell us what’s going on here, the next step was to use strace to take a look “under the hood”. We were lucky here, as this wasn’t a high traffic website and it was already late, so there was very little activity on the server, which resulted in only 4 PHP-FPM child processes. The devs switched to the new version, and we started tracing one of the processes while sending HTTP requests. Once the request gets handled by the traced child, we can see what the code is doing.

After a bunch of gibberish strace output, we got to the interesting part. The application started to read a .php file, and then it crashes and returns a 500 error. We have now pinpointed the problematic part of the code. This was one of the files that had changed in the newer version. The code in this file was setting up some environment variables for the database connection. We took a look at the values of the variables, and they were all correct. However, a syntax or logic error wasn’t yet excluded. We told the devs about the findings and got a response soon that this was a standard way of doing thing in the PHP ecosystem and that there is nothing wrong with the code. The devs then copied the index.php file to index2.php and when requesting the index2.php everything worked fine. Figuring that developers might (definitely) know PHP programming a bit (lot) better than me and given the empirical proof that there was nothing wrong with the code, we were back where we started.

The devs rewrote the index.php file in the new version to echo “success”, told me how to switch version so we could test stuff without bothering them.

Cache invalidation

Back to square one

There are two things we know for sure now:

  1. it’s not the code
  2. it’s a caching issue

Ironically it’s the same thing we suspected at the begging, but at least we are sure now so I would call that progress. The only thing left to do was to check the other PHP configuration files. Looking through the php.ini file, we came across this thing:

; Duration of time, in seconds for which to cache realpath information for a given
; file or directory. For systems with rarely changing files, consider increasing this
; value.
; http://php.net/realpath-cache-ttl
;realpath_cache_ttl = 120

interesting…

A quick search about realpath cache will tell you two things:

  1. the default TTL is 120 seconds
  2. realpath is used to get the full path from a relative path.

We lowered the TTL to 2 seconds, deployed the new version and got a 200 response.

This is great! OPcache re-enabled, and everything worked as expected.

What was going on?

To get a better understanding of what’s going on, let’s take a look at a small example. First of all, we configure the web server as follows

ServerName example.com
ServerAlias www.example.com
DocumentRoot /home/example/public_html/public/app

In the /home/example/public_html we have the following situation

$ pwd && tree
/home/example/public_html
.
├── rel1
│   ├── app
│   │   ├── index.php
│   │   └── realpath.php
│   └── includeme.php
└── rel2
    ├── app
    │   ├── index.php
    │   └── realpath.php
    └── includeme.php

The index.php files are the same in rel1 and rel2 direcotries

<?php
include $_SERVER["DOCUMENT_ROOT"].'/../includeme.php';
?>

while the rel1/includeme.php prints “1” and rel2/includeme.php prints “2”.

We can now make the symlink:

ln -s /home/example/public_html/rel1/ ./public

If we now make an http request to http://example.com we would get the expected output.

andrea@andrea:~$ curl http://example.com
1

Let’s now change the symlink to point to rel2/

unlink public && ln -s /home/example/public_html/rel2/ ./public

What we would normally expect is the see it output 2, as rel2/includeme.php should be included. However, we get the output 1 meaning rel1/includeme.php got actually included.
The ones with the sharpest eye among you will notice that we also have a realpath.php script. It just dumps what’s currently in the realpath cache and the current time.

So after switching the symlink to rel2 and requesting realpath.php we get the following:

andrea@andrea:~$ curl http://example.com/realpath.php
13:35:34
[...]
path: /home/example/public_html/public/app/../includeme.php
realpath: /home/example/public_html/rel1/includeme.php
expire: 13:37:24

Here we can clearly see the realpath_cache_ttl in action.

What about performance?

The realpath cache definitely improves performance; however, we haven’t noticed a significant performance impact on the application after lowering the TTL. This is mostly because in normal circumstances resolving the real path is a swift operation and we also have to consider that we have OPcache here as well, which helps a lot.

If you are not a fan of lowering cache TTL you can use the PHP clearstatcache function which clears the realpath cache.

Just one thing to note is that the cache is process specific, meaning that if you use PHP-FPM you’ll have to clear the cache individually for each PHP process.

Do you have any questions or comments on this topic? Let us know in the comments below, or get in touch and we’ll be happy to help!

Share this post