Zero Downtime Deploys

Dusty Candland | Tue Dec 05 2023 | infrastructure, processes

How we implemented zero downtime deploys to increase team productivity

We had moved our API out of Rails into a Clojure service. The service was running great, but deploys were a pain. We couldn't have downtime when deploying; so we were taking servers offline, updating them, and then bringing them back online. Using the typical rolling deployment pattern.

That worked, but it was slow and took a lot of manual time. Being slow meant longer times to fix bugs. Taking a lot of manual time meant wasted time and a hesitancy to release often. Both are bad for the team and the business.

After a long day of pushing fixes and feeling frustrated with the deployment process, we decided we needed a better way.

Coming from Rails, the Unicorn web server had a way to restart its self using Unix signals. I liked the way that worked and wanted our Clojure service to do the same thing.

I put together a quick proof of concept. It worked, but there were some issues to figure out.

First, the service had to be able to shut itself down gracefully. We implemented the Component system to ensure the service could shut down.

Secondly, we don't want to shut down until the new service was started. This presented an issue since the web service was listening on a given port, the new version couldn't listen on the port until the first version shut down.

Lastly, since Clojure runs on the JVM, the startup time was significant.

To accomplish the last two requirements, the service would listen for the restart signal. Once received, it would start the service in a new process. The new process would start everything but the web server. Once it got to that point, it would send a kill signal to the current instance and wait for the port to become available. Then it would start its web server.

That a lot! But it made sure the new version could start and allowed almost zero downtime restarts. The almost part (milliseconds) was handled by Nginx, so the net results was zero downtime deploys!

This allowed the team to have more confidence in deploys. Reduced the number of servers we needed to have running. Removed the need to manually deploy and reduced the deployment time from 30-40 minutes to a few minutes.

Webmentions

These are webmentions via the IndieWeb and webmention.io. Mention this post from your site: