Hygienic Upgrades

This post outlines a pattern that robust upgrades of running software systems have in common.

Upgrading systems

Real-world software systems can get very complex, but for the purposes of this blog post, it will suffice to describe a simpler system.

Imagine a simple distributed system with one client and one server. The client can send requests to the server and receive responses. The server cannot contact clients on their own. The system is functional when the client and server both function and can communicate effectively.

(Note that in the system described here, the client software is not obtained directly from the server, such as might be the case on a website, but instead an entirely separate piece of software.)

Because this is a software system, and not a physical product, the development team has decided they can "upgrade" (change) either the server or the client whenever they see fit.

The development team has decided they want to make a new version of the server. This new version is such that the current client would not function with the new server. As such, they also make a new version of the client.

Now comes the issue of deployment.

If they upgrade the server but not the client, the system breaks. Now the client has to be upgraded urgently. If anything goes wrong in the meantime, the system stays broken for an indeterminate amount of time.
If they upgrade the client but not the server, the system breaks. Now the server has to be upgraded urgently. If anything goes wrong in the meantime, the system stays broken for an indeterminate amount of time.
If they upgrade the client and the server at the same time, and something goes wrong, the system is broken for an indeterminate amount of time. Now the team has to produce a fix urgently.

If you remember only one thing from this blog post, let it be this:

Urgency is a symptom of a fragile system.

Goals

Ideally, we would like to have a situation in which:

All urgency is removed from the development process.
Something going wrong during the development process does not break the system.

Compatibility

The key to solving these issues is to introduce the notion of compatibility. It is possible (and necessary) to produce a client that can communicate with multiple versions of a server.

Similarly, it is possible to produce a server with which multiple versions of a client can communicate.

To remove the urgency in upgrading a client, we can make a server that can work with both the old and the new version of the client. To remove the urgency in upgrading a server, we can make a client that can work with both the old and the new version of the server.

Performing a hygienic upgrade

If we only produce a version 2 of both the client and the server, there is no way to perform an upgrade without any urgency.

Indeed, any of these options lead to urgency and a broken system:

upgrade the client to version 2 first, while the server is still at version 1.
upgrade the server to version 2 first, while the client is still at version 1.
try to upgrade both at the same time, and either fails and/or is not instantaneous.

We will need to use an in-between version of either the client or the server to bridge the breakage. Because the communication is only one-directional, we only need one of the two components to get an in-between version. (If the communication were bi-directional, such as in a peer-to-peer system, both would have to get an in-between version.) We choose to give the server an in-between version, called 1.5.

We make server version 1.5 such that both the client version 1 and the client version 2 will be able to communicate with it. This way we can upgrade the server from version 1 to version 1.5 first. If anything goes wrong with this upgrade, the server can be rolled-back without any urgency to upgrade the client.

In fact, after this upgrade, the rest of the process can happen at any later time. There is no hurry to go through the rest of the process at this point.

The next step consists of upgrading the client(s) from version 1 to version 2. If anything goes wrong with this upgrade, the clients can be rolled-back as well.

At this point you could consider the upgrade complete, but version 1.5 now probably has obsolete compatibility code that can still be removed. Version 2 of the server can remove this code to simplify the codebase, such that the upgrade of the server from version 1.5 to version 2 completes the process.

Now we find ourselves in the situation that we started in, ready for another hygienic upgrade cycle:

Appendix: Summary of the workflow

Perform the following changes to move from Client/Server at version 1 to version 2.

Initial situation: Client and server both at version 1.
First produce the following versions:
- Server version 2
- Server version 1.5: Compatible with client version 1 and 2.
- Client version 2: Compatible with server version 2.
Upgrade the server to version 1.5.
Upgrade the client to version 2.
Upgrade the server to version 2.
Final situation: Client and server both at version 2.