High availability for software installation procedures

The best way to vouch for the correctness of a piece of software is to write a proof that shows it. Few people actually do that, except maybe for the awesome algorithms and distributed systems communities (e.g Raft). Barring that, the next best thing is likely an exhaustive test suite.

However, even given a proof showing the correctness of an algorithm, implementation bugs can always occur. The likelihood of this happening increases with implementation complexity. This problem encouraged the original Raft authors to make “understandability” one of the main design goals of their algorithm. Their work highlights the importance of incorporating the software developer and user into the model of implementation correctness assessment. We extend this idea to the installation instructions for Haven, our software for securely backing up data storage servers.

An interesting observation is that it’s hard to show propositions about software-caused side effects without attaching procedures to the software. For example, a backup software can only ever provide me with a daily snapshot of all my data under the assumption that I run it at least once a day, or maybe even just the assumption that I install the software correctly.

However, as soon as procedure is required to be manually followed by the user to achieve correctness, human error becomes an issue. Humans make typos, forget information, and are lazy and a general nuisance. It becomes an interesting question, then, how to provide safety under these common types of operator error.

Our approach borrows simple concepts from the field of high-availability engineering; fail-safe behaviour, positive acknowledgement and redundancy to avoid flaky single points of failure.

Failing safely means that when encountering seemingly “strange” conditions, the system should abort and loudly yell at the user instead of trying to “keep on trucking” and potentially report a false success. In our case, this means that the backup process should assume that the user misconfigured everything, didn’t touch the defaults, and isn’t doing manual data integrity checks to test backups for correctness.

It also immediately follows that the software has to provide positive “Yes, the backup worked”-style notifictions to the user, since the absence of notifications could always also mean that e.g. the host running the backup software has lost power.

Finally, by adding redundancy to the procedures given to the user, we can greatly reduce the chance of an accumulation of user error great enough to render the backup useless. In our case, we provide them with two completely separate sets of installation instructions of two very different software systems. Each system independently attempts to perform the task of backing up the user’s data. The user is instructed to install multiple systems according to the installation instructions.

Therefore, to encur data loss, the user is required to make distinct errors within every set of instructions, compared to the usual requirement of just a single error. This also applies to bugs; for backups to be corrupted, there have to be bugs in all of the implementations, compared to just one bug in a single implementation.

So far, Haven’s components have been used successfully in production for two years, backing up a dataset sized around 600GB to two independent cloud providers. Two disaster recoveries were performed, one revealing troubling behaviour of Duplicity’s OpenStack backend that caused it to fail to detect aborted volume uploads when under network congestion.

The Haven source code is available on GitHub. Contributions of any kind are wholeheartedly welcome. If you’re using Haven, I’d love to hear from you.

Future work should focus on identifying more single points of failure (e.g. the host kernel, and ZFS) and assessing the expected damage on failure given their probability of failure. Also, better management of user attention in case of significant failures or delays should be implemented to achieve a higher signal-to-noise ratio in user communications. Finally, a system for automatic verification of the stored data could use a “staged” recovery scenario to check for a wider variety of failure modes than what is covered by available tooling. This would uncover issues with verification tools (e.g. duplicity verify) that assume too much about local state on the system performing the recovery.

Published by Jonas Schneider on April 27, 2015.