With the transition to a new operating system and a whole new technology of backup, we need to review and codify our backup policy.
All files which are created by us and our users, or which otherwise are both valuable and hard to replace, should be backed up.
Our primary goal for backups is to defend against catastrophic total
loss of one or more servers. An important additional goal is short term
file restoration, i.e. I deleted an important file; can I get it back?
Policies, and technological adaptations to those policies, should be simple so people can remember them. However, inevitably there will be special needs, and we need to balance functionality in daily use, ease of backing up the files, and ease of restoring them when needed.
Backup copies must survive for an agreed-upon time, and then go away on schedule. For legal reasons it's important for us to be able to say truthfully, we have backups to this date and no farther. And it's important that we document and publish our backup schedule.
Backups have two phases, both of which must be considered carefully: getting the files onto the backup media, and getting them from the media to a new server, in case of a blown disc. The latter step we hope is very rare, but has to work with high reliability when needed.
There is a lot of material that should not be backed up, either because it is easily reinstalled, or because it is not valuable. We should work hard to keep backed-up and non-backup files separate, and to define policies that help us to do that.
Traditionally, backups have been done of entire filesystems, because that's how the UNIX dump program works. Amanda under Linux can, and in our case does, dump named directories independent of filesystem boundaries.
There are two classes of backed-up files: large directories containing only
backed up files (example: a home directory partition), and relatively small
special cases which, at least currently, are hiding among much larger volumes
of non-backup files (but in a few cases the tail has started to wag the dog).
Let's call these outliers
. Examples of outliers are /etc (configuration
files) on each machine; the web directories on special purpose web
servers; and e-mail directories. Strategies for backing up outliers need to
be different, effective, and easy to administer.
This is jimc's proposed policy for backups:
We back up directories named /h[0-9]. On homedir servers, these will be entire filesystems filled with home directories. When a non-server has a /h directory that we consider valuable, we back that up too. This does not include non-backed-up private data areas on workstations. We'll back up a directory not named /h[0-9] only after major arm twisting.
On machines that are not homedir servers, it's common for /m1/h1 to
exist. To the extent feasible we put outliers there, and back it up without
backing up all of /m1. It normally is reached under the name of /h1 by a
symlink or bind mount, and it counts as being a /h[0-9] directory
,
although this policy does not specify whether it's dumped under the /h1 or
/m1/h1 name.
Outliers not already in a /h directory are copied to a (tgz file in a) /h directory, which is backed up. On non-homedir servers, a suitable subdirectory within /m1/h1 is a recommended place for these tgz files. Outlier backups are automated by a special script, as far as possible, which runs just before the /h directories are backed up.
At EOQ we also back up some /m directories: /m1 on Sunset and Malibu, and possibly other software that is hard to install, as value may appear. It's a matter for debate how sanguine we're going to be about software installation directories and how we can dump them without having to dump all the /m's every day.
Every directory that should be backed up will have a file called BACKMEUP; every /h[0-9] directory that should not be backed up will have a file called DONTBACKUP. Outlier directories that should be compressed and copied to a /h directory will have a file called BACKUPTAR (which signals DONTBACKUP also; you need one or the other). Contents of the file is not a policy issue but in the case of BACKUPTAR it's a convenient place for a list of filenames to exclude from the dump.
Currently, mail is backed up in two ways: as an outlier on a directory sunset:/m1/mailbackup, and by direct backups of /m1 on each homedir server. For political reasons we can't put either the system mailbox or the backup copy on the same filesystem as the user's homedir, i.e. /h1, where it would count towards his disc quota. Nor do we (jimc) want to back up /m1 just to get the mail. Therefore we will continue to back up the mail across the net to Sunset, but the backup directory will be on sunset:/h1 so it actually gets copied to the backup media, and the mail backup will be done just before filesystem backups. Mail acts as an outlier, but the manner of backing it up is special.
Currently, /etc is backed up in two ways: as an outlier on a directory sunset:/h1/rootbackup, and by direct backups of the root (servers only). The issues are the same as for mail: we don't need to backup 3 to 5 GB of root to catch /etc and possible other outliers in /var.
Additional outliers so far identified are /m1/custom on all hosts (should be moved to /h1 or /m1/h1), namedaemon on slave servers (Sunset's is in /h1), and the Certificate Authority, which needs a special procedure.
Let's add a step in dump execution, in which /usr/math/lib/daily.d/P* are executed on all hosts just before dumps occur (initiated from Ulanda). These P-series scripts do the outlier backups: /etc on all hosts, mail on homedir servers, and additional outliers where needed.
To avoid the inevitable falling through the cracks
, we need an
automated auditing script to see that all /h directories have the appropriate
file controlling backups, and that the requested backups are happening.
We will avoid bind mounts of /m1/h1 onto /h1, unless there is a real operational need. At present these bind mounts are in use:
On Laguna and Malibu, /h1/m1 exists and /m1 is a symlink to it. When time permits (i.e. EOQ) we'll create an actual /m1 partition on these machines.
We need, and may not have, a backed-up copy of the partition table of every machine.
Here's an outline of the procedure for restoring a server after a total loss:
We haven't discussed retention periods, and what goes to tape (versus sitting on Ulanda's RAID), and when this happens. But we need to deal with those issues.