From time to time, users ask about a special backup solution for files on disc. I have a favorite scheme which often is suitable for them, and so I'm putting it down in a web page. Here I'm setting sizes and frequencies according to the needs of one specific user, but the parameters can be adjusted for a range of needs.
Speed is not an issue. A USB enclosure with a 1Tb or 2Tb drive in it should be plenty, and it puts the drive under the user's control. It would be useful to have someone more expert recommend the currently favored brand and the variant (datacenter, desktop, whatever).
Likely the backup will happen often when she is not logged in. Also, it's rare but important that the user should be able to unmount and take away her own drive. A custom udev and/or fstab rule should take care of the permission problems.
For scheduling, keep it simple: her personal crontab on the host where the data resides (Varan).
Twice a day for a year is 730 instances. Clearly 730 x 70Gb (51 terabytes) is both impractical and unnecessary.
Differential or incremental dumps can greatly economize on space; but it's complicated to discover which dump has the most recent copy of the file to be restored. Let's put together a solution based on multiple full dumps.
Likely the data will compress well; but it's harder to audit the dump and to restore files if it's compressed. I'm assuming we won't compress. Similarly I doubt the payload is worth stealing, so we won't bother with encryption. These features could be added for other users for which they are important.
Jimc's solution at home is 10 full dumps burned on CD in a modified Tower of Hanoi (binary) rotation, so the interval between retained dumps increases exponentially as they age.
Our user is not going to put 70Gb on a CD; rather, the external drive will have N directories each with a full dump in it. 2Tb can hold 28 sets of 70Gb. But allow for future expansion beyond 70Gb.
Each dump area will be a subdirectory of the external disc's root. Each will contain a dump date created along these lines:
date "+%Y-%m-%d %H:%M $subdir" > $subdir/dumpdate
The name of the subdirectory appears after the date so these files can be turned into an ordered list of dumps; see details below.
The payload to be dumped should be in one or a few directories. Use rsync in local mode to dump it, like this: (Paths are for illustration only, adapt to real names and mount point.)
rsync -a -O -W --del --log-format="%o %f" \
/s1/payload /mnt/subdir28/ > /mnt/subdir28/dumplog 2>&1
Explanation of rsync options:
Why we copy the entire file: Mostly rsync is used to copy across the net, which is expensive, so rsync has a fancy and effective way to identify unaltered regions so as to skip sending them. But it triples the disc activity on the destination:
When both discs are local it's better to just write the whole thing at once.
If rsync finds that the metadata (size and date) is the same on both ends, rsync assumes that the content is also the same, and does not transfer the file. 99.999% of the time this is true, unless someone was doing super arcane stuff involving jiggering the mtime. Comparison by checksum or bit-for-bit comparison can be configured but this would be a total waste of time.
Scheduling: at 18:00 or a convenient "end of day" time, do the Tower of Hanoi thing. Determine the day number (UNIX time integer divided by 86400 secs) (this is UTC which for us advances at 16:00 or 17:00). Find the smallest i such that day/(2^i) is odd, and dump on hanoi$i. You need an upper bound: i in 0..8 would give a minimum of 512 days or 1.4 years of coverage. If i > 8 truncate it to 8.
Then at 12:00 (midday) use a separately named group of subdirs. Let w = the weekday, which is the day number modulo 7. Dump on daily$w. This would give twice daily total coverage for 7 days. If more is wanted, the modulus could be increased, and also the Hanoi upper bound. They can usefully be unequal: twice daily coverage up to their minimum, and once daily coverage up to their maximum.
To make a map of dump dates to subdirs, just do this:
sort -r -o /mnt/schedule.txt /mnt/hanoi*/dumpdate /mnt/daily*/dumpdate
-r causes the most recent one to be listed first.