Sun's Network Filesystem (NFS) on CouchNet

James F. Carter <jimc@jfcarter.net>, 2016-12-06

My net has several hosts. I want to be able to execute on any of them and to get my home directory: not a copy, but the actual files delivered over the network. I also have an immediate need for two people who log in to different machines to edit a big web document (set of web pages).

I already have tried several schemes where the home directory (or part of it) is copied between machines. None were really satisfactory. I do have a requirement that throws a monkey wrench into the network filesystem solution: my laptop needs to be functional without a network connection, e.g. inside an airplane. I see two solutions here: either the laptop will have the exported instance of the homedir (versus a real fileserver), or I will do the copying thing before and after my journey and temporarily use the laptop's local copy.

What network filesystems are possible? See my writeup on the Asus Transformer Pad Infinity (a tablet and mini-laptop) which has a discussion of the homedir issue.

NFS -- One of the two leading contenders. Has a lot of complication that isn't so obvious until you get burned, of which user mapping leads the mob. Block overlayer.
sshfs -- Requires the least infrastructure. File overlayer. Authentication is the hardest aspect.
ownCloud and competitors (like Dropbox) -- Obviously we're just asking for bugs up the kazoo. It would be fine for photo and media sharing, but I'm chickening out on heavy duty home directory service. File underlayer.
AFS -- Archaic.
CIFS -- For Windows; the UNIX extensions are not a perfect emulation. Authentication is a can of worms.
Coda and Intermezzo -- Not widespread use.
iSCSI, AoE -- Remote block device, no locking, so it has problems with multiple clients.
GFS, GFS2 -- Like iSCSI but with a lock daemon that supposedly works. jimc is not familiar with it.

SSHFS

I initially wanted to try sshfs, using this design:

Uses autofs (with which jimc is very familiar). A local directory like /net belongs to it. The remote directory is mounted on a mount point created on the fly. When the target is really local, autofs can do a bind mount.
Automounter unmounts an unused directory after a timeout (5 mins).
Automounter calls this script to mount the remote directory. No, it calls a script (if so configured, per mount point) to find out what to mount on it, then does the mount itself. With enough syntactic tweaks and backslashes it's apparently possible to make this work; numerous people report success with static files specifying only one numeric UID and one or two remote directories.
For users' homedirs (recognized by a heuristic, i.e. everything in /home is a homedir) the program searches for and steals the owner's SSH key agent or Kerberos credential (TGT).
If not a homedir, or if no key agent, the script mounts the remote directory readonly as root using a remote execution key.
Chicken and egg: ~/.ssh/agent-$HOST and ~/.ssh/agent-$HOST-$DISPLAY save the ssh-agent PID and socket filename. But if we're going to use those files to mount the homedir, the homedir has to already be mounted. However, it should be possible to find the agent socket without looking in the homedir -- it's in /tmp/ssh-XXXX/agent.$PID owned by the user.
Another disadvantage of sshfs is that it's a file overlayer, not a block overlayer. While 99% of my use of the files involves reading or writing the whole file at once, so the block vs. file style is irrelevant, the very few cases where I use a SQLite database will drive sshfs crazy.

I ended up abandoning the sshfs approach: too many kludges, too many ways for things to go wrong in the authentication area, too uncertain that a SSH agent or Kerberos ticket could be found reliably when needed.

Back to NFS

Instead I imported the design I use at UCLA-Mathnet for Sun Microsystem's Network File System (NFS), with some improvements that Mathnet doesn't have. If $XDIR is the exported directory and $SERVER is the 1-component name of the server that exports it, the clients are going to find it as /net/$SERVER/$XDIR . I make this happen by these files and steps. Of course many of the files can be renamed but these are either standard or are the ones I picked.

Autofs (Automounter)

Install and activate at boot time the packages nfs-kernel-server (nfs-server.service) and autofs. Package names are for OpenSuSE.
/etc/auto.master needs a line giving autofs control of the /net tree root:
/net /etc/auto.net
/etc/auto.net describes subdirectories created on the fly within /net by autofs, named after each file server that is currently in use. The first field can be an explicit path component (server), but in my design the single '*' matches any value. A '&' is replaced by the path component, and the -D option defines an environment variable with this value for use in the next step. This map file creates a recursive automount map defined by /etc/auto.net.generic .
* -fstype=autofs,-DSERVER=& file:/etc/auto.net.generic
/etc/auto.net.generic is used to interpret the third path component as a NFS mount point.
* -fstype=nfs ${SERVER}:/&
This all works reliably. Autofs mounts the directories when requested. It unmounts them if they aren't used. For local directories it does a bind mount rather than making the local NFS server send the file content to itself.

NFS Server and /etc/exports

The NFS server needs to be told what directories to export, and to who, in /etc/exports; see exports(5) for all the details. Here is a sample stanza. You give one export per line (blank lines and #comments ignored), but long lines can be wrapped using backslash newline.

/s1 -rw,root_squash,async,no_subtree_check,mountpoint \
*.cft.ca.us \
2001:470:1f05:844::4:0/112 \
2001:470:1f05:844::2:0/112 \
192.9.200.128/26

/s1 -- This is the directory being exported. It can have several path components, e.g. /usr/share, but all of mine have only one component.
-rw -- Clients may write (vs. ro = readonly). The comma separated list of options has to start with a hyphen.
root_squash -- A hacker executing as root on the client has only the privileges of the user nobody. Recommended.
async -- The app's write operation returns when the data is in the client's kernel buffers; the NFS driver will eventually send it to the server. Versus sync, meaning the operation only returns when there is positive confirmation from the server that the data is on disc. Async is much faster, but with sync only one block of data can be lost if the server or client crashes.
no_subtree_check -- It is possible but rare to export a directory and also export a subdirectory with different parameters or even to no hosts at all. Detecting this possibility takes work, and if you're sure it doesn't happen you should turn off the subtree check.
mountpoint -- Suppose you export /s1 /s2 /s3 etc. but not all of these are on every machine, e.g. only your NAS box has /s3. Create all the directories on all the machines and include /etc/exports stanzas for them, but just don't mount anything on /s3 except for the NAS box. mountpoint makes the server silently ignore (not export) the directories on which nothing is mounted. On my net the virtual machines don't have a separate filesystem for /home, and so it has to be exported without the mountpoint option.
*.cft.ca.us -- This is the first of a space separated list of hosts or ranges allowed to mount this directory. Any machine on my net that has a fixed name and IP may mount. The server does a reverse lookup to get the name -- it doesn't have to trust the client to give its name honestly.
2001:470:1f05:844::4:0/112 -- IPv6 dynamic address range for IPSec.
2001:470:1f05:844::2:0/112 -- IPv6 dynamic address range for OpenVPN.
192.9.200.128/26 -- IPv4 dynamic address range for both of them.

Instead of actually relying on the mountpoint parameter, I have a script that reads /etc/fstab and adds the mount points to /etc/exports, all with the same options. The policy, that gives the right exports on my net, is that only ext4 or btrfs filesystems are exported, the root is not exported, and /home is exported (not with the mountpoint option) if it is not a separate filesystem, which is the case on my virtual machines. With NFS you export directories and they don't have to be mount points.

NFS Client and /etc/nfsmount.conf

The NFS client negotiates several parameters with the server, guided by /etc/nfsmount.conf, which it reads at every mount. See nfsmount.conf(5) for details. The non-default values I picked were these:

Defaultvers=3; this is the protocol version it proposes first. The default is 4. There is quite a lot of difference and version 4 is a lot better, but in my context it causes mounts to fail about half the time. Rather than trying to debug it, I reverted to protocol 3, which has been reliable. Protocol 3 requires that the server and the client both use the same numeric UIDs; protocol 4 requires consistent alphabetic loginIDs but they could translate to different UIDs on each machine (not recommended).
Defaultproto=tcp (is the default). Back in 1985 the default was UDP, saving CPU cycles in intensive filesystem access, but 30 years later the machines have plenty of CPU and TCP is a lot more reliable.
Soft=True; this means that when an operation times out the NFS client will toss the data and report an error, which the client program may or may not be programmed to respond to effectively. Documentation recommends Soft=False, Hard=True, which means to retry forever. In other words, the user sees the application hang, until the user can get the sysop to reboot the server or whatever other corrective action. At that point the data is finally written, not being lost, and the user's session can proceed, but everyone's temper is on edge at that point. The user can't even use SIGINT to terminate the application, nor SIGSTOP; he needs SIGKILL from a separate session. (Don't set Hard=False; it turns into nohard which is an invalid mount option.)
Several timeouts were shortened drastically. NFS timeouts are the biggest source of user complaints. I set Timeo=30; this is the generic timeout for each request to finish, in tenths of a second. Default 600 (1 minute).
Retry=1; keep trying to mount for this many minutes. Default is 2 minutes for soft mounts or 10000 (1 week) for hard mounts.

Firewall

While NFS protocol version 4 handles the whole interaction multiplexed on port 2049, earlier protocols have a collection of service daemons on the server: rpc.mountd which handles mount requests, kernel module lockd which handles NFS file locking, rpc.statd which reestablishes or clears locks when either the server or the client reboots, and of course the NFS server itself. This means that the firewall needs to be open for these ports, both TCP and UDP. However, except for NFS itself the ports are randomly assigned when the daemon starts.

My first solution was to write a script that runs rpcinfo -p, extracts the actually assigned port numbers, and updates a firewall rule to let them through. This works, but is fragile and takes some systemd trickery to get the list updated if NFS were to be restarted.

My second solution was to do enough Google searches to find how to set the daemons' ports. This is for OpenSuSE Leap 42.1, but is identical in prior SuSE versions, and forum postings suggest that the same configuration file and variable names are used for the Red Hat family including CentOS, and possibly also Ubuntu. I assigned ports sort of arbitrarily, and added these variables to /etc/sysconfig/nfs:

MOUNTD_PORT="901"
STATD_PORT="9904" (rpc.statd drops privileges and can't use 902)
LOCKD_TCPPORT="903"
LOCKD_UDPPORT="903"

Then these ports were opened in the firewall. An alternative for kernel module lockd is to create /etc/modprobe.d/50-lockd.conf containing options lockd nlm_udpport=903 nlm_tcpport=903.

Also you need to add the fixed ports (except $STATD_PORT) to /etc/bindresvport.blacklist . When RPC ports are assigned at random, this will prevent your fixed ports from being used. Other services, like CUPS, also need to be in this file.

User Mapping

In NFS the server relies on the client's operating system (running as root) to not be hacked and to honestly report which user is trying to access the exported files, so the server can use normal access control methods like mode bits and POSIX ACLs. For protocol versions 3 and earlier the client reports a numerical UID, which therefore has to be in sync between the client and the server. The biggest can of worms with NFS version 4 is user mapping: the client's alphabetic loginID is sent to the server. So the loginIDs still have to be in sync, but not the UIDs; the server maps the loginID to a UID in its own way.

The main use case where UIDs can't be synced is where the client and server are in different organizations, e.g. client at home, server at work. Of course there's also the problem of syncing the loginIDs, but that is a little easier. However, you have a security issue: if the client is controlled by other than the server's organization, its sysop can configure whatever UID or loginID he pleases, and could steal files to which his organization is not entitled. So the server's sysop usually refuses to export to clients not under its control, i.e. expected to dishonestly report the UID or loginID.

Work to home mounting is requested frequently, and Microsoft Windows can handle it: the client gets a Kerberos ticket for the server's realm by giving the foreign loginID and password. But unfortunately in the NFSv4 implementation in Linux, user mapping has turned out (for me) to not be reliable: the map daemon at random times gets into a mode where it maps all the users to nobody, and you have to reboot the server to recover; restarting all the daemons is not sufficient, nor is it effective to signal the kernel to clear its ID map cache.

Also on my home net but not at work, when I use NFSv4 the server exports nothing about half the time, then recovers without intervention, whereas NFSv3 never fails. So I have reverted to protocol version 3.

System Monitor

I also wrote system monitor scripts. The one for autofs picks a partner that's up, and checks if /net/$PARTNER/$MTPT can be accessed, i.e. can be mounted. /home is always exported and it has a copy of /etc/exports at toplevel, so the script knows which filesystems to test. The script also checks autofs access from $HOSTNAME, i.e. from itself, and makes sure a bind mount is being used.

The test script for nfs-server picks a partner and uses SSH restricted execution to run itself there. The partner tries to mount each exported filesystem, and if this fails, the script restarts nfs-server.