In a scientific research computing environment, various users each submit multiple long-running jobs to be executed on a plethora of hosts, often but not necessarily homogeneous and integrated into a cluster. It takes software infrastructure to make these jobs run. Our present job queueing system is way back version, and we now have a new compute cluster, so we need to either pick a modern job queueing system or to refurbish the one we have. In addition, cloud virtualization is now feasible and offers significant advantages.
Our users want bulk CPU power and don't want a lot of sophisticated features that they have to learn. They will accept whatever we give them if they can just run their jobs.
When we have extremely long jobs (a month), we gain a lot in flexibility, reliability and responsiveness to the users if we run such jobs in a virtual machine. It isn't clear yet whether these VM's should be created for each job or should be permanent.
Creating a virtual machine is nontrivial but is not a prohibitive burden, if we go with creating virtual machines on the fly.
Cloud manager software exists for creating a public cloud of virtual machines. I doubt that we want a public cloud. We should create our virtual machines using ad-hoc scripting. Permanent virtual machines can be considered, versus creating them on the fly for each job.
Sun Grid Engine as we know it is expired; if we put in Open Grid Manager (its credible successor) it's equivalent to a whole new queueing system. The currently hot job queueing package, possibly the market leader, is SLURM.
We need to pick a hypervisor, and there are several credible choices; Xen does not win by default. Jimc sees KVM as the leading contender.
The users and jobs in our environment have these characteristics:
User training is a major issue. If the system is too complicated the users will not use it, and the sysadmins also have limited time to spend learning how to manage the cluster.
Our users have modest expectations on the flexibility of computational resources. A very small number of preset configurations will probably serve 99% of needs. However, the possibility should exist for custom configurations, likely with assistance from the sysadmin.
Our users' jobs tend to use massive amounts of CPU time, moderate amounts of memory (e.g. 1Gb to 2Gb is often enough, but important classes of jobs use more), and relatively little I/O. In other departments, though, the jobs can be much more I/O intensive. Our jobs should never swap.
Our users rarely use multiple threads per job. It is probably going to be sufficient, for parallel jobs, to provide multiple cores on the same physical host. In other words, we can cut corners in the area of coordinating job segments on different physical hosts.
Presently our cluster users do not need realtime or interactive high performance computing. However, the future possibility of interactive use should not be precluded by the design. The Teran group do major interactive work involving graphics on their dedicated super workstations.
Our users' executable programs and scripts are available via the NFS shared filesystem, and NFS is satisfactory for other file storage also. Nonetheless, we should avoid depending on the storage host staying running for long periods. In other words, it is best to copy the user's files onto the execution host (real or virtual), and copy the results back when the job is finished. But the use of NFS resources by the long-running job should not be forbidden, just discouraged.
The longest jobs regularly encountered run for three to four weeks. This poses reliability issues and interferes with system administration (rebooting). On the other hand, it's important to deliver quick response when a researcher is developing an algorithm on a small test case.
Our users rarely or never need continuity beyond a single job. In contrast, in the business world it is common to put up a database server, web server, etc. in a virtual machine, which must persist reliably for many years. This means in particular that their virtual machines must receive regular patches, whereas we could cut corners in this area. On the other hand, the master images must be kept up to date in patches.
Our users are all known to us in advance and have accounts on the department net. Recharging expenses is not necessary. This is not a public cloud service.
Our users are semi-cooperative and semi-trustworthy. We do not need heroic measures to prevent rivals from snooping out information about each other's projects, as might happen on a commercial cloud service used by competing corporations. The security procedures necessary to meet our FERPA obligations and to exclude outside hackers are sufficient also for internal security. Our major security threats are:
We are able to accomodate the cluster's network topology within quite flexible limites. Access by the user to the host (virtual or real) running the job is not an important requirement but can be helpful in debugging. Access from off-site is even less important, but forbidding it is not a requirement and the host is expected to resist attacks from inside or outside the local net, whatever the access possibility ends up to be.
Whatever software infrastructure we pick, we as sysadmins want it to give these services to our users:
In the previous millenium, several job queueing systems were developed which accept a user's job and run it on one of a collection of hosts. In the present day, any relevant hardware can run a virtual machine, and several cloud managers are popular, whose job is to make a virtual machine appear at the user's request; he then runs his job there. Our initial decision should be, should we stick with the job queueing model, adopt a cloud (virtualized) model, or use some hybrid? Here is a list of advantages and disadvantages of the cloud model.
Virtualization adds a big layer of complexity that the users do not want to deal with, for the most part. Neither do the sysadmins.
A virtual machine can be paused and later restarted on the same or a different physical host. This is impractical when running on physical hardware. (But see the discussion of Condor.) If you can pause your virtual machine you get these important advantages:
You can make periodic snapshots of the job, and if the physical host loses power or has a hardware failure, you avoid losing a month-long job. This is not too common for us but has happened. Other departments have more exposure in this area. For snapshots to work, all writeable files (e.g. output) have to be inside the virtual machine, not NFS mounted.
Along the same line, you can reboot the physical host, e.g. when applying a kernel patch. We frequently run afoul of this issue.
If every node has a long-running job, you can make room for a shorter job, e.g. for algorithm development and debugging. While at present we usually have unoccupied nodes, there have been incidents in the past where users could not get access for debugging, and justifiably complained.
Virtual machines can be more uniform than the physical hardware. However in our environment I doubt that uniformity is all that useful.
Virtual machines are good at delivering a subset of the physical machine's resources, and limiting the user to a particular number of cores and size of memory. In the past our clusters have had one core per host (Nemo does not make effective use of dual cores), but the new cluster has 12 cores per host (hyperthread in addition) and they should be used effectively. While CPU accounting (cpusets, see /usr/src/linux/Documentation/cgroups/cpusets.txt) could be used on the bare metal, we would have to write this ourselves, and it is attractive to use the feature through a hypervisor (virtual machine manager).
I am coming to the conclusion that we should run our users' jobs in individual virtual machines which are created on the fly for each job. An alternative is to pre-create and boot up the virtual machines, whereupon a traditional job queueing system can utilize them. But they can be paused and checkpointed, the same as for aleatory virtual machines, as long as the job queue manager does not become upset if it cannot contact the job shepherd process temporarily.
What would the user experience be like in such a regime? And how would the job queueing infrastructure be designed? The users would go through these steps:
The user pre-specifies the virtual machine configuration: number of cores, amount of memory, amount of disc scratch space, and an estimate of the job's duration (in normalized units, not wall clock seconds, which scales with the processor speed). If the job runs out of memory or disc, it will die. Estimating resource bloat is quite hard and this step will probably be the biggest source of complaints from users.
The user lists the files to be copied onto the virtual machine. NFS access should not be forbidden, but if the NFS server goes down the job will die. If outside files are being written, checkpoint copies of the virtual machine will likely be confused when reverting to a former state.
The cloud manager picks a physical host to start the job on.
The cloud manager assembles the virtual machine in these steps. The user sees nothing of this. This process sounds complicated but the main effort is in copying and extending the large image file, and copying the result to the execution host.
At this point, typical cloud managers hand off the new machine to the user where he does whatever he wants. We may want to provide a similar service, but our main focus is on batch jobs. We would like the job queueing system to copy user files onto the virtual machine and then start the job, without user intervention.
The user would like a simple display of his virtual machines showing these aspects:
The user should at least be able to kill the job if necessary. It can really help the user if he can log in to the virtual machine and inspect partial output and scratch files, to detect if the algorithm is stuck. To make this work, the cloud manager has to assign a fixed IP and register with DNS its relation with the hostname.
When the job finishes, the job queueing system should retrieve the output, and dispose of the virtual machine. It may be useful to keep the image around for postmortem analysis, but it should be freed fairly promptly because it takes a lot of disc space.
What's being described here is the operation of a standard job queueing system, not a public cloud. While the operations needed to create a virtual machine for each job are fairly extensive, they are well within the reach of normal scripting.
At present there are several commercial cloud vendors, such as Amazon EC2 and Microsoft Azure. Their business model is to create a virtual machine for you and charge you proportional to the amount of time you use it (plus network communication charges). There are several commercial and open source cloud managers by which the owner of a hardware cluster can implement the same business model. But that is not exactly what we want to do with our cloud. Nonetheless, let's look at published evaluations of various cloud managers.
This article gives quite a lot of useful information about three popular cloud managers:
A Comparison and Critique of Eucalyptus, OpenNebula and Nimbus by Peter Sempolinski and Douglas Thain, Univ. of Notre Dame, dated probably 2009.
They approach these systems as cluster managers and in anticipation of picking a job queueing system for a new cluster. Here is their summary table:
|Philosophy||Imitate Amazon EC2||Private customizable cloud||For scientific research|
|Customization||Not too much||Totally||Many parts|
|Internal Security||Tight, root required||Looser, can be raised||Fairly tight|
|User Security||Custom credentials||Log into head node||Register X.509 cert|
|Ideal Use Case||Many hosts, semi-trusted users||Fewer hosts, highly trusted users||Semi-trusted or untrusted users|
|Network Issues||DHCP on head node only||Manual (fixed IP?)||Needs dhcpd on every node; Nimbus assigns the MAC|
These three cloud managers are oriented to providing virtual machines
for the users to run their jobs in, rather than running jobs on the bare metal.
Their emphasis is on providing
cloud computing on a local cluster. They
mention Sun Grid Engine, Condor and Amazon EC2 as systems with alternative
philosophies. In their setting, typical users need a fairly large number of
cores, and the users know how to handle coordination between multiple virtual
machines, e.g. via MPICH.
It is important to remember that a lot of infrastructure is needed in addition to the cloud manager. The authors identify these issues in particular:
The hardware needs to support
pure virtualization; if only
paravirtualization is supported, this limits the speed of the VM's and
the flexibility of choosing the OS. (Jimc says: if the guest OS supports
paravirtualization it can gain a speed advantage by using a proprietary
data path to the hypervisor. But a totally paravirtual solution like
User Mode Linux could have bottlenecks as well, as the authors say.)
Network management can be a royal pain, with virtual machines appearing and vanishing at random. The cloud manager needs to interact with DNS to give the VM's names and with DHCP to give them configuration parameters. (Jimc says: LDAP to register the virtual MAC addresses, and Avahi to advertise local services.)
Each of the cloud managers can support a variety of hypervisors or Virtual Machine Monitors (VMM). The ones mentioned are Xen, KVM and VMWare; VirtualBox may or may not be supported.
The sysadmin needs to provide disc images that will be useful to the clients. (Jimc says: a copy on write capability, like User Mode Linux has, can make deploying the images a lot neater. Generally more than one kind of image will be needed.)
The cloud managers vary in how useful their user front ends are. The authors note that all three cloud managers support extensive customization of their front ends.
The authors note, when testing cloud managers, that invariably the most frustrating aspect was to reconcile the software's assumptions about network organization with what the actual network was willing to provide.
Another useful resource is:
FermiCloud: Technology Evaluation and Pilot Service Deployment (slides from a talk, PDF or PowerPoint, dated 2011-03-22).
This slideshow describes the whole process of setting up a cloud service, in
which the cloud manager is only one part of the puzzle. They also evaluated
Eucalyptus, OpenNebula and Nimbus (one packed slide for each). They ended up
deploying all three. The distribution of emphasis: the physical nodes are dual
quad core Xeons, and they put 8 on OpenNebula, 7 on Nimbus and 3 on Eucalyptus.
The OpenNebula framework does the best at meeting our requirements, they
say. They will focus on OpenNebula in the future. The main positive aspect of
OpenNebula is that it can be adapted easily to their needs (which differ from
Here are summaries of the features of these three cloud managers:
ATS is, or recently was, investigating Eucalyptus as the cloud manager for one of its big clusters. See this Wikipedia article about Eucalyptus. Its major features are:
This cloud manager is suited to a medium-sized cluster in which the users are trustworthy. Its major features are:
This cloud manager is advertised as being for scientific computing. See this rather skimpy Wikipedia article about Nimbus. Its major features are:
I'm afraid that all of these cloud managers are overkill for us.
The steps to create a virtual machine have been detailed above, and the
professional cloud managers provide a lot of services beyond this
that are of little or no value to us. It would be prudent to at least
install them and try them out, but we may very well decide that this
layer of complexity is way overkill, and that we want to write our own
virtual machine generator.
Since about 2002 we have used Sun Grid Engine (SGE) as our job queueing system on the Sixpac and Bamboo clusters, later extended to the Nemo cluster. Prior to that we used PBS on the Ficus cluster. We now have a new cluster, called Joshua, and it is time to think about upgrading our job queueing system, or at the very least, refurbishing SGE with the latest version.
The relation of the virtual cloud to the job queueing system can take two forms.
First philosophy: Once the job queueing system has picked the physical host to run on, an integral part of starting the job is to create a virtual machine to contain it, and to dispose of this machine when the job is over.
Second philosophy: The job queueing system runs jobs not on physical hardware but on permanent virtual machines. The job queueing system neither knows nor cares that it's using virtual nodes. Administrative actions, such as pausing for a checkpoint or a reboot, are not taken into consideration by the job queueing system.
These are the job queueing systems extant in late 2011:
This is the devil that we know. But our version is almost 10 years old, and the installation is a mess, so we really should upgrade to the latest version. But in the aftermath of economic chaos, Sun Microsystems no longer exists, and the product is now known as Oracle Grid Engine. See this Wikipedia article about SGE for the history. Due to license and ownership changes the project has forked as follows:
We used this software on the Ficus cluster in the 1990's. Administrative experience was pretty bad. However, the command line interface of PBS has been enshrined in the POSIX standards, and in fact PBS does much of what you would want a job scheduler to do. It is not too clear how much interest there currently is in PBS, or the license status of the various forks, but one could consult this Wikipedia article about PBS. Apparently the fork that is currently in active use is TORQUE.
This software, from University of Wisconsin at Madison, has been
around since the 1990's and is actively used and developed. It is subject
to the Apache license (open source). It is documented for both cluster
cycle scavenging on end-users' workstations when
otherwise idle, for which purpose it can do power management such as
wake-on-LAN and hibernation. If jobs are linked with their hacked I/O
library they can be paused and restarted, possibly on a different host.
However, they have bowed to reality and can deal with non-pausable jobs.
This software, licensed under GPL2, is currently actively used and developed. It appears to be a no-frills queueing system but it has plugins for many extensions.
Given the license situation and development ambiguity I am inclined to not try to deal with Open Grid Scheduler; because it is ten years in advance of our present SGE version I'm pretty sure that we would not just be upgrading SGE, we would be putting in a whole new job queueing system.
To my mind PBS (or its modern equivalent, Torque) is even less attractive given our poor experience with the original product.
Condor is intriguing, but we have no investment in this product and it may not be as flexible as we might want. I wouldn't object to investigating it, but am not advocating it either.
SLURM appears to be the up-and-coming job queueuing system, and may actually
be the current market leader, given the number of
TOP 500 clusters that
use it. Information from their website suggests that it should be easy to try
out and that it is likely to meet our needs once tried. I recommend that we
start our search with this product.
If we are going to have virtual machines, we need a hypervisor to control them. They come in two variants. In the bare metal style the hypervisor manages the hardware and one of the virtual machines, called Dom0 in Xen, has the management role and has special privileges with the hypervisor. In the userland style there is a normal kernel and operating system, and the hypervisor runs as a normal user process, possibly with special kernel support.
Here's a list of the currently popular hypervisors.
This is the prototype of the userland hypervisors. Unfortunately it seems to have lost momentum in the face of recently appearing good and better supported alternatives. One of its very nice features was copy-on-write. It was possible to have a readonly filesystem image, used by multiple guests, which was overlain by a sparse file, one per guest, containing only changed sectors, and which therefore was much smaller.
Jimc is disappointed that UML has gone the way of the dodo.
This was the first commercial product for virtualization. Earlier versions used the userland style, and likely this continues today. It is well regarded and widely supported by cloud managers. VMWare Workstation 8, the current version, costs $200 for a standard license, or $120 for academics. (Quantity discounts are possible but I did not investigate if they exist.) Each guest can have up to 8 cores, 2e12 bytes of disc, and 2^36 bytes (64Gb) memory. It is not clear how many simultaneously running guests your license allows. The license appears to be for one physical host, so it would cost $1920 to cover the Joshua cluster, and more in the likely case that we upgrade the job queueing system on the existing clusters.
Jimc does not expect that $1920 will be forthcoming.
This is Microsoft's hypervisor, of the bare-metal style.
It requires Windows Server 2008 x86_64 to be the master instance (Dom0).
It can support up to 4 cores per guest, and up to 384 guests per host.
Guests can be 32-bit or 64-bit (x86_64). The guest OS can be Linux or
all variants of Windows (excluding
home versions). Hyper-V is
a no-charge option. Windows Server 2008 is not free, but I believe that
our site license covers it.
Jimc does not seriously expect Hyper-V to be adopted for our compute clusters, but is including this hypervisor for completeness.
VirtualBox is widely used, particularly in small-scale projects, and is well regarded. In a 2010 survey, more than 50% of respondents who used a virtualization product used this one. Jimc has had good experience with Sun's VirtualBox. See this Wikipedia article about VirtualBox. Here are its major features:
Supportedguests include Windows, Linux, BSD, Solaris, etc.
Live Migrationis supported.
Xen uses the bare metal style. It has the major advantage that it is included in the main OpenSuSE distro, and Mathnet has limited and occasionally rocky experience with it. See this Wikipedia article about Xen. Here are some of its features:
live migrationand checkpointing, in which a rough copy of the running image is made, and then it is paused for 60 to 300 msec while a final copy is made. The migrated instance on another machine then resumes. For a checkpoint the original copy is the one that resumes.
KVM is currently in active use on
TOP 500 clusters and is
supported by all of the cloud managers considered.
It is provided in the main OpenSuSE distro. But nobody at Mathnet
has used it so far.
Wikipedia article about KVM.
Here is a summary of the hypervisors:
|Name||Style||Active||License||In Distro||Cloud||Used By||Used For|
|UML||User||No||GPL||No||No||jimc||OS installation testing|
|Xen||Bare||Yes||GPL||Yes||Yes||charlie||WinXP remote desktop|
|KVM||User||Yes||GPL||Yes||Yes||jimc||Development, Turbo Tax|
For virtualization it is very important to select a good hypervisor. Jimc is more inclined to pick the userland style, or at least to not pick Xen, because while we have not had actual incompatibilities with Xen, it feels safer to have our real kernel running on the bare metal. Also we have had odd networking effects with Xen, which may or may not be our fault and which may or may not work out better with another hypervisor.
I am familiar with VirtualBox, in the sense that it's the devil I know, and I'm pretty sure it would serve our needs. However, I mistrust ongoing support from Oracle, and if we go with one of the formal cloud managers, I'm not sure if they can handle VirtualBox.
KVM appears to be popular on major high performance clusters (as is Xen), definitely would serve our needs, and is supported by the formal cloud managers. I think we should start our evaluation of hypervisors with KVM.