Batch Jobs in the Cloud

James F. Carter <jimc@math.ucla.edu>, 2011-11-17

In a scientific research computing environment, various users each submit multiple long-running jobs to be executed on a plethora of hosts, often but not necessarily homogeneous and integrated into a cluster. It takes software infrastructure to make these jobs run. Our present job queueing system is way back version, and we now have a new compute cluster, so we need to either pick a modern job queueing system or to refurbish the one we have. In addition, cloud virtualization is now feasible and offers significant advantages.

Advance Conclusions

With links to the various sections.

Mathnet's Users and Jobs: Our users want bulk CPU power and don't want a lot of sophisticated features that they have to learn. They will accept whatever we give them if they can just run their jobs.
Cloud Computing Using Virtualization: When we have extremely long jobs (a month), we gain a lot in flexibility, reliability and responsiveness to the users if we run such jobs in a virtual machine. It isn't clear yet whether these VM's should be created for each job or should be permanent.
Workflow on a Virtual Machine: Creating a virtual machine is nontrivial but is not a prohibitive burden, if we go with creating virtual machines on the fly.
Available Cloud Managers: Cloud manager software exists for creating a public cloud of virtual machines. I doubt that we want a public cloud. We should create our virtual machines using ad-hoc scripting. Permanent virtual machines can be considered, versus creating them on the fly for each job.
Job Queueing Systems: Sun Grid Engine as we know it is expired; if we put in Open Grid Manager (its credible successor) it's equivalent to a whole new queueing system. The currently hot job queueing package, possibly the market leader, is SLURM.
Hypervisors: We need to pick a hypervisor, and there are several credible choices; Xen does not win by default. Jimc sees KVM as the leading contender.

Mathnet's Users and Jobs

The users and jobs in our environment have these characteristics:

User training is a major issue. If the system is too complicated the users will not use it, and the sysadmins also have limited time to spend learning how to manage the cluster.
Our users have modest expectations on the flexibility of computational resources. A very small number of preset configurations will probably serve 99% of needs. However, the possibility should exist for custom configurations, likely with assistance from the sysadmin.
Our users' jobs tend to use massive amounts of CPU time, moderate amounts of memory (e.g. 1Gb to 2Gb is often enough, but important classes of jobs use more), and relatively little I/O. In other departments, though, the jobs can be much more I/O intensive. Our jobs should never swap.
Our users rarely use multiple threads per job. It is probably going to be sufficient, for parallel jobs, to provide multiple cores on the same physical host. In other words, we can cut corners in the area of coordinating job segments on different physical hosts.
Presently our cluster users do not need realtime or interactive high performance computing. However, the future possibility of interactive use should not be precluded by the design. The Teran group do major interactive work involving graphics on their dedicated super workstations.
Our users' executable programs and scripts are available via the NFS shared filesystem, and NFS is satisfactory for other file storage also. Nonetheless, we should avoid depending on the storage host staying running for long periods. In other words, it is best to copy the user's files onto the execution host (real or virtual), and copy the results back when the job is finished. But the use of NFS resources by the long-running job should not be forbidden, just discouraged.
The longest jobs regularly encountered run for three to four weeks. This poses reliability issues and interferes with system administration (rebooting). On the other hand, it's important to deliver quick response when a researcher is developing an algorithm on a small test case.
Our users rarely or never need continuity beyond a single job. In contrast, in the business world it is common to put up a database server, web server, etc. in a virtual machine, which must persist reliably for many years. This means in particular that their virtual machines must receive regular patches, whereas we could cut corners in this area. On the other hand, the master images must be kept up to date in patches.
Our users are all known to us in advance and have accounts on the department net. Recharging expenses is not necessary. This is not a public cloud service.
Our users are semi-cooperative and semi-trustworthy. We do not need heroic measures to prevent rivals from snooping out information about each other's projects, as might happen on a commercial cloud service used by competing corporations. The security procedures necessary to meet our FERPA obligations and to exclude outside hackers are sufficient also for internal security. Our major security threats are:
- Having our machines 0wned by outsiders and used for nefarious purposes such as sending spam or fraud hosting.
- Vandalism of our users' data resources.
- Denial of service, either by accident or as part of vandalism.
We are able to accomodate the cluster's network topology within quite flexible limites. Access by the user to the host (virtual or real) running the job is not an important requirement but can be helpful in debugging. Access from off-site is even less important, but forbidding it is not a requirement and the host is expected to resist attacks from inside or outside the local net, whatever the access possibility ends up to be.

Whatever software infrastructure we pick, we as sysadmins want it to give these services to our users:

Pick a host to run the job on, preferably unoccupied or less loaded.
Assemble resources on that host, including the user's files, and start the job.
Tell the user how his job is progressing.
Tell the system administrator about the health, current load, and long-term utilization of the cluster.
Deliver the job's output to the user and dispose of resources.
Perform administrative operations such as:
- Make a host unavailable for running jobs.
- When the current job finishes, reboot the host.
- Move a job from one host to another.
- Make periodic snapshots of the jobs, that can be restarted if the host crashes.

Cloud Computing Using Virtualization

In the previous millenium, several job queueing systems were developed which accept a user's job and run it on one of a collection of hosts. In the present day, any relevant hardware can run a virtual machine, and several cloud managers are popular, whose job is to make a virtual machine appear at the user's request; he then runs his job there. Our initial decision should be, should we stick with the job queueing model, adopt a cloud (virtualized) model, or use some hybrid? Here is a list of advantages and disadvantages of the cloud model.

Virtualization adds a big layer of complexity that the users do not want to deal with, for the most part. Neither do the sysadmins.
A virtual machine can be paused and later restarted on the same or a different physical host. This is impractical when running on physical hardware. (But see the discussion of Condor.) If you can pause your virtual machine you get these important advantages:
- You can make periodic snapshots of the job, and if the physical host loses power or has a hardware failure, you avoid losing a month-long job. This is not too common for us but has happened. Other departments have more exposure in this area. For snapshots to work, all writeable files (e.g. output) have to be inside the virtual machine, not NFS mounted.
- Along the same line, you can reboot the physical host, e.g. when applying a kernel patch. We frequently run afoul of this issue.
- If every node has a long-running job, you can make room for a shorter job, e.g. for algorithm development and debugging. While at present we usually have unoccupied nodes, there have been incidents in the past where users could not get access for debugging, and justifiably complained.
Virtual machines can be more uniform than the physical hardware. However in our environment I doubt that uniformity is all that useful.
Virtual machines are good at delivering a subset of the physical machine's resources, and limiting the user to a particular number of cores and size of memory. In the past our clusters have had one core per host (Nemo does not make effective use of dual cores), but the new cluster has 12 cores per host (hyperthread in addition) and they should be used effectively. While CPU accounting (cpusets, see /usr/src/linux/Documentation/cgroups/cpusets.txt) could be used on the bare metal, we would have to write this ourselves, and it is attractive to use the feature through a hypervisor (virtual machine manager).

I am coming to the conclusion that we should run our users' jobs in individual virtual machines which are created on the fly for each job. An alternative is to pre-create and boot up the virtual machines, whereupon a traditional job queueing system can utilize them. But they can be paused and checkpointed, the same as for aleatory virtual machines, as long as the job queue manager does not become upset if it cannot contact the job shepherd process temporarily.

Workflow on a Virtual Machine

What would the user experience be like in such a regime? And how would the job queueing infrastructure be designed? The users would go through these steps:

The user pre-specifies the virtual machine configuration: number of cores, amount of memory, amount of disc scratch space, and an estimate of the job's duration (in normalized units, not wall clock seconds, which scales with the processor speed). If the job runs out of memory or disc, it will die. Estimating resource bloat is quite hard and this step will probably be the biggest source of complaints from users.
The user lists the files to be copied onto the virtual machine. NFS access should not be forbidden, but if the NFS server goes down the job will die. If outside files are being written, checkpoint copies of the virtual machine will likely be confused when reverting to a former state.
The cloud manager picks a physical host to start the job on.
The cloud manager assembles the virtual machine in these steps. The user sees nothing of this. This process sounds complicated but the main effort is in copying and extending the large image file, and copying the result to the execution host.
- Assigns a unique MAC address, hostname, and (maybe) IP address.
- Registers these host parameters with dynamic DNS and LDAP. A DHCP server may or may not be used to pass the IP address to the virtual machine; if so, its database must be updated.
- Copies a pre-made image of the operating system.
- Tweaks the virtual host definition file according to the requested memory, number of cores and MAC address.
- Appends and formats a partition for the requested scratch disc space and/or swap.
- Configures the image with its own hostname and IP address.
- Creates host keys for SSH and Kerberos, and implants them in the image (and in root's known_hosts file).
- Copies the image onto the initial physical host.
- Signals the hypervisor on that host to boot the virtual machine.
At this point, typical cloud managers hand off the new machine to the user where he does whatever he wants. We may want to provide a similar service, but our main focus is on batch jobs. We would like the job queueing system to copy user files onto the virtual machine and then start the job, without user intervention.
The user would like a simple display of his virtual machines showing these aspects:
- Hostname of the virtual machine (and physical machine also?)
- Whether the job is running, paused, or whatever other state.
- Amount of CPU time, memory and disc space used, and percent of the initial estimate.
- User provided progress indication and rate of progress.
The user should at least be able to kill the job if necessary. It can really help the user if he can log in to the virtual machine and inspect partial output and scratch files, to detect if the algorithm is stuck. To make this work, the cloud manager has to assign a fixed IP and register with DNS its relation with the hostname.
When the job finishes, the job queueing system should retrieve the output, and dispose of the virtual machine. It may be useful to keep the image around for postmortem analysis, but it should be freed fairly promptly because it takes a lot of disc space.

What's being described here is the operation of a standard job queueing system, not a public cloud. While the operations needed to create a virtual machine for each job are fairly extensive, they are well within the reach of normal scripting.

Available Cloud Managers

At present there are several commercial cloud vendors, such as Amazon EC2 and Microsoft Azure. Their business model is to create a virtual machine for you and charge you proportional to the amount of time you use it (plus network communication charges). There are several commercial and open source cloud managers by which the owner of a hardware cluster can implement the same business model. But that is not exactly what we want to do with our cloud. Nonetheless, let's look at published evaluations of various cloud managers.

This article gives quite a lot of useful information about three popular cloud managers:

A Comparison and Critique of Eucalyptus, OpenNebula and Nimbus by Peter Sempolinski and Douglas Thain, Univ. of Notre Dame, dated probably 2009.

They approach these systems as cluster managers and in anticipation of picking a job queueing system for a new cluster. Here is their summary table:

Item	Eucalyptus	OpenNebula	Nimbus
Philosophy	Imitate Amazon EC2	Private customizable cloud	For scientific research
Customization	Not too much	Totally	Many parts
Internal Security	Tight, root required	Looser, can be raised	Fairly tight
User Security	Custom credentials	Log into head node	Register X.509 cert
Ideal Use Case	Many hosts, semi-trusted users	Fewer hosts, highly trusted users	Semi-trusted or untrusted users
Network Issues	DHCP on head node only	Manual (fixed IP?)	Needs dhcpd on every node; Nimbus assigns the MAC

These three cloud managers are oriented to providing virtual machines for the users to run their jobs in, rather than running jobs on the bare metal. Their emphasis is on providing cloud computing on a local cluster. They mention Sun Grid Engine, Condor and Amazon EC2 as systems with alternative philosophies. In their setting, typical users need a fairly large number of cores, and the users know how to handle coordination between multiple virtual machines, e.g. via MPICH.

It is important to remember that a lot of infrastructure is needed in addition to the cloud manager. The authors identify these issues in particular:

The hardware needs to support pure virtualization; if only paravirtualization is supported, this limits the speed of the VM's and the flexibility of choosing the OS. (Jimc says: if the guest OS supports paravirtualization it can gain a speed advantage by using a proprietary data path to the hypervisor. But a totally paravirtual solution like User Mode Linux could have bottlenecks as well, as the authors say.)
Network management can be a royal pain, with virtual machines appearing and vanishing at random. The cloud manager needs to interact with DNS to give the VM's names and with DHCP to give them configuration parameters. (Jimc says: LDAP to register the virtual MAC addresses, and Avahi to advertise local services.)
Each of the cloud managers can support a variety of hypervisors or Virtual Machine Monitors (VMM). The ones mentioned are Xen, KVM and VMWare; VirtualBox may or may not be supported.
The sysadmin needs to provide disc images that will be useful to the clients. (Jimc says: a copy on write capability, like User Mode Linux has, can make deploying the images a lot neater. Generally more than one kind of image will be needed.)
The cloud managers vary in how useful their user front ends are. The authors note that all three cloud managers support extensive customization of their front ends.

The authors note, when testing cloud managers, that invariably the most frustrating aspect was to reconcile the software's assumptions about network organization with what the actual network was willing to provide.

Another useful resource is:

FermiCloud: Technology Evaluation and Pilot Service Deployment (slides from a talk, PDF or PowerPoint, dated 2011-03-22).

This slideshow describes the whole process of setting up a cloud service, in which the cloud manager is only one part of the puzzle. They also evaluated Eucalyptus, OpenNebula and Nimbus (one packed slide for each). They ended up deploying all three. The distribution of emphasis: the physical nodes are dual quad core Xeons, and they put 8 on OpenNebula, 7 on Nimbus and 3 on Eucalyptus. The OpenNebula framework does the best at meeting our requirements, they say. They will focus on OpenNebula in the future. The main positive aspect of OpenNebula is that it can be adapted easily to their needs (which differ from Mathnet's).

Here are summaries of the features of these three cloud managers:

Eucalyptus

ATS is, or recently was, investigating Eucalyptus as the cloud manager for one of its big clusters. See this Wikipedia article about Eucalyptus. Its major features are:

Imitates the Amazon EC2 cloud, so administrative tools and user training on EC2 can be useful.
Users only interact with the VM's through the provided tools, and they are forbidden to interact with the underlying hardware.
For user files it uses an imitation of Amazon S3 distributed storage.
The administrative design is suited for large numbers of hosts in multiple clusters.
It really helps to have a dedicated range of addresses for each cluster, to be assigned to virtual machines.

OpenNebula

This cloud manager is suited to a medium-sized cluster in which the users are trustworthy. Its major features are:

It has more opportunities for customization than any of the others.
Users can request arbitrary VM configurations (vs. a preset list).
Users log in to the head node to obtain service; this is how authentication and authorization are handled.
It normally uses NFS for all file transfers. It can also use SCP, or some different shared filesystem such as AFS.
Unencrypted NFS is snoopable.
Once the VM is running, the user can log into it.
Because of its great flexibility, the demands on user training are greater than for other cloud managers.
OpenNebula has a plug-in so it can take EC2-style user requests.

Nimbus

This cloud manager is advertised as being for scientific computing. See this rather skimpy Wikipedia article about Nimbus. Its major features are:

It is affiliated with Globus and uses some of its components, specifically for authentication with X.509 certificates.
For image file storage they use Globus GridFTP or Cumulus, which imitates Amazon S3.
There is a lot of scope for administrative customization, but the users have less tweakability than under OpenNebula.
The cloud manager software has to do SSH onto any compute node. Apparently this is a problem in some contexts.
It puts a DHCP server on each physical compute node, an unusual arrangement.
It has good support for capacity allocation, which is lacking in the other cloud managers.
It can divert jobs to other services including Amazon EC2.

Do It Yourself

I'm afraid that all of these cloud managers are overkill for us. The steps to create a virtual machine have been detailed above, and the professional cloud managers provide a lot of services beyond this that are of little or no value to us. It would be prudent to at least install them and try them out, but we may very well decide that this layer of complexity is way overkill, and that we want to write our own virtual machine generator.

Job Queueing Systems

Since about 2002 we have used Sun Grid Engine (SGE) as our job queueing system on the Sixpac and Bamboo clusters, later extended to the Nemo cluster. Prior to that we used PBS on the Ficus cluster. We now have a new cluster, called Joshua, and it is time to think about upgrading our job queueing system, or at the very least, refurbishing SGE with the latest version.

The relation of the virtual cloud to the job queueing system can take two forms.

First philosophy: Once the job queueing system has picked the physical host to run on, an integral part of starting the job is to create a virtual machine to contain it, and to dispose of this machine when the job is over.
Second philosophy: The job queueing system runs jobs not on physical hardware but on permanent virtual machines. The job queueing system neither knows nor cares that it's using virtual nodes. Administrative actions, such as pausing for a checkpoint or a reboot, are not taken into consideration by the job queueing system.

These are the job queueing systems extant in late 2011:

Sun Grid Engine (SGE)

This is the devil that we know. But our version is almost 10 years old, and the installation is a mess, so we really should upgrade to the latest version. But in the aftermath of economic chaos, Sun Microsystems no longer exists, and the product is now known as Oracle Grid Engine. See this Wikipedia article about SGE for the history. Due to license and ownership changes the project has forked as follows:

Oracle Grid Engine (version 2.6 and before are open source; current versions are commercial)
Univa Grid Engine (commercial)
Open Grid Scheduler (open source)

Portable Batch System (PBS)

We used this software on the Ficus cluster in the 1990's. Administrative experience was pretty bad. However, the command line interface of PBS has been enshrined in the POSIX standards, and in fact PBS does much of what you would want a job scheduler to do. It is not too clear how much interest there currently is in PBS, or the license status of the various forks, but one could consult this Wikipedia article about PBS. Apparently the fork that is currently in active use is TORQUE.

Condor

This software, from University of Wisconsin at Madison, has been around since the 1990's and is actively used and developed. It is subject to the Apache license (open source). It is documented for both cluster environments and cycle scavenging on end-users' workstations when otherwise idle, for which purpose it can do power management such as wake-on-LAN and hibernation. If jobs are linked with their hacked I/O library they can be paused and restarted, possibly on a different host. However, they have bowed to reality and can deal with non-pausable jobs.

Simple Linux Utility for Resource Management (SLURM)

This software, licensed under GPL2, is currently actively used and developed. It appears to be a no-frills queueing system but it has plugins for many extensions.

SLURM's main functions are: allocates nodes to users; starts and monitors jobs on these nodes; and arbitrates access to saturated resources.
It can coordinately allocate multiple nodes for a parallel job.
Nodes can be heterogeneous, up to 2^16 of them with multiple cores on each.
It can survive having the head node rebooted.
It knows how to suspend and resume jobs using SIGSTOP/SIGCONT. This is not really pausing the job, and the job still uses space for the image, swap, and disc scratch files.
MySQL is mentioned as the normal database backend for complete job logging, prioritization and accounting. PostgreSQL can be used with unspecified limitations. A text log file is sufficient for simple situations.
Both docs and user comments suggest that configuration is simple and straightforward for a similarly simple cluster, but plugins and scripting interfaces allow it to be adapted to more complex local needs.
It has a plugin for Open Grid Forum's Distributed Resource Management Application API (DRMAA), for job submission and control utilites. This API is also implemented by SGE, Condor, Torque, and about 7 other less popular job queueing systems. See this Wikipedia article about DRMAA.
They have a plugin that appears equivalent to our hostgroup utility. Apparently SLURM pays attention to hostgroups, and likely we could fairly easily get it to use our hostgroup table.
There is a PAM module that will keep users off nodes to which they have not been given leases.
It is used by many world-class computational clusters. Lots of glowing testimonials (selected) including from Lawrence Livermore Lab.
HP has adopted SLURM as the allocation manager in their XC cluster stack.

Given the license situation and development ambiguity I am inclined to not try to deal with Open Grid Scheduler; because it is ten years in advance of our present SGE version I'm pretty sure that we would not just be upgrading SGE, we would be putting in a whole new job queueing system.

To my mind PBS (or its modern equivalent, Torque) is even less attractive given our poor experience with the original product.

Condor is intriguing, but we have no investment in this product and it may not be as flexible as we might want. I wouldn't object to investigating it, but am not advocating it either.

SLURM appears to be the up-and-coming job queueuing system, and may actually be the current market leader, given the number of TOP 500 clusters that use it. Information from their website suggests that it should be easy to try out and that it is likely to meet our needs once tried. I recommend that we start our search with this product.

Hypervisors

If we are going to have virtual machines, we need a hypervisor to control them. They come in two variants. In the bare metal style the hypervisor manages the hardware and one of the virtual machines, called Dom0 in Xen, has the management role and has special privileges with the hypervisor. In the userland style there is a normal kernel and operating system, and the hypervisor runs as a normal user process, possibly with special kernel support.

Here's a list of the currently popular hypervisors.

User Mode Linux (UML)

This is the prototype of the userland hypervisors. Unfortunately it seems to have lost momentum in the face of recently appearing good and better supported alternatives. One of its very nice features was copy-on-write. It was possible to have a readonly filesystem image, used by multiple guests, which was overlain by a sparse file, one per guest, containing only changed sectors, and which therefore was much smaller.

Jimc is disappointed that UML has gone the way of the dodo.

VMWare

This was the first commercial product for virtualization. Earlier versions used the userland style, and likely this continues today. It is well regarded and widely supported by cloud managers. VMWare Workstation 8, the current version, costs $200 for a standard license, or $120 for academics. (Quantity discounts are possible but I did not investigate if they exist.) Each guest can have up to 8 cores, 2e12 bytes of disc, and 2^36 bytes (64Gb) memory. It is not clear how many simultaneously running guests your license allows. The license appears to be for one physical host, so it would cost $1920 to cover the Joshua cluster, and more in the likely case that we upgrade the job queueing system on the existing clusters.

Jimc does not expect that $1920 will be forthcoming.

Hyper-V or Viridian

This is Microsoft's hypervisor, of the bare-metal style. It requires Windows Server 2008 x86_64 to be the master instance (Dom0). It can support up to 4 cores per guest, and up to 384 guests per host. Guests can be 32-bit or 64-bit (x86_64). The guest OS can be Linux or all variants of Windows (excluding home versions). Hyper-V is a no-charge option. Windows Server 2008 is not free, but I believe that our site license covers it.

Jimc does not seriously expect Hyper-V to be adopted for our compute clusters, but is including this hypervisor for completeness.

Oracle VirtualBox

VirtualBox is widely used, particularly in small-scale projects, and is well regarded. In a 2010 survey, more than 50% of respondents who used a virtualization product used this one. Jimc has had good experience with Sun's VirtualBox. See this Wikipedia article about VirtualBox. Here are its major features:

It uses the userland style.
It was created by Innotek, purchased by Sun, and swallowed by Oracle.
Licensing has never been totally sanitary. As of version 4 the core package is GPL, but there are some proprietary extensions such as USB support and RDP.
The host OS can be Linux, OS-X, Windows (XP, Vista, Win7), Solaris and FreeBSD.
Supported guests include Windows, Linux, BSD, Solaris, etc.
Supported image formats include the VirtualBox native format, VMWare, and Viridian. It can also use iSCSI and can mount native host partitions.
These peripherals are supported. There are optional paravirtualized drivers (Guest Additions) that improve their performance.
- VESA compatible graphics; also OpenGL-2.0 and Direct3D
- Several Ethernet cards
- Sound: AC97, Intel HD, Soundblaster-16
- USB 1.1 pass-thru (USB 2.0 with proprietary extension)
Guests may have up to 32 cores.
Live Migration is supported.
The cloud managers reviewed here do not mention supporting VirtualBox.

Xen

Xen uses the bare metal style. It has the major advantage that it is included in the main OpenSuSE distro, and Mathnet has limited and occasionally rocky experience with it. See this Wikipedia article about Xen. Here are some of its features:

XenSource was the original organization for Xen (open source). It was bought by Citrix in 2007. As of 2009, Citrix committed to make all components of the Xen Cloud Platform open source, including enhancements that formerly were commercial.
Guests can be any operating system including Linux and Windows (any version).
Xen can handle up to 64 cores per host.
Paravirtual drivers (optional) can enhance efficiency and provide additional features.
It automatically starts Dom0 (the management virtual machine) on boot. This instance behaves like a normal bare-metal instance of the operating system.
Xen utilities can do live migration and checkpointing, in which a rough copy of the running image is made, and then it is paused for 60 to 300 msec while a final copy is made. The migrated instance on another machine then resumes. For a checkpoint the original copy is the one that resumes.
Linux for Dom0 used to require a kernel patch, and RHEL and Ubuntu withdrew their main distro support for it for that reason, but starting with kernel 2.6.37 these patches are included in mainline. 2.6.37 is the version in our distro, OpenSuSE 11.4.
The Wikipedia article lists a number of Xen Management Console programs.
Amazon EC2 uses Xen as its hypervisor.
The cloud managers reviewed here all support Xen.

Kernel-Based Virtual Machine (KVM)

KVM is currently in active use on TOP 500 clusters and is supported by all of the cloud managers considered. It is provided in the main OpenSuSE distro. But nobody at Mathnet has used it so far. See this Wikipedia article about KVM.

It uses the userland style.
License: GPL/LGPL.
KVM has been in the Linux kernel since 2.6.20. It's a loadable module on SuSE. This module gives access to hardware virtualization support on x86 and x86_64 processors.
While KVM was originally for x86 and x86_64, it has been ported to several other mainstream architectures, as hosts.
Qemu can be used to emulate foreign processors as guests.
Guests can have up to 16 cores and up to 2^43 (3.2e13) bytes memory.
These paravirtual drivers (optional) are available for both Linux and Windows guests:
- Ethernet card
- Disc controller
- Guest memory expander
- VGA graphics card
The Wikipedia article lists several management console packages.
The cloud managers reviewed here all support KVM.
Update in 2016: jimc has used KVM extensively with excellent results, for software development, OS installation testing, secure web hosting, and tax preparation under a Windows 10 guest (at home).

Here is a summary of the hypervisors:

Name	Style	Active	License	In Distro	Cloud	Used By	Used For
UML	User	No	GPL	No	No	jimc	OS installation testing
VMWare	User	Yes	$120	No	Yes	jimc	Evaluation
HyperV	Bare	Yes	Comml	No	No	none	none
VirtBox	User	Yes	~GPL	No	?	jimc	Win7 TurboTax
Xen	Bare	Yes	GPL	Yes	Yes	charlie	WinXP remote desktop
KVM	User	Yes	GPL	Yes	Yes	jimc	Development, Turbo Tax

For virtualization it is very important to select a good hypervisor. Jimc is more inclined to pick the userland style, or at least to not pick Xen, because while we have not had actual incompatibilities with Xen, it feels safer to have our real kernel running on the bare metal. Also we have had odd networking effects with Xen, which may or may not be our fault and which may or may not work out better with another hypervisor.

I am familiar with VirtualBox, in the sense that it's the devil I know, and I'm pretty sure it would serve our needs. However, I mistrust ongoing support from Oracle, and if we go with one of the formal cloud managers, I'm not sure if they can handle VirtualBox.

KVM appears to be popular on major high performance clusters (as is Xen), definitely would serve our needs, and is supported by the formal cloud managers. I think we should start our evaluation of hypervisors with KVM.