OpenBLAS Installation at Mathnet

James F. Carter <jimc@math.ucla.edu>, 2015-03-28

BLAS means Basic Linear Algebra Subprograms. See this Wikipedia article about BLAS and its variants. It does basic operations such as matrix multiplication (DGEMM) which are building blocks for more complex operations like eigenvector extraction (DGEEV in LAPACK). At UCLA-Mathnet for a long time we have used the stock BLAS which is standard in the OpenSuSE distro: blas-devel-3.4.2 in OpenSuSE 13.1. However, this is the reference implementation which is locally optimized but does not take into account details of the processor architecture such as the hierarchical cache, and multiple cores.

It has been pointed out that we could get a considerable speed improvement by using one of the globally optimized BLAS implementations. These offer the same API (subroutine names and arguments) but the algorithms take cache characteristics into account.

Benchmark Program

My first step was to write a simplified version of the LAPACK benchmark. Following its lead, I devoted equal time to each of DGEMM (matrix multiplication), DGESV (solving linear equations), and DGEEV (eigenvectors and values). The time to solve these matrix problems scales as N³ where N is the dimension of the matrix, and so an appropriate speed measure is the time (seconds) to do each matrix divided by N³. For each one I do some preliminary tests to estimate the speed, and then I set N so one big matrix will take up the rest of the assigned time, assuming the speed truly scales with N³, which is not always the case. And of course checking can be turned on, to make sure that the subroutines are being called correctly.

Here are links to the program source and the Makefile.

Available BLAS Implementations

There are two optimzed BLAS packages that are credible for Intel-type processors and which are available in OpenSuSE:

ATLAS: Automatically Tuned Linear Algebra Software. The main package is for x86_64 but Pentium variants are also available. You need to recompile it to turn on threading.
OpenBLAS, based on GotoBLAS2 by Kazushige Goto. (In OpenSuSE 13.1 find it in the Science unstable sub-repo.) It is available in three variants: single core, pthreads (multi-core), and OpenMP parallelism. Available for both x86_64 and i586 (generic Intel 32bit processors).

For both of them, SuSE gives a version that performs well on the range of machines on which SuSE is likely to be used, but which is not optimized for any specific processor. You can, however, compile the source code on the target machine and get the best performance for that architecture. Documentation suggests 10% to 15% performance improvement if you do this. However, Mathnet has a variety of processors, and managing multiple architecture-specific versions would be a lot of work, so I decided, at least initially, to install the generic packages.

Another issue is multi-threading: SuSE offers a threaded version of OpenBLAS. Naively you would expect a 4x speedup on a quad core machine, but recruiting a thread in Linux (and Solaris and other architectures) is expensive, and you don't get full proportionality. I recommend that the programmer should do threading at the highest possible level and not use BLAS threading. In fact, many researchers at Mathnet make effective use of low-tech manual threading: submit N single-thread jobs that do not interact, keeping the code very simple. There is contention for the memory bus and possibly I/O (as there would be with an internally threaded algorithm) so you don't speed up by a factor of N, but you usually come fairly close. Nonetheless, I did install the pthreaded version of OpenBLAS.

Since ATLAS was the package requested, I tested it first. Unfortunately it got a segmentation fault in DGEMM, and since my benchmark runs error-free (including checking) on the stock BLAS on a range of Mathnet and home machines, I point the finger of blame at SuSE's ATLAS, some obscure compatibility issue. Debugging their code and/or configuration did not sound attractive.

Having read blog and forum postings (follow the link for one example) saying that ATLAS and OpenBLAS give generally similar improved performance, and OpenBLAS is easier to set up, I decided to give it a try first, before attempting to debug ATLAS. Indeed, OpenBLAS is spectacularly faster than the stock BLAS.

Installing and Linking

To install OpenBLAS do these steps:

zypper install libopenblas0 libopenblasp0
update-alternatives --config libblas.so.3 #And select libopenblas0.so.0

In the first step, I install both the single and multiple thread versions (libopenblasp0) even though not every problem benefits from threading. Mathnet has an enterprise mirror into which I have downloaded the RPM files. At other sites, since libopenblas0 is in an unstable sub-repo which you are unlikely to have turned on, you may need to download it to temporary storage and then give the full path name to Zypper. In OpenSuSE 13.2, OpenBLAS is in the main distro so the download step will not be necessary.

The second step causes a symbolic link to be made so libblas.so.3 really points to OpenBLAS and not to the stock BLAS. Thus when a user builds a program with -lblas on the linker command line, he will get the then-prevailing alternative, without needing to recompile if a different package is installed. I wrote a daily housekeeping script that reviews such non-default alternative settings (presently this is the only one) and complains if any machine is set contrary to policy.

To link with the threaded version, replace -lblas with either -lopenblasp0 or the full path to the library, currently /usr/lib64/libopenblasp.so.0 .

Performance Improvement

How much performance improvement is achieved? Here are some results on several representative machines; others were tested and give generally similar results according to the CPU speed.

Achilles: Intel i7-4770S 3.10GHz x4, currently (2015-03-xx) Mathnet's top of the line quad core research workstation.
Joshua02: Intel Xeon X5650 2.67GHz x6, our main cluster has 16 of these.
Nemo01: Intel Xeon 3.40GHz x2, an older but still useful cluster with 10 of these.
Diamond: Intel i7-3517UE 1.70GHz x2, the CFT compute server.

	DGEMM		DGESV		DGEEV
Host	N	Sec/N³	N	Sec/N³	N	Sec/N³	BLAS
achilles	2529	6.536e-10	3626	1.819e-10	1078	4.604e-09	Stock
	5633	4.643e-11	7321	1.750e-11	1668	1.128e-09	OpenBLAS
	7278	1.494e-11	7299	6.179e-12	1321	1.317e-09	OpenBLAS pthread
joshua02	1976	1.191e-09	2893	3.555e-10	864	8.808e-09	Stock
	3454	1.826e-10	5046	6.166e-11	1228	2.447e-09	OpenBLAS
	6581	1.886e-11	8709	7.105e-12	1076	2.042e-09	OpenBLAS pthread
nemo01	1577	2.487e-09	2269	7.958e-10	687	1.918e-08	Stock
	2806	3.406e-10	3746	1.211e-10	906	7.384e-09	OpenBLAS
	3447	1.772e-10	4476	6.600e-11	838	7.658e-09	OpenBLAS pthread
Diamond	2212	8.592e-10	3076	2.825e-10	896	7.727e-09	Stock
	4384	9.928e-11	5949	3.550e-11	1404	1.808e-09	OpenBLAS
	4867	6.215e-11	6539	2.290e-11	920	3.531e-09	OpenBLAS pthread

How much speedup was there, as a ratio?

Host	DGEMM	DGESV	DGEEV	Comparison
achilles	14.08	10.39	4.08	OpenBLAS/stock
	43.75	29.44	3.50	pthread/stock
	3.11	2.83	0.86	pthread (4x)/1 core
joshua02	6.52	5.77	3.60	OpenBLAS/stock
	63.15	50.04	4.31	pthread/stock
	9.68	8.68	1.20	pthread (6x)/1 core
nemo01	7.30	6.57	2.60	OpenBLAS/stock
	14.03	12.06	2.50	pthread/stock
	1.92	1.83	0.96	pthread (1x)/1 core
Diamond	8.65	7.96	4.27	OpenBLAS/stock
	13.82	12.34	2.19	pthread/stock
	1.60	1.55	0.51	pthread (2x)/1 core

So in summary, DGEMM (matrix multiply) speeds up by 7x to 14x, the newer and faster machines getting more benefit. DGESV (linear equations) speeds up by 6x to 10x; for some reason Joshua has the least benefit. DGEEV (eigenvectors) speeds up by 2.6x to 4x, the faster machines doing better.

As for threading, the most spectacular was 63x with matrix multiplication on Joshua02 (6 cores). OpenBLAS promises to avoid using Intel hyperthread, that is, it should limit the number of threads to the number of cores (not doubled), to avoid the overhead of spawning threads that cannot bring any benefit. However, hyper CPUs were fully utilized. For DGEMM (matrix multiply) and DGESV (linear equations) the speedup versus serial OpenBLAS was roughly proportional to the number of cores -- very roughly, and the overhead of recruiting threads is clear. In fact, Joshua02 speeds up more than the number of cores; likely OpenBLAS is not optimal for its cache configuration and when the problem is sliced up for threading it becomes more cache-friendly. But for DGEEV (eigenvectors) only Joshua02 ran a bit faster with threading, a disappointing result.

So my conclusion is, serial OpenBLAS is very successful as a drop-in replacement for the stock BLAS, and pthreaded OpenBLAS can give spectacular performance for particular kinds of problems. But the user needs to compare his or her particular algorithm with and without threading. Also, manual threading, i.e. multiple non-interacting single thread jobs, can be very effective with little effort.