Running an example

Cholesky factorization

Change into the Cholesky front-end test driver directory.
> cd src/lapack/dec/chol/front/flamec/test/fla/
This directory contains a test driver set up to perform a sequential (non-parallel) Cholesky factorization with libflame. That is, we will have parallelism enabled neither at the higher levels of the Cholesky algorithm in libflame nor within the lower levels inside the BLAS library (in this case, GotoBLAS).
Modify the local makefile according to the locations of the libflame libraries and GotoBLAS. The variable assignments of interest should read:
INST_PATH    := $(HOME)/flame
LIB_PATH     := $(INST_PATH)/lib
INC_PATH     := $(INST_PATH)/include
FLAME_LIB    := $(LIB_PATH)/libflame.a
BLAS_LIB     := $(LIB_PATH)/libgoto.a

If you used a different install path than the default, you should change the value of INST_PATH to reflect this. Regardless of this, though, you will need to update the value of BLAS_LIB to reflect the location of the GotoBLAS library you built earlier.

Note: If you did not configure libflame with --enable-builtin-lapack-routines, you must link against LAPACK. The easiest way to do this is by changing BLAS_LIB to something like:

Note: Though we highly recommend using libflame with libgoto, we allow users to use other BLAS libraries. If you wish to use a BLAS library other than GotoBLAS, you should go back and re-configure libflame with --disable-goto-interfaces, and then re-compile and re-install the library.
Though it is sometimes not necessary, the LDFLAGS variable in the makefile should be set to the link flags shown during the libflame configuration summary. These link flags may be found in the post-configure.sh file, which may be found relative to the libflame directory at
config/<host_string>/post-configure.sh
where <host_string> is a string identifying your particular host architecture.
Set the compilers in the local makefile to be the same as the ones used to build libflame. If you don't remember which ones were used, refer to the post-configure.sh script. Typically GNU compilers (gcc and gfortran) are used on most architectures if they are present. (One exception: if libflame is configured on an Itanium system, the Intel compilers icc and ifort are given preference over GNU, assuming they are present.)
Build the test driver.
> make
If you get error messages that look similar to the following:
ld: i386 architecture of input file `/home/field/flame/lib/goto/libgoto_opteron-r1.22.a(saxpy.o)' is incompatible with i386:x86-64 output
then you probably built a 32-bit GotoBLAS library when you needed a 64-bit version. Go back to your GotoBLAS directory, modify the Makefile.rule file, uncomment the BINARY64 = 1 line, and rebuild the library with make. Then come back to the test driver directory and try once again to build the test driver program. If the above error was your only type of error, then the program should link cleanly now.
Run the test driver with default input parameters.
> ./test_Chol.x < input
Within the output you should see pairs of lines similar to the following:
data_chol_l( 1, 1:5 ) = [ 100 2.545 0.00e+00 2.545 2.99e-14 ];
data_chol_u( 1, 1:5 ) = [ 100 2.525 0.00e+00 2.506 5.69e-13 ];

This output is in matlab format. That is, it can be fed into matlab in order to display a graph of the performance. The first row shows results for the lower triangular case and the second row shows results for the upper triangular case. The numbers of interest are in between the '[' ']' brackets. The first column is the matrix size. The second column is the performance of the reference implementation. The fourth column is the performance of the libflame implementation. The fifth column is the maximum element-wise difference between the libflame result matrix and the result given by the reference implementation (in this case, netlib LAPACK). The third column may be ignored. Here, performance is given in gigaflops (billions of floating-point operations per second, or GFLOPS) and so higher numbers are better. Also, as long as the differences between the libflame and reference results are very small (less than 1.0e-08), the result should be numerically accurate enough for most purposes.
You may also run the test driver interactively. Just enter values for each of the input parameters as they are prompted.
> ./test_Chol.x
% number of repeats: 3

The number of repeats determines how many times the experiment is run. Here we're asking for three repeats, meaning the test driver will report the best results of three consecutive trials.
% enter problem size first, last, inc: 100 1000 50
Here the program prompts for the first problem size, the maximum problem size, and the increment. For the values shown above, the first experiment would use matrices 100-by-100 in size, and then 150-by-150, and so on, up to 1000-by-1000.
% enter m (-1 means bind to problem size): -1
Enter -1 here regardless of the previous input.
If you are using a system with more than one CPU or processing core, you can get a performance boost by using SuperMatrix. If you are not sure how many cores are in your system, you can grep cpuinof, which resides in the proc filesystem:
> grep processor /proc/cpuinfo
processor : 0
processor : 1

The number of lines of output indicates how many processing cores are available on the system. The output above shows that the system contains a total of two cores.
If you configured libflame with both multithreading and SuperMatrix enabled, then libflame also includes parallelized implementations of many common operations, including Cholesky factorization. Change to the following directory:
> cd src/lapack/dec/chol/front/flamec/test/flash_sm/
or, from the previous test directory:
> cd ../flash_sm/
Edit the makefile just as you did before in steps 2 through 4. Now edit the file named input. Its original contents should look like this:
3
100
100 500 100
-1
2

These values are the same as with the sequential, except that the second and fifth lines are new. The second line specifies a blocksize. SuperMatrix uses hierarchical storage-by-blocks, and so the test driver must know what blocksize to use when creating the matrices. The fifth line specifies how many threads to use in the computation. Running with two threads allows the computation to finish at most twice as quickly. Modify the number of threads to be an integer between 1 and the total number of processing cores on your system. Then you may run the test driver, redirecting the input file as before:
> ./test_Chol.x < input
The output should indicate that performance is rising beyond what was possible when libflame was being run sequentially:
data_chol_l( 5, 1:5 ) = [ 500 3.339 0.00e+00 5.478 7.66e-13 ];
data_chol_u( 5, 1:5 ) = [ 500 3.542 0.00e+00 6.305 2.26e-12 ];

This is output from an experiment on a quad-core 2.4GHz opteron system using the above input values. The theoretical peak performance for one core on this machine is 4.8 GFLOPS. Here, we can see that both lower and upper triangular cases of the Cholesky factorization provided by libflame are exceeding that peak because the computation is being divided among two threads. Performance usually goes up for larger problem sizes. Here's what we see on the same system for problem sizes of 1000-by-1000:
data_chol_l( 10, 1:5 ) = [ 1000 3.465 0.00e+00 6.366 4.65e-11 ];
data_chol_u( 10, 1:5 ) = [ 1000 3.593 0.00e+00 6.755 5.74e-12 ];

Here, libflame with two threads is close to twice as fast as the sequential reference implementation.

Last Updated on 4 April 2008 by Field G. Van Zee.