John Gunnels (UT-Austin)
Greg Henry (Intel)
Robert van de Geijn (UT-Austin)
Release R1.1 for Intel Pentium (R) III
To be kept informed sign the FLAME
ITXGEMM is an implementation of matrix-matrix multiplication
that builds on some recent theoretical results of ours
that show how to take advantage of all layers of memory hierarchies
on modern microprocessors. The project is a collaboration
at Intel (R), and
Robert van de Geijn
The University of Texas at Austin.
About the current implementation:
- It targets specifically the Intel Pentium (TM) III processor
Don't expect good performance on a Pentium (TM) II or Intel Celeron (TM)
- It is for the Linux operating system. Don't try
to use it under Windows (TM) yet.
- It only supports double precision real (64-bit) arithmetic for now..
Obtaining the library
If you would like to try out this implementation please
go through the following steps:
Get the auxiliary routines from Greg's web site:
These are assembly coded routines that implement
a matrix-matrix multiplication of submatrices that
are staged to take maximal advantage of the L1 cache.
Note: you need
libITXauxR1.0PIII.a for ITXGEMM release R1.0
Get our kernels that stage the computation to take
full advantage of the L2 and L1 by
Add the following libraries when you link your code:
If you want to get our faster dgemm kernel, but
you want to link other BLAS routines from another
library use the following order:
libITXGEMMR1.0PIII.a OtherLibrary.a libITXauxR1.0PIII.a
If you want to link
library of BLAS for Linux, link
libITXGEMMR1.0PIII.a sblas13d.a libITXauxR1.0PIII.a
If you would like to link
libITXGEMMR0.1PIII.a libatlas.a libITXauxR1.0PIII.a
Do not redistribute the library.
to this web page or Greg's web page instead.
Reference this work when you use it successfully
for your own research.
How to do your own performance evaluation.
Note: ATLAS has implementations of some
LAPACK routines as part of the library (e.g. dgetrf).
Thus, to do a fair comparison between ATLAS and ITXGEMM,
you will need to order the libaries upon linking as follows:
liblapack.a libITXGEMMR1.0PIII.a libatlas.a libITXauxR1.0PIII.a
This will force a routine like dgetrf to be taken from
lapack and then linked with the ITXGEMM matrix-matrix multiply.
After timing this, you should then link only with
in that order and time to see how well the same LAPACK routine does
with the ATLAS matrix-matrix multiply. Finally link with
in that order and evaluate the ATLAS dgetrf routine.
Notice that we have an optimized version of
dgetrf that is faster than either ATLAS or LAPACK,
but it is not yet part of the ITXGEMM release.
Naturally, ITXGEMM is fast.
Possibly the fastest by some measure.
In particular, test performance for odd-sized matrices.
Test if it is fast for your application and let us know!
Next time someone promotes another package,
ask for performance comparisons with ITXGEMM!
Performance results from the paper
presented at ICCS01
For related publications, see the FLAME
publication web page.
Commonly asked questions
Yes , we rely on assembly-coded kernels.
There really are only three such kernels, and
they are tiny by most measures. The rest is
all in C.
Our thesis: To unleash the true power
of a processor, one must assembly code
at least and at most
the inner-kernel since compilers will
always lag behind.
Yes, there are many opportunities for optimization
left. We have only just begun.
Yes, we can do the same for other architectures.
No, we are not funded to do so.
Yes, we can add our techniques to
to accelerate their performance.
No, we are not funded to do so.
Yes, we submitted an extended abstract on our techniques to SC00.
Lovely paper if I say so myself.
They chose not to accept it. Thus, you will
have to wait for the journal paper instead.
Yes, it is inconvenient to have
two ".a" files.
No, we do not have any bright ideas
about how to otherwise handle the the intellectual
property rights questions the UT and Intel lawyers
Get on the FLAME mailing list
Please sign the FLAME
so we can keep you informed of new developments regarding ITXGEMM.
We have a full set of level-3 BLAS
FLAME . They attain
performance similar to the LU factorization
in the performance web page.
They will be released shortly.
Get on our mailing list to remain informed.
The IA-64 will be targeted next.
Please give us feedback on how this kernel helps or hurts
performance for your application by mailing to
THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED
WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY,
NONINFRINGEMENT OF INTELLECTUAL PROPERTY, OR FITNESS FOR ANY
PARTICULAR PURPOSE. IN NO EVENT SHALL THE UNIVERSITY OF TEXAS OR ITS
SUPPLIERS BE LIABLE FOR ANY DAMAGES WHATSOEVER (INCLUDING, WITHOUT
LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, LOSS
OF INFORMATION) ARISING OUT OF THE USE OF OR INABILITY TO USE THE
MATERIALS, EVEN IF THE UNIVERSITY OF TEXAS HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES. BECAUSE SOME JURISDICTIONS PROHIBIT THE
EXCLUSION OR LIMITATION OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL
DAMAGES, THE ABOVE LIMITATION MAY NOT APPLY TO YOU. The University of
Texas further does not warrant the accuracy or completeness of the
information, text, graphics, links or other items contained within
these materials. The University of Texas may make changes to these
materials, or to the products described therein, at any time without
notice. The University of Texas makes no commitment to update the
Back to FLAME page
This web page is maintained by
Robert van de Geijn
Last Updated: Dec. 14, 2000