SpiderMind Library VML

This page details performance results for a long vector math library written by Warren A Hunt. This library is designed to provide high performance primitives for data-parallel environments. The library is a simple extension of his high performance SpiderMind short vector library.

The results presented are in cycle's per element (CPEs) taken over 4096 Random 32-bit floating point numbers over the "working" range of each function. CPE numbers INCLUDE ALL loop overhead. The "working" range is defined as: "all numbers that aren't extraordinary and don't produce and extraordinary result" This definition of "working range" is significantly broader than the working ranges Intel uses for each function. E.G. Intel's working range for ARCTAN is 0.77..0.99999 and for ARCSIN is 4.0..1.0e+19 (even though ARCSIN is only defined on -1..1) (Intel's "working" ranges are also degenerate for ARCCOS and ARCCOSH.) (It is now suspected that they have switched the ranges (and potentailly performance results) between the arc trig functions and the arc hyperbolic trig functions.)

All SpiderMind Library Functions are guaranteed to be within 1 bit of precision for their working range with the following two exceptions:

The table includes CPE counts for my algorithms compiled in IA32 and x64, and Intel's advertised high-accurate (HA) and low-accurate (LA) (It is not advertised which ISA they are using). It also contains a performance ratio between the SpiderMind x64 implementations and the Intel high accuracy (HA) numbers. All of these results are for (one core of) a Core2 micro-architecture. SpiderMind leaves out certain functions either because they are used infrequenty, easy composites of other functions or they are only 1-3 instructions long and not interesting.

The Intel performance numbers are usually several times faster than the x87 hardware implementations (per element) and about twice as fast as other vector implementations such as id-software's SSE library (idlib). The SpiderMind Library VML is approximately 2x faster than Intel's VML. It is estimated to execute a MUL on about 55% of machine cycles.

The SpiderMind VML uses the SpiderMind SSE wrapper classes, no direct use of intrinsics and NO ASSEMBLY! It also uses no software lookup tables and eight or fewer constants per function, so it does not impact the cache or require any memory reads per element. (For some functions, x64 mode is required to prevent register spilling). The high performance of this library is derived from very detailed knowledge of the underlying micro-architecture, careful implementation and from error-reduction techniques when computing a Taylor-series approximation known as minimax polynomials.

X SML x64 SML IA32 Intel HA Intel LA Xfaster X
Inv 1.14 1.08 4.38 1.98 3.84 Inv
Div 1.21 1.11 4.34 2.44 3.58 Div
Sqrt 1.46 1.4 5.49 2.5 3.76 Sqrt
InvSqrt 1.33 1.28 3.86 2.19 2.9 InvSqrt
Cbrt 10.66 7.28 Cbrt
InvCbrt 14.1 9.7 InvCbrt
Pow 11.47 11.72 24.93 24.93 2.17 Pow
Powx 11.47 11.72 23.92 23.86 2.08 Powx
Exp 5.06 4.96 6.88 6.62 1.35 Exp
Ln 5.64 5.84 11.26 9.56 1.99 Ln
Log10 6 6.43 10.71 9.5 1.78 Log10
Cos 4.27 4.17 9.23 7.2 2.16 Cos
Sin 4.08 3.98 9.58 6.29 2.34 Sin
SinCos 18.77 10.16 SinCos
Tan 6.3 6.2 19.58 10.26 3.1 Tan
Acos 11 10.93 15.74 8.94 1.43 Acos
Asin 10.79 10.8 10.7 8.57 0.99 Asin
Atan 10.62 10.99 15.63 7.95 1.47 Atan
Atan2 10.62 10.99 36.36 14.35 3.42 Atan2
Cosh 6.08 6.39 12.28 9.48 2.01 Cosh
Sinh 6.2 6.38 18.04 9.6 2.9 Sinh
Tanh 6.9 7 16.91 12.96 2.45 Tanh
Acosh 7.91 8.02 13.93 13.93 1.76 Acosh
Asinh 7.88 8.05 14.47 13.82 1.83 Asinh
Atanh 7.5 8.02 20.14 14.94 2.68 Atanh
Erf 16.85 13.1 Erf
Erfc 29.14 22.8 Erfc
ErfInv 24.01 14.48 ErfInv
Hypot 2.29 2.11 12.01 12.01 5.24 Hypot
Floor 2.27 2.27 Floor
Ceil 2.33 2.33 Ceil
Trunc 2.26 2.26 Trunc
Round 2.89 2.89 Round
Rint 2.24 2.24 Rint
NearbyInt 2.22 2.22 NearbyInt
Modf 1.6 1.51 2.79 2.79 1.74 Modf

Intel performance results were obtained from:
http://www.intel.com/software/products/mkl/data/vml/functions/_performanceall.html