This page details performance results for a long vector math library written by Warren A Hunt. This library is designed to provide high performance primitives for data-parallel environments. The library is a simple extension of his high performance SpiderMind short vector library.
The results presented are in cycle's per element (CPEs) taken over 4096 Random 32-bit floating point numbers over the "working" range of each function. CPE numbers INCLUDE ALL loop overhead. The "working" range is defined as: "all numbers that aren't extraordinary and don't produce and extraordinary result" This definition of "working range" is significantly broader than the working ranges Intel uses for each function. E.G. Intel's working range for ARCTAN is 0.77..0.99999 and for ARCSIN is 4.0..1.0e+19 (even though ARCSIN is only defined on -1..1) (Intel's "working" ranges are also degenerate for ARCCOS and ARCCOSH.) (It is now suspected that they have switched the ranges (and potentailly performance results) between the arc trig functions and the arc hyperbolic trig functions.)
All SpiderMind Library Functions are guaranteed to be within 1 bit of precision for their working range with the following two exceptions:
The table includes CPE counts for my algorithms compiled in IA32 and x64, and Intel's advertised high-accurate (HA) and low-accurate (LA) (It is not advertised which ISA they are using). It also contains a performance ratio between the SpiderMind x64 implementations and the Intel high accuracy (HA) numbers. All of these results are for (one core of) a Core2 micro-architecture. SpiderMind leaves out certain functions either because they are used infrequenty, easy composites of other functions or they are only 1-3 instructions long and not interesting.
The Intel performance numbers are usually several times faster than the x87 hardware implementations (per element) and about twice as fast as other vector implementations such as id-software's SSE library (idlib). The SpiderMind Library VML is approximately 2x faster than Intel's VML. It is estimated to execute a MUL on about 55% of machine cycles.
The SpiderMind VML uses the SpiderMind SSE wrapper classes, no direct use of intrinsics and NO ASSEMBLY! It also uses no software lookup tables and eight or fewer constants per function, so it does not impact the cache or require any memory reads per element. (For some functions, x64 mode is required to prevent register spilling). The high performance of this library is derived from very detailed knowledge of the underlying micro-architecture, careful implementation and from error-reduction techniques when computing a Taylor-series approximation known as minimax polynomials.
| X | SML x64 | SML IA32 | Intel HA | Intel LA | Xfaster | X |
| Inv | 1.14 | 1.08 | 4.38 | 1.98 | 3.84 | Inv |
| Div | 1.21 | 1.11 | 4.34 | 2.44 | 3.58 | Div |
| Sqrt | 1.46 | 1.4 | 5.49 | 2.5 | 3.76 | Sqrt |
| InvSqrt | 1.33 | 1.28 | 3.86 | 2.19 | 2.9 | InvSqrt |
| Cbrt | 10.66 | 7.28 | Cbrt | |||
| InvCbrt | 14.1 | 9.7 | InvCbrt | |||
| Pow | 11.47 | 11.72 | 24.93 | 24.93 | 2.17 | Pow |
| Powx | 11.47 | 11.72 | 23.92 | 23.86 | 2.08 | Powx |
| Exp | 5.06 | 4.96 | 6.88 | 6.62 | 1.35 | Exp |
| Ln | 5.64 | 5.84 | 11.26 | 9.56 | 1.99 | Ln |
| Log10 | 6 | 6.43 | 10.71 | 9.5 | 1.78 | Log10 |
| Cos | 4.27 | 4.17 | 9.23 | 7.2 | 2.16 | Cos |
| Sin | 4.08 | 3.98 | 9.58 | 6.29 | 2.34 | Sin |
| SinCos | 18.77 | 10.16 | SinCos | |||
| Tan | 6.3 | 6.2 | 19.58 | 10.26 | 3.1 | Tan |
| Acos | 11 | 10.93 | 15.74 | 8.94 | 1.43 | Acos |
| Asin | 10.79 | 10.8 | 10.7 | 8.57 | 0.99 | Asin |
| Atan | 10.62 | 10.99 | 15.63 | 7.95 | 1.47 | Atan |
| Atan2 | 10.62 | 10.99 | 36.36 | 14.35 | 3.42 | Atan2 |
| Cosh | 6.08 | 6.39 | 12.28 | 9.48 | 2.01 | Cosh |
| Sinh | 6.2 | 6.38 | 18.04 | 9.6 | 2.9 | Sinh |
| Tanh | 6.9 | 7 | 16.91 | 12.96 | 2.45 | Tanh |
| Acosh | 7.91 | 8.02 | 13.93 | 13.93 | 1.76 | Acosh |
| Asinh | 7.88 | 8.05 | 14.47 | 13.82 | 1.83 | Asinh |
| Atanh | 7.5 | 8.02 | 20.14 | 14.94 | 2.68 | Atanh |
| Erf | 16.85 | 13.1 | Erf | |||
| Erfc | 29.14 | 22.8 | Erfc | |||
| ErfInv | 24.01 | 14.48 | ErfInv | |||
| Hypot | 2.29 | 2.11 | 12.01 | 12.01 | 5.24 | Hypot |
| Floor | 2.27 | 2.27 | Floor | |||
| Ceil | 2.33 | 2.33 | Ceil | |||
| Trunc | 2.26 | 2.26 | Trunc | |||
| Round | 2.89 | 2.89 | Round | |||
| Rint | 2.24 | 2.24 | Rint | |||
| NearbyInt | 2.22 | 2.22 | NearbyInt | |||
| Modf | 1.6 | 1.51 | 2.79 | 2.79 | 1.74 | Modf |
Intel performance results were obtained from:
http://www.intel.com/software/products/mkl/data/vml/functions/_performanceall.html