## Unit3.5.4Practical implementation of Strassen's algorithm

Achieving a high performing, practical implementation of Strassen's algorithm, which was discussed in Unit 2.5.2, is nontrivial:

• As the number of levels of Strassen increases, the extra overhead associated with the additions becomes nontrivial.

• Now that you understand the cost of moving data between memory layers, you can also appreciate that there is cost related to moving data associated with the extra additions, if one is not careful.

• Accommodating matrices with sizes that are not a power of two (or a multiple of a power of two) becomes a can of worms.

Imagine trying to deal with the complexity of this recursive algorithm as you try to apply what you have learned so far in the course...

Until recently, the approach was to only perform a few levels of Strassen's algorithm, and to call a high-performance conventional matrix-matrix multiplication for the multiplications at the leaves of the recursion.

• If one performs one level of Strassen, this would reduce the number of flops required by $1 - 7/8 = 1/8 = 0.125 \text{.}$

• If one performs two levels of Strassen, this would reduce the number of flops required by $1 - (7/8)^2 \approx 0.234 \text{.}$

This yields some potential for improved performance. It comes at the expense of additional workspace due to the many intermediate matrices that need to be computed, which reduces the largest problem that can fit in memory.

In the recent paper

• [14] Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn, Strassen's Algorithm Reloaded, International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), 2016.

it was shown that the additions and subtractions that appear in Unit 2.5.2 can be incorporated into the packing that is inherent in the optimizations of matrix-matrix multiplication that you mastered so far in the course! This observation makes Strassen's algorithm much more practical than previously thought. Those interested in the details can read that paper.

If that paper interests you, you may also want to read

• [12] Jianyu Huang, Leslie Rice, Devin A. Matthews, Robert A. van de Geijn, Generating Families of Practical Fast Matrix Multiplication Algorithms, in Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

which discusses a family of Strassen-like algorithms.

The code discussed in that paper can be found at https://github.com/flame/fmm-gen.