Those who delve deeper into how to achieve high performance for matrix-matrix multiplication find out that it is specifically a rank-k update, the case of $C := \alpha A B +\beta C$ where the $k$ (inner) size is small, that achieves high performance. The blocked LU factorization that we discussed in Unit 5.5.2 takes advantage of this by casting most of its computation in the matrix-matrix multiplication $A_{22} := A_{22} - A_{21} A_{12} \text{.}$ A question becomes: how do I find blocked algorithms that cast most computation in terms of a rank-k updates?
and various other publications that can be found on the FLAME project publication web site http://www.cs.utexas.edu/~flame/web/FLAMEPublications.html.