## Unit2.5.2Strassen's algorithm

So far, we have discussed algorithms for computing $C := A B + C$ via a triple-nested loop that then perform $2 m n k$ flops. A question is whether this is the best we can do.

A classic result regarding this is Strassen's algorithm [26]. It shows that, for problems of size $m = n = k = 2^r$ for some integer $r \text{,}$ matrix-matrix multiplication can be computed in time $O( n^{\log_2{7}} ) \approx O( n^{2.807}) \text{.}$ Since Strassen proposed this, a succession of results further improved upon the exponent. In this discussion, we stick to the original result.

How can this be? For simplicity, assume $m = n = k$ are all even. Partition

\begin{equation*} C = \left( \begin{array}{c c} C_{00} \amp C_{01} \\ C_{10} \amp C_{11} \end{array} \right) , A = \left( \begin{array}{c c} A_{00} \amp A_{01} \\ A_{10} \amp A_{11} \end{array} \right) , B = \left( \begin{array}{c c} B_{00} \amp B_{01} \\ B_{10} \amp B_{11} \end{array} \right), \end{equation*}

where $X_{ij}$ are $n/2 \times n/2$ for $X \in \{ C, A, B \}$ and $ij \in \{ 00, 01, 10, 11 \} \text{.}$ Now that you understand how partitioned matrix-matrix multiplication works, you know that the following computations compute $C := A B + C \text{:}$

\begin{equation*} \begin{array}{l c l} C_{00} \amp=\amp \alpha (A_{00} B_{00} + A_{01} B_{10}) + C_{00} \\ C_{01} \amp=\amp \alpha (A_{00} B_{01} + A_{01} B_{11}) + C_{01} \\ C_{10} \amp=\amp \alpha (A_{10} B_{00} + A_{11} B_{10}) + C_{10} \\ C_{11} \amp=\amp \alpha (A_{10} B_{01} + A_{11} B_{11}) + C_{11} . \end{array} \end{equation*}

Each of the eight matrix-matrix multiplications requires $2 (n/2)^3 = 1/4 n^3$ flops and hence the total cost is our usual $2 n^3$ flops.

Surprisingly, the following computations also compute $C := A B + C \text{:}$

\begin{equation*} \begin{array}{l @{\hspace{1pt}} c @{\hspace{1pt}} l l r} M_0 \amp =\amp ( A_{00} + A_{11} ) ( B_{00} + B_{11} ); % \\ \amp \amp C_{00} +\!\!= M_0; C_{11} +\!\!= M_0; \\ M_1 \amp=\amp ( A_{10} + A_{11} ) B_{00}; % \\ \amp \amp C_{10} +\!\!= M_1 ; C_{11} -\!\!= M_1 ; \\ M_2 \amp=\amp A_{00} ( B_{01} - B_{11} ); % \\ \amp \amp C_{01} +\!\!= M_2 ; C_{11} +\!\!= M_2 ; \\ M_3 \amp=\amp A_{11}( B_{10} - B_{00} ); % \\ \amp \amp C_{00} +\!\!= M_3 ; C_{10} +\!\!= M_3 ; \\ M_4 \amp=\amp ( A_{00} + A_{01}) B_{11}; % \\ \amp \amp C_{01} +\!\!= M_4 ; C_{00} -\!\!= M_4; \\ M_5\amp=\amp (A_{10} - A_{00} )( B_{00} + B_{01} ); % \\ \amp \amp C_{11} +\!\!= M_5; \\ M_6\amp=\amp (A_{01} - A_{11} )( B_{10} + B_{11} ); % \\ \amp \amp C_{00} +\!\!= M_6 . \end{array} \end{equation*}

If you count carefully, this requires 22 additions of $n/2 \times n/2$ matrices and 7 multiplications with $n/2 \times n/2$ matrices. Adding matrices together requires $O( n^2)$ flops, which are insignificant if $n$ is large. Ignoring this, the cost now becomes

\begin{equation*} 7 \times 2 ( n/2 )^3 = 2 \frac{7}{8} n^3 \end{equation*}

flops. The cost of the matrix-matrix multiplication is now $7/8$ of what it was before!

But it gets better! Each of the matrix-matrix multiplications can themselves be computed via this scheme. If you do that, applying the idea at two levels, the cost is reduced to

\begin{equation*} 7 \times ( 7 \times 2 (n/4)^3 ) = 2 \left( \frac{7}{8} \right)^2 n^3 \end{equation*}

flops. How many times can we do that? If $n = 2^r$ we can half the size of the matrices $r = \log_2( n )$ times. If you do that, the cost becomes

\begin{equation*} 2 \left( \frac{7}{8} \right)^r n^3 {\rm ~flops}. \end{equation*}

Now,

\begin{equation*} \left( {7}/{8} \right)^{\log_2(n)} 2 n^3 = n^{\log_2(7/8)} 2 n^3 = 2 n^{\log_2 7} \approx 2 n^{2.807} . \end{equation*}

Here we ignored the cost of the additions. However, it can be analyzed that this recursive approach requires $O( n^{2.807} )$ flops.