Algorithm used for PLA_Gemm_C: For the case where C <- alpha * A * B + beta * C is to be computed, we partition A = ( A_0 ... A_(K-1) ) and B = / B_0 \ | : | \ B_(k-1) / and iterate updating C = alpha * A_j * B_j + C. In other words, the computation is set up as a sequence of rank-k updates, with k = nb_alg. More precisely, the algorithm is given by ****************************************************************** C <- beta * C Partition A = ( A_F || A_L ) and B = / B_F \ | === | \ B_L / where A_F is m x 0 and B_F is 0 x n while A_L is not m x 0 determine block size b Partition ( A_F || A_L ) = ( A_0 || A_1 | A_2 ) where A_0 = A_F and A_1 has width b and / B_F \ / B_0 \ | === | = | === | \ B_L / | B_1 | | --- | \ B_2 / where B_0 = B_F and B_1 has length b Update C <- alpha * A_1 * B_1 + C (rank-b update) Continue with ( A_F || A_L ) = ( A_0 | A_1 || A_2 ) and / B_F \ / B_0 \ | === | = | --- | \ B_L / | B_1 | | === | \ B_2 / endwhile ****************************************************************** Appropriate changes need to be made depending on transa and transb