Algorithm used for PLA_Gemm_A: For the case where C <- alpha * A * B + beta * C is to be computed, we partition C = ( C_0 ... C_(K-1) ) and B = ( B_0 ... B_(k-1) ) and iterate computing C_j = alpha * A * B_j + C_j. In other words, the computation is set up as a sequence of matrix-panel( of columns) multiplies. More precisely, the algorithm is given by ****************************************************************** C <- beta * C Partition B = ( B_F || B_L ) and C = ( C_F || C_L ) where B_F is k x 0 and C_F is m x 0 while C_L is not m x 0 determine block size b Partition ( B_F || B_L ) = ( B_0 || B_1 | B_2 ) where B_0 = B_F and B_1 has width b and ( C_F || C_L ) = ( C_0 || C_1 | C_2 ) where C_0 = C_F and C_1 has width b Update C_1 <- alpha * A * B_1 + C_1 (matrix-panel mult.) Continue with ( B_F || B_L ) = ( B_0 | B_1 || B_2 ) and ( C_F || C_L ) = ( C_0 | C_1 || C_2 ) endwhile ****************************************************************** Appropriate changes need to be made depending on transa and transb