Algorithm used for PLA_Gemm_A: 

For the case where C <- alpha * A * B + beta * C is to be computed,
we partition C = ( C_0 ... C_(K-1) ) and B = ( B_0 ...  B_(k-1) )
and iterate computing C_j = alpha * A * B_j + C_j.  In other words,
the computation is set up as a sequence of matrix-panel( of columns) 
multiplies.

More precisely, the algorithm is given by

  ******************************************************************
      
      C <- beta * C
      Partition  B = ( B_F || B_L ) and C = ( C_F || C_L ) 
     	     where B_F is k x 0 and C_F is m x 0
      while C_L is not m x 0 
         determine block size b
         Partition ( B_F || B_L ) = ( B_0 || B_1 | B_2 ) 
                   where B_0 = B_F and B_1 has width b
              and  ( C_F || C_L ) = ( C_0 || C_1 | C_2 ) 
                   where C_0 = C_F and C_1 has width b
          Update C_1 <- alpha * A * B_1 + C_1    (matrix-panel mult.)
          Continue with
                   ( B_F || B_L ) = ( B_0 | B_1 || B_2 ) 
	      and  
                   ( C_F || C_L ) = ( C_0 | C_1 || C_2 ) 
      endwhile
              
  ******************************************************************

  Appropriate changes need to be made depending on transa and transb