Algorithm used for PLA_Gemm_B: 

For the case where C <- alpha * A * B + beta * C is to be computed,
we partition C = /   C_0   \ and /   A_0   \
                 |    :    |     |    :    |
                 \ C_(K-1) /     \ A_(k-1) /
and iterate computing C_i = alpha * A_i * B + C_i.  In other words,
the computation is set up as a sequence of panel( of rows)-matrix
multiplies.

More precisely, the algorithm is given by

  ******************************************************************
      
      C <- beta * C
      Partition  A = / A_F \ and C = / C_F \
                     | === |         | === |
                     \ A_L /         \ C_L /
     	     where A_F is 0 x k and C_F is 0 x n
      while A_L is not 0 x k 
         determine block size b
         Partition 
              / A_F \    / A_0 \       / C_F \    / C_0 \
              | === | =  | === |  and  | === | =  | === |
              \ A_L /    | A_1 |       \ C_L /    | C_1 |
                         | --- |                  | --- | 
                         \ A_2 /                  \ C_2 /
                   where A_0 = A_F and A_1 has length b
                         C_0 = C_F and C_1 has length b
          Update C_1 <- alpha * A_1 * B + C_1    (panel-matrix mult.)
          Continue with
              / A_F \    / A_0 \       / C_F \    / C_0 \
              | === | =  | --- |  and  | === | =  | --- |
              \ A_L /    | A_1 |       \ C_L /    | C_1 |
                         | === |                  | === | 
                         \ A_2 /                  \ C_2 /
      endwhile
              
  ******************************************************************

Appropriate changes need to be made depending on transa and transb