Algorithm used for PLA_Gemm_C: 

For the case where C <- alpha * A * B + beta * C is to be computed,
we partition A = ( A_0 ... A_(K-1) ) and B = /   B_0   \
                                             |    :    |
                                             \ B_(k-1) /
and iterate updating C = alpha * A_j * B_j + C.  In other words,
the computation is set up as a sequence of rank-k updates, with 
k = nb_alg.

More precisely, the algorithm is given by

  ******************************************************************
      
      C <- beta * C
      Partition  A = ( A_F || A_L ) and B = / B_F \
                                            | === |
                                            \ B_L /
     	     where A_F is m x 0 and B_F is 0 x n
      while A_L is not m x 0 
         determine block size b
         Partition ( A_F || A_L ) = ( A_0 || A_1 | A_2 ) 
                   where A_0 = A_F and A_1 has width b
              and  / B_F \    / B_0 \
                   | === | =  | === |
                   \ B_L /    | B_1 |
                              | --- | 
                              \ B_2 /
                   where B_0 = B_F and B_1 has length b
          Update C <- alpha * A_1 * B_1 + C    (rank-b update)
          Continue with
                   ( A_F || A_L ) = ( A_0 | A_1 || A_2 ) 
    
              and  / B_F \    / B_0 \
                   | === | =  | --- |
                   \ B_L /    | B_1 |
                              | === | 
                              \ B_2 /
      endwhile
              
  ******************************************************************

  Appropriate changes need to be made depending on transa and transb