PfHP Parallelizing multiple loops

Skip to main content

\( \newcommand{\R}{\mathbb R} \newcommand{\Rm}{\mathbb R^m} \newcommand{\Rn}{\mathbb R^n} \newcommand{\Rnxn}{\mathbb R^{n \times n}} \newcommand{\Rmxn}{\mathbb R^{m \times n}} \newcommand{\C}{\mathbb C} \newcommand{\Cm}{\mathbb C^m} \newcommand{\Cmxm}{\mathbb C^{m \times m}} \newcommand{\Cnxn}{\mathbb C^{n \times n}} \newcommand{\Cmxn}{\mathbb C^{m \times n}} \newcommand{\Cn}{\mathbb C^n} \newcommand{\Null}{{\cal N}} \newcommand{\Col}{{\cal C}} \newcommand{\Rowspace}{{\cal R}} \newcommand{\Span}{{\cal Span}} \newcommand{\rank}{{\rm rank}} \newcommand{\FlaTwoByTwo}[4]{ \left( \begin{array}{c | c} #1 \amp #2 \\ \hline #3 \amp #4 \end{array} \right) } \newcommand{\FlaTwoByTwoSingleLine}[4]{ \left( \begin{array}{c c} #1 \amp #2 \\ #3 \amp #4 \end{array} \right) } \newcommand{\FlaTwoByTwoSingleLineNoPar}[4]{ \begin{array}{c c} #1 \amp #2 \\ #3 \amp #4 \end{array} } \newcommand{\FlaOneByTwo}[2]{ \left( \begin{array}{c | c} #1 \amp #2 \end{array} \right) } \newcommand{\FlaOneByTwoSingleLine}[2]{ \left( \begin{array}{c c} #1 \amp #2 \end{array} \right) } \newcommand{\FlaTwoByOne}[2]{ \left( \begin{array}{c} #1 \\ \hline #2 \end{array} \right) } \newcommand{\FlaTwoByOneSingleLine}[2]{ \left( \begin{array}{c} #1 \\ #2 \end{array} \right) } \newcommand{\FlaThreeByOneB}[3]{ \left( \begin{array}{c} #1 \\ \hline #2 \\ #3 \end{array} \right) } \newcommand{\FlaThreeByOneT}[3]{ \left( \begin{array}{c} #1 \\ #2 \\ \hline #3 \end{array} \right) } \newcommand{\FlaOneByThreeR}[3]{ \left( \begin{array}{c | c c} #1 \amp #2 \amp #3 \end{array} \right) } \newcommand{\FlaOneByThreeL}[3]{ \left( \begin{array}{c c | c} #1 \amp #2 \amp #3 \end{array} \right) } \newcommand{\FlaThreeByThreeBR}[9]{ \left( \begin{array}{c | c c} #1 \amp #2 \amp #3 \\ \hline #4 \amp #5 \amp #6 \\ #7 \amp #8 \amp #9 \end{array} \right) } \newcommand{\FlaThreeByThreeTL}[9]{ \left( \begin{array}{c c | c} #1 \amp #2 \amp #3 \\ #4 \amp #5 \amp #6 \\ \hline #7 \amp #8 \amp #9 \end{array} \right) } \newcommand{\diag}[1]{{\rm diag}( #1 )} \newcommand{\URt}{{\sc HQR}} \newcommand{\FlaAlgorithm}{ \begin{array}{|l|} \hline \routinename \\ \hline \partitionings \\ ~~~ \begin{array}{l} \partitionsizes \end{array} \\ {\bf \color{blue} {while}~} \guard \\ ~~~ \begin{array}{l} \repartitionings \end{array} \\ ~~~ \color{red} { \begin{array}{l} \hline \color{black} {\update} \\ \hline \end{array}} \\ ~~~ \begin{array}{l} \moveboundaries \end{array} \\ {\bf \color{blue} {endwhile}} \\ \hline \end{array} } \newcommand{\FlaAlgorithmWithInit}{ \begin{array}{|l|} \hline \routinename \\ \hline \initialize \\ \partitionings \\ ~~~ \begin{array}{l} \partitionsizes \end{array} \\ {\bf \color{blue} {while}~} \guard \\ ~~~ \begin{array}{l} \repartitionings \end{array} \\ ~~~ \color{red} { \begin{array}{l} \hline \color{black} {\update} \\ \hline \end{array}} \\ ~~~ \begin{array}{l} \moveboundaries \end{array} \\ {\bf \color{blue} {endwhile}} \\ \hline \end{array} } \newcommand{\FlaBlkAlgorithm}{ \begin{array}{|l|} \hline \routinename \\ \hline \partitionings \\ ~~~ \begin{array}{l} \partitionsizes \end{array} \\ {\bf \color{blue} {while}~} \guard \\ ~~~ {\bf choose~block~size~} \blocksize \\ ~~~ \begin{array}{l} \repartitionings \end{array} \\ ~~~ ~~~ \repartitionsizes \\ ~~~ \color{red} { \begin{array}{l} \hline \color{black} {\update} \\ \hline \end{array}} \\ ~~~ \begin{array}{l} \moveboundaries \end{array} \\ {\bf \color{blue} {endwhile}} \\ \hline \end{array} } \newcommand{\complexone}{ \begin{array}{|c|}\hline \!\pm\! \\ \hline \end{array}~ } \newcommand{\HQR}{{\rm HQR}} \newcommand{\QR}{{\rm QR}} \newcommand{\st}{{\rm \ s.t. }} \newcommand{\QRQ}{{\rm {\normalsize \bf Q}{\rm \tiny R}}} \newcommand{\QRR}{{\rm {\rm \tiny Q}{\bf \normalsize R}}} \newcommand{\deltaalpha}{\delta\!\alpha} \newcommand{\deltax}{\delta\!x} \newcommand{\deltay}{\delta\!y} \newcommand{\deltaz}{\delta\!z} \newcommand{\deltaw}{\delta\!w} \newcommand{\DeltaA}{\delta\!\!A} \newcommand{\meps}{\epsilon_{\rm mach}} \newcommand{\fl}[1]{{\rm fl( #1 )}} \newcommand{\becomes}{:=} \newcommand{\defrowvector}[2]{ \left(#1_0, #1_1, \ldots, #1_{#2-1}\right) } \newcommand{\tr}[1]{{#1}^T} \newcommand{\LUpiv}[1]{{\rm LU}(#1)} \newcommand{\maxi}{{\rm maxi}} \newcommand{\Chol}[1]{{\rm Chol}( #1 )} \newcommand{\lt}{<} \newcommand{\gt}{>} \newcommand{\amp}{&} \)

Unit 4.4.4 Parallelizing multiple loops

In Section 4.3, we only considered parallelizing one loop at a time. When one has a processor with many cores, and hence has to use many threads, it may become beneficial to parallelize multiple loops. The reason is that there is only so much parallelism to be had in any one of the \(m \text{,}\) \(n\text{,}\) or \(k \) sizes. At some point, computing with matrices that are small in some dimension starts affecting the ability to amortize the cost of moving data.

OpenMP allows for "nested parallelism." Thus, one possibility would be to read up on how OpenMP allows one to control how many threads to use in a parallel section, and to then use that to achieve parallelism from multiple loops. Another possibility is to create a parallel section (with #pragma omp parallel rather than #pragma omp parallel for) when one first enters the "five loops around the micro-kernel," and to then explicitly control which thread does what work at appropriate points in the nested loops.

Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee, Anatomy of High-Performance Many-Threaded Matrix Multiplication, in the proceedings of the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2014), 2014.

the parallelization of various loops is discussed, as is the need to parallelize muiltiple loops when targeting "many-core" architectures like the Intel Xeon Phi (KNC) and IBM PowerPC A2 processors. The Intel Xeon Phi has 60 cores, each of which has to execute four "hyperthreads" in order to achieve the best performance. Thus, the implementation has to exploit 240 threads...