In Section 4.3, we only considered parallelizing one loop at a time. When one has a processor with many cores, and hence has to use many threads, it may become beneficial to parallelize multiple loops. The reason is that there is only so much parallelism to be had in any one of the $m \text{,}$ $n\text{,}$ or $k$ sizes. At some point, computing with matrices that are small in some dimension starts affecting the ability to amortize the cost of moving data.