Skip to main content

Unit 2.6.2 Summary

Subsubsection The week in pictures

Figure 2.6.1. A simple model of the memory hierarchy, with registers and main memory.
Figure 2.6.2. A simple blocking for registers, where micro-tiles of \(C \) are loaded into registers.
Figure 2.6.3. The update of a micro-tile with a sequence of rank-1 updates.
Figure 2.6.4. Mapping the micro-kernel to registers.

Subsubsection Useful intrinsic functions

From Intel's Intrinsics Reference Guide

  • __m256d _mm256_loadu_pd (double const * mem_addr)


    Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst (output). mem_addr does not need to be aligned on any particular boundary.

  • __m256d _mm256_broadcast_sd (double const * mem_addr)


    Broadcast a double-precision (64-bit) floating-point element from memory to all elements of dst (output).

  • __m256d _mm256_fmadd_pd (__m256d a, __m256d b, __m256d c)


    Multiply packed double-precision (64-bit) floating-point elements in a and b, add the intermediate result to packed elements in c, and store the results in dst (output).