## Unit3.4.1Alignment (easy and worthwhile)

The vector intrinsic routines that load and/or broadcast vector registers are faster when the data from which one loads is aligned.

Conveniently, loads of elements of $A$ and $B$ are from buffers into which the data was packed. By creating those buffers to be aligned, we can ensure that all these loads are aligned. Intel's intrinsic library has a special memory allocation and deallocation routines specifically for this purpose: _mm_malloc and _mm_free. (align should be chosen to equal the length of a cache line, in bytes: 64.)