

# VECTORIZATION CASE STUDY INTEL® ADVISOR – VECTORIZATION ADVISOR Mike Voss, Principal Engineer, Intel

Special thanks to Alex Shinsel (Consulting Engineer, Intel)

## SIMD => Single Instruction Multiple Data

#### VLP / Vectorization

- Scalar
  - one instruction produces one result

#### SIMD processing

- one instruction can produce multiple results (SIMD)
- e.g. vaddpd / vaddps (p => packed)





### **Outline from Previous Lectures on Vectorization**

- What is vectorization and why is it important
- The different ways we can vectorize our code
- The two main challenges in vectorization
  - Determining that vectorization is legal (the results will be the same)
    - Dependence analysis
    - Obstacles to vectorization and how to deal with them
  - Optimizing performance
    - Memory issues (alignment, layout)
    - Telling the compiler what you know (about your code & about your platform)
- Using compiler intrinsics
- Using OpenMP\* simd pragmas
- A case study



### In previous lectures on vectorization:





Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

### intel

4



### USING INTEL<sup>®</sup> ADVISOR TO ASSIST WITH VECTORIZATION Based on a presentation by Alex Shinsel



# **VECTORIZATION ADVISOR & ROOFLINE**

**○** Re-finalize Survey

Collect 🖿 📐

2.2 Check Memory Access P...

## Vectorization Advisor Workflow

- **Survey** is the bread and butter of Vectorization Advisor! All else builds on it!
- Trip Counts adds onto Survey and enables the Roofline.
- **Dependencies** determines whether it's safe to force a scalar loop to vectorize.
- Memory Access Patterns diagnoses vectorization inefficiency caused by poor memory striding.



#### What Am I Looking At? Smart Mode Loop Display (Does not work for Remove Workflow **Toggle Buttons** Search Threading Advisor) Filters Filter by Elapsed time: 6.62s Ó 🖞 Vectorized 🖞 Vectorized 💆 OFF Smart Mode<sup>2</sup> Q, Threading Vectorization origin and ADVICOD 0010 Workflow Workflow FILTE All Modules All Sources Loops And Functions All Threads • **•** type Nefinement Reports P Summary Survey & Roofline Annotation Report W Suitability Report **Report tabs** Why No ROOFLINI Vector Collect 🕅 🖿 🗔 Self Time + -Function Call Sites and Loops ð Total Time 👻 Type Vectoriz. Issues Primary 1.1 Find Trip Counts an setQueen 2.202s 38.480s Function [loop in setQueen at nqueens serial.cpp:116] 0.297s 33.526s loop Scalar Pane Collect 0.000-1 E 250c common main coh Eurotion Trip Counts Top Down Code Analytics Assembly Recommendations Why No Vectorization? Assistance **FLOPS** Source Function Call Sites and Loops Total Time % Total Time Self Time Type Why No 2' Secondary motate Sources nain 100.0% 5.250s 0.000s Function Add Intel Advisor Pane tabs Ŧ -solve 100.0% 5.250s 0.000sl Function annotations to identify 🖃 🗂 [loop 🧃 olve at ngueens\_serial.cpp:140] 100.0% 5.250s 0.000s Scalar loop with ± se 100.0% 5.250s 0.000s1 G Re-finalize Survey Secondary Pane

### Survey Vectorization Advisor

#### Function/Loop Icons

**f** Scalar Function**f** Vector Function

🖸 Scalar Loop

C Vector Loop

#### Tip:

For vectorization, you generally only care about loops. Set the type dropdown to "Loops".

Vectorizing a loop is usually best done on innermost loops. Since it effectively divides duration by vector length, you want to target loops with high self time. Efficiency is important! Efficiency=100% Vec. Length

The black arrow is 1x. Gray means you got less than that. Gold means you got more. You want to get this value as high as possible!

|                                        |   |                         | Self     | Total    | -      | Why No          | Vectorized Loops |            | Σ     |      |
|----------------------------------------|---|-------------------------|----------|----------|--------|-----------------|------------------|------------|-------|------|
| Function Call Sites and Loops          | e | Vector Issues           | Time     | Time     | Туре   | Vectorization?  | Vect             | Efficiency | Gain  | VL . |
| 🗵 🖱 [loop in main at example.cpp:38]   |   | 9 1 Assumed depend      | 0.391s 🔲 | 0.391s 🗖 | Scalar | vector depen    |                  |            |       |      |
| 🗵 🗂 [loop in main at example.cpp:64]   |   | 9 1 Possible inefficien | 0.297s 🔲 | 0.297s 🗖 | Vector |                 | AVX2             | 2%         | 0.37x | 16   |
| 🗄 🗂 [loop in main at example.cpp:51]   |   | 9 1 Possible inefficien | 0.094s 🛙 | 0.094s 🛙 | Vector | 1 vectorizatio  | AVX2             | 8%         | 1.23x | 16   |
| [∃_0] [loop in main at example.cpp:26] |   |                         | 0.030s I | 0.030s1  | Vector |                 | AVX2             | 100%       | 7.98x | 8    |
| [loop in main at example.cpp:14]       |   | Assumed depend          | 0.000s   | 0.000s1  | Scalar | vector depen    |                  |            |       |      |
| [] [loop in main at example.cpp:23]    |   |                         | 0.000sl  | 0.030s1  | Scalar | 🖬 inner loc 🛛 w |                  |            |       |      |

Expand a vectorized loop to see it split into body, peel, and remainder (if applicable).

Advisor *advises* you on potential vector issues. This is often your cue to run MAP or Dependencies. Click the icon to see an explanation in the bottom pane.

The Intel Compiler embeds extra information that Advisor can report in addition to its sampled data, such as why loops failed to vectorize.

Optimization Notice

### Let's look at an example...



## **Trip Counts**

- Trip Counts extends the Survey results. It must be run separately because it has higher overhead that would interfere with timing measurements.
- Vectorization is most effective on inner loops with high iteration counts.
  - It may be beneficial to swap small inner loops and larger outer loops.
  - For maximum performance, iteration counts that are a multiple of the vector length are ideal.
- Trip Counts is useful in diagnosing data alignment and padding problems in loops that traverse multidimensional arrays.
  - In such cases, the trip counts on peel and remainder loops may change as rows/columns push each other out of alignment.

| + - Function Call        | Turne             | Trip Counts     |            |            |            |          |          |  |
|--------------------------|-------------------|-----------------|------------|------------|------------|----------|----------|--|
| Sites and Loops          | Туре              | Average Min Max |            | Max        | Call Count | Iteratio | Loop I   |  |
| 🖃 🗂 [loop in main at dat | Vectorized (Body; | 4374; 5; 4      | 4374; 1; 1 | 4375; 7; 7 | 400000; 34 | < 0.001s |          |  |
| 🗵 🗂 [loop in main at     | Vectorized (Body) | 4374            | 4374       | 4375       | 400000     | < 0.001s | < 0.001s |  |
| 🗵 🗂 [loop in main at     | Remainder         | 5               | 1          | 7          | 348000     | < 0.001s | < 0.001s |  |
| 🖻 🗂 [loop in main at     | Peeled            | 4               | 1          | 7          | 352000     | < 0.001s | < 0.001s |  |

### ... and FLOPS Part of the Trip Counts Collection

- Trip Counts and FLOPS are the same collection type, but can be toggled independently using the checkboxes in the workflow or command line flags.
- FLOPS collects information about <u>Floating Point Operations</u>, or FLOPs. This is used with Survey data to calculate FLOPS, <u>Floating Point Operations Per Second</u>.
- It also collects some memory data, so it can calculate Arithmetic Intensity.
- Arithmetic Intensity is a measurement of FLOPs/Byte accessed. This is a trait of the algorithm of a function/loop itself.

| FLOPS    |       |        |            |              | $\leq$   |
|----------|-------|--------|------------|--------------|----------|
| GFLOPS   | AI    | GFLOP  | Memory, GB | Elapsed Time | Total El |
| 3.917    | 0.179 | 53.120 | 296.831    | 13.562s      | 13.562s  |
| 1.756    | 0.045 | 13.280 | 296.831    | 7.563s       | 7.563s   |
| 7.2490   | 0.134 | 53.120 | 395.775    | 7.328s       | 7.328s   |
| 19.999 🗖 | 0.179 | 53.120 | 296.831    | 2.656s       | 2.656s   |
| 7.2640   | 0.045 | 13.280 | 296.831    | 1.828s       | 1.828s   |



### Let's look at our example again...



### What is a Roofline Chart?

### A Roofline Chart plots application performance against hardware limitations.

- Where are the bottlenecks?
- How much performance is being left on the table?
- Which bottlenecks can be addressed, and which should be addressed?
- What's the most likely cause?
- What are the next steps?



Roofline first proposed by University of California at Berkeley: <u>Roofline: An Insightful Visual Performance Model for Multicore Architectures</u>, 2009 Cache-aware variant proposed by University of Lisbon: <u>Cache-Aware Roofline Model: Upgrading the Loft</u>, 2013

### **Roofline Metrics**

Roofline is based on Arithmetic Intensity (AI) and FLOPS.

- Arithmetic Intensity: FLOP / Byte Accessed
  - This is a characteristic of your algorithm



- FLOPS: <u>Fl</u>oating-Point <u>Op</u>erations / <u>S</u>econd
  - Is a measure of an implementation (it achieves a certain FLOPS)
  - And there is a maximum that a platform can provide





### Classic vs. Cache-Aware Roofline

Intel<sup>®</sup> Advisor uses the Cache-Aware Roofline model, which has a different definition of Arithmetic Intensity than the original ("Classic") model.

#### **Classical Roofline**

- Traffic measured from one level of memory (usually DRAM)
- AI may change with data set size
- AI changes as a result of memory optimizations

#### **Cache-Aware Roofline**

- Traffic measured from all levels of memory
- AI is tied to the algorithm and will not change with data set size
- Optimization does not change AI\*, only the performance

\*Compiler optimizations may modify the algorithm, which may change the AI.



### **Ultimate Performance Limits**





### **Sub-Roofs and Current Limits**





## The Intel<sup>®</sup> Advisor Roofline Interface

- Roofs are based on benchmarks run before the application.
  - Roofs can be hidden, highlighted, or adjusted.
- Intel<sup>®</sup> Advisor has size- and color-coding for dots.
  - Color code by duration or vectorization status
  - Categories, cutoffs, and visual style can be modified.



## **Identifying Good Optimization Candidates**

Focus optimization effort where it makes the most difference.

- Large, red loops have the most impact.
- Loops far from the upper roofs have more room to improve.



Arithmetic Intensity (FLOPs/Byte)

## **Identifying Potential Bottlenecks**

Final roofs *do* apply; sub-roofs *may* apply.

- Roofs above indicate potential bottlenecks
- Closer roofs are the most likely suspects
- Roofs below may contribute but are generally not primary bottlenecks



### Back to the example...



### **Overcoming the Scalar Add Peak**

- Survey and Code Analytics tabs indicate vectorization status with colored icons. ( ) = Scalar ( ) = Vectorized
- "Why No Vectorization" tab and column in Survey explain what prevented vectorization.
- Recommendations tab may help you vectorize the loop.
- Dependencies determines if it's safe to force vectorization.



| Problems and Messages |                                             |       |            |              |            |             |         |              |          | 1     |   |
|-----------------------|---------------------------------------------|-------|------------|--------------|------------|-------------|---------|--------------|----------|-------|---|
| ID                    | •                                           | Туре  |            |              | Sources    | Modules     |         | Site Name    | State    | State |   |
| P3                    | P3 😣 Read after write dependency Ib         |       | lbpGET.cpp | slbe.exe     |            | loop_site_5 | 1 🎙 New | <b>№</b> New |          |       |   |
| Read                  | Read after write dependency: Code Locations |       |            |              |            |             |         |              |          |       |   |
| ID                    | Instruc                                     | tion  | Desc       | Function     | Source     |             | Var     | iable refer  | Module   | State | ^ |
| ± X4                  | 0x1400                                      | 88772 | Read       | fsBGKShanChe | n 🖹 IbpGET | .cpp:155    | regi    | ster XMM5    | slbe.exe | New   |   |
| ±X5                   | 0x1400                                      | 88772 | Write      | fsBGKShanChe | n 🖹 IbpGET | .cpp:155    | regi    | ster XMM5    | slbe.exe | New   | ~ |

Refinement Reports

Survey & Roofline

Summarv

<

Source

+ -

 $\Sigma \wedge$ 



### Dependencies Analysis

Vectorization Advisor

Optimization Notice

- Generally, you don't need to run Dependencies analysis unless Advisor tells you to. It produces recommendations to do so if it detects:
  - Loops that remained unvectorized because the compiler was playing it safe with autovectorization.
- Use the survey checkboxes to select which loops to analyze.
- If no dependencies are found, it's safe to force vectorization.
- Otherwise, use the reported variable read/write information to see if you can rework the code to eliminate the dependency.

| 🌳 Summary 🛛 👹 Survey & Roo     | fline                     | 🍅 Refinement Reports |  |  |
|--------------------------------|---------------------------|----------------------|--|--|
| Site Location                  | Loop-Carried Dependencies |                      |  |  |
| 🗄 🗂 [loop in main at example.c | . ONO dependencies found  |                      |  |  |
| 🗄 🕛 [loop in main at example.c | 😵 RA                      | W:1                  |  |  |

Recommendation: Confirm Confidence: Need More dependency is real Data
There is no confirmation that a real (proven) dependency is present in the loop. To confirm: Run a Dependencies analysis.



### Back to our example...



#### Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

### Memory Access Patterns Analysis Collecting a MAP

• If you have low vector efficiency, or see that a loop did not vectorize because it was deemed "possible but inefficient", you may want to run a MAP analysis.

Vector Issues

✓ 9 1 Possible inefficient memory access patterns present

• Advisor will also recommend a MAP analysis if it detects a possible inefficient access pattern.



- Select the loops you want to run the MAP on using the checkboxes. It may be helpful to reduce the problem size, as MAP only needs to detect patterns, and has high overhead.
  - Note that if changing the problem size requires recompiling, you will need to recollect the survey before running MAP.



### Memory Access Patterns Analysis Reading a MAP

- MAP is color coded by stride type. From best to worst:
  - Blue is unit/uniform (stepping by 1 or 0)
  - **Yellow** is constant (stepping a set distance)
  - **Red** is variable (a changing step distance)
- Click a loop in the top pane to see a detailed report below.
  - The strides that contribute to the loop are broken down in this table.

| 🌪 Su    | mm       | ary 😽 S     | urvey & Roofline           | 🍅 Refineme      | ent Reports                           |               | $\leq$                |
|---------|----------|-------------|----------------------------|-----------------|---------------------------------------|---------------|-----------------------|
| Site Lo | ocati    | on          | Strides Distrib            | Access Pa M     | ax. Site Footprint                    | Recommend     | dations               |
| 🕀 🗂 [le | оор      | in main     | 76% / 0% / <mark>24</mark> | Mixed stri 64   | KB                                    |               |                       |
| 🗄 🗂 [le | оор      | in main     | 76% / 0% / <mark>24</mark> | Mixed stri 64   | KB                                    |               |                       |
| 🗄 🔁 [le | оор      | in main     | 70% / 6% / <mark>24</mark> | Mixed stri 56   | 64MB                                  |               |                       |
| 🕀 🗂 🗉   | оор      | in main     | 100% / 0% / 0              | All unit str 70 | КВ                                    |               |                       |
| 🗄 🗂 [lo | оор      | in main     | <mark>33% / 67% / 0</mark> | Mixed stri 61   | 6MB                                   | 💡 1 Inefficie | ent memory access pa  |
| •       |          |             |                            |                 |                                       |               |                       |
| Mem     | ory A    | Access Patt | terns Report De            | pendencies Re   | port <table-cell> Recomm</table-cell> | nendations    |                       |
| ID      | 8        | Stride      | Туре                       | Source          | Modules                               | Nested Func.  | . Variable references |
| ± P1    | -        | 36000       | Constant stride            | stride.cpp:49   | stride.exe                            |               | tableA, tableB        |
| ± P2    | •        | 36000       | Constant stride            | stride.cpp:49   | stride.exe                            |               | results               |
| ± P7    | ī        |             | Parallel site info         | . stride.cpp:47 | stride.exe                            |               |                       |
| P19     | 1-1      | 0           | Uniform stride             | stride.exe:0x.  | stride.exe                            | _svml_atan4   |                       |
| PR      |          |             | Uniforn strine             | synt dirom      | stal condd                            | svr at 2      |                       |
| ¥1.     | <b>•</b> | -12; -8;    | Variable stride            | svml_dispmd     | svml_dispmd.dl                        | I _svml_atan2 |                       |
| P1.     | dılı,    | -12; -8;    | Variable stride            | svml_dispmd     | svml_dispmd.dl                        | I _svml_atan2 |                       |



## Legal Disclaimer & Optimization Notice

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <u>www.intel.com/benchmarks</u>.

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

# ALIGNMENT, PADDING AND PEEL/REMAINDER

### The original 1D table in the peel example





### The original 1D table when aligned





### The original 1D table when aligned, padded





### The 2D case





Software