CS 377P: Programming for Performance

Assignment 6: OpenMP programming

Due date: May 1st, 2018, 11:59 PM

You can work independently or in groups of two.
Late submission policy: Submission can be at the most 1 day late with 10% penalty.
See the bottom for additional notes for this assignment.

In this assignment, you will use OpenMP pragmas and functions to parallelize Bellman-Ford algorithm for single-source shortest-path computation, and study the effect of loop schedule on load balance. You may use classes from the C++ STL and boost libraries if you wish. Read the entire assignment before starting work since you will be incrementally changing your code in each section of the assignment, and it will be useful to see the overall structure of what you are being asked to do. Before you start, note the following:

You can use your code in assignment 5 for this assignment.
Input graphs and source nodes for the input graphs are the same as in assignment 5. Please refer to post @105 on Piazza for accessing the input graphs without copying them.
For graph format and how to handle the node numbering differences in between DIMACS and CSR, see the specification of assignment 5.
You should verify the correctness of your codes as in assignment 5.

Parallelize Bellman-Ford Algorithm with OpenMP:

(a) (20 points) Make one copy of your serial implementation of Bellman-Ford algorithm from assignment 5. Modify the copy so that std::atomic<int> is used to represent a node's distance, and a CAS operation is used when updating a node's distance. This is your starting point to parallelize Bellman-Ford algorithm with OpenMP, and it does not contain any pthread constructs or arrays to avoid false-sharing. Your code will look similar to the following: converged = false; start_time = ... /* Bellman-Ford computation over nodes */ while(!converged) { for each node n in g { for each edge e from n {     /* update distance(e.dst) using CAS */       }     } } end_time = ... exec_time = end_time - start_time;
(b) (20 points) Parallelize the code you derived in (a) with OpenMP. Distribute nodes among threads in a round-robin fashion with chunk size of 1, i.e. with clause schedule(static,1) associated with the loop iterating over all nodes. Measure the runtime of only Bellman-Ford computation, i.e. without graph construction, thread creation/join, initialization and printing results. To achieve this, your code should look like the following:

#pragma omp parallel [other clauses] {      start_time = ...      /* your Bellman-Ford code from part (a)*/      end_time = ...      exec_time = end_time - start_time; }

(c) (10 points) Change the schedule clause in (b) to the following parameters: (static,8), (static,32), (static,128), (static,512), (dynamic,1), (dynamic,8), (dynamic,32), (dynamic,128), and (dynamic,512). Do you see any difference in runtimes? Why is that?

    Hint: You can set the environment variable OMP_SCHEDULE to choose which loop schedule you want while having only one copy of OpenMP code. See details at OpenMP Loop Scheduling.

(d) (20 points) Distributing nodes to threads may result in load imbalance if the input graph is a power-law graph (why?). To address this issue, we can distribute edges to threads so that edges from a high-degree node can be handled by different threads. Before that, we need a version that directly iterate over edges:

Add an array in your CSR representation for graphs to keep track of the source nodes for edges.
Change your SSSP code from (a) so that it loops through all edges directly, similar to the following:

converged = false; start_time = ...

/* Bellman-Ford computation over edges */ while(!converged) { for each edge e in g { /* update distance(e.dst) using CAS */ } } end_time = ... exec_time = end_time - start_time;

(e) (30 points) Parallelize the code you derived in (d) with OpenMP. Measure the runtime with the following loop schedules over edges: (static,1), (static,8), (static,32), (static,128), (static,512), (dynamic,1), (dynamic,8), (dynamic,32), (dynamic,128), and (dynamic,512). Do you see any difference in runtimes? Why is that? Again, the measured runtime should contain only Bellman-Ford computation, i.e. without graph construction, thread creation/join, initialization and printing results. Your code will be similar to the following:

#pragma omp parallel[other clauses] { start_time = ... /* your Bellman-Ford code from part (d)*/ end_time = ... exec_time = end_time - start_time; }

What to turn in:

The code for all subproblems.
Draw a bar graph showing the serial runtimes for all 4 input graphs.
For each input graph, draw four x-y scatter graphs, in each of which x-axis shows the number of threads used, and y-axis the speedup you observed. In the first x-y scatter graph, report the speedup curves for all 5 chunk sizes using static loop schedules and nodes as work units. In the second x-y scatter graph, report the speedup curves for all 5 chunk sizes using dynamic schedules and nodes as work units. Produce another two x-y scatter graphs but using edges as work units. Speedup is computed as (runtime of serial code in assignment 5) / (runtime of OpenMP code with N threads).
Explain the runtime and speedup numbers you get by input graphs, work unit, loop schedule, true/false sharing, and required synchronization.

Submission

Submit to canvas a .tar.gz file with your code for each subproblem and a report in PDF format. In the report, state both of your teammates clearly, and include all the figures and analysis. Include a Makefile so that I can compile your codes by make [PARAMETER]. Include a README.txt to explain how to compile your code, how to run your program, and what the outputs will be.

Notes

[04/19/2018] You should run your versions of OpenMP codes with 1, 2, 4, and 8 threads and compute the speedup numbers.