CS 377P: Programming for Performance
Assignment 6: OpenMP programming
Due date: May 1st, 2018, 11:59 PM
You can work independently or in groups of two.
Late submission policy: Submission can be at the most 1 day
late with 10% penalty.
See the bottom for additional notes for this assignment.
In this assignment, you will use OpenMP pragmas and functions to
parallelize Bellman-Ford algorithm for single-source shortest-path
computation, and study the effect of loop schedule on load
balance. You may use classes from the
C++ STL and boost libraries if you wish. Read the entire
assignment before starting work since you will be
incrementally changing your code in each section of the
assignment, and it will be useful to see the overall
structure of what you are being asked to do.
Before you start, note the following:
- You can use your code in assignment 5 for this assignment.
- Input graphs and source nodes for the input graphs are the
same as in assignment 5. Please refer to post @105 on Piazza for
accessing the input graphs without copying them.
- For graph format and how to handle the node numbering
differences in between DIMACS and CSR, see the specification of
assignment 5.
- You should verify the correctness of your codes as in
assignment 5.
Parallelize Bellman-Ford Algorithm with
OpenMP:
(a) (20 points) Make one copy of your serial implementation of
Bellman-Ford algorithm from assignment 5. Modify the copy so that std::atomic<int>
is used to represent a node's distance, and a CAS operation is used
when updating a node's distance. This is your starting point to
parallelize Bellman-Ford algorithm with OpenMP, and it does not
contain any pthread constructs or arrays to avoid false-sharing.
Your code will look similar to the following:
converged = false;
start_time = ...
/* Bellman-Ford computation over nodes */
while(!converged) {
for each node n in g {
for each edge e from n {
/* update distance(e.dst) using
CAS */
}
}
}
end_time = ...
exec_time = end_time - start_time;
(b) (20 points) Parallelize the code you derived in (a) with OpenMP.
Distribute nodes among threads in a round-robin fashion with chunk
size of 1, i.e. with clause schedule(static,1) associated
with the loop iterating over all nodes. Measure the runtime of only
Bellman-Ford computation, i.e. without graph construction, thread
creation/join, initialization and printing results. To achieve this,
your code should look like the following:
#pragma omp parallel [other
clauses]
{
start_time = ...
/* your Bellman-Ford code
from part (a) */
end_time = ...
exec_time = end_time -
start_time;
}
(c) (10 points) Change the schedule clause in (b) to the following
parameters: (static,8), (static,32), (static,128),
(static,512), (dynamic,1), (dynamic,8),
(dynamic,32), (dynamic,128), and (dynamic,512).
Do you see any difference in runtimes? Why is that?
Hint: You can set the
environment variable OMP_SCHEDULE to choose
which loop schedule you want while having only one copy of OpenMP
code. See details at OpenMP
Loop Scheduling.
(d) (20 points) Distributing nodes to threads may result in load
imbalance if the input graph is a power-law graph (why?). To address
this issue, we can distribute edges to threads so that edges from a
high-degree node can be handled by different threads. Before that,
we need a version that directly iterate over edges:
- Add an array in your CSR representation for graphs to keep
track of the source nodes for edges.
- Change your SSSP code from (a) so that it loops through all
edges directly, similar to the following:
converged
= false;
start_time = ...
/* Bellman-Ford computation over
edges */
while(!converged) {
for each edge e in g {
/* update distance(e.dst)
using CAS */
}
}
end_time = ...
exec_time = end_time -
start_time;
(e) (30 points) Parallelize the code you derived in (d) with OpenMP.
Measure the runtime with the following loop schedules over edges: (static,1),
(static,8), (static,32), (static,128),
(static,512), (dynamic,1), (dynamic,8),
(dynamic,32), (dynamic,128), and (dynamic,512).
Do you see any difference in runtimes? Why is that? Again, the
measured runtime should contain only Bellman-Ford computation, i.e.
without graph construction, thread creation/join, initialization and
printing results. Your code will be similar to the following:
#pragma omp parallel [other clauses]
{
start_time = ...
/* your Bellman-Ford code
from part (d) */
end_time = ...
exec_time = end_time -
start_time;
}
What to turn in:
- The code for all subproblems.
- Draw a bar graph showing the serial runtimes for all 4 input
graphs.
- For each input graph, draw four x-y scatter graphs, in each of
which x-axis shows the number of threads used, and y-axis the
speedup you observed. In the first x-y scatter graph, report the
speedup curves for all 5 chunk sizes using static loop schedules
and nodes as work units. In the second x-y scatter graph, report
the speedup curves for all 5 chunk sizes using dynamic schedules
and nodes as work units. Produce another two x-y scatter graphs
but using edges as work units. Speedup is computed as
(runtime of serial code in assignment 5) / (runtime of OpenMP
code with N threads).
- Explain the runtime and speedup numbers you get by input
graphs, work unit, loop schedule, true/false sharing, and
required synchronization.
Submission
Submit to canvas a .tar.gz file with your code for each subproblem
and a report in PDF format. In the report, state both of your
teammates clearly, and include all the figures and analysis. Include
a Makefile so that I can compile your codes by make [PARAMETER].
Include a README.txt to explain how to compile your code, how to run
your program, and what the outputs will be.
Notes
- [04/19/2018] You should run your versions of OpenMP codes with 1, 2, 4, and 8 threads and compute the speedup numbers.