CS 377P: Programming for Performance
Assignment 6: Shared-memory parallel
programming
Due date: April 20th, 2017
Late submission policy: Submission can be at the most 2
days late. There will be a 10% penalty for each day after the due
date (cumulative).
Clarifications to the assignment will be posted at the
bottom of the page.
This assignment has two parts. In the first part, you will implement
parallel programs to compute an approximation to pi using the
numerical integration program discussed in class. You will implement
several variations of this program to understand factors that affect
performance in shared-memory programs. In the second part of the
assignment, you will implement a parallel program to implement the
Bellman-Ford algorithm for single-source shortest-path computation.
You may use classes from the C++ STL and
boost libraries if you wish. Read the entire assignment
before starting work since you will be incrementally
changing your code in each section of the assignment, and it
will be useful to see the overall structure of what you are
being asked to do.
Coding
Numerical integration to compute an estimate for pi:
- A sequential program for performing the numerical integration
is available here. It is an adaptation of
the code I showed you in class. The main difference is that it
is sequential and it performs the numerical integration in the
range [0.0, 0.5] rather than [0.0,1.0) since this gives more
accurate results. The code includes some header files that you
will need in the rest of the assignment. Read this code and run
it. It prints the estimate for pi and the running time in
nanoseconds.
What to
turn in:
- Use your knowledge of basic calculus to explain briefly why
this code provides an estimate for pi, and why integrating in
the range [0.0,0.5] gives more accurate results than
integrating in the range [0.0,1.0).
- In this part of the assignment, you will study the
effect of true-sharing on performance. Modify the
sequential code given to you as follows, to compute the estimate
for pi in parallel using pthreads. Your code should create some
number of threads and divide the responsibility for performing
the numerical integration between these threads. You can use the
round-robin assignments of points in the code I showed you in
class. Whenever a thread computes a value, it should add it
directly to the global variable pi. Use a
p-threads mutex to ensure that pi is updated atomically.
What to turn in:
- Find the running times for one, two, four and eight threads
and plot the running times and speedups you observe.
What value of pi is computed by your code when it is run on 8
threads?
- You can avoid the mutex in the previous part by using atomic
instructions to add contributions from threads to the global
variable sum. C++ provides a rich set of atomic
instructions for this purpose. Here is one way to use them for
your numerical integration program. The code below creates
an object pi that contains a field of type double on which
atomic operations can be performed. This field is initialized to
0, and its value can be read using method load(). The
routine add_to_pi atomically adds the value passed to it to this
field. You should read the definition of compare_exchange_weak
to make sure you understand how it works. The while loop
iterates until this operation succeeds. Use this approach
to implement the numerical integration routine in a lock-free
manner.
What to turn in:
- As before, find the running times for one, two, four and
eight threads and plot the running times and speedups you
observe. Do you see any improvements in running times
compared to the previous part in which you used mutexes? How
about speedups? Explain your answers briefly. What value
of pi is computed by your code when it is run on 8 threads?
std::atomic<double> pi{0};
void add_to_pi(double bar) {
auto current = pi.load();
while (!pi.compare_exchange_weak(current, current + bar));
}
- In this part of the assignment, you will study the effect of false-sharing
on performance. Create a global array sum and have each
thread t add its contribution directly into sum[t].
At the end, thread 0 can add the values in this array to
produce the estimate for pi.
What
to turn in:
- Find the running times for one, two, four and eight threads,
and plot the running times and speedups you observe. What
value of pi computed by your code when it is run on 8 threads?
- In this part of the assignment, you will study the performance
benefit of eliminating both true-sharing and false-sharing. Run
the code given in class in which each thread has a local
variable in which it keeps its running sum, and then writes its
final contribution to the sum array. At the end, thread
0 adds up the values in the array to produce the estimate for
pi.
What to turn
in:
- Find the running times for one, two, four and eight threads,
and plot the running times and speedups you observe. What
value of pi is computed by your code when it is run on 8
threads?
- Write a short summary of your results in the previous parts,
using phrases like "true-sharing" and "false-sharing" in your
explanation.
Parallel Bellman-Ford implementation:
Recall that the Bellman-Ford algorithm
solves the single-source shortest path problem. It is a
topology-driven algorithm, so it makes a number of sweeps over the
nodes of the graph, terminating sweeps when node labels do not
change in a sweep. In each sweep, it visits all the nodes of the
graph, and at each node, it applies a push-style relaxation
operator to update the labels of neighboring nodes.
One way to parallelize Bellman-Ford is to
create some number of threads (say t), and divide the
nodes more or less equally between threads in blocks of (N/t)
where N is the number of nodes in the graph. In each
sweep, a thread applies the operator to all the nodes assigned to
it. You can also assign nodes to threads in a round-robin way.
Giving all threads roughly equal numbers of nodes may not give you
good load-balance for power-law graphs (why?) but we will live
with it. Feel free to invent more load-balanced ways of assigning
nodes to threads.
The main concurrency correctness issue you need
to worry about is ensuring that updates to node labels are done
atomically. The lecture slides show how you can use a CAS
operation to accomplish this. Read the C++ documentation to see to
implement this in C++.
Input graphs: use rmat15, rmat23,
road-FLA and road-NY.
Source nodes: node 1 for rmat graphs,
node 140961 for road-NY, node 316607 for road-FL. These
are the nodes with the highest degree.
The output for the SSSP algorithm
should be produced as a text file containing one line for each
node, specifying the number of the node and the label of that
node. You can check your sssp solution by comparing your
results with the ones you found in assignment 3.
What to turn in:
- Submit the sssp output produced by your program when it is run
on 8 threads for each input graph.
- Find the running times and speedups for one, two, four and
eight threads for each input graph, and plot them. You can use
different plots for the running times for different input graphs
since the sizes and therefore the running times will be very
different. Use a single plot for the speedups.
- Do you observe good speedups for rmat-style graphs? How about
road networks?
Submission
Submit (in canvas) your code and all the items listed in the
experiments above.
Grading
Numerical integration: 30 points
SSSP: 70 points
Graph formats
Input graphs will be given to you in DIMACS format,
which is described below.
- You can find all graphs for this assignment on Stampede here:
/work/01131/rashid/class-inputs .
- We have provided the following graphs for sssp: power-law
graphs rmat15 and rmat23, and road networks road-FLA
(Florida road network) and road-NY (New York road
network). Graphs like rmat23 are quite big so do not do any runs
with them until your code has been debugged on some small graphs
that you have constructed.
DIMACS format for graphs
One popular format for representing directed graphs as
text files is the DIMACS
format (undirected graphs are represented as a directed graph by
representing each undirected edge as two directed edges). Files
are assumed to be well-formed and internally consistent so it is
not necessary to do any error checking. A line in a file
must be one of the following.
Notes:
- Because of the generator used for rmat
graphs, the files for some of the graphs may have
multiple edges between the same pair of nodes. When
building the CSR representation in memory, keep only the
edge with the largest weight. For example, if you find
edges (s d 1) and (s d 4) for example, keep only the
edge weight 4. In principle, you can keep the smallest
weight edge or follow some other rule, but I want
everyone to follow the same rule to make grading easier.
- [4:39PM, 9th April]: When you compute
speedup for numerical integration, the numerator should
be the running time of the serial code I gave
you, and the denominator should be the running time of
the parallel code on however many threads you used. The
speedup will be different for different numbers of
threads. Note that the running time of the serial code
will be different from the running time of your parallel
code running on one thread because of the overhead of
synchronization in the parallel code even when it is
running on one thread. I'll go into this in more detail
in class.
- [4:50PM, 9th April]: Lane Kolbly and
Tongliang have a thread on Piazza about how to use the
right version of gcc. Make sure you read it. Summary: you
need to do module load gcc/4.9.1 to load a
more recent version of gcc than the default version provided on
Stampede.