CS 377P: Programming for Performance

Assignment 6: Shared-memory parallel programming

Due date: April 20th, 2017

Late submission policy: Submission can be at the most 2 days late. There will be a 10% penalty for each day after the due date (cumulative).

Clarifications to the assignment will be posted at the bottom of the page.

This assignment has two parts. In the first part, you will implement parallel programs to compute an approximation to pi using the numerical integration program discussed in class. You will implement several variations of this program to understand factors that affect performance in shared-memory programs. In the second part of the assignment, you will implement a parallel program to implement the Bellman-Ford algorithm for single-source shortest-path computation. You may use classes from the C++ STL and boost libraries if you wish. Read the entire assignment before starting work since you will be incrementally changing your code in each section of the assignment, and it will be useful to see the overall structure of what you are being asked to do.

Coding

Numerical integration to compute an estimate for pi:

A sequential program for performing the numerical integration is available here. It is an adaptation of the code I showed you in class. The main difference is that it is sequential and it performs the numerical integration in the range [0.0, 0.5] rather than [0.0,1.0) since this gives more accurate results. The code includes some header files that you will need in the rest of the assignment. Read this code and run it. It prints the estimate for pi and the running time in nanoseconds.

What to turn in:

Use your knowledge of basic calculus to explain briefly why this code provides an estimate for pi, and why integrating in the range [0.0,0.5] gives more accurate results than integrating in the range [0.0,1.0).

In this part of the assignment, you will study the effect of true-sharing on performance. Modify the sequential code given to you as follows, to compute the estimate for pi in parallel using pthreads. Your code should create some number of threads and divide the responsibility for performing the numerical integration between these threads. You can use the round-robin assignments of points in the code I showed you in class. Whenever a thread computes a value, it should add it directly to the global variable pi. Use a p-threads mutex to ensure that pi is updated atomically.

What to turn in:

Find the running times for one, two, four and eight threads and plot the running times and speedups you observe. What value of pi is computed by your code when it is run on 8 threads?

You can avoid the mutex in the previous part by using atomic instructions to add contributions from threads to the global variable sum. C++ provides a rich set of atomic instructions for this purpose. Here is one way to use them for your numerical integration program. The code below creates an object pi that contains a field of type double on which atomic operations can be performed. This field is initialized to 0, and its value can be read using method load(). The routine add_to_pi atomically adds the value passed to it to this field. You should read the definition of compare_exchange_weak to make sure you understand how it works. The while loop iterates until this operation succeeds. Use this approach to implement the numerical integration routine in a lock-free manner.

What to turn in:

As before, find the running times for one, two, four and eight threads and plot the running times and speedups you observe. Do you see any improvements in running times compared to the previous part in which you used mutexes? How about speedups? Explain your answers briefly. What value of pi is computed by your code when it is run on 8 threads?

std::atomic<double> pi{0};

void add_to_pi(double bar) {
auto current = pi.load();
while (!pi.compare_exchange_weak(current, current + bar));
}

In this part of the assignment, you will study the effect of false-sharing on performance. Create a global array sum and have each thread t add its contribution directly into sum[t]. At the end, thread 0 can add the values in this array to produce the estimate for pi.

What to turn in:

Find the running times for one, two, four and eight threads, and plot the running times and speedups you observe. What value of pi computed by your code when it is run on 8 threads?

In this part of the assignment, you will study the performance benefit of eliminating both true-sharing and false-sharing. Run the code given in class in which each thread has a local variable in which it keeps its running sum, and then writes its final contribution to the sum array. At the end, thread 0 adds up the values in the array to produce the estimate for pi.

What to turn in:

Find the running times for one, two, four and eight threads, and plot the running times and speedups you observe. What value of pi is computed by your code when it is run on 8 threads?

Write a short summary of your results in the previous parts, using phrases like "true-sharing" and "false-sharing" in your explanation.

Parallel Bellman-Ford implementation:

    Recall that the Bellman-Ford algorithm solves the single-source shortest path problem. It is a topology-driven algorithm, so it makes a number of sweeps over the nodes of the graph, terminating sweeps when node labels do not change in a sweep. In each sweep, it visits all the nodes of the graph, and at each node, it applies a push-style relaxation operator to update the labels of neighboring nodes.
     One way to parallelize Bellman-Ford is to create some number of threads (say t), and divide the nodes more or less equally between threads in blocks of (N/t) where N is the number of nodes in the graph. In each sweep, a thread applies the operator to all the nodes assigned to it. You can also assign nodes to threads in a round-robin way. Giving all threads roughly equal numbers of nodes may not give you good load-balance for power-law graphs (why?) but we will live with it. Feel free to invent more load-balanced ways of assigning nodes to threads.
    The main concurrency correctness issue you need to worry about is ensuring that updates to node labels are done atomically. The lecture slides show how you can use a CAS operation to accomplish this. Read the C++ documentation to see to implement this in C++.
     Input graphs: use rmat15, rmat23, road-FLA and road-NY.
     Source nodes: node 1 for rmat graphs, node 140961 for road-NY, node 316607 for road-FL. These are the nodes with the highest degree.
     The output for the SSSP algorithm should be produced as a text file containing one line for each node, specifying the number of the node and the label of that node. You can check your sssp solution by comparing your results with the ones you found in assignment 3.

What to turn in:

Submit the sssp output produced by your program when it is run on 8 threads for each input graph.
Find the running times and speedups for one, two, four and eight threads for each input graph, and plot them. You can use different plots for the running times for different input graphs since the sizes and therefore the running times will be very different. Use a single plot for the speedups.
Do you observe good speedups for rmat-style graphs? How about road networks?

Submission

Submit (in canvas) your code and all the items listed in the experiments above.

Grading

Numerical integration: 30 points

SSSP: 70 points

Graph formats

Input graphs will be given to you in DIMACS format, which is described below.

You can find all graphs for this assignment on Stampede here: /work/01131/rashid/class-inputs .
We have provided the following graphs for sssp: power-law graphs rmat15 and rmat23, and road networks road-FLA (Florida road network) and road-NY (New York road network). Graphs like rmat23 are quite big so do not do any runs with them until your code has been debugged on some small graphs that you have constructed.

DIMACS format for graphs

One popular format for representing directed graphs as text files is the DIMACS format (undirected graphs are represented as a directed graph by representing each undirected edge as two directed edges). Files are assumed to be well-formed and internally consistent so it is not necessary to do any error checking. A line in a file must be one of the following.

Comments. Comment lines give human-readable information about the file and are ignored by programs. Comment lines can appear anywhere in the file. Each comment line begins with a lower-case character c.
```
 
c This is an example of a comment line.
```
Problem line. There is one problem line per input file. The problem line must appear before any node or edge descriptor lines. The problem line has the following format.
```
p FORMAT NODES EDGES
```
The lower-case character p signifies that this is the problem line. The FORMAT field should contain a mnemonic for the problem such as sssp. The NODES field contains an integer value specifying n, the number of nodes in the graph. The EDGES field contains an integer value specifying m, the number of edges in the graph.These two fields tell you how much storage to allocate for the CSR representation of the graph.
Edge Descriptors. There is one edge descriptor line for each edge the graph, each with the following format. Each edge (s,d,w) from node s to node d with weight w appears exactly once in the input file.
```
a s d w
```
The lower-case character "a" signifies that this is an edge descriptor line. The "a" stands for arc, in case you are wondering.

Notes:

Because of the generator used for rmat graphs, the files for some of the graphs may have multiple edges between the same pair of nodes. When building the CSR representation in memory, keep only the edge with the largest weight. For example, if you find edges (s d 1) and (s d 4) for example, keep only the edge weight 4. In principle, you can keep the smallest weight edge or follow some other rule, but I want everyone to follow the same rule to make grading easier.
[4:39PM, 9th April]: When you compute speedup for numerical integration, the numerator should be the running time of the serial code I gave you, and the denominator should be the running time of the parallel code on however many threads you used. The speedup will be different for different numbers of threads. Note that the running time of the serial code will be different from the running time of your parallel code running on one thread because of the overhead of synchronization in the parallel code even when it is running on one thread. I'll go into this in more detail in class.
[4:50PM, 9th April]: Lane Kolbly and Tongliang have a thread on Piazza about how to use the right version of gcc. Make sure you read it. Summary: you need to do module load gcc/4.9.1 to load a more recent version of gcc than the default version provided on Stampede.