**Late submission policy:** Submission can be at the most 2
days late. There will be a 10% penalty for each day after the due
date (cumulative).

**
Clarifications** to the assignment will be posted at the
bottom of the page.

- A sequential program for performing the numerical integration is available here. It is an adaptation of the code I showed you in class. The main difference is that it is sequential and it performs the numerical integration in the range [0.0, 0.5] rather than [0.0,1.0) since this gives more accurate results. The code includes some header files that you will need in the rest of the assignment. Read this code and run it. It prints the estimate for pi and the running time in nanoseconds.

- Use your knowledge of basic calculus to explain briefly why this code provides an estimate for pi, and why integrating in the range [0.0,0.5] gives more accurate results than integrating in the range [0.0,1.0).

- In this part of the assignment, you will study the
effect of
*true-sharing*on performance. Modify the sequential code given to you as follows, to compute the estimate for pi in parallel using pthreads. Your code should create some number of threads and divide the responsibility for performing the numerical integration between these threads. You can use the round-robin assignments of points in the code I showed you in class. Whenever a thread computes a value,**it should add it directly to the global variable**. Use a p-threads mutex to ensure that**pi***pi*is updated atomically.

- Find the running times for one, two, four and eight threads and plot the running times and speedups you observe. What value of pi is computed by your code when it is run on 8 threads?

- You can avoid the mutex in the previous part by using atomic
instructions to add contributions from threads to the global
variable sum. C++ provides a rich set of atomic
instructions for this purpose. Here is one way to use them for
your numerical integration program. The code below creates
an object pi that contains a field of type double on which
atomic operations can be performed. This field is initialized to
0, and its value can be read using method load(). The
routine add_to_pi atomically adds the value passed to it to this
field. You should read the definition of compare_exchange_weak
to make sure you understand how it works. The while loop
iterates until this operation succeeds. Use this approach
to implement the numerical integration routine in a lock-free
manner.

- As before, find the running times for one, two, four and eight threads and plot the running times and speedups you observe. Do you see any improvements in running times compared to the previous part in which you used mutexes? How about speedups? Explain your answers briefly. What value of pi is computed by your code when it is run on 8 threads?

std::atomic<double> pi{0};

void add_to_pi(double bar) {

auto current = pi.load();

while (!pi.compare_exchange_weak(current, current + bar));

}

- In this part of the assignment, you will study the effect of
*false-sharing*on performance. Create a global array*sum*and have each thread*t***add its contribution directly into**At the end, thread 0 can add the values in this array to produce the estimate for pi.**sum[t]**.

- Find the running times for one, two, four and eight threads, and plot the running times and speedups you observe. What value of pi computed by your code when it is run on 8 threads?

- In this part of the assignment, you will study the performance
benefit of eliminating both true-sharing and false-sharing. Run
the code given in class in which each thread has a local
variable in which it keeps its running sum, and then writes its
final contribution to the
*sum*array. At the end, thread 0 adds up the values in the array to produce the estimate for pi.

- Find the running times for one, two, four and eight threads, and plot the running times and speedups you observe. What value of pi is computed by your code when it is run on 8 threads?

- Write a short summary of your results in the previous parts,
using phrases like "true-sharing" and "false-sharing" in your
explanation.

**Parallel Bellman-Ford implementation:
**

** **Recall that the Bellman-Ford algorithm
solves the single-source shortest path problem. It is a
topology-driven algorithm, so it makes a number of sweeps over the
nodes of the graph, terminating sweeps when node labels do not
change in a sweep. In each sweep, it visits all the nodes of the
graph, and at each node, it applies a push-style relaxation
operator to update the labels of neighboring nodes.

One way to parallelize Bellman-Ford is to
create some number of threads (say *t*), and divide the
nodes more or less equally between threads in blocks of (*N/t*)
where *N* is the number of nodes in the graph. In each
sweep, a thread applies the operator to all the nodes assigned to
it. You can also assign nodes to threads in a round-robin way.
Giving all threads roughly equal numbers of nodes may not give you
good load-balance for power-law graphs (why?) but we will live
with it. Feel free to invent more load-balanced ways of assigning
nodes to threads.

The main concurrency correctness issue you need
to worry about is ensuring that updates to node labels are done
atomically. The lecture slides show how you can use a CAS
operation to accomplish this. Read the C++ documentation to see to
implement this in C++.

Input graphs: use rmat15, rmat23,
road-FLA and road-NY.

Source nodes: **node 1 for rmat graphs,
node 140961 for road-NY, node 316607 for road-FL****. ****These
are the nodes with the highest degree.
**The output for the SSSP algorithm
should be produced as a text file containing one line for each
node, specifying the number of the node and the label of that
node. You can check your sssp solution by comparing your
results with the ones you found in assignment 3.

** What to turn in**:

- Submit the sssp output produced by your program when it is run
on 8 threads for each input graph.

- Find the running times and speedups for one, two, four and
eight threads for each input graph, and plot them. You can use
different plots for the running times for different input graphs
since the sizes and therefore the running times will be very
different. Use a single plot for the speedups.

- Do you observe good speedups for rmat-style graphs? How about road networks?

**Graph formats
**Input graphs will be given to you in

- You can find all graphs for this assignment on Stampede here:
/work/01131/rashid/class-inputs .

- We have provided the following graphs for sssp: power-law
graphs
*rmat15*and*rmat23*, and road networks*road-FLA*(Florida road network) and*road-NY*(New York road network). Graphs like rmat23 are quite big so do not do any runs with them until your code has been debugged on some small graphs that you have constructed.

DIMACS format for graphs

One popular format for representing *directed* graphs as
text files is the DIMACS
format (undirected graphs are represented as a directed graph by
representing each undirected edge as two directed edges). Files
are assumed to be well-formed and internally consistent so it is
not necessary to do any error checking. A line in a file
must be one of the following.

**Comments.**Comment lines give human-readable information about the file and are ignored by programs. Comment lines can appear anywhere in the file. Each comment line begins with a lower-case character**c**.c This is an example of a comment line.

**Problem line.**There is one problem line per input file. The problem line must appear before any node or edge descriptor lines. The problem line has the following format.p FORMAT NODES EDGES

The lower-case character`p`signifies that this is the problem line. The`FORMAT`field should contain a mnemonic for the problem such as sssp. The`NODES`field contains an integer value specifying*n*, the number of nodes in the graph. The`EDGES`field contains an integer value specifying*m*, the number of edges in the graph.These two fields tell you how much storage to allocate for the CSR representation of the graph.

**Edge Descriptors.**There is one edge descriptor line for each edge the graph, each with the following format. Each edge*(s,d,w)*from node*s*to node*d*with weight*w*appears exactly once in the input file.a s d w

The lower-case character`"a"`signifies that this is an edge descriptor line. The "a" stands for arc, in case you are wondering.

**Notes:
**

- Because of the generator used for rmat
graphs, the files for some of the graphs may have
multiple edges between the same pair of nodes. When
building the CSR representation in memory, keep only the
edge with the largest weight. For example, if you find
edges (s d 1) and (s d 4) for example, keep only the
edge weight 4. In principle, you can keep the smallest
weight edge or follow some other rule, but I want
everyone to follow the same rule to make grading easier.

- [4:39PM, 9th April]: When you compute
speedup for numerical integration, the numerator should
be the running time of the
**serial code**I gave you, and the denominator should be the running time of the parallel code on however many threads you used. The speedup will be different for different numbers of threads. Note that the running time of the serial code will be different from the running time of your parallel code running on one thread because of the overhead of synchronization in the parallel code even when it is running on one thread. I'll go into this in more detail in class. - [4:50PM, 9th April]: Lane Kolbly and
Tongliang have a thread on Piazza about how to use the
right version of gcc. Make sure you read it. Summary: you
need to do
*module load gcc/4.9.1*to load