CS 377P: Programming for
Performance
Assignment 5: Shared-memory
parallel programming
Due date: April 14th, 2021, 10:00
PM
Late submission policy: Submission can be at the most 2
days late with 10%
penalty for each day.
In
this
assignment, you will implement parallel programs to compute an
approximation to
pi using the numerical integration program discussed in class.
You will implement
several variations of this program to understand factors that
affect
performance in shared-memory programs. Read the entire
assignment before
starting work since you will be incrementally changing your
code in each
section of the assignment, and it will be useful to see the
overall structure
of what you are being asked to do.
Numerical
integration to compute an estimate for pi:
- A
sequential program for performing the numerical integration
is available here.
It is an adaptation of the code I showed you in class.
The code includes some header files that you will need in
the rest of the assignment. Read this code and run it. It
prints the estimate for pi and the running time in
nanoseconds.
What to turn in:
- Use your knowledge of basic calculus to
explain briefly why this code provides an estimate for pi.
- Modify
the sequential code as follows to compute the estimate for
pi in parallel using pthreads.
Your code should create some number of threads and divide
the responsibility for performing the numerical integration
between these threads. You can use the round-robin
assignments of points in the code I showed you in class.
Whenever a thread computes a value, it should add it
directly to the global variable pi without
any synchronization.
What
to turn in:
- Find the running times (of only computing pi)
for one, two, four and eight threads and plot the running
times and speedups you observe. What values are computed
by your code for different numbers of threads? Why would
you expect that these values not to be accurate estimates
of pi?
- In
this part of the assignment, you will study the effect of true-sharing
on performance. Modify the code in the previous part by
using a pthread mutex to
ensure that the global variable pi is updated
atomically.
What
to turn in:
- Find the running times (of only computing pi)
for one, two, four and eight threads and plot the running
times and speedups you observe. What value of pi is
computed by your code when it is run on 8 threads?
- You
can avoid the mutex in the previous part by using atomic
instructions to add contributions from threads to the
global variable pi. C++ provides a rich
set of atomic instructions for this purpose. Here is one way
to use them for your numerical integration program.
The code below creates an object pi that contains a field of
type double on which atomic operations can be performed.
This field is initialized to 0, and its value can be read
using method load(). The
routine add_to_pi
atomically adds the value passed to it to this field. You
should read the definition of compare_exchange_weak
to make sure you understand how it works. The while loop iterates until this
operation succeeds. Use this approach to implement the
numerical integration routine in a lock-free manner.
What to turn in:
- As before, find the running times (of only
computing pi) for one, two, four and eight threads and
plot the running times and speedups you observe. Do
you see any improvements in running times compared to the
previous part in which you used mutexes? How about
speedups? Explain your answers briefly. What value
of pi is computed by your code when it is run on 8
threads?
std::atomic<double>
pi{0.0};
void add_to_pi(double bar) {
auto current = pi.load();
while (!pi.compare_exchange_weak(current,
current
+ bar));
}
- In
this part of the assignment, you will study the effect of false-sharing
on performance. Create a global array sum and have
each thread t add
its contribution directly into sum[t]. At
the end, thread 0 can add the values in this array to
produce the estimate for pi.
What to turn in:
- Find the running times (of only computing pi)
for one, two, four and eight threads, and plot the running
times and speedups you observe. What value of pi computed
by your code when it is run on 8 threads?
What to turn in:
- Find the running times (of only computing
pi) for one, two, four and eight threads, and plot the
running times and speedups you observe. What value of pi
is computed by your code when it is run on 8 threads?
- The
code used
in the previous part used pthread_join.
Replace this with a barrier and run your code again.
What
to
turn in:
- Find the running times (of only computing pi)
for one, two, four and eight threads, and plot the running
times and speedups you observe. What value of pi is
computed by your code when it is run on 8 threads?
- Write
a short summary of your results in the previous parts, using
phrases like “atomic operations,” "true-sharing," and
"false-sharing" in your explanation.
Notes:
- When
you compute speedup for numerical integration, the numerator
should be the running time of the serial code I gave
you, and the denominator should be the running time of the
parallel code on however many threads you used. The speedup
will be different for different numbers of threads. Note
that the running time of the serial code may be different
from the running time of your parallel code running on one
thread because of the overhead of synchronization in the
parallel code even when it is running on one thread.