CS 378: Programming for Performance

Assignment 4-6: Parallel Single Source Shortest Path

Part 1 due date: November 5th

Part 2 due date: November 12th

Part 3 due date: November 19th

You can do this assignment alone or with someone else from class.
Each group can have a maximum of two students.
Each group should turn in one submission.

Week 1: Round-Based Parallelism

In this week you will write a parallel implementation of SSSP using pthreads. The goal of this week is to learn to use pthreads and to properly use synchronization.

1. Dijkstra

Implement Dijkstra's algorithm for SSSP. You may do so with a priority queue which doesn't support changing priorities of items (which Dijkstra's algorithm requires) so long as the result is correct.

The input will be a dimacs .gr file. You may use the graph data-structure of your choice, though we recommend you write it yourself as you must know what is happening to correctly implement synchronization for the next part. We recommend either Compressed Row Storage or an object based representation where each node is an object and maintains a list of edges. The graphs are weighted, so you will need to store the edge weights.

Implement a function which verifies that the computed result is correct. To do this, for each edge check if source dist + edge weight >= dest dist.

2. Bellman-Ford (Node-based)

Implement Bellman-Ford in parallel. You will use pthreads to create a set of worker threads. Your implementation should divide all the nodes of the graph evenly between threads, have each thread relax the edges of all nodes assigned to it, then decide if any thread changed the graph and if so iterate. You will need to protect nodes you update with lock. You must use a pthread mutex to do so. You will also need to protect the computation and communication involved in deciding to iterate with some synchronization. Finally, you should use a pthread barrier to make sure all threads are done with relaxation before deciding to iterate.

3. Round-based Relaxation

Bellman-Ford relaxes every edge in the graph even though many, if not most, edges do not need it. Implement an optimization which only relaxes the nodes necessary. To compute this, every time you relax an edge, keep track of the destination node and add that to a list of nodes maintained by each thread. After all threads are done relaxing their nodes, merge the lists from each thread. If the list is not empty, distribute the nodes evenly between threads and iterate. If the list is empty, then there were no nodes to relax, so the algorithm is done. You may merge the lists in serial or parallel. Proper synchronization is still required, but should be identical to that in 2.

Deliverable

Submit you source and a short write up. Measure times for each implementation, measuring for 1-12 threads for 2 and 3, and compare. Measure the number of relaxations attempted (edges looked at) and relaxations performed (nodes lowered) by each implementation at each number of threads. This will have significant overhead so you should not use this instrumented code for the timing runs. Perform all measurements on the USA road network and the random graph.
if you have bandwidth limits then you can use following links for downloading the graphs. These are located on CS servers so should bypass BW limit issues
USA_Road_Network_CS Server link
Random_Graph_CS_Server_Link
Disk Quota Problems
For disk quota problems, you can use the following link for accessing the graphs directly without downloading them.
These are extracted files and you can directly use them while running your program
Location: /u/pingali/public_html/CS378/2012fa/2012fa-assignments/graphs
If you turn in your results using random graph/2^26 nodes instead of USA road network, you'll get full credit.

Week 2: Chaotic SSSP with Worklists

This week you will implement a chaotic worklist-based parallel SSSP. You will implement spinlocks, worklists, stealing, and a lock-free version of SSSP.

Implement SpinLocks

Implement a spinlock using the atomic intrinsics available in gcc/icc/clang. Modify the round-based parallel implementation to use this spinlock instead of a pthread mutex.

Implement Stealing Worklists

The round-based implementation is a special case of a worklist. A worklist contains the work items that remain to do. In Dijkstra's algorithm the worklist was the priority queue. Implement a per-thread worklist with stealing. Each thread should have a local priority queue (or equivalent data-structure at your discression). A thread pushes and pops work from its local queue. When the queue is empty, the thread "steals" from another queue. Stealing is simply performing a pop on a non-local worklist. Thus if one thread is out of work, it will search for another thread's worklist with work and pop from there. Because the worklists may be accessed by multiple threads, they should be protected with the spinlock you wrote for the first part of this week's assignment.

In the implementation you should no longer operate in rounds, but let each thread execute until there is no more work in the system. This is a chaotic-relaxation implementation since nodes may not be processed in Dijkstra's order.

One difficulty is knowing when the algorithm is done. Just because a thread cannot find work to steal doesn't mean that all work is done. Some threads may be actively executing work and will generate new work. Thus work is only done when all threads are done.

Lock-free relax-edge

Rather than locking destinations of edges of an edge to perform relax on them, perform the relax as an atomic-minimum operation. Implement atomic-min and use it for relax-edge. Your implementation should be able to tell you if the distance was lowered so you know if you must add the node to the worklist. Add this optimization to the worklist-based chaotic-relaxation implementation from part 2 of this week's assignment.

Deliverables

Present your spinlock and atomic minimum code and argue that it is correct. Compare performance and scaling for each section of this assignment with the best implementation from last week using the input graph(s) of your choice.

Week 3: Delta-stepping

This week you will implement the delta-stepping algorithm for SSSP. This algorithm mixes the worklist and round-based parallelism styles from the previous week.

Delta-Stepping

SSSP is very sensitive to the order in which you process nodes. Dijkstra's algorithm has aa work optimal scheduler and does no more work than necessary. Chaotic relaxation minimized synchronization by allowing all threads to schedule work independently. This lets the aggregate schedule followed to differ significantly from a work optimal schedule. Delta-stepping attempts to achieve the best of both worlds: It tries to minimize synchronization but still have all threads processing high priority work. It achieves this by blocking the priority space and executing all work within a block chaotically. The size of this block is the 'delta' parameter. Thus in one step, all work between 0 and delta will be executed. In the next step, delta+1 to delta*2 will be executed. In SSSP, processing a node with distance D will not generate work less than D, so we are guaranteed that when work between 0 and delta is exhausted, no work less than delta exists.

Implement delta-stepping. There will be an outer-loop which walks through delta ranges with a barrier after each range. Within a range, use chaotic relaxation with work-stealing. Since priority scheduling is being achieved by processing things in delta-sized blocks, you can use a fast queue for local queues (such as an std:: deque).

Deliverables

Write up performance and scaling for delta-stepping and compare to implementations from previous weeks. Vary the value of delta and plot best parallel runtime (which should be at the max thread count) v.s. delta value for different inputs. Measure how many nodes each thread processes in each level and ensure there is reasonable load-balance (don't do this for the runs you measure runtime for). Make sure you measure in such a way as to not introduce contention from the measurement.

Sample Timings

599ms(Updated) for the usa road network at 12 threads.
5.2 seconds for Random Graphs at 12 threads.
New optimized time for Random Graph: 3889ms

Notes:

For Dijkstra's you may use std::priority_queue which will save you a lot of time (the algorithm using this should be around 10-20 lines of code).
Many pthread tutorials are on the web. You will need at least pthread mutex and pthread barrier to do this assignment.

Example code for the round based implementation:

do
  changed = false;
  for each node in my nodes:
    for each edge of node
      lock edge.dest
      if needs relaxation 
        relax_edge
        changed = true;
  barrier
  if (first thread)
    global_changed = any thread has set their changed flag
  barrier
while (global_changed)

More test inputs, some are small
It is easy to see how to implement spinlocks with a CAS. atomic-or is also usable and faster on x86. We don't care which atomic operation you use.
Busy spinning on a CAS (or atomic-or) is expensive as it generates a lot of coherence traffic. One optimization is to spin with normal reads on a lock until it is free, then attempt the atomic op. If it fails, you go back to spinning with a read. This lets the cache replicate the cache line amoung multiple spinning threads until the lock is released. This is called a test-and-test-and set spinlock.