CS 377P: Programming for Performance

Assignment 3: Operator formulation of algorithms

Due date: March 7th, 2017

Late submission policy: Submission can be at the most 2 days late. There will be a 10% penalty for each day after the due date (cumulative).

Clarifications to the assignment are posted at the bottom of the page.

Description

This assignment introduces you to the operator formulation of algorithms. The motto introduced in class is Algorithm = Operator + Schedule, and in this assignment, you will implement sequential algorithms for the single-source shortest-path (sssp) problem to understand this motto. Read the entire assignment before starting your coding, and get started early: this assignment requires more programming than previous assignments.

Key concepts

Recall that we classify algorithms into topology-driven and data-driven algorithms.

Topology-driven algorithms make a number of sweeps over the graph. At the start of the algorithm, node labels are initialized as needed by the algorithm (for example, for sssp, the label of the source node is initialized to zero and the labels of all other nodes are initialized to

\infty

). In each sweep, the operator is applied to all nodes. The algorithm terminates when a sweep does not modify the label of any node. In some problems, particularly those in which labels are floating point numbers, we may never get to exact convergence so we terminate the algorithm when node updates are below some threshold or when some upper bound on the number of iterations is reached.

Data-driven algorithms maintain a work-list of active nodes. The work-list can be considered to be an abstract data type (class) that supports two methods: put and get. Active nodes are added to the work-list by invoking the put method with the set of active nodes. The work-list can be maintained either as a set (so no duplicates are allowed) or as a multi-set (duplicates are allowed). In this assignment, work-lists can be implemented as multi-sets so you do not need to check for duplicates. The get method returns an active node from the work-list if it is not empty, and removes it from the work-set. If there are multiple active nodes in the work-list, the schedule determines which one is returned. Applying the operator to an active node may change the labels of other nodes in the graph; if so, these nodes become active and are added to the work-list. For problems in which labels are floating-point numbers, we may choose not to activate a node if the change to its label is below some threshold. Data-driven algorithms terminate when the work-list is empty and all active nodes have been processed.

Graph formats

Input graphs will be given to you in DIMACS format, which is described at the end of this assignment. The output for each algorithm should be produced as a text file containing one line for each node, specifying the number of the node and the label of that node.

You can find all graphs for this assignment on Stampede here: /work/01131/rashid/class-inputs .
We have provided the following graphs for sssp: power-law graphs rmat15, rmat20, rmat22, and rmat23, and road networks road-FLA (Florida road network) and road-NY (New York road network). Graphs like rmat22 and rmat23 are quite big so do not do any runs with them until your code has been debugged on some small graphs that you have constructed.

Coding

I/O routines for graphs: These routines will be important for debugging your programs so make sure they are working before starting the rest of the assignment.

Write a C++ routine that reads a graph in DIMACS format from a file, and constructs a Compressed-Sparse-Row (CSR) representation of that graph in memory. Node and edge labels can be ints for the graphs we are dealing with.
Write a C++ routine that takes a graph in CSR representation in memory, and prints it out to a file in DIMACS format.
Write a C++ routine that takes a graph in CSR representation in memory, and prints node numbers and node labels, one per line.

Data-driven algorithms: Implement a routine that takes a graph G and a work-list w of active nodes as input, and performs a data-driven sssp computation on graph G. By passing different work-lists to this routine as described below, you can implement different data-driven algorithms for sssp without changing the code in your routine. Instrument your code to count the number of node and edge relaxations.

Graph initialization: read in the graph from the file, create the graph in CSR format in memory, and initialize node labels so that the source node has label 0 and all other nodes are initialized to a large positive number (you can use INT_MAX).
Chaotic relaxation sssp algorithm:

Implement a work-list called bag for the work-list. The get method for this work-list should select a random active node from the nodes in the work-list.
You can use the rand function in C++ to generate random numbers; this webpage shows you how to generate random numbers within a particular range http://www.cplusplus.com/reference/cstdlib/rand/ By using different seeds, you can generate different sequences of random numbers.
Chaotic relaxation can take a very long time even for small graphs for some schedules of node relaxations. Your code should terminate the computation if the number of relaxations exceeds some bound that depends on the size of the graph.

Delta-stepping sssp algorithm:

Implement a work-list implemented as a sequence of bags in which the first bag contains nodes with labels in the interval [0,Δ), $the

second bag contains nodes with labels in the
interval$ $[Δ,2$ $Δ),

etc.$ The get method should return a random node from the first non-empty bag. The value of Δ should be a parameter to the constructor for your work-list. For efficiency, your work-list can keep track of the first non-empty bag instead of searching the bags one at a time to find the first non-empty bag.

Dijkstra's algorithm:

Setting Δ to one in the delta-stepping algorithm gives you Dijkstra's algorithm. You may get better performance by using a heap to implement the work-list but you do not need to implement this.

Experiments

Data-driven sssp algorithms

graphs: rmat15, rmat20, rmat22, rmat23, road-NY, road-FL.
[updated] source node for sssp computation: node 1 for all rmat graphs, node 140961 for road-NY, node 316607 for road-FL. These are the nodes with the highest degree.
Draw two small graphs with roughly 5 nodes and 20 edges, and generate files for them in DIMACS format. You should use these graphs to debug your code before using the bigger graphs we have provided to you.

Submit these two graphs with your report.

Write a routine that traverses a graph in CSR format and determines the number of the node with the largest out-degree. This is an exercise to check that you understand the CSR format and know how to use it for graph algorithms.

Report this node number for each of the graphs given to you (you should check that this is the same as the source node for sssp described above).

Chaotic relaxation:

Experiment with three different seeds for the random number generator.

Report the running times, the number of node relaxations, and the number of edge relaxations for rmat15. If your code timed out, put some symbol like "*" in the table for that experiment.

Dijkstra's algorithm:

Run Dijkstra's algorithm on rmat15 and road-NY.

Report the number of node relaxations.
Compute analytically what this number should be, and compare it with the number from your experiment.
Output the final node labels for both graphs in the format specified in the Graph Formats section of this assignment.

Delta-stepping:

Determine experimentally the optimal values of $Δ$ for rmat15 and for road-NY, and report these in your submission.
Output the final node labels for both graphs.
Use the $Δ$ value you found for rmat15 to perform sssp for all the rmat graphs. Plot a graph in which the x-axis is the number of nodes in the rmat graph and the y-axis is the running time.
Plot a similar graph for the number of node relaxations.

Submission

Submit (in canvas) your code and all the items listed in the experiments above.

Grading

Code: 50 points

Experiments: 50 points

DIMACS format for graphs

One popular format for representing directed graphs as text files is the DIMACS format (undirected graphs are represented as a directed graph by representing each undirected edge as two directed edges). Files are assumed to be well-formed and internally consistent so it is not necessary to do any error checking. A line in a file must be one of the following.

Comments. Comment lines give human-readable information about the file and are ignored by programs. Comment lines can appear anywhere in the file. Each comment line begins with a lower-case character c.
```
 
c This is an example of a comment line.
```
Problem line. There is one problem line per input file. The problem line must appear before any node or edge descriptor lines. The problem line has the following format.
```
p FORMAT NODES EDGES
```
The lower-case character p signifies that this is the problem line. The FORMAT field should contain a mnemonic for the problem such as sssp. The NODES field contains an integer value specifying n, the number of nodes in the graph. The EDGES field contains an integer value specifying m, the number of edges in the graph.These two fields tell you how much storage to allocate for the CSR representation of the graph.
Edge Descriptors. There is one edge descriptor line for each edge the graph, each with the following format. Each edge (s,d,w) from node s to node d with weight w appears exactly once in the input file.
```
a s d w
```
The lower-case character "a" signifies that this is an edge descriptor line. The "a" stands for arc, in case you are wondering.

Notes added after assignment was posted:

(2/21, 2:13 PM): You may use classes from the C++ STL and boost libraries if you wish.
(2/22, 5:36 PM): I changed the definition of edges in the DIMACS format. Edges in the file start with "a" (for arc).
(2/25: 12:09PM): Because of the generator used for rmat graphs, the files for some of the graphs may have multiple edges between the same pair of nodes. When building the CSR representation in memory, keep only the edge with the largest weight. For example, if you find edges (s d 1) and (s d 4) for example, keep only the edge weight 4. In principle, you can keep the smallest weight edge or follow some other rule, but I want everyone to follow the same rule to make grading easier. This has been discussed twice in piazza as well but feel free to post there if this is not clear.
(3/3: 6:04PM): Source nodes for SSSP computations have been updated above and in Piazza.
(3/4: 2:00PM/8:10PM): Here is the solution to the rmat15 sssp problem.