Project ideas for CS 378

You can either propose a project of your own or you can pick one of the following projects. In either case, you must let the TA know your choice by March 31, 2009.

General guidelines for final project:

Project 1: Cache-oblivious dense MMM for the Power architecture

Supervisor: Keshav Pingali

Goal:
One approach to memory hierarchy optimization is to use cache-oblivious algorithms. These algorithms are based on a divide-and-conquer strategy, and they are usually implemented using recursion. To produce an efficient implementation, it is necessary to stop the recursion once the problem size becomes small enough, and invoke a recursive micro-kernel, which is straight-line code that multiplies matrices of small enough size that the computation can be performed in the registers. Implement a cache-oblivious MMM for the Power architecture, and compare the performance of your code with the performance of GotoBLAS MMM or ATLAS-generated MMM.

Requirements:

Readings:

Project 2: CUDA implementation of sorting on NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal:
Figure out a good algorithm for sorting integers and floats on the NVIDIA Tesla C870. You should be familiar with conventional sorting algorithms such as quicksort and mergesort. There are also implementations of sorting algorithms called sorting networks that are designed for parallel computers. The first reference below describes sorting algorithms and networks in detail.

Requirements:

Readings:

Project 3: CUDA implementation of Jacobi's method for the Poisson equation on NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal:
As discussed in class, Jacobi's method can be used to solve a finite-difference approximation of the Poisson equation using a 5-point stencil in the two-dimensional case. This code is simple to parallelize using "old" and "new" values on each grid point. However, it can also be parallelized by updating grid points at random using old or new neighboring values. This approach avoids synchronization overheads, which may be faster but may take more steps to converge. This project asks you to implement both versions on a GPU.

Requirements:

Reading:

Project 4: Partitioned Parallel Delaunay Refinement

Supervisor: Milind Kulkarni

Goal:
As we discussed in class, there are many approaches one can take to parallelize Delaunay Mesh Refinement. One approach (which we developed in class) is to partition a Delaunay mesh among multiple processors and treat each partition as a separate mesh which can then be refined independently. In this approach, fixing a badly shaped triangle near the boundary of a partition requires "splitting" edges at the partition boundary. To ensure that the resulting mesh is consistent, these split edges must be communicated to the neighboring partition (See [1] for an explanation of this, or come talk to Milind). In this project, you will implement a shared-memory parallel, partitioned version of Delaunay mesh refinement that uses this approach. We will provide a sequential implementation of Delaunay mesh refinement (in Java), code that will partition the mesh for you (i.e. given an input mesh, it will assign triangles in the mesh to different partitions) and a variety of input data sets.

Requirements:

References:
[1] A paper which, in section 2.1, describes the type of parallelism you will exploit
[2] A description of the basic Delaunay mesh refinement algorithm

Project 5: Parallel AVI Simulation

Supervisor: Milind Kulkarni

Goal:
In class, we have discussed finite-element analysis as a means of simulating complex physical systems. Asynchronous Variational Integrators (AVIs) are a means of solving certain kinds of finite element problems. At a very high level, the general structure of AVIs is as follows. The finite elements are represented as a graph, with each element represented by a node, and edges between adjacent elements (think of how we represented the mesh in Delaunay mesh refinement). Each element has a timestamp associated with it; updating an element requires reading its neighbors and updating its timestamp. An element can be if its timestamp is less than all of its neighbors. The algorithm terminates when all elements' timestamps reach a final value.

In [1], Huang et al. present an approach to running AVI algorithms in parallel called PAVI. It finds a set of elements whose timestamps are local minima, updates them in parallel, and then finds the next set of elements that can be updated. In this project, you will implement a parallel AVI framework based on the algorithm in [1]. We will provide a synthetic workload to use in evaluating your framework.

Requirements:

References:
[1] J. Huang, X. Jiao, R. Fujimoto and Hongyuan Zha. DAG-Guided Parallel Asynchronous Variational Integrators with Super-Elements.

Project 6: GPU Computing for the SWAMP Sequence Alignment

Supervisor: Zifei Zhong

Goal:
Sequence alignment is used to discover structural and hence, functional similarities between biological sequences. Pushing the boundaries of current computing and alignment approaches for faster and better sequence alignment, SWAMP is leading the way. Using the high-sensitivity approach first utilized by Smith-Waterman, SWAMP or Smith-Waterman using Associative Massive Parllelism, is a suite of algorithms designed for the high-performance parallel model known as ASC. SWAMP uses innovative and creative techniques to maximize the algorithms' efficiency, designed to take advantage of ASC's strengths. You are asked to implement SWAMP on GPU.

Requirement:

References:

  1. SWAMP: Smith-Waterman using Associative Massive Parallelism
  2. A Local Sequence Alignment Algorithm Using an Associative Model of Parallel Computation
  3. GPU computing for the SWAMP sequence alignment