Project ideas for CS 378

You can either propose a project of your own or you can pick one of the following projects. In either case, you must let the TA know your choice by March 31, 2009.

General guidelines for final project:

Each project has a supervisor. Work closely with him - you should meet once a week to assess progress.
Each project has a reading list. These should be used as starting points for a more thorough literature search.
At the end of the project, you must turn in a final report that describes the problem you addressed, summarizes the approach you took to solve the problem, and gives experimental results from your implementation. It is due on the day of the last lecture.

Project 1: Cache-oblivious dense MMM for the Power architecture

Supervisor: Keshav Pingali

Goal:
One approach to memory hierarchy optimization is to use cache-oblivious algorithms. These algorithms are based on a divide-and-conquer strategy, and they are usually implemented using recursion. To produce an efficient implementation, it is necessary to stop the recursion once the problem size becomes small enough, and invoke a recursive micro-kernel, which is straight-line code that multiplies matrices of small enough size that the computation can be performed in the registers. Implement a cache-oblivious MMM for the Power architecture, and compare the performance of your code with the performance of GotoBLAS MMM or ATLAS-generated MMM.

Requirements:

You must implement an optimized micro-kernel as described in the Yotov et al paper.
You must perform an empirical search to find the best micro-kernel size.
You must compare the performance of your code with that of MMM code produced by ATLAS or in the GotoBLAS.

Readings:

Project 2: CUDA implementation of sorting on NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal:
Figure out a good algorithm for sorting integers and floats on the NVIDIA Tesla C870. You should be familiar with conventional sorting algorithms such as quicksort and mergesort. There are also implementations of sorting algorithms called sorting networks that are designed for parallel computers. The first reference below describes sorting algorithms and networks in detail.

Requirements:

You must implement at least one conventional sorting algorithm such as quicksort or mergesort. We will not accept dumb algorithms like bubblesort.
You must implement at least one sorting network such as bitonic sort.

Readings:

Project 3: CUDA implementation of Jacobi's method for the Poisson equation on NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal:
As discussed in class, Jacobi's method can be used to solve a finite-difference approximation of the Poisson equation using a 5-point stencil in the two-dimensional case. This code is simple to parallelize using "old" and "new" values on each grid point. However, it can also be parallelized by updating grid points at random using old or new neighboring values. This approach avoids synchronization overheads, which may be faster but may take more steps to converge. This project asks you to implement both versions on a GPU.

Requirements:

Implement Jacobi's method discussed in class using 3-, 5-, and 7-point stencils to solve the 1D, 2D, and 3D Poisson equation (see Figure 1 in the paper below for the 2D case).
Write two implementations each, one based on the pseudo code in Section 3.4.1 and one based on the pseudo code in Figure 10.
Use a reasonable grid size that is large but does not overflow the GPUs device memory.

Reading:

A paper that describes the algorithm and how to efficiently implement it on GPUs [link to pdf]
"A survey of general-purpose computation on Graphics Hardware", by Owens et al. (see Section 4)

Project 4: Partitioned Parallel Delaunay Refinement

Supervisor: Milind Kulkarni

Goal:
As we discussed in class, there are many approaches one can take to parallelize Delaunay Mesh Refinement. One approach (which we developed in class) is to partition a Delaunay mesh among multiple processors and treat each partition as a separate mesh which can then be refined independently. In this approach, fixing a badly shaped triangle near the boundary of a partition requires "splitting" edges at the partition boundary. To ensure that the resulting mesh is consistent, these split edges must be communicated to the neighboring partition (See [1] for an explanation of this, or come talk to Milind). In this project, you will implement a shared-memory parallel, partitioned version of Delaunay mesh refinement that uses this approach. We will provide a sequential implementation of Delaunay mesh refinement (in Java), code that will partition the mesh for you (i.e. given an input mesh, it will assign triangles in the mesh to different partitions) and a variety of input data sets.

Requirements:

You must implement a shared memory, parallel implementation of Delaunay mesh refinement.
You must evaluate your implementation against the provided serial implementation. Evaluate your performance both when you disregard partitioning time and when you factor in partitioning time.
Evaluate the performance of your implementation on multiple cores (again, both including and disregarding partitioning time).
Optional: Evaluate the performance of your implementation against an optimistic parallel implementation that we will provide.

References:
[1] A paper which, in section 2.1, describes the type of parallelism you will exploit
[2] A description of the basic Delaunay mesh refinement algorithm

Project 5: Parallel AVI Simulation

Supervisor: Milind Kulkarni

Goal:
In class, we have discussed finite-element analysis as a means of simulating complex physical systems. Asynchronous Variational Integrators (AVIs) are a means of solving certain kinds of finite element problems. At a very high level, the general structure of AVIs is as follows. The finite elements are represented as a graph, with each element represented by a node, and edges between adjacent elements (think of how we represented the mesh in Delaunay mesh refinement). Each element has a timestamp associated with it; updating an element requires reading its neighbors and updating its timestamp. An element can be if its timestamp is less than all of its neighbors. The algorithm terminates when all elements' timestamps reach a final value.

In [1], Huang et al. present an approach to running AVI algorithms in parallel called PAVI. It finds a set of elements whose timestamps are local minima, updates them in parallel, and then finds the next set of elements that can be updated. In this project, you will implement a parallel AVI framework based on the algorithm in [1]. We will provide a synthetic workload to use in evaluating your framework.

Requirements:

You must implement a shared memory, parallel AVI framework which can be used to simulate AVI problems in parallel
You must evaluate your implementation on the provided synthetic workload on multiple cores.
Devise different scheduling strategies for parallel execution; what do you do if you have more elements that can be executed than processors that can execute them? Why might one approach be better than another?

References:
[1] J. Huang, X. Jiao, R. Fujimoto and Hongyuan Zha. DAG-Guided Parallel Asynchronous Variational Integrators with Super-Elements.

Project 6: GPU Computing for the SWAMP Sequence Alignment

Supervisor: Zifei Zhong

Goal:
Sequence alignment is used to discover structural and hence, functional similarities between biological sequences. Pushing the boundaries of current computing and alignment approaches for faster and better sequence alignment, SWAMP is leading the way. Using the high-sensitivity approach first utilized by Smith-Waterman, SWAMP or Smith-Waterman using Associative Massive Parllelism, is a suite of algorithms designed for the high-performance parallel model known as ASC. SWAMP uses innovative and creative techniques to maximize the algorithms' efficiency, designed to take advantage of ASC's strengths. You are asked to implement SWAMP on GPU.

Requirement:

Implementing the SWAMP algorithm on GPU, and compare with sequential Smith-Waterman algorithm.
Your GPU SWAMP implementation should Gain decent speedup over the sequential Smith-Waterman algorithm.
Write a report describing your implementation and experiment results.

References: