Project ideas for CS 378


You can either propose a project of your own or you can pick one of the following projects.
In either case, you must let Sachin know your choice by March 20th.

General guidelines for final project:
Project 1: Optimized dense MMM for the Power architecture

Supervisor: Keshav Pingali

Goal: Implement an optimized MMM for the Power architecture, and compare the performance
of your code with the performance of library code.

Requirements:
Readings:

Project 2: Cache-oblivious dense MMM for the Power architecture

Supervisor: Keshav Pingali

Goal: One approach to memory hierarchy optimization is to use cache-oblivious algorithms.
These algorithms are based on a divide-and-conquer strategy, and they are usually implemented
using recursion. To produce an efficient implementation, it is necessary to stop the recursion once
the problem size becomes small enough, and invoke a recursive micro-kernel, which is straight-line
code that multiplies matrices of small enough size that the computation can be performed in the
registers. Implement a  cache-oblivious MMM for the Power architecture, and compare the performance
of your code with the performance of GotoBLAS MMM or ATLAS-generated MMM.

Requirements:

Readings:

Project 3: CUDA implementation of sparse MVM for NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal: As we saw in class, sparse matrix-vector multiplication (MVM) is one of the most important kernels in computational science. In this project, you will use the CUDA programming model to implement a highly optimized MVM on the NVIDIA Tesla C870 GPU. There are many representations of sparse matrices that are used in practice. Three of the simplest ones are compressed row storage (CRS), compressed column storage (CCS), and coordinate storage (COO). In this project, you must explore how to implement sparse MVM for sparse matrices stored in different formats. The Sparsity webpage and the last reference in the readings below have lots of sparse matrix data sets.

Requirements:
Readings:
Project 4: Cell implementation of sparse MVM

Supervisor: Sid Chatterjee

Goal: As we saw in class, sparse matrix-vector multiplication (MVM) is one of the most important kernels in computational science. In this project, you will implement sparse MVM on the Cell processor. There are many representations of sparse matrices that are used in practice. Three of the simplest ones are compressed row storage (CRS), compressed column storage (CCS), and coordinate storage (COO). In this project, you must explore how to implement sparse MVM for sparse matrices stored in different formats. The Sparsity webpage and the last reference in the readings below have lots of sparse matrix data sets.

Requirements:
Readings:

Project 5: CUDA implementation of sorting on NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal: Figure out a good algorithm for sorting integers and floats on the NVIDIA Tesla C870. You should be familiar with conventional sorting algorithms such as quicksort and mergesort. There are also implementations of sorting algorithms called sorting networks that are designed for parallel computers. The first reference below describes sorting algorithms and networks in detail.

Requirements:
Readings:

Project 6: CUDA implementation of n-body methods on NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal: As you saw in lecture, there are two broad categories of algorithms for physical simulations: particle methods and continuous methods. The simplest particle method computes all pairwise interactions at each
time step, and is O(n^2) in complexity where n is the number of particles. More efficient approximate methods are available - for example, in lecture, you were introduced to the Barnes-Hut method which approximates clusters of distant particles by their center of mass. In this project, you must implement Barnes-Hut in CUDA and run it on the NVIDIA Tesla C870.

Readings:

Project 7: Visualization for mambo

Supervisor: Sid Chatterjee

Goal: The current output facilities for mambo are statistics oriented, but
there are enough hooks inside the simulator to enable richer visualization.
The goal of this project is to build a visualizer for mambo that will allow
the user to obtain a more detailed understanding of program performance
through a GUI.

Requirements:
You must produce a standalone visualizer that will work with the mambo
simulation infrastructure and allow graphical visualization of interesting
simulation events, such as the behavior of cache misses or pipeline stalls
as a function of time. Send mail to Sid Chatterjee (sc@us.ibm.com) for more
details.



Project 8:  Performance optimizations in multicore libraries


Supervisor: Sid Chatterjee

Goal: As we saw in the discussion of the Goto BLAS, high-performance
libraries typically use a number of different performance optimizations in
combination. The goal of this project is to take a well-tuned multicore
library (for either Cell or x86), analyze the various optimizations used in
it, and write up a report based on this analysis.

Requirements:
You must select a library from a non-numerical domain (i.e., no BLAS, FFT,
etc.). Choose a library for which you have source code available. Devise
a set of experiments to extract the components of performance. Write up a
final report based on the results of your experiments.