Projects

Project ideas for CS 378

You can either propose a project of your own or you can pick one of the following projects.
In either case, you must let Sachin know your choice by March 20th.

General guidelines for final project:

Each project has a supervisor. Work closely with him - you should meet once a week to assess progress.
Each project has a reading list. These should be used as starting points for a more thorough literature search.
At the end of the project, you must turn in a final report that describes the problem you addressed, summarizes the approach you took to solve the problem, and gives experimental results from your implementation. It is due on the day of the last lecture.

Project 1: Optimized dense MMM for the Power architecture

Supervisor: Keshav Pingali

Goal: Implement an optimized MMM for the Power architecture, and compare the performance
of your code with the performance of library code.

Requirements:

Your code can follow the structure of ATLAS-generated MMM or that of the Goto BLAS.
You must implement register tiling, as well as L1 and L2 cache tiling.
You must perform an empirical search to find the best tile sizes and unroll factors.
You must implement data copying to reduce conflict misses.
You should investigate whether instruction scheduling and software pipelining improve performance.
You need to implement clean-up code so that your program can be called with matrices of any size.
You must compare the performance of your code with that of MMM code produced by ATLAS or in the GotoBLAS.

Readings:

Project 2: Cache-oblivious dense MMM for the Power architecture

Supervisor: Keshav Pingali

Goal: One approach to memory hierarchy optimization is to use cache-oblivious algorithms.
These algorithms are based on a divide-and-conquer strategy, and they are usually implemented
using recursion. To produce an efficient implementation, it is necessary to stop the recursion once
the problem size becomes small enough, and invoke a recursive micro-kernel, which is straight-line
code that multiplies matrices of small enough size that the computation can be performed in the
registers. Implement a cache-oblivious MMM for the Power architecture, and compare the performance
of your code with the performance of GotoBLAS MMM or ATLAS-generated MMM.

Requirements:

You must implement an optimized micro-kernel as described in the Yotov et al paper.
You must perform an empirical search to find the best micro-kernel size.
You must compare the performance of your code with that of MMM code produced by ATLAS or in the GotoBLAS.

Readings:

Project 3: CUDA implementation of sparse MVM for NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal: As we saw in class, sparse matrix-vector multiplication (MVM) is one of the most important kernels in computational science. In this project, you will use the CUDA programming model to implement a highly optimized MVM on the NVIDIA Tesla C870 GPU. There are many representations of sparse matrices that are used in practice. Three of the simplest ones are compressed row storage (CRS), compressed column storage (CCS), and coordinate storage (COO). In this project, you must explore how to implement sparse MVM for sparse matrices stored in different formats. The Sparsity webpage and the last reference in the readings below have lots of sparse matrix data sets.

Requirements:

You must implement sparse MVM at least for CRS and CCS formats. The Owens et al paper is a survey that has pointers to other papers on sparse MVM implementations on GPUs.
Many sparse matrices have small dense blocks within them, and exploiting these dense blocks can boost performance. This is a major focus of the SPARSITY project. Figure out how to exploit dense blocks in your code.

Readings:

Project 4: Cell implementation of sparse MVM

Supervisor: Sid Chatterjee

Goal: As we saw in class, sparse matrix-vector multiplication (MVM) is one of the most important kernels in computational science. In this project, you will implement sparse MVM on the Cell processor. There are many representations of sparse matrices that are used in practice. Three of the simplest ones are compressed row storage (CRS), compressed column storage (CCS), and coordinate storage (COO). In this project, you must explore how to implement sparse MVM for sparse matrices stored in different formats. The Sparsity webpage and the last reference in the readings below have lots of sparse matrix data sets.

Requirements:

You must implement sparse MVM at least for CRS and CCS formats.
Many sparse matrices have small dense blocks within them, and exploiting these dense blocks can boost performance. This is a major focus of the SPARSITY project. Figure out how to exploit dense blocks in your code.

Readings:

Project 5: CUDA implementation of sorting on NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal: Figure out a good algorithm for sorting integers and floats on the NVIDIA Tesla C870. You should be familiar with conventional sorting algorithms such as quicksort and mergesort. There are also implementations of sorting algorithms called sorting networks that are designed for parallel computers. The first reference below describes sorting algorithms and networks in detail.

Requirements:

You must implement at least one conventional sorting algorithm such as quicksort or mergesort. We will not accept dumb algorithms like bubblesort.
You must implement at least one sorting network such as bitonic sort.

Readings:

Project 6: CUDA implementation of n-body methods on NVIDIA Tesla C870

Supervisor: Martin Burtscher

Goal: As you saw in lecture, there are two broad categories of algorithms for physical simulations: particle methods and continuous methods. The simplest particle method computes all pairwise interactions at each
time step, and is O(n^2) in complexity where n is the number of particles. More efficient approximate methods are available - for example, in lecture, you were introduced to the Barnes-Hut method which approximates clusters of distant particles by their center of mass. In this project, you must implement Barnes-Hut in CUDA and run it on the NVIDIA Tesla C870.

Readings:

Barnes-Hut paper by Singh et al
"Fast multipole methods on graphics processors" by Gumerov et al This paper describes a generalization of Barnes-Hut called the fast multipole method. The math is more complicated, but you can ignore that and focus on the GPU implementation of the method, since you will likely need to do something similar.

Project 7: Visualization for mambo

Supervisor: Sid Chatterjee

Goal: The current output facilities for mambo are statistics oriented, but
there are enough hooks inside the simulator to enable richer visualization.
The goal of this project is to build a visualizer for mambo that will allow
the user to obtain a more detailed understanding of program performance
through a GUI.

Requirements:
You must produce a standalone visualizer that will work with the mambo
simulation infrastructure and allow graphical visualization of interesting
simulation events, such as the behavior of cache misses or pipeline stalls
as a function of time. Send mail to Sid Chatterjee (sc@us.ibm.com) for more
details.

Project 8: Performance optimizations in multicore libraries

Supervisor: Sid Chatterjee

Goal: As we saw in the discussion of the Goto BLAS, high-performance
libraries typically use a number of different performance optimizations in
combination. The goal of this project is to take a well-tuned multicore
library (for either Cell or x86), analyze the various optimizations used in
it, and write up a report based on this analysis.

Requirements:
You must select a library from a non-numerical domain (i.e., no BLAS, FFT,
etc.). Choose a library for which you have source code available. Devise
a set of experiments to extract the components of performance. Write up a
final report based on the results of your experiments.