Project
ideas for CS 378
You can either propose a project of your own or you can pick one of the
following projects.
In either case, you must let Sachin know your choice by March 20th.
General guidelines for final project:
- Each project has a supervisor. Work closely with him - you should
meet once a week to assess progress.
- Each project has a reading list. These should be used as starting
points for a more thorough literature search.
- At the end of the project, you must turn in a final report that
describes the problem you addressed, summarizes the approach you took
to solve the problem, and gives experimental results from your
implementation. It is due on the day of the last lecture.
Project 1:
Optimized
dense MMM for the Power architecture
Supervisor: Keshav Pingali
Goal: Implement an optimized MMM for the Power architecture, and
compare the performance
of your code with the performance of library code.
Requirements:
- Your code can follow the structure of ATLAS-generated MMM or that
of the Goto BLAS.
- You must implement register tiling, as well as L1 and L2 cache
tiling.
- You must perform an empirical search to find the best tile sizes
and unroll factors.
- You must implement data copying to reduce conflict misses.
- You should investigate whether instruction scheduling and
software pipelining improve performance.
- You need to implement clean-up code so that your program can be
called with matrices of any size.
- You must compare the performance of your code with that of MMM
code produced by ATLAS or in the GotoBLAS.
Readings:
Project
2: Cache-oblivious dense MMM for the Power architecture
Supervisor:
Keshav Pingali
Goal: One approach to memory hierarchy optimization is to use cache-oblivious algorithms.
These algorithms are based on a divide-and-conquer strategy, and they
are usually implemented
using recursion. To produce an efficient implementation, it is
necessary to stop the recursion once
the problem size becomes small enough, and invoke a recursive micro-kernel, which is
straight-line
code that multiplies matrices of small enough size that the computation
can be performed in the
registers. Implement a cache-oblivious MMM for the Power
architecture, and
compare the performance
of your code with the performance of GotoBLAS MMM or ATLAS-generated
MMM.
Requirements:
- You must implement an optimized micro-kernel as described in the
Yotov et al paper.
- You must perform an empirical search to find the best
micro-kernel size.
- You must compare the performance of your code with that of MMM
code produced by ATLAS or in the GotoBLAS.
Readings:
Project 3: CUDA
implementation of sparse MVM for NVIDIA Tesla C870
Supervisor:
Martin Burtscher
Goal: As we saw in class, sparse matrix-vector multiplication (MVM) is
one of the most important kernels in computational science. In this
project, you will use the CUDA programming model to implement a highly
optimized MVM on the NVIDIA Tesla C870 GPU. There are many
representations of sparse matrices that are used in practice. Three of
the simplest ones are compressed row
storage (CRS), compressed
column storage (CCS), and coordinate
storage (COO). In this project, you must explore how to
implement sparse MVM for sparse matrices stored in different formats.
The Sparsity webpage and the last reference in the readings below have
lots of sparse matrix data sets.
Requirements:
- You must implement sparse MVM at least for CRS and CCS formats.
The Owens et al paper is a survey that has pointers to other papers on
sparse MVM implementations on GPUs.
- Many sparse matrices have small dense blocks within them, and
exploiting these dense blocks can boost performance. This is a major
focus of the SPARSITY project. Figure out how to exploit dense blocks
in your code.
Readings:
Project
4: Cell implementation of sparse MVM
Supervisor: Sid Chatterjee
Goal: As we saw in class, sparse matrix-vector multiplication (MVM) is
one of the most important kernels in computational science. In this
project, you will implement sparse MVM on the Cell processor. There are
many representations of sparse matrices that are used in practice.
Three of the simplest ones are compressed row storage (CRS), compressed
column storage (CCS), and coordinate storage (COO). In this project,
you must explore how to implement sparse MVM for sparse matrices stored
in different formats. The Sparsity webpage and the last reference in
the readings below have lots of sparse matrix data sets.
Requirements:
- You must implement sparse MVM at least for CRS and CCS formats.
- Many sparse matrices have small dense blocks within them, and
exploiting these dense blocks can boost performance. This is a major
focus of the SPARSITY project. Figure out how to exploit dense blocks
in your code.
Readings:
Project 5: CUDA implementation of sorting on NVIDIA Tesla C870
Supervisor: Martin Burtscher
Goal: Figure out a good algorithm for sorting integers and floats on
the NVIDIA Tesla C870. You should be familiar with conventional sorting
algorithms such as quicksort and mergesort. There are also
implementations of sorting algorithms called sorting networks that are designed
for parallel computers. The first reference below describes sorting
algorithms and networks in detail.
Requirements:
- You must implement at least one conventional sorting algorithm
such as quicksort or mergesort. We will not accept dumb algorithms like
bubblesort.
- You must implement at least one sorting network such as bitonic
sort.
Readings:
Project 6: CUDA implementation of n-body methods on NVIDIA Tesla C870
Supervisor: Martin Burtscher
Goal: As you saw in lecture, there are two broad categories of
algorithms for physical simulations: particle methods and continuous
methods. The simplest particle method computes all pairwise
interactions at each
time step, and is O(n^2) in complexity where n is the number of
particles. More efficient approximate methods are available - for
example, in lecture, you were introduced to the Barnes-Hut method which
approximates clusters of distant particles by their center of mass. In
this project, you must implement Barnes-Hut in CUDA and run it on the
NVIDIA Tesla C870.
Readings:
Project 7: Visualization for mambo
Supervisor: Sid Chatterjee
Goal: The current output facilities for mambo are statistics oriented,
but
there are enough hooks inside the simulator to enable richer
visualization.
The goal of this project is to build a visualizer for mambo that will
allow
the user to obtain a more detailed understanding of program performance
through a GUI.
Requirements:
You must produce a standalone visualizer that will work with the mambo
simulation infrastructure and allow graphical visualization of
interesting
simulation events, such as the behavior of cache misses or pipeline
stalls
as a function of time. Send mail to Sid Chatterjee (sc@us.ibm.com) for
more
details.
Project 8: Performance optimizations in multicore libraries
Supervisor: Sid Chatterjee
Goal: As we saw in the discussion of the Goto BLAS, high-performance
libraries typically use a number of different performance optimizations
in
combination. The goal of this project is to take a well-tuned multicore
library (for either Cell or x86), analyze the various optimizations
used in
it, and write up a report based on this analysis.
Requirements:
You must select a library from a non-numerical domain (i.e., no BLAS,
FFT,
etc.). Choose a library for which you have source code available. Devise
a set of experiments to extract the components of performance. Write up
a
final report based on the results of your experiments.