General guidelines for final project:
Goal:
One approach to memory hierarchy optimization is to
use cache-oblivious algorithms. These algorithms are based on
a divide-and-conquer strategy, and they are usually implemented using
recursion. To produce an efficient implementation, it is necessary to
stop the recursion once the problem size becomes small enough, and
invoke a recursive micro-kernel,
which is straight-line code that multiplies matrices of small enough
size that the computation can be performed in the
registers. Implement a cache-oblivious MMM for the Power
architecture, and compare the performance of your code with the
performance of GotoBLAS MMM or ATLAS-generated MMM.
Requirements:
Readings:
Goal:
Figure out a good algorithm for sorting integers and floats on the
NVIDIA Tesla C870. You should be familiar with conventional sorting
algorithms such as quicksort and mergesort. There are also
implementations of sorting algorithms called sorting networks
that are designed for parallel computers. The first reference below
describes sorting algorithms and networks in detail.
Requirements:
Readings:
Goal:
As discussed in class, Jacobi's method can be used to solve a
finite-difference approximation of the Poisson equation using a
5-point stencil in the two-dimensional case. This code is simple to
parallelize using "old" and "new" values on each grid point. However,
it can also be parallelized by updating grid points at random using
old or new neighboring values. This approach avoids synchronization
overheads, which may be faster but may take more steps to converge.
This project asks you to implement both versions on a GPU.
Requirements:
Goal:
As we discussed in class, there are many approaches one can take to
parallelize Delaunay Mesh Refinement. One approach (which we
developed in class) is to partition a Delaunay mesh among multiple
processors and treat each partition as a separate mesh which can then
be refined independently. In this approach, fixing a badly shaped
triangle near the boundary of a partition requires "splitting" edges
at the partition boundary. To ensure that the resulting mesh is
consistent, these split edges must be communicated to the neighboring
partition (See [1] for an explanation of this, or
come talk to Milind). In this project, you will implement a
shared-memory parallel, partitioned version of Delaunay mesh
refinement that uses this approach. We will provide a sequential
implementation of Delaunay mesh refinement (in Java), code that will
partition the mesh for you (i.e. given an input mesh, it will assign
triangles in the mesh to different partitions) and a variety of input
data sets.
Requirements:
Goal:
In class, we have discussed finite-element analysis as a means of
simulating complex physical systems. Asynchronous Variational
Integrators (AVIs) are a means of solving certain kinds of finite
element problems. At a very high level, the general structure of AVIs
is as follows. The finite elements are represented as a graph, with
each element represented by a node, and edges between adjacent
elements (think of how we represented the mesh in Delaunay mesh
refinement). Each element has a timestamp associated with it;
updating an element requires reading its neighbors and updating its
timestamp. An element can be if its timestamp is less than all of its
neighbors. The algorithm terminates when all elements' timestamps
reach a final value.
In [1], Huang et al. present an approach to running AVI algorithms in parallel called PAVI. It finds a set of elements whose timestamps are local minima, updates them in parallel, and then finds the next set of elements that can be updated. In this project, you will implement a parallel AVI framework based on the algorithm in [1]. We will provide a synthetic workload to use in evaluating your framework.
Requirements:
Goal:
Sequence alignment is used to discover structural and hence,
functional similarities between biological sequences. Pushing the
boundaries of current computing and alignment approaches for faster
and better sequence alignment, SWAMP is leading the way. Using the
high-sensitivity approach first utilized by Smith-Waterman, SWAMP or
Smith-Waterman using Associative Massive Parllelism, is a suite of
algorithms designed for the high-performance parallel model known as
ASC. SWAMP uses innovative and creative techniques to maximize the
algorithms' efficiency, designed to take advantage of ASC's
strengths. You are asked to implement SWAMP on GPU.
Requirement:
References: