: Roshan Dathathri
Affine loop nests are arbitrarily nested loop nests in which the
array accesses and the loop bounds are affine functions (linear
functions with a constant offset) of the program parameters.
They form the compute-intensive core of scientific computations
like linear-algebra kernels and stencil-style computations.
Compilers (using the polyhedral model) can statically analyze
these loop nests and generate parallel and tiled code without
programmer intervention. However, the performance of the
generated code is highly sensitive to the tile size, which is
dependent on the machine architeture (but not the problem
sizes). So, the programmer has to choose the right tile sizes to
get the best performance. Exhaustively searching for the best
tile sizes (auto-tuning) is time consuming.
Analytical models have been proposed to compute the best tile
sizes. These models are based on analyzing both machine
architecture features and program features meticuously to find
an analytical function for the optimal tile sizes in terms of
the values of the features. Yotov et al. in PLDI 2003 proposed
an analytical model for determining optimal tile sizes for
matrix multiplication. They showed that their analytical model
can do almost as well as exhaustive search. However, their
analytical model is specific to one problem - matrix
multiplication. Extending it to other affine loop nests would
require analyzing each of the problems independently.
Recent work has tried to use machine learning techniques to
learn optimal tile sizes automatically. Cummins et al. in ADAPT
2016 use classifiers and regressors to learn tile sizes for
stencil GPU kernels. The advantage is that the best tile sizes
for different stencil problems can be learned automatically.
Neural networks seem well suited to learn tile sizes because the
analytical functions for optimal tile sizes are usually not
The goal of this project is to build a machine learning system -
possibly based on deep neural networks - that takes affine loop
nest features and architecture features as inputs and predicts
the best tile size for that loop nest. To begin with, you can
restrict yourself to perfectly nested loops. Here are things to
Project deliverables and
- Loop nest features and architectural features relevant
for optimal tile size determination: You can look at
the Yotov et al analytical model for MMM to see what
features they used for MMM. The Cummins et al. machine
learning model might also provide insight. You can also
investigate techniques from machine learning for finding
relevant features automatically.
- Training data: Once you have decided on features,
you will need training data. If F1 and F2 are the features,
the goal of training is to learn a function TS: F1xF2 ->
TileSize. Therefore, your training data will consist of
tuples of the form (f1,f2,t) where f1 and f2 are values for
the features, and t is the optimal tile size for those
features. To create these tuples, you will need to determine
optimal tile sizes for a variety of loop nests and a variety
of architectures. For matrix multiplication, you can use
ATLAS to find optimal tile sizes. For stencil codes, you can
use a polyhedral compiler like Pluto to generate code for a
specific tile size.
- Machine learning system: You are free to use any
system you like. M5 is one possibility. Deep neural networks
are hot nowadays so you can also consider these.
- You can do this project in stages. For example, a
first step might be to restrict yourself to MMM, train
using MMM data, and see if the function your system learns
resembles the analytical one from Yotov et al. Similarly,
you can train another system for stencil codes and compare
with Cummins et al. Of course, the ultimate goal is to build
a single tile size predictor that can handle any affine loop
- (Nov 1) A clear statement in English describing your
- (Nov 8) A survey of analytical models, polyhedral
compilers, and neural networks.
- (Dec 6) A tool that takes loop nest features and machine
features as input and outputs the tile size to use in the
code for that machine.
- (Dec 6) A project report, written like an ACM conference,
that summarizes the work you did.
Search Really Necessary to Generate High-Performance BLAS?
Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David
Padua, Keshav Pingali, Paul Stodghill. PLDI 2003.
Practical Automatic Polyhedral Parallelizer and Locality
Optimizer. Uday Bondhugula, A. Hartono, J. Ramanujan,
P. Sadayappan. ACM SIGPLAN Programming Languages Design and
Implementation (PLDI), Jun 2008, Tucson, Arizona.
OpenCL Workgroup Size for Stencil Patterns. Chris
Cummins, Pavlos Petoumenos, Michel Steuwer, Hugh Leather. In
Proceedings of the 6th International Workshop on Adaptive
Self-tuning Computing Systems (ADAPT'16).
to Neural Networks. Yaser Abu-Mostafa. An online
course lecture featured on edX.