Lecture Material

Basic material

(1) Course overview

Parallel architectures, parallel algorithms, parallel data structures
Slides: Introduction to CS 395T
(1) Moore's Law paper, Electronics, 1965.
(2) Static Power Model for Architects, Butts and Sohi, Micro 2000.
(3) Introduction to the Cell processor, Kahle et al, IBM J.Res&Dev, July 2005
(4) Amorphous Data-Parallelism, Pingali  et al., 2011

(2) Parallelism in Regular and Irregular Algorithms
Sources of Parallelism and Locality in Computational Science Algorithms
Ordinary differential equations (ode's), finite-differences, systems of ode's,
partial differential equations (pde's), finite-elements, n-body methods (Barnes-Hut)
Slides: Some computational science algorithms
(1) Mathematica tutorial on numerical methods for solving pde's

Source of Parallelism and Locality in Irregular Algorithms
(1) Delta-stepping: A Parallel Single-Source Shortest Path Algorithm Meyer and Sanders (ESA'98)
(2) A Work-efficient Parallel BFS Algorithm Leiserson and Schardl (SPAA 2010)

(3) Abstractions for regular algorithms and machines, static scheduling
Dependence graphs, PRAM model, DAG scheduling

Slides: Algorithm and machine abstractions: dependence graphs and PRAM model
Control dependence computation
(1) Dependence graphs and compiler optimizations, Kuck et al., POPL 1981
(2) The program dependence graph and its use in optimization, Ferrante, Ottenstein,Warren, TOPLAS, 1987
(3) Optimal control dependence computation, Pingali and Bilardi, TOPLAS, 1997
(4) Experimental evaluation of list scheduling, Cooper et al, Rice TR, 1998
(5) From control flow to dataflow, Beck et al., JPDC 1989
(6) A bridging model for parallel computation Valiant, CACM August 1990

(4) Dynamic scheduling
Slides: Dynamic scheduling
(1) Load Balancing literature survey
(2) Scheduling multi-threaded computations by work-stealing, Blumofe and Leiserson, JACM, 1999.

(5) Locality(I): Temporal and spatial locality, caches, blocked algorithms

Slides: Cache models for locality
(1) Anatomy of high-performance matrix multiplication, Goto et al, ACM TOMS, May 2008.

Locality(II): Cache-oblivious algorithms
Slides: Cache-oblivious Programs
(1) Cache-oblivious algorithms, Frigo et al, FOCS 99
(2) An experimental comparison of cache-oblivious and cache-conscious programs, Yotov et al, SPAA 2007

(6) Architecture: Multicore architectures, cache coherence
Slides: Coherent caches
Slides: Memory consistency models

(7) GPUs and GPU programming
Here are the slides from the HiPEAC tutorial:

The OpenGPU set (which has more on the programming models advances) is here:

(1) A survey of general-purpose computation on graphics hardware, Owens et al, Eurographics 2005.

Advanced topics

(8) Combining parallelism and locality

New abstractions for data-parallel programming, Brodman et al, HotPar 2009
(2) Distributed Dense Numerical Linear Algebra: DPLASMA, Bosilca et al, University of Tennessee Technical report
Cilk, an efficient multithreaded runtime system, Blumofe et al, PPoPP 1995

(9) Advanced parallel data structures
Lock-free synchronization
(1) Transactional Memory Architectural Support for Lock-Free Data Structures, Maurice Herlihy, J. Eliot B. Moss ISCA 1993.
An efficient heuristic procedure for partitioning graphs, Kernighan and Lin, Bell System Technical Journal, 1970.
(3) A fast and high quality multilevel scheme etc. Karypis and Kumar, SIAM J. Sci. Comput. 1998.

Compiler analysis and parallelization
Slides: Loop parallelization using compiler analysis
Dependences and Transformations
Tutorial on points-to analysis, Michael Hind
(1) The Omega test, Pugh, Supercomputing 91
(2) Analysis of programs with pointers
(3) Shape analysis, Ghiya and Hendren, POPL 1996

(11) Auto-tuning
Slides: Optimizing MMM and the ATLAS code generator
(1) Optimizing matrix multiply using PHiPAC, Biles et al, LAPACK Working Note 111.
(2) Is search really necessary to generate high-performance BLAS?,  Yotov et al, Proceedings of IEEE, March 2005.
(3) FFTW Homepage

(12) Large-scale data analysis
Slides: MapReduce: Simplified Data Processing on Large Clusters
(1) Map-reduce: simplified data processing on large clusters Dean and Ghemawat, OSDI 2004.
(2) HDFS - http://www.cs.utexas.edu/~nikhil/The_Hadoop_Distributed_File_System.pdf
(3) Hadoop

(13) Synthesis of parallel programs
(1) SPIRAL: code generation for DSP processors Puschel etal, Proceedings of IEEE, 2005
(2) Tensor-contraction engine Baumgartner et al, Proceedings of IEEE, 2005
(3) Designing a stencil compiler for the CM-5, Birckner et al, Los Alamos Tech report

(14) Approximate computing
(1) Green: A Framework for Supporting Energy-conscious Programming using Controlled Approximation,
Baek and Chilimbi, PLDI 2010.
(2) Exploiting the  Forgiving Nature of Applications for Scalable Parallel Execution.
  Meng, Raghunathan, Chakradhar, and Byna, IPDPS 2010.
(3) Dynamic knobs for power-aware computing. Hoffmann et al ASPLOS, 2011