--------------------

(1) Course overview

Parallel architectures, parallel algorithms, parallel data structures

Slides: Introduction to CS 395T

Readings:

(1) Moore's Law paper, Electronics, 1965.

(2) Static Power Model for Architects, Butts and Sohi, Micro 2000.

(3) Introduction to the Cell processor, Kahle et al, IBM J.Res&Dev, July 2005

(4) Amorphous Data-Parallelism, Pingali et al., 2011

(2) Parallelism in Regular and Irregular Algorithms

Sources of Parallelism and Locality in Computational Science Algorithms

Ordinary differential equations (ode's), finite-differences, systems of ode's,

partial differential equations (pde's), finite-elements, n-body methods (Barnes-Hut)

Slides: Some computational science algorithms

Readings:

(1) Mathematica tutorial on numerical methods for solving pde's

Source of Parallelism and Locality in Irregular Algorithms

Readings:

(1) Delta-stepping: A Parallel Single-Source Shortest Path Algorithm Meyer and Sanders (ESA'98)

(2) A Work-efficient Parallel BFS Algorithm Leiserson and Schardl (SPAA 2010)

(3) Abstractions for regular algorithms and machines, static scheduling

Dependence graphs, PRAM model, DAG scheduling

Slides: Algorithm and machine abstractions: dependence graphs and PRAM model

Control dependence computation

Readings:

(1) Dependence graphs and compiler optimizations, Kuck et al., POPL 1981

(2) The program dependence graph and its use in optimization, Ferrante, Ottenstein,Warren, TOPLAS, 1987

(3) Optimal control dependence computation, Pingali and Bilardi, TOPLAS, 1997

(4) Experimental evaluation of list scheduling, Cooper et al, Rice TR, 1998

(5) From control flow to dataflow, Beck et al., JPDC 1989

(6) A bridging model for parallel computation Valiant, CACM August 1990

(4) Dynamic scheduling

Slides: Dynamic scheduling

Readings:

(1) Load Balancing literature survey

(2) Scheduling multi-threaded computations by work-stealing, Blumofe and Leiserson, JACM, 1999.

(5) Locality(I): Temporal and spatial locality, caches, blocked algorithms

Slides: Cache models for locality

Readings:

(1) Anatomy of high-performance matrix multiplication, Goto et al, ACM TOMS, May 2008.

Locality(II): Cache-oblivious algorithms

Slides: Cache-oblivious Programs

Readings:

(1) Cache-oblivious algorithms, Frigo et al, FOCS 99

(2) An experimental comparison of cache-oblivious and cache-conscious programs, Yotov et al, SPAA 2007

(6) Architecture: Multicore architectures, cache coherence

Slides: Coherent caches

Slides: Memory consistency models

(7) GPUs and GPU programming

Slides:

Here are the slides from the HiPEAC tutorial:

http://www.

The OpenGPU set (which has more on the
programming models advances) is here:

http://www.opengpu.net/EN/

(1) A survey of general-purpose computation on graphics hardware, Owens et al, Eurographics 2005.

Advanced topics

---------------------------------

(8) Combining parallelism and locality

Slides:

Readings:

(1) New abstractions for data-parallel programming, Brodman et al, HotPar 2009

(2) Distributed Dense Numerical Linear Algebra: DPLASMA, Bosilca et al, University of Tennessee Technical report

(3) Cilk, an efficient multithreaded runtime system, Blumofe et al, PPoPP 1995

(9) Advanced parallel data structures

Slides: Lock-free synchronization

Readings:

(1) Transactional Memory Architectural Support for Lock-Free Data Structures, Maurice Herlihy, J. Eliot B. Moss ISCA 1993.

(2) An efficient heuristic procedure for partitioning graphs, Kernighan and Lin, Bell System Technical Journal, 1970.

(3) A fast and high quality multilevel scheme etc. Karypis and Kumar, SIAM J. Sci. Comput. 1998.

(10) Compiler analysis and parallelization

Slides: Loop parallelization using compiler analysis

Slides: Dependences and Transformations

Slides:Tutorial on points-to analysis, Michael Hind

Readings:

(1) The Omega test, Pugh, Supercomputing 91

(2) Analysis of programs with pointers

(3) Shape analysis, Ghiya and Hendren, POPL 1996

(11) Auto-tuning

Slides: Optimizing MMM and the ATLAS code generator

Readings:

(1) Optimizing matrix multiply using PHiPAC, Biles et al, LAPACK Working Note 111.

(2) Is search really necessary to generate high-performance BLAS?, Yotov et al, Proceedings of IEEE, March 2005.

(3) FFTW Homepage

(12) Large-scale data analysis

Slides: MapReduce: Simplified Data Processing on Large Clusters

Readings:

(1) Map-reduce: simplified data processing on large clusters Dean and Ghemawat, OSDI 2004.

(2) HDFS - http://www.cs.utexas.edu/~nikhil/The_Hadoop_Distributed_File_System.pdf

(3) Hadoop

(13) Synthesis of parallel programs

Slides:

Readings:

(1) SPIRAL: code generation for DSP processors Puschel etal, Proceedings of IEEE, 2005

(2) Tensor-contraction engine Baumgartner et al, Proceedings of IEEE, 2005

(3) Designing a stencil compiler for the CM-5, Birckner et al, Los Alamos Tech report

(14) Approximate computing

Slides:

Readings:

(1) Green: A Framework for Supporting Energy-conscious Programming using Controlled Approximation,

Baek and Chilimbi, PLDI 2010.

(2) Exploiting the Forgiving Nature of Applications for Scalable Parallel Execution.

Meng, Raghunathan, Chakradhar, and Byna, IPDPS 2010.

(3) Dynamic knobs for power-aware computing. Hoffmann et al ASPLOS, 2011