# Concurrency: Honors Welcome to cs378h Chris Rossbach #### Outline for Today - Questions? - Administrivia - Course Overview - Course Details and Logistics - Concurrency & Parallelism Basics Acknowledgments: some materials in this lecture borrowed from: - Emmett Witchel, who borrowed them from: Kathryn McKinley, Ron Rockhold, Tom Anderson, John Carter, Mike Dahlin, Jim Kurose, Hank Levy, Harrick Vin, Thomas Narten, and Emery Berger - Mark Silberstein, who borrowed them from: Blaise Barney, Kunle Olukoton, Gupta #### Course Details | Course Name: | CS378H – Concurrency: Honors | | |-----------------|----------------------------------------------------------|--------------------------| | Unique Number: | 52670 | | | Lectures: | T-Th 9:30-11:00AM <u>Zoom</u> | | | Class Web Page: | http://www.cs.utexas.edu/users/rossbach/cs378h | | | Instructor: | Chris Rossbach | | | TA: | <u>Karthik Velayutham</u> | PRINCIPLES OF | | Text: | Principles of Parallel Programming (ISBN-10: 0321487907) | PAR ALLEL<br>PROGRAMMING | Please read the syllabus! CALVIN LIN LAWRENCE SNYDER ...More on this shortly... • Concurrency is super-cool, and super-important - Concurrency is super-cool, and super-important - You'll learn important concepts and background - Concurrency is super-cool, and super-important - You'll learn important concepts and background - Have fun programming cool systems - GPUs! (optionally) FGPAs! - Modern Programming languages: Go! Rust! - Interesting synchronization primitives (not just boring old locks) - Programming tools people use to program *super-computers* (ooh...) - Concurrency is super-cool, and super-important - You'll learn important concepts and background - Have fun programming cool systems - GPUs! (optionally) FGPAs! - Modern Programming languages: Go! Rust! - Interesting synchronization primitives (not just boring old locks) - Programming tools people use to program *super-computers* (ooh...) #### Two perspectives: - The "just eat your kale and quinoa" argument - The "it's going to be fun" argument CPU CPU Storage CPU # My current computer ## My current computer Too boring... **Applications** #### Wait! - What's concurrency? - What's parallelism? How much parallel and concurrent programming have you learned so far? CPU(s) **GPU** Image DSP Crypto ... How much parallel and concurrent programming have you learned so far? - Concurrency/parallelism can't be avoided anymore (want a job?) - A program or two playing with locks and threads isn't enough - I've worked in industry a lot—I know Course goal is to expose you to lots of ways of programming systems like these ...So "you should take this course because it's good for you" (eat your #\$(\*& kale!) CPU(s) **GPU** Image DSP Crypto ••• #### <u>Goal</u>: Make Concurrency Your Close Friend <u>Method</u>: Use Many Different Approaches to Concurrency | Abstract | Concrete | |-----------------------------------------|---------------------------------------------------------------------------------------------------------------------| | Locks and Shared Memory Synchronization | Prefix Sum with pthreads | | Language Support | Go lab: condition variables, channels, go routines<br>Rust lab: 2PC | | Parallel Architectures | GPU Programming Lab (Optional) FPGA Programming Lab | | HPC | Optional MPI lab | | Distributed Computing / Big Data | Rust 2PC / MPI labs | | Modern/Advanced Topics | <ul> <li>Specialized Runtimes / Programming Models</li> <li>Auto-parallelization</li> <li>Race Detection</li> </ul> | | Whatever Interests YOU | Project | ## Logistics Reprise | Course Name: | CS378H – Concurrency: Honors | | |-----------------|----------------------------------------------------------|-----------------------| | Unique Number: | 52670 | | | Lectures: | TTh 9:30-11:00AM <u>WAG</u> 420 | | | Class Web Page: | http://www.cs.utexas.edu/users/rossbach/cs378h | | | Instructor: | Chris Rossbach | | | TA: | <u>Karthik Velayutham</u> | PRINCIPL | | Text: | Principles of Parallel Programming (ISBN-10: 0321487907) | PARAL<br>PROGRAM | | | | English on the latest | Seriously, read the syllabus! Also, start Lab 1! • Inclusivity and respect are absolute musts • Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos - Don't make your repos public or look at other people's public repos One instruction at a time (apparently) CS378h One instruction at a time (apparently) Multiple instructions in parallel CS3 processor #### Free lunch... #### 35 YEARS OF MICROPROCESSOR TREND DATA Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore #### Free lunch − is over ⊗ Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore # Flynn's Taxonomy # Flynn's Taxonomy | SISD | SIMD | |------|------| | MISD | MIMD | SISD SIMD Single Instruction stream Single Instruction stream Multiple Data stream Single Data stream MISD Multiple Instruction stream Multiple Instruction stream Multiple Data stream Single Data stream Normal Serial program Uncommon architecture: Fault – tolerance Pipeline parallelism ## Execution Models: Flynn's Taxonomy • Example: vector operations (e.g., Intel SSE/AVX, GPU) • Example: vector operations (e.g., Intel SSE/AVX, GPU) ## MIMD ## MIMD • Example: multi-core CPU ### MIMD • Example: multi-core CPU • Decomposition: Domain v. Functional - Decomposition: Domain v. Functional - Domain Decomposition - SPMD - Input domain - Output Domain - Both - Decomposition: Domain v. Functional - Domain Decomposition - SPMD - Input domain - Output Domain - Both - Decomposition: Domain v. Functional - Domain Decomposition - SPMD - Input domain - Output Domain - Both - Functional Decomposition - MPMD - Independent Tasks - Pipelining - Decomposition: Domain v. Functional - Domain Decomposition - SPMD - Input domain - Output Domain - Both - Functional Decomposition - MPMD - Independent Tasks - Pipelining # Game of Life ## Game of Life - Given a 2D Grid: - $v_t(i,j) = F(v_{t-1}(of \ all \ its \ neighbors))$ ## Game of Life - Given a 2D Grid: - $v_t(i,j) = F(v_{t-1}(of \ all \ its \ neighbors))$ ### What model fits "best"? | SISD | SIMD | |-----------------------------|-----------------------------| | Single Instruction stream | Single Instruction stream | | Single Data stream | Multiple Data stream | | MISD | MIMD | | Multiple Instruction stream | Multiple Instruction stream | | Single Data stream | Multiple Data stream | Each CPU gets part of the input Each CPU gets part of the input Each CPU gets part of the input Each CPU gets part of the input ### Issues? Accessing Data Each CPU gets part of the input - Accessing Data - Can we access v(i+1, j) from CPU 0 Each CPU gets part of the input - Accessing Data - Can we access v(i+1, j) from CPU 0 - ...as in a "normal" serial program? - Shared memory? Distributed? - Time to access v(i+1,j) == Time to access v(i-1,j) ? - Scalability vs Latency Each CPU gets part of the input - Accessing Data - Can we access v(i+1, j) from CPU 0 - ...as in a "normal" serial program? - Shared memory? Distributed? - Time to access v(i+1,j) == Time to access v(i-1,j) ? - Scalability vs Latency - Control - Can we assign one vertex per CPU? - Can we assign one vertex per process/logical task? - Task Management Overhead Each CPU gets part of the input - Accessing Data - Can we access v(i+1, j) from CPU 0 - ...as in a "normal" serial program? - Shared memory? Distributed? - Time to access v(i+1,j) == Time to access v(i-1,j) ? - Scalability vs Latency - Control - Can we assign one vertex per CPU? - Can we assign one vertex per process/logical task? - Task Management Overhead - Load Balance Each CPU gets part of the input - Accessing Data - Can we access v(i+1, j) from CPU 0 - ...as in a "normal" serial program? - Shared memory? Distributed? - Time to access v(i+1,j) == Time to access v(i-1,j) ? - Scalability vs Latency - Control - Can we assign one vertex per CPU? - Can we assign one vertex per process/logical task? - Task Management Overhead - Load Balance - Correctness - order of reads and writes is non-deterministic - synchronization is required to enforce the order - locks, semaphores, barriers, conditionals.... Each CPU gets part of the input How could we do a functional decomposition? - Accessing Data - Can we access v(i+1, j) from CPU 0 - ...as in a "normal" serial program? - Shared memory? Distributed? - Time to access v(i+1,j) == Time to access v(i-1,j) ? - Scalability vs Latency - Control - Can we assign one vertex per CPU? - Can we assign one vertex per process/logical task? - Task Management Overhead - Load Balance - Correctness - order of reads and writes is non-deterministic - synchronization is required to enforce the order - locks, semaphores, barriers, conditionals.... ## Lab #1 - Basic synchronization - <a href="http://www.cs.utexas.edu/~rossbach/cs378/lab/lab0.html">http://www.cs.utexas.edu/~rossbach/cs378/lab/lab0.html</a> Start early!!! # Questions?