CS378 Assignment #1

The goal of this assignment is to familiarize yourself with basic shared-memory synchronization concepts and primitives, get some experience predicting, measuring, and understanding the performance of concurrent programs. In this lab you will write a program that parallelizes a seemingly basic task: incrementing counters. The task is algorithmically quite simple, and the synchronization required to preserve correctness is not intended to be a major challenge. In contrast, understanding (and working around) the performance subtleties introduced by practical matters such as code structure, concurrency management primitives, and the hardware itself can be non-trivial. At the end of this lab you should have some familiarity with concurrency primitives, and some awareness of performance considerations that will come up repeatedly in later sections of the course.

Specifically, your task is to write a program in which the main thread creates (forks) a parameterizable number of worker threads, and waits for them all to complete (join). The core task of each worker thread is to execute something like the psuedo-code below, where the counter variable is global and shared across all threads, while the my_increment_count is local, and tracks the number of times each individual worker increments the variable:

We will assume the following correctness property: the sum of all increment operations performed by each worker must equal the value of the counter itself (no lost updates). It may not surprise you to hear that some synchronization is required to preserve that condition. In this lab, we'll learn how to use thread and locking APIs to do this, look at some different ways of preserving correctness, measure their performance, and do some thinking about those measurements.

We strongly recommend you do this lab using C/C++ and pthreads. However, it is not a hard requirement--if you wish to use another language, it is fine as long as it supports or provides access to thread management and locking APIs similar to those exported by pthreads: you'll need support for creating and waiting for threads, creating/destroying and using locks, and for using hardware-supported atomic instructions. Using language-level synchronization support (e.g. synchronized or atomic keywords) is not acceptable--it can preserve correctness obviously, but it sidesteps the point of the lab. You can meet these requirements in almost any language, but it is worth talking to me or the TA to be sure if you're going to use something other than C/C++.

Deliverables will be detailed below, but the focus is on a writeup that provides performance measurements as graphs, and answers (perhaps speculatively) a number of questions. Spending some time setting yourself up to quickly and easily collect and visualize performance data is a worthwhile time investment since it will come up over and over in this lab and for the rest of the course.

Step 1: Creating Threads, Unsynchronized Counting

In step 1 of the lab, you will write a program that accepts command-line parameters to specify the following:

It is not critical that you actually parse "--maxcounter" and "--workers" and it's fine if you want to write your program to be invoked for example as:

Where the position of "10000" and "4" by convention means maxcounter and workers respectively. However, there are many tools to make command line parsing easy and it is worthwhile to learn to do it well, so some additional tips and pointers about it are in the hints section of this document.

Your program will fork "workers" number of worker threads, which will collectively increment the shared counter and save the increment operations it performs somewhere accessible to the main thread (e.g. global array of worker counts) before returning. The main thread will wait on all the workers and report the final value of the counter and the sum and values of the local counters before exiting. Note that you expressly will NOT actually attempt to synchronize the counter variable yet! The results may not preserve the stated correctness conditions.

For your writeup, perform the following (please answer all the questions but keep in mind many of them represent food for thought and opportunities to speculate--some of the questions may not have definitive or easily-obtained answers):

Step 2: Synchronization

Next, we'll add some synchronization to the counter increment to ensure that we have no lost updates. If you're doing this lab with pthreads, this should involve using pthread mutex, spinlocks, and calls to initialize and destroy them. In each case, verify that you no longer have lost updates before proceeding. For your write up, include the following experiments and graphs. If you're comfortable doing so, you can merge the data from different experiments into a single graph--but it is fine to graph things separately.

Step 3: Load Balance

In this step, we will take some steps toward addressing load imbalance. There are many potential sources of load imbalance, and we'll start by considering that our program currently has no way to control how threads are distrubuted across processors. If you're using pthreads on Linux, start by reading the documentation for pthread_set_affinity_np, and get_nprocs_conf. If you are using other languages or platforms, rest assured similar APIs exist and can be easily found. You will use these functions to try to control the distribution of threads across the physical cores on your machine. You can choose to use one of the mutex, spinlock, or atomic versions above, or better yet, compare them all. The differences are quite dramatic.

Step 4: Reducing Contention

The most fundamental reason performance generally degrades with increasing parallelism for a shared counter is that every access involves an update, which on modern architectures with deep cache heirarchies causes the cache line containing the shared counter to bounce around from cache to cache, dramatically decreasing the benefit of caching. However, if most threads don't actually change the counter value, (i.e. the data are predominantly read-shared rather than write-shared), cache coherence traffic can be significantly reduced, enabling the additional parallelism to yield performance benefits. In this final section, we will change the worker function such that a parameterizable fraction of its operations are reads and the remaining ones increment the counter. To do this, modify your program to accept an additional command line parameter specifying the fraction of operations that should be reads (or writes), and change the worker's thread proc to use this parameter along with rand() to conditionally make an update. Since this introduces some non-determinism into the program, and we prefer to preserve the goal of splitting fixed work across different numbers of workers, the termination condition must change as well, such that each worker performs a fixed fraction of the number of operations specified by the --target parameter. Consequently, the final value of the counter will no longer be a fixed target, but the correctness condition (no lost updates) remains the same. A revised version of the psuedo-code above is below:

/* PSUEDO-CODE--don't just copy/paste!
   dWriteProbability is type double 
   in the range [0..1] specifying the 
   fraction of operations that are updates.
*/

bool operation_is_a_write() {
  if(static_cast<double>(rand()) / static_cast<double>(RAND_MAX)) < dWriteProbability)
     return true;
  return false; 
}

worker_thread() {
  int my_operations = 0;
  int my_operation_count = maxcounter/num_workers;
  while(my_operations < my_operation_count) {
    int curval = counter;
    if(operation_is_a_write()) {
       counter++;
       my_increment_count++;
    }
    my_operations++;
  }
}

What to Turn In

Hints

Dealing with command-line options in C/C++ can be painful if you've not had much experience with it before. For this lab, it is not critical that you actually use the long form option flags (or any flags at all). (e.g. If you want to just assume that the argument at position 1 is always workers, and position 2 is always maxcounter that is fine. However, getting used to dealing with command line options in a principled way is something that will serve you well, and we strongly encourage you to consider getopt or boost/program_options. They will take a little extra time to learn now, but will save you a lot of time in the future.

When you measure performance, be sure you've turned debugging flags off and enabled optimizations in the compiler. If you're using gcc and pthreads, it's simplest to just turn off "-g", and turn on"-O3". In fact, the "-g" option just controls debugging symbols, which do not fundamentally impact performance: for the curious, some deeper consideration of gcc, debug build configurations, and performance can be found here.

It is very much worth spending some time creating scripts to run your program and tailoring it to generate output (such as CSV) that can easily be imported into graphing programs like R. For projects of this scale, I often just collect CSV output and import it into Excel. When engaged in longer term empirical efforts, a greater level of automation is often highly desirable, and I tend to prefer using bash scripts to collect CSV, along with Rscript to automatically use R's read.csv and the ggplot2 package which has functions like ggsave to automatically save graphs as PDF files. The includegraphics LaTeX macro used in the write up template works with PDF files too!

CS378: Concurrency