CS378: Concurrency

Lab #4

The goal of this assignment is exposure to GPU programming. You will solve the same problem you solved in Lab 1, k-means, using CUDA. Recall from lab 1, the following background about K-means:

K-Means

K-Means is a machine-learning algorithm most commonly used for unsupervised learning. Suppose you have a data set where each data point has a set of features, but you don't have labels for them, so training a classifier to bin the data into classes cannot rely on supervised algorithms (e.g. Support Vector Machines, which learn hypothesis functions to predict labels given features).

One of the most straightforward things we can do with unlabeled data is to look for groups of data in our dataset which are similar: clusters. K-Means is a "clustering" algorithms. K-Means stores k centroids that define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than to any other centroid. K-Means finds these centroids by alternately assigning data points to clusters based on a current version of the centroids, and then re-computing the centroids based on the current assignment of data points to clusters. The behavior the algorithm can be visualized as follows:
Initial input Choose three random centers Map each point to its nearest centroid New centroid is mean of all points
mapping to it. Iterations move
the centroids closer to their destination.
Centroids stop moving.
Each point labeled with its nearest centroid.

The Algorithm

In the clustering problem, we are given a training set x(1),...,x(m), and want to group the data into cohesive "clusters." We are given feature vectors for each data point x(i) encoded as floating-point vectors in D-dimensional space. But we have no labels y(i). Our goal is to predict k centroids and a label c(i) for each datapoint. Here is some pseudo-code implementing k-means:

kmeans(dataSet, k) {

  // initialize centroids randomly
  numFeatures = dataSet.getNumFeatures();
  centroids = randomCentroids(numFeatures, k);

  // book-keeping
  iterations = 0;
  oldCentroids = null;

  // core algorithm
  while(!done) {

    oldCentroids = centroids;
    iterations++;

    // labels is a mapping from each point in the dataset 
    // to the nearest (euclidean distance) centroid
    labels = findNearestCentroids(dataSet, centroids);

    // the new centroids are the average 
    // of all the points that map to each 
    // centroid
    centroids = averageLabeledCentroids(dataSet, labels, k);
    done = iterations > MAX_ITERS || converged(centroids, oldCentroids);
}

K-Means in CUDA

You will write the same program you wrote for lab 1, with same input options and the same single-threaded CPU baseline, but will parallelize using CUDA rather than pthreads for Steps 2 and beyond. However, your single-threaded CPU-based K-Means should be the baseline against which you measure all the variants on your GPU implementation below.

Step 1: Resurrect your sequential CPU solution

From lab 2, you should already have a single-threaded CPU-based program that accepts command-line parameters to specify the following:

Your program's output should include:

Step 2: Parallelize your algorithm

Write a basic version of K-Means in CUDA. The most natural parallelization is to assign a CUDA thread per point in the input. The CS department has set up a handful of machines with GPUs and CUDA installed on them: eldar-2 through eldar-4. The CUDA-8.0 runtime and tools are installed in /opt/cuda-8.0. There are a number of ways to make it work, but putting the /opt/cuda-8.0/bin on your path should suffice to let you use NVCC, the CUDA compiler. There are 2 GPUs per machine. If you want to select a particular GPU, see the cudaGetDeviceCount() and cudaSetDevice() functions in the CUDA Device Management API.

Using the random-n2048-d16-c16.txt, random-n16384-d24-c16.txt, and random-n65536-d32-c16.txt sample inputs, --iterations 20, and --threshold 0.0000001, use your implementation to find 16 centroids. Create a graph of scalability against input size of your solution using the Please normalize your measurements with the single-threaded solution from Step 1.

Step 3: Use shared Memory

Use CUDA shared memory to implement private partial aggregations per thread group.

Create a similar graph of scalability versus input size for your optimized solution. In this case, include bars for the fastest multi-threaded solution you measured in lab 2, labeled to in indicate the number of threads and the combination of locking primitives used to implement it.

Step 4: Extra credit -- pursue better performance

In this step, you may, for extra credit, explore other ways to make your GPU k-means faster. Can you use other architectural support (read-only memory, ballot instructions, CAS), other parallelization techniques? K-Means can be decomposed as a GroupBy-Aggregate, which can be implemented relatively easily in CUDA using thrust::sort and thrust::prefix_sum primitives. Is this approach faster or slower than BSP-style domain decomposition? As with lab 2, extra credit will be given for any solution that undertakes this section, as long as the solution is still correct (you may find it hard to improve scalability). Major kudos (and extra points) go to the solution that is fastest in absolute terms and the one that is most scalable.

Deliverables

You should use canvas to submit, along with your code, Makefiles, and measurements scripts, a brief writeup with the scalability graphs requested above. Since the goal is compare the performance of different implementations, it is fine to include all measurements on the same graph, as long as they are well-labeled. Either way, be sure that your writeup includes sufficient text to enable us to understand which graphs are which.

Your writeup should additionally answer the following questions. In cases where we ask you to explain performance behavior, it is fine to speculate, but be clear whether your observations are empirical or speculation.

Handy links: