CS380P: Parallel Systems

Lab #2: KMeans with CUDA

The goal of this assignment is exposure to GPU programming. You will solve k-means, using CUDA.

K-Means

K-Means is a machine-learning algorithm most commonly used for unsupervised learning. Suppose you have a data set where each data point has a set of features, but you don't have labels for them, so training a classifier to bin the data into classes cannot rely on supervised algorithms (e.g. Support Vector Machines, which learn hypothesis functions to predict labels given features).

One of the most straightforward things we can do with unlabeled data is to look for groups of data in our dataset which are similar: clusters. K-Means is a "clustering" algorithms. K-Means stores k centroids that define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than to any other centroid. K-Means finds these centroids by alternately assigning data points to clusters based on a current version of the centroids, and then re-computing the centroids based on the current assignment of data points to clusters. The behavior the algorithm can be visualized as follows:
Initial input Choose three random centers Map each point to its nearest centroid New centroid is mean of all points
mapping to it. Iterations move
the centroids closer to their destination.
Centroids stop moving.
Each point labeled with its nearest centroid.

The Algorithm

In the clustering problem, we are given a training set x(1),...,x(m), and want to group the data into cohesive "clusters." We are given feature vectors for each data point x(i) encoded as floating-point vectors in D-dimensional space. But we have no labels y(i). Our goal is to predict k centroids and a label c(i) for each datapoint. Here is some pseudo-code implementing k-means:

kmeans(dataSet, k) {

  // initialize centroids randomly
  numFeatures = dataSet.getNumFeatures();
  centroids = randomCentroids(numFeatures, k);

  // book-keeping
  iterations = 0;
  oldCentroids = null;

  // core algorithm
  while(!done) {

    oldCentroids = centroids;
    iterations++;

    // labels is a mapping from each point in the dataset 
    // to the nearest (euclidean distance) centroid
    labels = findNearestCentroids(dataSet, centroids);

    // the new centroids are the average 
    // of all the points that map to each 
    // centroid
    centroids = averageLabeledCentroids(dataSet, labels, k);
    done = iterations > MAX_ITERS || converged(centroids, oldCentroids);
}

For detailed instructions on how to write your code and submit it for evaluation, please see these instructions.

You should include a report that answers the following questions. In cases where we ask you to explain performance behavior, it is fine to speculate, but be clear whether your observations are empirical or speculation.