Disjoint Set Data Structure

Suppose we have n items (student records, bank account records, whatever) each with unique keys from 1..n. We want to keep the items in a collection of sets (disjoint sets) such that an item must occur in exactly one of those sets. For example, we want to partition a set of students into "students with GPA >= 2.0" and "students with GPA < 2.0." Or we might want the collection to have many sets, e.g. different income levels of people. The point is, the sets are mutually exclusive and include all the items.

In disjoint sets, each set is identified by a representative that is some member of the set. For convenience, we can choose this element to be the one with the smallest key, but we need to be able to choose some representative. It will soon become convenient to think of the representative as the parent of the other items in the set, like in a tree.

In many situations in computer science, problems involving disjoint sets naturally arise such that the sets grow dynamically (i.e., during the course of an algorithm, sets change by merging) and two important operations are:

Find (x) - determine which set an item with key x is in, i.e., return the key of the representative of the set x is in. Using this operation, one can tell whether two elements are in the same set: you just do a Find on both of them and compare the return values. If they are the same, then the two items are in the same set.
Union (x, y) - unite the sets containing x and y. (Note: union is a C keyword; don't write a function called union in your C or C++ program!)

Here is an example of an algorithm that uses these operation. It computes the connected components of an undirected graph G=(V,E). Recall that the connected components of a graph are the subgraphs that are all mutually connected:

Connected-Components (G)
	for each vertex v in V do
		// each vertex is initially in its own component
		make v a singleton set with a unique key from 1..|V|
	end for
	for each edge (u, v) in E do
		// if u and v are connected by an edge, then
		// everything in u's component is connected
		// to everything in v's component, so Union the sets
		if Find (u) != Find (v) then Union (u, v)
	end for

Now, if we want to find out whether two vertices are in the same connected component, we just call Find on both vertices and see if the result is the same. If so, then they are connected.

There is an easy algorithm that implements the disjoint set operations. The idea is that we have an array p[1..n] with an element for each item that is in some set. If the item is a representative (parent) of some set, then the value of this element its own index, otherwise the value is the index of another item in the array, giving rise to a linked list eventually ending in the parent item. For example, suppose we have the following disjoint sets (indicated by their unique 1..n keys):

{ 6 14 1 } { 2 3 13 } { 5 12 7 8 10 } { 9 11 4 15 }

Then we might have the following lists giving the set relationships:

               1 <-- 6 <-- 14

               2 <-- 3 <-- 13

               5 <-- 12 <-- 7 <-- 8 <-- 10

               9 <-- 11 <-- 4 <-- 15

To find the representative of an element, we just traverse the list until we reach the parent (which points to itself).

Of course, nothing (except for ASCII graphics) prevents us from having tree-like structures with this representation, where more than one item points to the same parent item, e.g.:

                      />1<\
                     /     \
                    14      6              13 -->2<-- 3

           5<\
              \                 />9<\
            />12<\             /  ^  \
           /  ^   7           /   |   \
          10  |              4    11  15
              8

An array representing this forest would look like this:

i    1  2  3  4  5  6  7  8  9  10  11  12  13  14  15
p[i] 1  2  2  9  5  1  12 12 9  12  9   5   2   1   9

Let's see algorithms to implement this linked-list Union/Find. We'll assume that p[1..n] is initialized to p[i] = i so each item is in its own singleton set and is its own set's representative.

// return the key of the representative of this set
Find (x)
	if p[x] != x then
		return Find (p[x])
	end if
	return x

// join two sets containing items x and y together
Union (x, y)
	a = Find (x)	// a is x's representative
	b = Find (y)	// b is y's representative
	p[a] = b	// now b is a's parent, and Find (x) would return b

These algorithms are very simple. In the worst case, Find will take O(n) operations; this worst case would occur when we have one big set represented as a long linked list and try to Find the representative of the last item. Union is bound by the same worst case time, since it calls Find O(1) times.

One obvious improvement would be to have all set members point directly to their representatives, instead of to arbitrary other members of the set. One way to do this is called path compression: each time we do a Find on a set element, we make its parent the representative element. We "compress" the path from leaf to root; after a Find, instead of many levels between a leaf and the root, there is just one. And, every item along the path from leaf to root is also directly connected to the root:

Find (x)
	if p[x] != x then
		p[x] = Find (p[x])
	end if
	return p[x]

The next time a Find is done on any element along this path, the parent will be returned in O(1) time. This yields very good performance in practice, since any long chain will likely be broken quickly. However, we still have to worry about Union: each time we do a Union, we push some tree down a level. This means that the next find done on a member of this subtree will take a little longer than it would have before the Union (although the path is compressed during the Find, the damage is already done with the time spent doing the Find). We can minimize the impact of this situation using a heuristic called union by rank. With this method, we keep track of how many elements are in a subtree, and make the smaller subtree the child of the larger subtree. This insures that most of the items in the new tree are unaffected by the Union in terms of how long it takes to find the representative. With each item x we associate a count count[x] that contains the number of items in the tree rooted at x. When the sets first start out as singletons, they each have a count of one.

Union (x, y)
	a = Find (x)
	b = Find (y)
	if count[a] > count[b]
		// a has more kids; make b its child
		p[b] = a
		// and update the count of a to include b's kids
		count[a] += count[b]
	else
		// or vice-versa
		p[a] = b
		count[b] += count[a]

	endif

(Your book uses a different, approximate method in order to make the analysis of the algorithm easier.)

Here is an example of doing some Union operations on sets with indices 1..6 (c[i] is count[i]):

i	 1  2  3  4  5  6 
p[i]	 1  2  3  4  5  6 
c[i]	 1  1  1  1  1  1 

Union (3, 5)
i	 1  2  3  4  5  6 
p[i]	 1  2  5  4  5  6 
c[i]	 1  1  1  1  2  1 

Union (4, 2)
i	 1  2  3  4  5  6 
p[i]	 1  2  5  2  5  6 
c[i]	 1  2  1  1  2  1 

Union (2, 6)
i	 1  2  3  4  5  6 
p[i]	 1  2  5  2  5  2 
c[i]	 1  3  1  1  2  1 

Union (1, 4)
i	 1  2  3  4  5  6 
p[i]	 2  2  5  2  5  2 
c[i]	 1  4  1  1  2  1 

Union (3, 6)
i	 1  2  3  4  5  6 
p[i]	 2  2  5  2  2  2 
c[i]	 1  6  1  1  2  1

(We'll see the trees on the board in class.)

The analysis of this Union/Find algorithm is beyond the scope of an undergraduate class. However, the results are pretty amazing: a Union or Find operation takes O(lg* n) time amortized over all the operations (i.e., one particular instance may take longer, but overall, each one averages out to O(lg* n). This lg* is the iterated logarithm function; it's the number of times you can take the log base 2 of a number. This is a very slowly growing number; lg* 10^20 is about 4. You will probably never need to do Union/Find on sets with that many elements; indeed, there aren't even that many bytes in all the computers in the world. So for all practical purposes, Union/Find can be considered to run in constant time.