Sorting

Sorting means putting unordered data into order. The typical situation is where we have items in an array and would like to sort them by some key so that we can e.g. print them out nicely or do fast binary searches on them.

There are many different sorting algorithms. Sorting has been one of the most heavily researched areas of computer science. We will see just a few sorting algorithms.

In most of the discussion, we will assume we want to sort an array of n floats in ascending numerical order. We could just as easily sort them in descending order, or sort integers, or sort records by some key, or sort disk nodes, but floats will keep things simple.

Bubble Sort

This is one of the most simple sorting algorithms. It is something you can easily memorize (but might not want to, as we'll see later). The idea is you go through the array, comparing adjacent elements. If they are out of order, you swap them. You keep repeating this process over and over until you can go through the array without doing any swaps. At this point, no elements are out of order and the array is sorted. Here it is in C:

/* function to swap two elements in an array */
void swap (float v[], int i, int j) {
        float   t;

        t = v[i];
        v[i] = v[j];
        v[j] = t;
}

/* bubble sort function, sorts elements v[0..n-1] */
void bubble_sort (float v[], int n) {
        int     i, 		/* array index */
		swapped;	/* true if we have swapped */

        do {
		/* we have not swapped yet */

                swapped = 0;

		/* go through array, looking for out of order elements */

                for (i=1; i<n; i++)

			/* if v[i-1] and v[i] are out of order... */

                        if (v[i-1] > v[i]) {

				/* swap them */

                                swap (v, i, i-1);

				/* and remember to go through the loop again */

                                swapped = 1;
                        }
        } while (swapped);
}

Bubble sort is so named because the "lighter" (i.e., smaller) elements "bubble" to the top (i.e., lower indices) of the array, while the "heavier" elements sink to the bottom. An interesting experiment is to code up bubble sort to display the contents of the array each time through the while loop and watch the lighter elements bubble to the top.

Let's analyze this algorithm. For comparison sorts, where the algorithm does things based on comparisons of array elements, time is usually measured in terms of how many comparisons are done. There are three cases:

The best case. In the best case, the array is already sorted and bubble_sort simply checks that it is sorted and exits. This is n-1 = (n) comparisons.
The worst case. In the worst case, the element that is supposed to be first is actually last, so that the do/while loop must run n times while the value "bubbles" up to the first array index. Each time through the do/while loop, n-1 comparisons are done. This is O(n(n-1)) = O(n²) comparisons.
The average case. This is the average running time, averaged over all possible initial orderings of the array. The analysis is somewhat more complicated, but the result is still O(n²) comparisons.

Bubble sort is pretty inefficient because of this quadratic running time; we can do a lot better than O(n²).

Selection Sort

See lecture 2 for a presentation and analysis of selection sort. Basically, it is just as bad as bubble sort.

Insertion Sort

Insertion sort "inserts" each element of the array into a sorted sub-array of increasing size. Initially, the subarray is of size 0. It depends on the insert function shown below to insert an element somewhere into a sorted array, shifting everything to the right of that element over one space. Here is the code:

/* insert k into sorted array v[0..n-1], moving everything into v[0..n] */

void insert (float v[], int n, float k) {
        int     i;

        for (i=n-1; (k < v[i]) && (i >= 0); i--) v[i+1] = v[i];
        v[++i] = k;
}

void insertion_sort (float v[], int n) {
        int     i;

        for (i=0; i<n; i++) insert (v, i, v[i]);
}

The worst and best case analysis is the same, and will be done in class. The result is that insertion sort, like bubble sort and selection sort, takes O(n²) comparisons.

Can We Do Better?

So far, we have seen three sorting algorithms that operate using O(n²) comparisons in the worst case. Does there exist a better sorting algorithm? Let's try to find a strict lower bound on the number of comparisons needed to sort an array.

If we ignore the procedural aspects of an algorithm and look only at the data being sorted, we see that each comparison results in at most one change in the order of the array, e.g., maybe two elements may be swapped, or maybe nothing will happen at any one step.

Without loss of generality, let's assume that each array element is different. This makes the analysis easier and is often not too far an assumption from the truth.

We can think of this process as search through a binary search tree where each node is a permutation (a particular order) of the array. The root of this tree is the order of the array as the algorithm initially encounters it.

What we're searching for is the node where the permutation of elements is sorted. The right and left children of a node are the two resulting permutations when the comparison is "less than" and "greater than," respectively. It is up to the algorithm which two elements to compare. For example, the following decision tree shows the movement of data in the bubble sort algorithm performed on three items (the tree is not complete; it is large):

                            { a b c }
                       /                 \
                      /                   \
                     /                     \
            a < b   /                       \   a > b
                   /                         \
               { a b c }                   { b a c }
        b < c  /       \ b > c      a < c  /       \ a > c
              /         \                 /         \
         { a b c }   { a c b }        { b a c }   { b c a }
                       /   \            /   \       /   \

A general purpose sort is a sorting algorithm that works on any kind of ordered data. You provide the algorithm with an ordering on the data, and the algorithm sorts them for you.

You provide the comparison sort with a way to compare two items of data and the algorithms sorts them for you. The standard C function qsort is a good example of a general sort:

#include <stdlib.h>
void qsort(void *base, size_t nel, size_t width,
     int (*compar) (const void *, const void *));

base is a pointer to the first element of the array to sort.
nel is the number of elements in the array.
width is the size of an individual element of the array, for example, in an array of doubles, you would write sizeof (double) for width.
compar is a pointer to a function that compares any two items from the array through pointers to the elements. You have to write this comparison function yourself. It should return a positive integer if the first element is greater than the second, a negative integer of it is less than, and 0 if they are equal. For an array of pointers to character strings, the strcmp function works fine. qsort is a randomized version of Quicksort with very good performance.

So sorting is like a search from the initial permutation (root) to the sorted permutation (some node in the tree). In the worst case, the sorted permutation may be a leaf node, requiring a number of comparisons proportional to the height of the tree. So a worst case lower bound on comparison sorting is the height of this decision tree. If our algorithm is clever, its decision tree will be an almost-complete binary tree. The height of a decision tree with m nodes is

(ln m).

How many nodes are there in the decision tree for an array of size n? Since there is a node for every permutation of the array, there are n! nodes (i.e., n-factorial, n * (n-1) * (n-2) * (n-3) * ... * 1 nodes). So the height of the decision tree is (ln (n!)). A lower bound on the factorial function (known as Stirling's approximation) is:

(2n)^1/2 (n/e)ⁿ <= n!

for all n. If we take logarithms on both sides and use the properties that log ab = log a + log b and log a/b = log a - log b, and some asymptotic notation to hide constants, we get:

(1) + ln n + n ln n - (n) <= ln (n!)

which works out to simply

ln (n!) = (n ln n)

(An easier, but less rigorous, way to see this is to see that n! = O(nⁿ), so log n! = O(log nⁿ) = O(n log n).)

So the height of the decision tree has a lower bound of (n ln n). In the worst case, the sorting algorithm will have to "search" all the way down to a leaf node, so (n ln n) comparisons is the best a comparison sort can be expected to do. Since the number of comparisons is at least the number of array accesses or other operations, this is the lower bound on the worst case time-complexity of any comparison sort.

This all implies that we should be able to sort in time (n log n) Can we? Yes. For example, consider inserting all the array elements into a binary search tree, then traversing the tree, replacing elements into the array in order, e.g.:

int i = 0;
void traverse (tree *t) {
	if (!t) return;
	traverse (t->left);
	v[i++] = t->k;
	traverse (t->right);
}

Inserting takes

(n log n), traversing takes less time, so this is an

(n log n) sorting algorithm. However, it suffers from the degeneracy that plagues binary search trees in general (we might get a linked-list looking tree and it would just turn into insertion sort), and it wastes a lot of memory. Next time we will see

(n log n) sorting algorithms that use no extra memory.