Fast Searching with Hash Tables

This is a summary of the material presented in the lab last time.

A searching algorithm looking for a key k in a set of keys S typically must wade through many keys in S before finding k. We can quantify this behavior in terms of the number of comparisons of keys in S with k.

With linear searching in e.g. linked lists and unordered arrays, the number of comparisons is (n), where n = |S|. With binary search in e.g. binary trees and ordered arrays, this can be improved to (log n) comparisons. B-Trees allow us to lessen the impact of disk input/output on searching, but the (log n) lower bound on the number of comparisons remains.

Hashing is a technique that allows essentially O(1) comparisons to bring us to the right key in the set. We keep the keys in an array called a hash table. When we want to find the location (index) of a certain key k, we call a function called a hash function that computes some function of k telling us magically where k should be in the array. Then we go to the array and look for k in that immediate area, taking only (if we arrange everything right) O(1) work.

In a hash table with N array elements, the function must return a valid array index between (in C) 0 and N-1. Ideally, the hash function will tell us exactly where the key is located in the array. In reality, more than one key may hash to the same location (i.e., have the same hash function value). When two keys have the same hash value, this is called a collision. Much of hash tables is deciding what to do in case of a collision.

Collision Policies

A collision occurs when two keys k1 and k2 hash to the same array index i. There are two main techniques used to resolve collisions encountered in a hash table:

Properties of a Well Designed Hash Table

A good hash table should have the following properties: By counterexample, a poorly designed hash table for storing words from a dictionary would use a hash function assigning index 0..25 for words beginning with a..z, respectively. This is poor because we will probably have many more words than letters (thus large buckets or an overflowing probing table) and some letters will occur more frequently than others (leading to efficient searching for 'xylophone', perhaps, but not for 'echidna').

Let's look at an example of hashing with probing. Suppose we have student records like this:

typedef struct _rec {
	int	id;		/* student ID */
	char	name[100];	/* student name */
	float	gpa;		/* student gpa */
} rec;
We are expecting to have maybe 10,000 students, so we'll have our hash table accomodate 30,000 records:
#define N 30000
rec	table[N];
A reasonable hash function is the last four digits of the student ID. However, this gives us a number in the range 0..9999, wasting two thirds of the table and inviting collisions. So we'll use the student ID modulo N:
int hash (int id) {
	return id % N;
}
We first initialize the table so that each id field has an impossible value meaning "this slot is available.":
void init_table (void) {
	int	i;

	for (i=0; i<N; i++) 
		table[i].id = -1;
}
Here is the function for inserting with probing:
void insert (rec r) {
	int	h;

	/* find hash value */

	h = hash (r.id);

	/* look for an available slot */

	while (table[i].id != -1) {
		i++;

		/* wrap around so we don't go off array */

		if (i == N) i = 0; 
	}
	table[i] = rec;
}
And here is the function for searching by student ID, returning a pointer to the record found, or NULL if the record isn't in the array:
rec *search (int id) {
	int	h;

	/* just like inserting... */

	h = hash (id);

	/* ...except we're looking for the id, not -1 */

	while (table[i].id != id) {
		i++;
		if (i == N) i = 0; 
		/* not there? */
		if (table[i].id == -1) return NULL;
	}
	return &table[i];
}