Hashing

A hash function converts a number in a large range into a number in a smaller range. This smaller range corresponds to the index numbers in an array.

arrayIndex = hugeNumber % arraySize

An array into which data is inserted using a hash function is called a hash table. Collision occurs when two keys map to the same index. Solutions to collision:

Open Addressing
Separate Chaining

Open Addressing - when a data item cannot be placed at the index calculated by the hash function, another location in the aray is sought.

Linear Probing
Quadratic Probing
Double Hashing

In Linear Probing we search sequentially for vacant cells. As more items are inserted in the array clusters grow larger. It is not a problem when the array is half full, and still not bad when it is two- thirds full. Beyond this, however, the performance degrades seriously as the clusters grow larger and larger. The performance is determined by the Load Factor. The Load Factor is the ratio of the number of items in a table to the table's size.

loadFactor = nItems / arraySize

If x is the position in the array where the collision occurs, in Quadratic Probing the step sizes are x + 1, x + 4, x + 9, x + 16, and so on. The problem with Quadratic Probing is that it gives rise to secondary clustering.

Double Hashing or rehashing: Hash the key a second time, using a different hash function, and use the result as the step size. For a given key the step size remains constant throughout a probe, but it is different for different keys. The secondary hash function must not be the same as the primary hash function and it must not output 0 (zero).

stepSize = constant - ( key % constant )

The constant is a prime number and smaller than the array size. Double hashing requires that the size of the hash table is a prime number. Using a prime number as the array size makes it impossible for any number to divide it evenly, so the probe sequence will eventually check every cell. Suppose the array size is 15 ( indices from 0 to 14 ) and that a particular key hashes to an initial index of 0 and a step size of 5. For example consider hashing the following sequence of numbers 15, 30, 45, 60, 75, 90, 105. Then the probe sequence will be 0, 5, 10, 0, 5, 10, and so on, repeating endlessly.

If the array size was 13 and the numbers were [13, 26, 39, 42, 65, 78, 91] then the step size would be [2, 4, 1, 3, 5, 2, 4]. Supposing the step size was the same for a set of numbers then the sequence of steps would be [0, 5, 10, 2, 7, 12, 4, 9, 1, 6, 11, 3] and so on. If there is even one empty cell, the probe will find it.

In Separate Chaining a data item's key is hashed to the index in the usual way, and the item is inserted into the linked list at that index. Other items that hash to the same index are simply added to the linked list. In separate chaining it is normal to put N or more items into an N-cell array. Finding the initial cell takes fast O(1) time, but searching through a list takes time proportional to the number of items on the list - O(m). In separate chaining the load factor can rise above 1 without hurting performance very much. It is not important to make the table size a prime number.

Buckets: Another approach similar to separate chaining is to use an array at each location in the hash table instead of a linked list. Such arrays are called buckets. This approach is not as efficient as the linked list approach, however, because of the problem of choosing the size of the buckets. If they are too small they may overflow, and if they are too large they waste memory.

Hash Functions: A good hash function is simple so that it can be computed quickly. A perfect hash function maps every key into a different table location. Use a prime number as the array size.

Hashing Strings: We can convert short strings to key numbers by multiplying digit codes by powers of a constant. The three letter word ace could turn into a number by calculating

key = 1 * 26² + 3 * 26¹ + 5 * 26⁰

This approach has the desirable attribute of involving all the characters in the input string. The calculated key value can then be hashed into an array index in the usual way:

index = key % arraySize

def hashFunc1 ( key, arraySize ):
  hashVal = 0
  pow26 = 1

  for j in range (len(key) - 1, -1, -1):
    letter = int (key[j]) - 96
    hashVal += pow26 * letter
    pow26 *= 26

  return hashVal % arraySize

The hashFunc1() method is not as efficient as it might be. Other than the character conversion, there are two multiplications and an addition inside the loop. We can eliminate one multiplication by using Horner's method:
a₄x⁴ + a₃x³ + a₂x² + a₁x¹ + a₀ = ( ( ( a₄x + a₃ ) x + a₂ ) x + a₁ ) x + a₀

The hashFunc1() cannot handle long strings because the hashVal exceeds the size of int. Notice that the key always ends up being less than the array size. In Horner's method we can apply the modulo (%) operator at each step in the calculation. This gives the same result as applying the modulo operator once at the end, but avoids the overflow.

def hashFunc2 ( key, arraySize ):
  hashVal = 0
  for j in range (len(key)):
    letter = ord (key[j]) - 96
    hashVal = (hashVal * 26 + letter ) % arraySize
  return hashVal