A simple introduction to hash tables
Introduction
A hash table is a type of data structure. You apply a Maths formula to a piece of data that you want to store. This generates a number, which corresponds to a position in a hash table. Hash tables are very fast because you can calculate and then go to a piece of data you want to find, without having to search sequentially through lots of other pieces of data to get to it.
An example of how hash tables work
A database of pupils is going to be organised using a hash table. This file structure has been chosen because the database will be used to deal with pupil/parent enquiries and so needs to retrieve each record quickly. The secretary dealing with enquiries will type in a pupil's surname to get their data back. The database designer has selected the following maths formula, known as a ‘hash algorithm’:
When a surname is typed in, convert each letter of the surname into a number, and then add the numbers together to give an address. Convert letters to numbers using A=1, B=2, C=3 ... X=24, Y=25, Z=26.
For example, when the secretary types Jones, this is converted into 10 + 15 + 14 + 5 + 19 = 63. The computer goes to memory address 63 and there's the start of the record! If you wanted to store ‘Jones’ instead, you would work out the memory address you are going to store the data in by applying the hashing algorithm to the data. This is a much-simplified example but does illustrate the basic idea. It is a way of converting a request for a file into an address, so the record can be retrieved immediately, without having to go through other records.
Good hashing algorithms
Creating a hash file is an excellent way of creating a fast access file structure. You do need, of course, a direct access storage device, not a magnetic tape, for example. You also need to ensure that you have a 'good' hashing algorithm. The one above is very poor indeed because there will be lots of different surnames that give the same memory address! This is called a 'clash' or 'collision'. You need to design a hashing algorithm that minimises clashes because they slow down access times. On the other hand, an algorithm might also spread out the data so much that large areas of storage are used up! Having large areas of storage that aren't used efficiently is known as ‘redundancy’. A further consideration is the hashing algorithm itself. It mustn’t be so complicated that it takes ages to calculate anything!
A good hashing algorithm will:
- Minimise clashes.
- Ensure that the hash codes of data aren't spread too far apart, wasting memory.
- Be quick to calculate.
Dealing with clashes
Some clashes (collisions) are inevitable. When they do happen, there are two techniques that can be used to deal with the situation:
1) The computer can still go to the memory address that was calculated by the hashing algorithm. It then starts searching sequentially from that point to either find the data that it was searching for or to store the data it wanted to store.
2) The computer can set-up an overflow area. When there is a clash, the computer can jump to this special area to either find the data that it was searching for or to store the data it wanted to store.