Different uses of hashing
Introduction
The word 'hashing' is used in many different areas of computing. The basic idea behind hashing, however, is that you design an algorithm that creates a unique identifier for some data using the data itself. You then store these 'hash codes' in a table. You can then use a particular piece of data in a record you are trying to find, for example, to identify a hash code, so that you can then get back the entire file quickly, rather than sorting through an entire file one by one.
Random files (also known as hash files)
Organising and finding records in a computer file can be a complicated business and can take a long time. Different methods of file organisation, including serial, sequential and index sequential searching, involves varying degrees of searching through other records to get to the one you want. With a random file structure, you go straight to the record you want! This is a very fast access method. In a random file, some search data is entered for the system to use to get back the whole record. You then apply a maths algorithm (or formula) to this search data. This transforms the search data into an address. The computer can then go and get the record! Let's look at an example of this.
An example of how random files work
A database of pupils is going to be organised using a random file. This file structure has been chosen because the database will be used to deal with pupil/parent enquiries and so needs to retrieve each record quickly.
The secretary dealing with enquiries will type in a pupil's surname to get their data back. The database designer has selected the following maths formula, known as a ‘hash algorithm’:
-
- When a surname is typed in, convert each letter of the surname into a
- number, and then add the numbers together to give an address. Convert
- letters to numbers using A=1, B=2, C=3 ... X=24, Y=25, Z=26.
For example, when the secretary types Jones, this is converted into 10 + 15 + 14 + 5 + 19 = 63. The computer goes to memory address 63 and there's the start of the record!
If you wanted to store ‘Jones’ instead, you would work out the memory address you are going to store the data in by applying the hashing algorithm to the data.
This is a much-simplified example but does illustrate the basic idea. It is a way of converting a request for a file into an address, so the record can be retrieved immediately, without having to go through other records.
Good hashing algorithms
Creating a random file is an excellent way of creating a fast access file structure. You do need, of course, a direct access storage device, not a magnetic tape, for example. You also need to ensure that you have a 'good' hashing algorithm. The one above is very poor indeed because there will be lots of different surnames that give the same memory address! This is called a 'clash' or 'collision'. You need to design a hashing algorithm that minimises clashes because they slow down access times. On the other hand, an algorithm might also spread out the data so much that large areas of storage are used up! Having large areas of storage that aren't used efficiently is known as ‘redundancy’. A further consideration is the hashing algorithm itself. It mustn’t be so complicated that it takes ages to calculate anything!
A good hashing algorithm will:
-
- Minimise clashes.
- Ensure that the hash codes of data aren't spread too far apart, wasting memory.
- Be quick to calculate.
Dealing with clashes
Some clashes (collisions) are inevitable. When they do happen, there are two techniques that can be used to deal with the situation:
- The computer can still go to the memory address that was calculated by the hashing algorithm. It then starts searching sequentially from that point to either find the data that it was searching for or to store the data it wanted to store.
- The computer can set-up an overflow area. When there is a clash, the computer can jump to this special area to either find the data that it was searching for or to store the data it wanted to store.
Hash codes in cryptographic hashing, downloading files and password generators
Another use of hash files is to confirm that a file you have downloaded from a website has been downloaded correctly. The idea behind this kind of hashing is that if you take a block of data or file and apply a standard cryptographic hashing algorithm such as MD5 (Message Direct algorithm 5), then it will always return a fixed size hash value. You can try this for yourself here:
http://www.miraclesalad.com/webtools/md5.php
If you have ever downloaded a really big file, such as a free operating system like Linux or a large application like GIMP, you may have noticed that you can download a hash value for the file as well. This has been generated by the person who has made the file available, form a confirmed, working example of the file. If you download it, you can use the hash file to check that the file you downloaded is absolutely identical to the file where you downloaded it from. In other words, the file hasn't been corrupted during transmission across the Internet. This kind of checking occurs in peer-to-peer networks, to confirm that you have downloaded a file correctly and in sites that sells music and films. If you are using Windows, you can generate an MD5 value for a file you have just downloaded from here:
http://www.md5summer.org/download.html or http://www.winmd5.com/
or you could watch a YouTube video on exactly how to use and check MD5 files, such as the one here: https://www.youtube.com/watch?v=EKTk27Bpcy4
You can then compare the MD5 value generated for a file you've just downloaded with the one given on a website for the file.
Passwords
Another use of hashes is to generate complex passwords from simple phrases. Instead of using a password such as Is your Cat's name Fred? you could use a hash code generator to convert it into something far more complex. If you ever need to get back the MD5 password, you can use an MD5 hash generator such as the one given earlier to do just that.