The data dictionary
Introduction
The data dictionary holds information about data! Any system needs data to make the system work. The Systems Analyst must construct a dictionary of all the data items used in the system because this information will be needed by the people who actually build the new system, who write the software. This point of reference for information about data items is known as the ‘data dictionary’. They tell somebody the form of the data, how each data item is actually made up.
Data dictionaries are often best done as a table, using the following headings:
-
- The name of the data item.
- What synonyms there are for the data item.
- Whether it is a primary key or foreign key.
- Data type. (Whether it is a real number, an integer, a text, a character, a Boolean, a date and so on).
- Validation rules that apply. (E.g., the range of allowable values for integers, the number of allowable characters for text, the allowable characters for a character, the way that the date has to be entered using an input mask, the number of decimal points allowed, whether it is required or not and so on.
- Examples of typical data entries.
- The origin of the data, where it comes from, how it is generated in the first place, where it is stored.
- What exactly the data item is used for, what happens to it, why it is part of the system at all.
- Specification of access rights – who can view, edit or delete the data item.
The Systems Analyst will start the data dictionary at the beginning of the project and, like the list of problems, will add to it as new information becomes available. Some of this information may come from interviews, but much may come from existing documentation. One reason why collecting documents from an existing system is important is that it shows the Systems Analyst what data is needed in the current system, with examples of the data, where data comes from, how it is used and so on.
An example of creating a data dictionary
A file is to be created about the dogs that some owners have. A typical set of records will look like this:
We would need to create a data dictionary to hold the data about the data in the file. Data about data includes for example, the data type of the data, how many bytes to allow, any validation rules that apply and so on and is often called 'metadata' rather than data about data. The data dictionary might typically go in a table, but we will just list our data dictionary as it is quite small and also so we can see the logic behind each decision we make.
1) ID. This field is not a number but is an ID code. Therefore we will not use data type Integer but will use data type text instead. We will assume that the maximum number of dogs that will ever be in this file is 5000 so that an ID code of 4 characters long will be fine. We will allow 4 bytes for the ID code, one byte for each character.
2) Name. We know that some people give their dogs very long names. It is difficult to judge what to allow so we will allow plenty of room for error. We will allow 50 bytes, data type text (string).
3) Type. Let us assume that we have identified 203 different breeds of dog. The longest breed name is 28 characters long. If we allow that for each breed, there will be a lot of waste. For example, ‘Poodle’ only needs six bytes not 28. Because there are a fixed number of choices, we will code up the breeds. If we use one character, we can represent 26 breeds. If we use 2 characters, we can represent 26 x 26 = 676 breeds. This is more than enough. We could give Poodle the code PO, Alsatian AL and so on. We will therefore make this field data type text and allow 2 bytes.
4) Date of birth. This needs to be in the format DD/MM/YY. This is therefore data type Date and requires 6 bytes.
5) Gender. This is data type Boolean because a dog can only be either male or female. Allow 1 bit.
6) Weight. This is a real number. We only need 1 decimal place and the range of numbers is small therefore allow 2 bytes. (Real numbers will be represented using the floating-point system).
7) Owner's telephone number. This is data type text (because telephone numbers may include leading zeros and spaces). We will allow 6 bytes, to represent 12 digit numbers in BCD format.