BNF
Backus Naur Form
When you write a program, each instruction you use must be constructed according to strict rules. You must pay great attention to getting the syntax of each instruction perfect or you will not be able to translate your program!
Once you have written the program in the programming language of your choice, you need to translate it into code that the computer understands. The process can be represented by the following diagram:

Compilation
We have seen that to run a program, you must turn the source code into object code. We have already seen in an earlier chapter that translation can be split up into three sections:
The lexical analysis stage
-
- The source code is looked at and any unnecessary parts of it are removed. Unnecessary parts would include comments and spaces.
- The code is then grouped into tokens. A token is:
-
-
-
- A keyword like FOR, PRINT and so on.
- A symbol that has got a fixed meaning, for example, +, *.
- Numerical constants.
- User-defined variable names.
-
-
The syntax analysis stage
The syntax of the string of tokens created in the lexical analysis stage is checked to see if they obey the syntax rules for the language. The entire stream of tokens is split up (or 'parsed') into expressions and then these are checked against a database of syntax rules. There are a number of different ways to represent syntax definitions, such as using Backus Naur Form (BNF) or syntax diagrams. We are about to see these ways in action!
Code generation and optimisation
Code suitable for running on a CPU is generated and then checked for optimisation in this stage.
Backus Naur Form (BNF)
Assuming that lexical analysis has taken place, and the source code written by the programmer has been turned into a string of tokens, how does the computer check the tokens, to ensure that the rules for the language have been followed? This is where BNF comes in!
How are syntax rules represented, so that syntax analysis can take place during compilation?
There are three basic symbols you need to know:

This last definition simply means that the variable num can either be a 1 symbol or a 2 symbol or 3 and so on. The 1, 2, 3 etc in the last example have a special name. They are called ‘terminal symbols’ because they cannot be broken down any further.
An example using BNF
How can you define a positive integer? A positive integer could be a number of any length. E.g. the following numbers are positive integers: 2, 243, 7199, 23244, 12222346 and so on. It can be done as follows:

But this will only allow you to represent single-digit numbers. How can you represent integers of any length? It can be done using a technique that has its roots in recursion.
Consider the number 1524.
This is a digit (1) followed by an integer (524).
But (524) is a digit (5) followed by an integer (24).
But (24) is a digit (2) followed by an integer (4).
But (4) is a single digit.

An integer is defined as a digit followed by an integer. The recursion will stop when the final integer is a single digit. This is why we need the OR symbol in BNF, to enable us to drop out of the recursion. The final definition of an integer is:
Write out these definitions in English.

This definition will allow us to define any unsigned integer. Perhaps a better name for the meta variable would be unsigned_integer so we now have:

Suppose we wanted to represent signed integers, i.e. an integer with either a plus or a minus sign in front of it. How could we do this? There are a number of ways but one way is as follows:

We simply defined a new meta variable called ‘sign’. Then we included it in our definition of signed_integer.
Another example
Consider this definition of a membership ID:

We started building up this definition of membership_number by defining the meta variables that didn’t use recursion because they are the simplest ones to deal with. Then we defined the more complex meta variables, the ones that did use recursion. Finally, we created a meta variable for the membership ID from the meta variables we just created.
-
- &BAEE427 is a valid membership ID.
- #D56177354 is a valid membership ID.
- &EEE8 is a valid ID.
- #AAEF52 is not a valid ID because F is not a valid letter.
- #7233 is not a valid ID because you must have a letter from A to E after the # or &.
- &aC3655 is not valid because a small ‘a’ is not allowed according to the definition of letter.
- &BBE593433 is not valid because we are not allowed the digit 9.
- CE67 is not valid because it doesn’t begin with # or &.
Now test each other. Create membership IDs and see if a friend can tell you correctly whether they are valid or not.