Codes and character sets
Character sets
When you press a key on your computer's keyboard, a code is generated. Every lower case letter, every capital letter, every number, every 'special' character like the pound sign and @, for example, the space bar, the ENTER key and so on, has its own binary code.
When you write an email, for example, you press keys on the keyboard. Each key is converted into a binary code. Together, they are saved to form a message. You might then send this message to your friend. When your friend receives the message, they have to be using the same codes as you to 'decode' the binary codes back into letters. Imagine what would happen if the codes for ' H e l l o ' were different - according to your friend's codes, the message might be ' r % P } + '! Clearly, this is not good. The reason that computers can talk meaningfully to each other is that they all (more or less) use the same set of codes. There are three sets that you should know about. In addition, you should look up ANSI for yourself and find out how it differs from these three.
- ASCII
- EBCDIC
- Unicode
Standard ASCII and Extended ASCII
The most important set of codes to represent all of the possible key presses on a computer keyboard is the American Standard Code for Information Interchange, or ASCII (pronounced 'ass-key'). It is the set of codes used by Personal Computers.
In Standard ASCII, each character on the keyboard is represented by a 7 bit code. There are 96 displayable characters and 32 codes that are used for controlling e.g. printing. In Standard ASCII, for example, the letter 'A' is represented by the 7 bit code 1000001 (65 in decimal), the letter 'a' is 1100001 (97 in decimal), the '?' is represented by 0111111 (63 in decimal), a space is 32 in decimal and Null is decimal code 0. All of the different possible codes together make up what is known as the ASCII character set. If you are using 7 bits to represent a code, you have a total of 27, or 128 possible combinations. That means you can represent 128 different characters!
0 NUL |
16 DLE |
32 SP |
48 0 |
64 {@ |
80 P |
96 ‛ |
112 p |
1 SOH |
17 DC1 |
33 ! |
49 1 |
65 A |
81 Q |
97 a |
113 q |
2 STX |
18 DC2 |
34 “ |
50 2 |
66 B |
82 R |
98 b |
114 r |
3 ETX |
19 DC3 |
35 # |
51 3 |
67 C |
83 S |
99 c |
115 s |
4 EOT |
20 DC4 |
36 $ |
52 4 |
68 D |
84 T |
100 d |
116 t |
5 ENQ |
21 NAK |
37 % |
53 5 |
69 E |
85 U |
101 e |
117 u |
6 ACK |
22 SYN |
38 & |
54 6 |
70 F |
86 V |
102 f |
118 v |
7 BEL |
23 ETB |
39 ‘ |
55 7 |
71 G |
87 W |
103 g |
119 w |
8 BS |
24 CAN |
40 ( |
56 8 |
72 H |
88 X |
104 h |
120 x |
9 HT |
25 EM |
41 ) |
57 9 |
73 I |
89 Y |
105 i |
121 y |
10 LF |
26 SUB |
42 * |
58 : |
74 J |
90 Z |
106 j |
122 z |
11 VT |
27 ESC |
43 + |
59 ; |
75 K |
91 [ |
107 k |
123 { |
12 FF |
28 FS |
44 , |
60 < |
76 L |
92 \ |
108 l |
124 | |
13 CR |
29 GS |
45 - |
61 = |
77 M |
93 ] |
109 m |
125 } |
14 SO |
30 RS |
46 . |
62 > |
78 N |
94 ^ |
110 n |
126 ~ |
15 SI |
31 US |
47 / |
63 ? |
79 O |
95 _ |
111 o |
127 DEL |
The standard ASCII character set – decimal values and the character or control code they represent.
Standard ASCII uses 7 bits for each code. Programmers and computer people like to work in nice easy packets of 8 bits (called a byte). That means that we have an extra, 8th bit to play with! We can use this extra 8th bit in Standard ASCII for some error checking, looking for errors when bits are sent from one place to another. When bits are being transmitted, there is a real possibility that they get 'corrupted'. In other words, the bits and therefore the codes change and so the message changes, too! It is necessary to check for errors when data is transmitted. The error-checking that takes place using the 8th bit is known as ‘parity checking’. This is dealt with elsewhere in the book.
An alternative is to use all 8 bits for a code instead of 7 bits so you would have a total of 28, or 256 different combinations in your character set. In other words, you can represent 128 more characters than in 7 bit ASCII. You can have a code for the letters that appear in other languages but not in the English alphabet or for graphics symbols, for example. All of the 8-bit codes together are known as the Extended ASCII character set. Most computers today use Extended ASCII so extra characters can be represented. There is another character set, however, that is used in large commercial systems.
EBCDIC
The Extended Binary Coded Decimal Interchange Code is a character set used by older 'mainframes'. A mainframe computer is simply a computer that can be accessed from many terminals. It is often the preferred type of computer for businesses that process a lot of data. This character set uses different codes to ASCII so a PC couldn't directly 'talk' to one, although it could if a special program was written to allow them to talk.
It is unfortunate that there are two widely used character sets in use. This situation has come about because at one time, in the early days of computer development, manufacturers tried to make their own character sets the standard one to use and a lot of different ones were promoted. ASCII and EBCDIC emerged historically together and now we must live with it!
Unicode and UCS (Unified Character Set) – ISO 10646
When we are talking about the ASCII or EBCDIC codes we are viewing the world as a place where everyone speaks and uses English, using the characters and symbols we are all familiar with. There is a problem!
Many languages do not use the 26 letters of the English alphabet. There are literally thousands of symbols used, for example, to write Chinese. Then there are Japanese symbols, characters used in the Russian alphabet, Greek, Thai, Runic, Bengali, Tamil, Telugu, Arabic, Malay, Lao, Khmer, Tibetan, Ethiopian, Gujarati, Cherokee, Mongolian, Yi and the list goes on and on and on. It goes further than that, however. There are also many mathematics symbols in use all over the world and all kinds of other symbols, such as the scripts used by the writer Tolkien (of Lord of the Rings fame)! And of course there may be many new scripts and symbols added in the future. If there is to be a way for users of software to access the characters in any language, and if we all want to access the greatest possible range of symbols used in the world, then clearly we are going to have to think a little bit bigger than ASCII. This is especially true for companies who do business globally or who need to create multi-lingual documents.
We have seen that ASCII is simply a list of 256 numbers (8 bits), each number being allocated one of the characters or symbols that you can see on the keyboard in front of you. We have to do this because computers can only understand numbers not characters. Unicode uses exactly the same process as ASCII. It is a list of numbers, each number being allocated a particular character or symbol. However, Unicode is a much bigger list than the 256 numbers available in ASCII. In fact, the standard that defines Unicode and UCS, called ISO 10646, uses 31 bits. This gives about 2000 million codes. Unicode only uses a subset of this however, using 16 bits to give about 65000 unique codes that have been allocated to symbols.
Don’t be confused by the two common standards ‘Unicode’ and ‘UCS’. They originally started out as two different standards but the two organisations saw the light and decided that one system would be better for all concerned. They are still separate standards but have become in practical terms interchangeable. Also note that Unicode has incorporated ASCII.
Unicode, HTML and web browsers
Have you ever written some HTML code for a website? Suppose you want to display a special symbol such as a trademark symbol on the website. You would use the ™ in your code. This is because the Unicode symbol for the trademark symbol, ™ is ™. Δ will display the Greek letter delta. You can easily find a Unicode reference for your website by doing a search for ‘HTML Unicode’.
Another point to make about Unicode is that if web browsers are to be used by many different peoples from the entire world then they need to understand more than just ASCII code. The latest web browsers make use of Unicode and can therefore be used universally. (You may need to set up your web browser properly or install the appropriate fonts if you are having problems displaying Unicode characters - there is plenty of help on the Internet).