Previous, Math 511 Home, Contents, Next

Error Correcting Codes

 

When humans have to record lengthy identification numbers, the potential for typographical errors is high. A publisher doesn’t want to ship the wrong books because someone typed in one wrong digit from a long list of identification numbers. In such situations, check digits are usually added to the identification number. These check digits can be used to verify that the number is entered correctly. For example, the ISBN number of Theory and Practice of Error Control Codes by Blahut is 0-201-10102-5. In this case, the initial 0 denotes the language (English), the 201 denotes the publisher (Addison-Wesley), and the 10102 is picked by the publisher to denote the particular book. The last digit, 5, is the check digit. It is computed according to the following rule. If the first nine digits are d1d2d3d4d5d6d7d8d9, then the tenth digit is given by the formula

d10 = 1´d1 + 2´d2 + 3´d3 + 4´d4 + 5´d5 + 6´d6 + 7´d7 + 8´d8 + 9´d9 (mod 11)

You can check that this procedure does yield 5 in the given example. You can check that an ISBN number is correct by computing

1´d1 + 2´d2 + 3´d3 + 4´d4 + 5´d5 + 6´d6 + 7´d7 + 8´d8 + 9´d9 + 10´ d10 (mod 11).

Since 10 = -1  (mod 11), the result should be 0.

 

If any single digit in an ISBN number is miscopied, or if any two numbers are transposed, the result will no longer be 0. We can check this as follows. Suppose we have a correct ISBN number d1d2d3d4d5d6d7d8d9d10 and we accidentally replace the digit d4 by e4. We compare

            1´d1 + 2´d2 + 3´d3 + 4´d4 + 5´d5 + 6´d6 + 7´d7 + 8´d8 + 9´d9 + 10´ d10

   -       1´d1 + 2´d2 + 3´d3 + 4´e4 + 5´d5 + 6´d6 + 7´d7 + 8´d8 + 9´d9 + 10´ d10

         ----------------------------------------------------

                                            4 ´ (d4 - e4)

Since d1d2d3d4d5d6d7d8d9d10 is a valid ISBN number, the top checksum is 0. Now since Z11 is a field, the only way for 4´(d4-e4) to be 0 is for d4 = e4. So any change in the 4th (or any other) digit will change the checksum, and so the bottom checksum won’t be 0. We can similarly show that any single transposition (switching the position of two digits) will also change the checksum.

 

When the data entry clerk enters the number, the computer carries out the computation above. If the computation doesn’t come up 0, the computer can beep and the data entry person can re-enter the ISBN number, or perhaps refer to the title of the book if the number has been miscopied onto an order form. One possible problem with this scheme is that since the computation is done mod 11, the check digit could turn out to be 10, rather than a single digit 0-9. The solution to this can be inferred from the fact that the ISBN number of Lie Algebras and Lie Groups by Serre is 0-8053-8703-X.

 

This is only one of many check digit schemes (though it is the most efficient scheme I know). Credit card numbers, the bar codes on products, all sorts of identification numbers use various check digit schemes as a means of detecting errors. These schemes all take the same basic form. Carry out some algebraic computation on the digits in the identification number in some algebraic structure. Record the answer as the check digit. If the algebraic structure is well chosen, then any single change in the data will produce a change in the check digit. Of course, several errors might cancel each other out, but you can't expect a single check digit to do everything. The advantage of using some finite algebraic structure is you want the check digit to take on just one of a finite number of possibilities (like 0-9 and X) so you can record it using a single digit. You will want pick a structure where it is easy to check if a number is correct and where several different types of errors (typos, transpositions, etc.) all can be caught, as in the ISBN example.

 

An early place where check digits were used was in computers. Let’s start with a short review of terminology. Computers work with binary, base 2, numbers. These numbers are built up from Binary digITS, called bits. Each bit is either 0 or 1. We gather 8 bits together into a unit called a byte. 10011010 is a byte. If you change one bit, programs probably won’t work anymore, so it is vital to detect when errors have been made. The earliest approach to catching errors was the use of a parity bit in each byte.

 

In our (old-fashioned) byte, we will have seven data bits and one parity bit. Suppose our data bits are 1011010. If we are using even parity, we will choose the 8th bit so that there are an even number of 0’s and 1’s in the byte. Since we have four 1’s and three 0’s in our seven data bits, we will choose a 0 for the parity bit (chosen to be the bit on the left) so we end up with even numbers of both 0’s and 1’s in the final eight-bit byte, 01011010. After we receive a byte, we can check that the parity is even. If there are an odd number of 0’s and 1’s in the received byte, we will know an error has been made and can ask for the byte to be retransmitted. This approach will detect any single error in transmitting a byte. Going back to our example, suppose we receive 01111010. We count three 0’s and five 1’s in this byte. Since three and five are odd, we know an error has occurred.

 

To put this in the same framework as the ISBN check digit scheme, note that if we have data bits b1b2b3b4b5b6b7, then we can write the formula for the parity bit b0 (with even parity) as

b0 = b1 + b2 + b3 + b4 + b5 + b6 + b7  (mod 2)

We then test that the parity is even by computing

b0 + b1 + b2 + b3 + b4 + b5 + b6 + b7 (mod 2).

Since 1 = -1 (mod 2), the answer should be 0.

 

If you recall when we discussed ASCII encoding, the standard way to encode letters into binary in computer systems was to use the starting three bits 010 for capital letters and 011 for lower case letters, with the letters from A to Z listed in the last five bits from 00001 to 11010 (1 - 26 in binary). Some of you may have wondered why the highest bit was 0 in both cases. The reason is that when ASCII was developed, you only used seven data bits, and the highest order bit was the parity bit. So ASCII used the two highest order data bits to denote what type of character was denoted. Completing the list given above, 11 is primarily lower case letters, 10 is primarily upper case letters, 01 is for other printable characters (0100000 was the space for example) and 00 is for carriage control (0001000 is a tab for example). As the use of one parity bit and seven data bits was replaced by the use of eight data bits during the 80’s, ASCII changed to always having a 0 in the highest order bit for the standard characters, and using bytes starting with a 1 to encode extra symbols, like ¸ and ´.

 

As noted, computers rarely (if ever) use the original seven data bit with one parity bit method of encoding anymore. Memory and bandwidth (the amount of data transferred per unit time) being valuable, programmers wanted to use all 8 bits for data, and soon enough did so. PCs typically use eight bits for data and have a separate ninth parity bit. As memory circuits have gotten more reliable, it has become cheaper and easier just to stick to eight data bits and no parity bits for the internal operation of machines, which is the case for some PCs today. When transmitting data between different machines, more sophisticated error detecting and error correcting codes are now used. The first example of an error correcting code was the Hamming code.

 

Parity checks are an example of an error-detecting code; they give us a way to detect if an error has occurred in recording a byte of data. But often just knowing that an error has occurred isn't sufficient; you want to know what the error was and how to fix it. This requires an error-correcting code. Richard Hamming at Bell Labs was having a problem with parity errors in the 50's while working at Bell Labs. Because the only computer (back in the 50's computers weren't household items) was busy with high priority projects during the week, Hamming could only submit his jobs to be run over the weekend. For several weeks, each Monday when he came in he got the message that the program had terminated with a parity error and to please try again. Of course, trying again meant waiting till the next weekend. Hamming thought there had to be a way to set things up to use parity to correct a single error, rather than just detecting it. He worked out a method to do just that, now called the Hamming code.

 

 The (7,4) Hamming code uses four data bits and three parity bits (for a total of seven bits). Suppose the data bits are b1b2b3. We set the parity bit p134 so b1b3b4p134 has even parity. Then we set the parity bit p123 so b1b2b3p123 has even parity. Finally we set the parity bit p234 so b2b3b4p234 has even parity. We then encode our results as b1b2b3b4p134p123p234. We can write this in terms of matrices as

where the computation is done mod 2. We will call the 4´7 matrix listed above G.

 

Let's look at an example. Suppose our four data bits are 0110. Then we compute the parity bits p134 = 1, p123 = 0, and p234 = 0, and we get our seven-bit result 0110100. To test this sequence for errors, we check the parity of the sets b1b3b4p134, b1b2b3p123, and b2b3b4p234, which in this case are 0101, 0110, and 1100 respectively. Since all three have even parity, we detect no errors. Now suppose an error changes our seven-bit sequence to 0111100. This time when we look at the sets b1b3b4p134, b1b2b3p123, and b2b3b4p234, we get 0111, 0110, and 1110 respectively. Since two of these have odd parity, we detect an error. Looking at which sequences have errors, we see that we have detected no error in b1b2b3p123, but we have detected an error in b1b3b4p134, and b2b3b4p234. Assuming that there is only a single error (which we hope is more likely than there being two errors), we can conclude that the error is in bit b4, since it is the only bit that is in both sets b1b2b4p124 and b1b3b4p134 but not in b1b2b3p123. So we switch bit b4 in the errant sequence 0111100 from a 1 to a 0 to get back to the correct seven-bit sequence 0110100. Extracting the four data bits, we finally get 0110, which is what we started this paragraph with.

 

Since every bit, both data and parity, is in at least one of the sets b1b3b4p134, b1b2b3p123, and b2b3b4p234, any change in a single bit will foul up at least one of the three parity checks. On the other hand, since no two bits are in exactly the same parity checks, we can always determine which single bit was changed by examining which parity checks fail. So we can always correct a single error by solving a logic problem. Unfortunately, computers prefer computational algorithms to methods such as "solving a logic problem." Fortunately, we can use matrix algebra to find a nice algorithm for fixing single errors in the Hamming code.

 

The first thing we will want is to find a quick way of checking if a seven-bit word is a valid codeword. In order for (b1b2b3b4p134p123p234) to be a valid codeword, the four-bit sets b1b2b4p134, b1b2b3p123, and b2b3b4p234 must all have even parity. Since our computations are all done mod 2, this is the same as asking b1 + b3 + b4 + p134 = 0, b1 + b2 + b3 + p123 = 0, and b2 + b3 + b4 + p234 = 0. We can write this as a matrix equation

 

where H is the 3´7 matrix and all computations are done mod 2. We can use matrix computations to check that any valid codeword will give use a (0 0 0) vector after multiplying by H. Recall that a valid codeword must be of the form (b1b2b3b4)G. Then we compute ((b1b2b3b4)G)H = (b1b2b3b4)(GH) = (b1b2b3b4)0 = (0 0 0) where you can check that GH is the matrix of all zeros, which we've denoted 0. For any seven-bit word w, we call wH the syndrome of the word. If w is a valid codeword, then the syndrome is (0 0 0). If the syndrome is not (0 0 0), then an error has been made. We will next consider how to find which single bit could have caused the error from the syndrome.

 

Let’s look at what happens to the syndrome if we have an error in our seven-bit codeword. We first compute our valid codeword (b1b2b3b4)G. Suppose now an error is made in the third bit. This gives us our new seven-bit word (b1b2b3b4)G + (0 0 1 0 0 0 0). If we now check this invalid codeword using the H matrix we find

((b1b2b3b4)G + (0 0 1 0 0 0 0))H          = (b1b2b3b4)GH + (0 0 1 0 0 0 0)H

= (0 0 0) + (1 1 1)

= (1 1 1).

So an error in the third bit means the syndrome will be (0 0 1 0 0 0 0)H, which is the third row of the H matrix. Similarly, an error in the jth bit will make the syndrome the jth row of H. So to check a seven-bit word w encoded using the Hamming (7,4) code, we compute the syndrome wH. If wH is (0 0 0), then the word is valid and we can just read the data bits from the first four places. If the syndrome is something else, then we check which row of H is the same as the syndrome and switch that bit in the word to get a valid codeword, from which we then read off the data bits. You can check that every possible non-zero three-bit syndrome is a row of H, so this method always finds a single error that can be fixed to give a valid codeword.

 

The Hamming code illustrates the general approach to error-correcting codes. You choose an algebraic structure (in this case Z2) and carry out several calculations with the number you want to send. You then send both the number and the results of the computations. At the other end, the person (or computer) who gets your number can do a little algebra to check that everything came through all right, or to put it right if it didn't. The question becomes what is the best choice of algebraic structure and computation. You want it to be easy to implement the computations, without requiring too much work to check if a word is a valid codeword or to find the corrected codeword if the received word is invalid. At the same time, a good error-correcting code should be efficient, correcting as many errors as possible while requiring as few extra numbers be transmitted as possible. The Hamming code is easy to implement, but is not very efficient. While theoretical considerations showed more efficient codes must be possible, it took nearly a decade to develop better error-correcting codes than the Hamming code and so the Hamming code was actually used in computer systems through the 60's. Nowadays, there are several families of more efficient codes that are used in practice. In order to develop one of these more efficient codes, the Reed-Solomon codes that are used on CDs, we will study some new algebraic structures, polynomial domains and finite fields.

 

 

 


Please report any problems with this page to bennett@math.ksu.edu
©2000 Andrew G. Bennett