A Tour Through the Theory of Error-Correcting Codes
0x88dfac8bedc5 Lv3

Mars, Pixels, and Code

Suppose you are visiting Mars. You want to take some pictures and send them to your parents in Earth. To transmit a picture to Earth, it must be first divided into small squares called pixels. Each pixel can be represented by the degree of blackness in a scale from 0 to 31. These pixels are then converted into binary and sent to Earth via some electronic device. However, the vast interstellar void of space between Mars and Earth may contain stray electromagnetic waves that could possibly flip one or several bits in your signals. Assume no more than one out of any three consecutive bits will be flipped. How can you transmit these pixels?

One method for this problem is to simply send each bit 5 times. Suppose you want to send the pixel 13. Since there are 32 = 25 distinct pixels, each pixel is represented in 5 bits, so the binary representation for 13 is 01101. Instead of sending 01101, you send 0000011111111110000011111.

Your parents would break your message into blocks of length 5, and replace each block with the number that occurs most frequently in that block. For example, if the 4th, the 7th, and the 10th bits in the signal you sent have been flipped, you parent would receive 0001010110111110000011111, break it into 5 blocks of length 5 00010|10110|11111|00000|11111, replace each block by the most frequent number 00010 → 0, 10110 → 1, 11111 → 1, 00000 → 0, 11111 → 1, and finally get the pixel 01101.

Formally, what we did is using a code called the 5-repetition code. The code is correct because the error bits can’t dominant any of the blocks. However, whenever you want to transmit a pixel (5 bits of information), you need to send 5 × 5 = 25 bits. Such cost is nonnegligible given that it’s not easy to gain electronic energy in Mars, so we need a more efficient code! One improvement is using the 3-repetition code instead, that is, repeating each bit three times. Because there is at most one error bit in each block, the improved code is still correct. Moreover, this time you will only need to send 3 × 5 = 15 bits to transmit a pixel! At this moment, you may have an urge to formalize the notions of "code", "correctness", and "efficiency".

The World of Code

Let’s first formalize code and objects associated to code:

Definition 1 (Alphabet, String, and Language). An alphabet Σ is simply a finite set of elements called symbols. Any finite sequence over Σ is called a string. A set of strings is called a language. If Σ = ℤ/q, L is called a q-ary language.

Definition 2 (Code). A code and codewords C of length n is a subset of Σn for some alphabet Σ. If Σ = ℤ/q, then C is called a q-ary code. Elements of C are called codewords.

Definition 3 (Coding Schema). A coding schema for language L over Σ1 with code C over Σ2 is a bijective map S : Σ1 → C. We say C encodes L via S. We would omit the coding schema if it is clear in the context.

Example 1. The 5-repeated code we discussed earlier is a binary code of length 5 consisting of two codewords {00000, 11111}. We used it to encode a language over {0, 1} that describes the pixels in a picture.

In the following text, we will only present the results for q-ary codes because the codes over other alphabets can be easily transformed into this class. We now have enough languages to define the efficiency of a code:

Definition 4 (Information Rate). Let C be a q-ary code of length n that encodes some language. Its information rate is defined as follows:

Remark 1. Information rate measures the efficiency of a q-ary code. The quantity |C| is exactly the number of symbols used for the language being encoded. Representing these symbols in the q-ary number system requires logq(|C|) digits. Thus, the numerator is the number of digits necessary to encode a symbol and can be interpreted as the amount of information contained in one codeword. It follows that captures the average amount of information contained in each digit of a codeword. The more information each digit contains, the more efficient the code is.

Example 2. The 3-repetition code we discussed earlier has information rate , while the 5-repetition code has information rate , so the former is more efficient than the latter.

Unlike efficiency, the "correctness" of a code can’t be expressed by one line of formula. This notion can be captured by the error-detection and error-correction capacities of the code. But we first need a way to measure the distance between two words.
If we think (ℤ/qℤ)n as a vector space, it is natural to define some geometrics on it:

Definition 5 (Hamming Distance and Minimum Distance). The Hamming distance dH(x,y) between any two words in Σn is the number of positions in which they differ. The minimum distance of C, denoted by d(C) is the smallest possible Hamming distanece between any two distinct codewords in C.

Now we can formalize the notion of "detection" and "correction":

Definition 6 (Error-Detection and Error-Correction). Let C be a code of length n that encodes some language. Then we say C can detect up to e error digits in any word received if there exists an algorithm ψ such that ψ(w,C) returns true iff c ≠ w, for any c ∈ C and w ∈ Σn with d(c,w) ≤ e. We say that C can correct up to e error digits in any word received if there exists an algorithm ϕ such that ϕ(w,C) returns c, for any c ∈ C and w ∈ Σn with d(c,w) ≤ e.

Although we won’t provide a formal definition for the term algorithm, the readers can assume it refers to the deterministic Turing machines or instructions written in any popular programming languages. The following theorems show that the minimum distance of a code suggests the range in which the code can detect or correct the error.

Theorem 1. A code C can detect up to e error digits in any word received as long as d(C) ≥ e + 1.

Proof. Let c be the codeword sent and w be the word received. Let e denote the number of error digits in w. Suppose d(C) ≥ e + 1. Then we have dH(c,w) = e < e + 1 ≤ d(C). It follows that w ≠ c for all c′ ∈ C − {c}. Therefore, w can’t match any codeword if and only if e > 0. Let ψ(w,C) be an algorithm that searches C for a match of w. It returns true if no match is found, returns false otherwise. Since e > 0 iff no match is found, the algorithm returns true iff w ≠ c. ◻

Theorem 2. A code C can correct up to e error digits in any word received as long as d(C) ≥ 2e + 1.

Proof. Let c be the codeword and w be the word received. Suppose d(C) ≥ 2e + 1, where e is the number of error digits in w. For any codeword c′ ≠ c, we have d(c′,w) ≥ d(c′,c) − d(w,c) ≥ d(C) − e ≥ (2e+1) − e = e + 1 by the triangular inequality. This implies c is the unique closest codeword to w. Let ϕ(w,C) be the algorithm that first computes the distance between w, c for all c ∈ C and then returns the codeword that is closest to m. Since c is the closest codeword to w. The algorithm will always return c. ◻

Example 3. We can apply these theorems on the k-repetition code, where k is some positive odd number. The minimum distance of the code is d(00...0,11...1) = k, which suggests that the code can detect up to k − 1 error bits and correct up to (k−1)/2 error bits.

A Puzzle

Now we have all the tools (perhaps) to deal with code! It’s time to face the suuuuuuuuuper-generalized version of the Mars-Earth problem:

Problem 1. Given a base q, a codeword length n, and a bound d such that d(C) ≥ d, what is the maximum information rate we can obtain? How to construct the code that maximizes the information rate?

Since q, n is fixed, maximizing the is equivalent to maximizing |C|, the size of the code. Thus, the problem can be reformulated as follows:

Problem 2. Let S(q,n,d) denote the maximum number of codewords in a q-ary code C of length n with minimum distance d(C) ≥ d. What is S(q,n,d) and how to construct a code of size S(q,n,d)?

This problem is still open today for tons of choices of q, n, d. However, the value of S(q,n,d) can be determined by adding some constraints on its parameters:

Theorem 3. If d = 1, then S(q,n,d) = qn.

Proof. Since the distance between any two different codeword is at least 1, setting d to 1 is equivalent to removing the constraint on the minimum distance of C. Thus, to reach the optimal, we can simply put all elements in (ℤ/qℤ)n into C. As a result, we have |C| = |(ℤ/qℤ)n| = qn. ◻

Theorem 4. If n = d, then S(q,n,d) = q.

Proof. Let C be the n-repetition code in base q: For any two distinct codewords , their distance is n, so the minimum distance of C is n = d. Hence, |C| = q is a lower bound for S(q,n,d). Let C be a q-ary code of length n with d(C′) ≥ n. Since the maximum distance between any two words in (ℤ/qℤ)n is n, d(C) ≤ n and so d(C′) = n. Suppose C contains at least q + 1 codewords for contradiction. By the pigeonhole hole principle, there must be two codewords x,y in C that start with same symbol. Since x,y have length n and they start with the same symbol, the number of positions in which they differ is at most n − 1, that is, d(x,y) ≤ n − 1, which contradicts that d(C′) = n. Therefore, C contains at most q elements and so q is an upper bound for S(q,n,d). We have shown that q ≤ S(q,n,d) ≤ q, so that S(q,n,d) = q. ◻

Theorem 5. If n = 4, d = 3, then S(q,n,d) ≤ q2 with equality if there exists a pair of orthogonal Latin squares of order q.

Proof. Let C be a q-ary code of length 4 with d(C) ≥ 3. Suppose, for contradiction, the size of C is at least q2 + 1. Since 3 ≤ d(C) ≤ n = 4, d(C) is either 3 or 4. If d(C) = 4, as shown before, its maximum possible size is q < q2 + 1. Thus, d(C) = 3. Since |C| ≥ q2 + 1, by the pigeonhole principle, there exists two codewords x, y ∈ C whose first two digits are the same, so that d(x,y) ≤ 2 < 3 = d(C), which contradicts that d(C) ≤ d(x,y). Thus, the size of C is at most q2 and S(q,4,3) ≤ q2.
Assume there exists a pair of orthogonal Latin squares A, B of order q over ℤ/q. Let aij, bij denote the (i,j)th entry of A, Brespectively. We construct the code C as follows:

C = {(i,j,aij,bij) : i, j ∈ ℤ/qℤ}.

Pick any two different codewords (s,t,ast,bst) and (u,v,auv,buv), we have (s,t) ≠ (u,v). Since A, B are orthogonal, (ast,bst) ≠ (auv,buv). It follows that the minimum distance of C is at least 2. Suppose, for contradiction, the minimum distance of C equals 2. Then we have either s = u, ast = auv, or s = u, bst = buv, or t = v, ast = auv, or t = v, bst = buv. If the first case is true, then ast and auv are equal and in the same row, which contradicts that A is a Latin square. If the second case is true, then bst and buv are equal and in the same row, which contradicts that B is a Latin square. If the third case is true, then ast and auv are equal and in the same column, which contradicts that A is a Latin square. If the last case is true, then bst and buv are equal and in the same column, which contradicts that B is a Latin square. Since all cases lead to contradiction, the minimum distance of C is at least 3, that is, d(C) ≥ 3 = d. Thus, C satisfies the minimum distance constraint. Because |C| = q2 reaches the upper bound for S(q,4,3), we conclude that S(q,4,3) = q2. ◻

Although the result of Theorem 5 is not interesting, the proof suggests some deep relationship between codes and Latin squares, and it is a good place to finish our tour.

Reference

van, L. J. H., &amp; Wilson, R. M. (2001). A course in Combinatorics. Cambridge University Press.
Wikimedia Foundation. (2021, October 10). Code rate. Wikipedia. Retrieved March 18, 2022, from https://en.wikipedia.org/wiki/Code_rate