Other Topics‎ > ‎

The Danger of Overemphasizing Theory

This article explains the danger of product design engineers putting too much emphasis on information and coding theory results and not putting enough emphasis on practical design issues such as miscorrection.


Introduction

There is no dispute that theory is valuable and important, but sound implementation strategies are also important.

It’s good that, in our society, we have expert theoretical mathematicians, physicists and information theorists, but it is also good that we have expert carpenters, plumbers, electricians and auto mechanics.

I think we all know that sometimes theoreticians can be impractical and overlook important practical issues.

Around the mid 90's, some information theorists started actively promoting Low Density Parity Check (LDPC) codes because, for a given level of error correction performance, the amount of redundancy they require is close to meeting a minimum theoretical bound. Generally speaking, in the minds of information theorists getting closer and closer to that theoretical bound is the most important thing. Theoreticians are aggressively and relentlessly seeking to find new error-correcting codes which come closer and closer to meeting the theoretical bound of minimum redundancy or, in other words, meeting Shannon’s capacity limit. There’s nothing wrong with that.

It has been proven that Hamming codes and the triple-bit error-correcting Golay code are the only “perfect” error-correcting block codes that exist. “Perfect” means they require the minimum amount of redundancy. They are “perfect” from a mathematical point of view.

It’s natural and logical for the average person to think that perfect codes would be widely implemented since they are perfect.

However, perfect codes are not widely implemented because “perfect” from a mathematical point of view does not necessarily mean “best” from a practical implementation point of view.

Perfect codes have the attribute that any decoder designed for them will never “fail” to decode. That is, all possible error patterns will either be correctly corrected or miscorrected. By “miscorrected” we mean the decoder “believes” it has correctly corrected the error pattern that actually occurred, but, in fact, it has inserted extra errors, but doesn’t know it has done that and therefore cannot signal that fact. Miscorrection is a bad thing. That miscorrection property is a property of *all* decoders for “perfect” codes no matter how the decoders are designed. For perfect codes, it is impossible to detect error patterns which exceed the capability of the code to correct. In other words, if the code is designed to correct t errors but t+i errors actually occur, every decoder will make a mistake and correct t or fewer errors but in the wrong positions thereby usually adding t errors to the t+i errors that already are in the received word so you end up with 2t+i errors that you know nothing about.

Most product design engineers who use Hamming codes - which are perfect - add an extra bit of redundancy so that their decoders are able to detect double-bit errors and signal that two errors have occurred rather than miscorrect two-bit error patterns as a pure, perfect Hamming code decoder would be forced to do.

You run into this same miscorrection problem whenever you use any error-correcting code with a bare minimum amount of redundancy.

The following two true stories further illustrate this point.


True Story A

I started studying error-correcting codes in 1973. At that time, single-burst error-correcting codes were just beginning to be used in hard disk drives. I was working at Control Data Corporation (CDC) which was one of the largest computer companies in the United States at that time. CDC was competing against IBM.

CDC designed and manufactured hard disk drives and hard disk drive controllers. CDC’s design engineers were proud they had chosen a single-burst error-correcting code that could correct a single 11-bit burst error by using only 32 bits of redundancy per sector while IBM could only correct a single 4-bit burst error by using 58 bits of redundancy.

However, CDC soon began receiving complaints from the Air Force Logistics Systems in Texas that CDC's disk drives were too frequently miscorrecting data. It caused a panic in CDC, and the person I was learning ECC from (Dr. Matt Ulstad) was called in to help solve the problem.

What Matt concluded was that the parity check matrix for the CDC code was not dense enough. That is, it was too sparse with too few nonzero elements. The effect of a low density parity check matrix was frequent miscorrections.

CDC used a search algorithm on a supercomputer to find a better generator polynomial and reduced the length of bursts that were corrected.


True Story B

In the 1979-1982 time frame, while I was a design engineer at what is now Seagate, I proposed that a Reed-Solomon code be implemented in disk drives in place of the standard single-burst binary error-correcting code.

At that time, ICs were nowhere near the density they are today, and we were lucky to get 8 Flip Flops in one IC so, in order for it to be practical to implement a RS decoder, I proposed that a fairly low gate-count, burst-trapping decoder be implemented in hardware (the type used for correcting single bursts in binary single-burst error correcting codes) and that, if burst-trapping failed, we would do error-correction in firmware/software to correct for multiple symbols in error.

After that system was implemented, we discovered that, like in True Story A, data was sometimes being miscorrected.

I had to implement what Elwyn Berlekamp called an error pattern “rejection strategy” whereby, if the error pattern determined by the decoder could not be interpreted as two bursts of length b or less, then that error pattern was rejected and the decoder asserted that it had failed.


What Should we Learn from these Stories?

We should learn that it is prudent to have more redundancy than is absolutely needed so that if error patterns occur which exceed the correction capability of the code, they will usually be detected and not miscorrected.

From an implementation point of view, it is good to have significantly more redundancy than is absolutely needed because then the decoder will fail frequently when severe error patterns occur. In most practical situations, it doesn't make much difference if a code requires 20% redundancy or 22% redundancy. Memory and bandwidth are getting cheaper and cheaper. The extra redundancy is important because it allows severe error patterns to be detected even though they cannot be corrected which is a very important and good thing.

The extra redundancy required by binary BCH codes is a good thing because it allows the vast majority of error patterns that exceed the correction capability of the code to be detected.

It’s good to know if extremely severe error patterns are occurring so that corrective action can be taken. If that happens frequently, something is wrong. Maybe a chip needs to be replaced or some other corrective action can be taken. It’s not good to not know that error patterns with more than t errors are occurring.

The only people on planet earth who really care if an error-correcting code meets the Shannon capacity limit are coding theorists. Almost nobody else cares except maybe engineers who are designing error-correction systems for the space channel where power is limited.

Consider the analogy of the miles per gallon your car gets. As long as the price of gasoline is low enough, not many people care if their car gets 25 or 30 miles per gallon. Storage and communications bandwidth are cheap and getting cheaper and cheaper. The number of bits a fiber optic cable can transmit per second is astronomical with advanced wavelength division multiplexing. Now we are seeing multi-terabyte capacity disk drives. So what difference does it make if an error-correcting code requires 30% redundancy instead of 25%? I don’t think it makes much difference to anyone except coding theorists.

Suppose someone discovered that cars could theoretically be designed to get 75 miles per gallon. Does that mean that no one should be happy or content or think they own a good car unless it gets 75 miles per gallon? I don’t think so.

Suppose that in order for cars to be designed to actually get the theoretical limit of 75 miles per gallon they necessarily must also be designed with more parts and are therefore more unreliable and cost more than cars that get lower mileage. Then the decision as to which car to buy becomes more difficult - similar to the decision to implement an error correcting code with the minimum amount of redundancy required or to implement one that detects more uncorrectable error patterns. When it comes to designing practical products that people everywhere love, meeting a theoretical information theory limit may be completely unimportant and irrelevant.

To me the most important characteristics of an error-correcting system are ...
  • Is it powerful?
  • Is it practical?
  • Does it solve the problem it was supposed to solve?
  • Does it perform well?
  • Is it easy to understand and implement?
  • Does it enhance the value of the products it is in?
  • Is everyone happy with, like or love the performance of the product it is in?
Suppose someone designed a product, millions of copies were manufactured, all of the purchasers and users loved the product, and then a theoretician came along and said, “your design is not theoretically optimal”. Most-likely the designers, manufacturers, purchasers and users would say, “so what?, who cares?, we love the product and we don’t care if it is theoretically optimal or not! Whether or not it is theoretically optimal, in our minds, is irrelevant! We love the product the way it is, and we do not want you to change it!”

Comments from Elwyn Berlekamp

I agree that LDPC codes are being promoted primarily by theoretical people who have little practical experience in implementation. But that's not to say that they aren't interesting, and might not be useful in some situations. Usually, the length at which they become competitive is very, very long. Sometimes delay or memory constraints makes them infeasible.

They also usually assume that the channel is memoryless (i.e., burst-free). Character-based codes such as Reed-Solomon do relatively poorly on such channels relative to the performance they attain when B-bit bursts strike only a much smaller number of characters. So I continue to think that channel burstiness, high code rates, and high ratios of channel speed to processor speed all tend to push the designer towards algebraic codes such as Reed-Solomon, whereas slow memoryless channels with low signal-to-noise ratios and soft decision information (e.g., a real number output estimate for each +1 or -1 bit transmitted) tend to push the designer towards Viterbi or LDPC codes or Turbocodes or some combination thereof. The deep space channel is a classic example of the latter. Most optical and/or magnetic memories have tended to be nearer to the former.

Most coding theorists or practitioners are highly specialized. Each has an approach which he seeks to apply to all problems he encounters, even those which succumb better to other approaches.

LDPC Codes will not work as the Horizontal/Lateral Code in RAID Systems

All RAID systems use some type of error correcting code to span across the physical devices. I call this code the “horizontal” or “lateral” code.

LDPC codes will not work for the lateral code in RAID systems because LDPCs require very long block lengths in order to be effective and the horizontal block length in RAID is very short so LDPCs cannot be used. RS codes, on the other hand, work well as discussed on this web page: Finite Fields, RS Codes and RS RAID.


ECC Tek’s Binary BCH Designs

ECC Tek recently announced the immediate availability of ultra-high-performance binary BCH encoders and decoders for correcting large numbers of random errors in MLC NAND Flash memories. ECC Tek has developed numerous programmable binary BCH designs which are programmable to correct up to 72 bits in error in short (512+ Byte) and long (1024+ Byte) pages and can easily develop designs that correct more than 100 bits in error in longer pages.

ECC Tek has nearly 8 years experience designing binary BCH encoders and decoders and incorporated many gate-saving and performance-enhancing improvements in its latest designs. All of ECC Tek’s binary BCH decoders can correct the maximum number of errors in continuous incoming pages with no slowdown of the input.

Binary BCH designs that correct a large number of random errors probably require approximately 1/3 the number of gates and 1/3 the amount of power of an equivalent performance LDPC decoder. In addition, with binary BCH codes, you can be 100% certain that all error patterns with t or fewer errors will be corrected. LDPC decoders cannot make that claim. Also, binary BCH decoders will signal failure if more than t errors occur with a very high probability of success (close to 100%) if the decoders are designed properly. Again, LDPC decoders cannot do that.

ECC Tek can easily customize its binary BCH designs for each licensee. The level of parallelism can be adjusted in each component. Higher levels of parallelism increase performance and gate count. Inputs can be any number of bits wide. Current implementations input and output bytes.

All of ECC Tek’s decoders feature a variable latency so that the time to decode only depends on the number of errors that actually occur. Most other decoder designs do not and probably cannot provide this feature. Fixed latency can also easily be implemented.

ECC Tek’s binary BCH designs are solid, sound, proven, field-tested and millions are currently in mass production. If you use a digital camera, thumb drive or MP3 player, there’s a chance you are already using ECC Tek’s designs. In over 20 years of licensing ECC designs, no licensee has had any problems with ECC Tek’s designs, and all licensees have been happy with the results. In most cases, ECC Tek’s designs have significantly outperformed the licensee’s requirements and expectations.

Binary BCH encoders and decoders that correct from 8 to 72 bits in error in short and long pages are immediately available. Existing designs can easily be modified to correct more than 72 bits in error.