ECC refers to Error Correction Codes. Error happens whenever there is a bit flips & information is read incorrectly. The bit flip can be a single or double bit causing single bit errors or double bit errors. The bit flips can occur because of hard Errors or soft Errors.
The hard errors are due to inherent defects of circuits during manufacturing, Temperature-variance and general Wear and Tear. These issues often cause stuck at faults where bit is permanently stuck at 0 or 1. Soft Errors occur due to gamma rays colliding with bits resulting in bit flips. The temporary bit flips are also caused due to noisy environment where electronic interference is high for E.g if a circuit happens to be close to power supply.
There are 3 key parts to the ECC :
ECC Generation:
ECC generation is basically a process of applying an algorithm to calculate extra bits that would be stored with Data. The algorithm is an XOR logic where each ECC bit is derived from XOR of several bits including few of ECC bits. These bits are stored along with Data into memory or array & then retrieved back for Detection and generation. The number of ECC bits for generation is dependent on size of the data & can be calculated using below formula :
SECDED : 2^n+1: where n+1 = number of ECC bits.
DECTED : 2^n+2: Where n+2 = number of ECC bits.
For E.g :
For 8 Bits of Data with single bit correction and double bit detection (SECDED) we would need 3 ECC bits i.e from 2^(2+1).
For 8 Bits of Data with double bit correction and Triple bit detection (DECTED) we would need 4 ECC bits i.e from 2^(2+2).
ECC Detection:
The detection is basically a method to know whether there was an Error. For detection, at the minimum we should know that whether there was a bit flip i.e polarity change and also in some cases also how many bits were flipped. These are the 2 important parts of information that allows us to make decision whether Error can be corrected.
For detection, ECC bits are re-generated with same XOR formula that was used in generation and then these bits are compared against the original ECC bits retrieved from the Memory/Array. If the XOR result of the original and regenerated ECC bits is not 0, then there is a evidence of Error syndrome & hence Error is detected.
In order to understand how many bits were in Error, an Error Syndrome is used. An Error Syndrome is basically a list of codes that are stored and used as a reference. Whenever Error is detected, XOR result is referenced against these stored Codes. If there is a match then the Error can be corrected otherwise Error cannot be corrected even though it was detected.
ECC Correction:
Correction is a process of restoring the data to its original state. This is done by using Error Syndrome. The Error syndrome is unique per bit and if the XOR result match is found then that particular bit is in Error. For E.g for SECDED protection on 8-bit data with 3 bits of ECC, there will be 11 Error Syndrome Codes for detecting single bit Error on each 11 bit positions. In order to correct that bit, if the syndrome matches then the polarity of that particular bit is flipped i.e from 0 -> 1 or 1-> 0 & original value is restored.
It is also important to note that if the XOR result is not 0 (i.e if Error was detected) but no matching Error Syndrome was found then Error cannot be corrected. This typically happens whenever there are more bit flips than the ECC bits can correct. For E.g if there 2 bit flips in SECDED type of protection on Data.