Error-Correcting Codes For Semiconductor Memory Applications: A State-of-the-Art Review
Error-Correcting Codes For Semiconductor Memory Applications: A State-of-the-Art Review
Chen
M. Y. Hsiao
This paper presents a state-of-the-art review of error-correcting codes for computer semiconductor memory applications. The
construction of four classes of error-correcting codes appropriate for semiconductor memory designs is described, and for each class
of codes the number of check bits required for commonly used data lengths is provided. The implementation aspects of error
correction and error detection are also discussed, and certain algorithms useful in extending the error-correcting capability for the
correction of soft errors such as a-particle-induced errors are examined in some detail.
Introduction
In recent years error-correcting codes (ECCs) have been used have been widely implemented by IBM and the computer
increasingly to enhance the system reliabihty and the data industry worldwide [7-10]. Examples of systems which incor-
integrity of computer semiconductor memory subsystems. As porate these codes are the IBM 158, 168, 303X, 308X, and
the trend in semiconductor memory design continues toward 4300 series, Cray I, Tandem, etc. There are also various
higher chip density and larger storage capacity, ECCs are standard part numbers of these codes offered by many semi-
becoming a more cost-effective means of maintaining a high conductor manufacturers [11] (for example, the AM2960 and
level of system reliability [1-4]. AMZ8160 of Advanced Micro Devices, the MC68540 of
Motorola, the MB1412A of Fujitsu, and the SN54/74 LS630,
A memory system can be made fauh tolerant with the LS631 of Texas Instruments).
application of an error-correcting code; i.e., the mean time
between "failures" of a properly designed memory system can The number of errors generated in the failure of a memory
be significantly increased with ECC. In this context, a system chip is largely dependent on the chip failure type. For example,
"fails" only when the errors exceed the error-correcting capa- a cell failure may cause one error, while a line failure or a
bility of the code. Also, in order to optimize data integrity, total chip failure in general causes more than one error. For
the ECC should have the capability of detecting the most ECC applications, the memory array chips are usually orga-
likely of the errors that are uncorrectable. nized so that the errors generated in a chip failure can be
corrected by the ECC. In the case of SEC-DED codes, the
Error-correcting codes used in early computer memory one-bit-per-chip organization is the most effective design. In
systems were of the class oi single-error-correcting wad double- this organization, each bit of a codeword is stored in a different
error-detecting (SEC-DED) codes invented by R. W. Ham- chip; thus, any type of failure in a chip can corrupt, at most,
ming [5]. A SEC-DED code is capable of correcting one error one bit of the codeword. As long as the errors do not line up
and detecting two errors in a codeword. The double-error- in the same codeword, multiple errors in the memory are
detecting capability serves to guard against data loss. In 1970, correctable.
a new class of SEC-DED codes called odd-weight-column
codes was published by Hsiao [6]. With the same coding Memory array modules are generally packaged on printed-
efficiency, the odd-weight-column codes provide improve- circuit cards with current semiconductor memory technology,
ments over the Hamming codes in speed, cost and reliability and usually a group of bits from the same card form a portion
of the decoding logic. As a result, odd-weight-column codes of an ECC codeword, as illustrated in Figure 1. With this
® Copyright 1984 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of
royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the
first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by
124 computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
C. L. CHEN AND M, Y. HSIAO IBM J. RES. DEVELOP. • VOL. 28 • NO. 2 • MARCH 1984
multiple-bit-per-card type of organi2ation, a failure at the Card 1 • • • Card «
nnnn nnnn
card-support-circuit level would result in a byte error, where Chip
nnnn
the size of the byte is the number of bits feeding from the card
to a codeword. In this type of configuration, it is important
for data integrity that the ECC be able to detect byte errors
• nna
* •
[12], A SEC-DED code is in general not capable of detecting • •
all single-byte errors. However, a class of SEC-DED codes • • • • •
capable of detecting all single-byte errors can be constructed
[13, 14]. These are called single-error-correcting doubk-error- nnnn nnnn
detecting single-byte-error-delecting (SEC-DED-SBD) codes.
ECC codeword
There are certain design applications where the memory
array cannot be organized in one-bit-per-chip fashion because Figure 1 A 4-bit-per-card memory array.
of cost or other reasons such as system granularity or power
restrictions. As chip density increases, it becomes more diffi-
cult to design a one-bit-per-chip memory system. For a mul-
tiple-bit-per-chip type of memory organization, a single-byte- Binary linear block codes
error-correcting double-byte-error-detecting (SBC-DBD) code A binary (n.k) linear block code is a A-dimensional subspace
[15-20] would be more effective in error correction and error of a binary «-dimensional vector space [8,15, 16]. An n-bit
detection. codevvord of the code contains k data bits and r= n- k check
bits. An r X « parity check matrix H is used to describe the
System reliability generally tends to decrease as the capacity code. Let V = (t'l, x'2, • • • , i'«) be an «-bit vector. Then V is a
of a memory system increases. To maintain the same high codeword if and only if
level of reliability, a double-error-correcting triple-error-detect- H V' = 0, (1)
ing (DEC-TED) code may be used. However, this type of code
requires a larger number of check bits than a SEC-DED code where V denotes the transpose of V, and all additions are
and more complex hardware to implement the functions of performed modulo 2.
error correction and error detection [8, 15. 16]. The encoding process of a code consists of generating r
check bits for a set of A: data bits. To facilitate encoding, the
An error-correcting code can be used to correct "soft" errors H matrix is expressed as
as well as hard errors. Soft errors are temporary errors such as
«-particle-induced errors that disappear during the next mem- H = [P,I,], (2)
ory write operation. With a maintenance strategy that allows where P is an r x ^ binary matrix and h is the r X r identity
the accumulation of hard errors, a high soft error rate would matrix. Then the first k bits of a codeword can be designated
cause a high uncorrectable error (UE) rate. To reduce the UE as the data bits, and the last r bits can be designated as the
rate that involves soft errors, a SEC-DED code can be modified check bits. Furthermore, the ith check bit can be explicitly
to correct two hard errors or a combination of one hard and calculated from the ith equation of the set of r equations in
one soft error [21-25]. (1). A code specified by an H matrix of (2) is called a systematic
code.
In this paper we review the current status of error-correcting
codes for semiconductor memory applications and present Any binary r x n matrix H of rank r can always be
the state of the art by describing the construction of four transformed into the systematic form of (2). Since the rank of
classes of error-correcting codes suitable for this type of design H is r, there exists a set of r linearly independent columns.
application. These four classes are SEC-DED codes, SEC- The columns of the matrix can be reordered so that the
DED-SBD codes, SBC-DBD codes, and DEC-TED codes. For rightmost r columns are linearly independent. Applying ele-
each class of code we provide the number of check bits mentary row operations [ 16] on the resultant matrix, a matrix
required for commonly used data lengths, information that is of (2) is obtained. The systematic code obtained is equivalent
particularly useful to designers for system planning. We also to the code defined by the original H matrix. Figure 2(a) is an
discuss the implementation aspects of error correction and example of the parity check matrix of a (26,20) code in a
error detection for these classes of error control codes. In nonsystematic form. Note that the last six columns of the
addition, we describe a number of algorithms useful in ex- matrix are linearly independent. The submatrix of the six
tending the error-correcting capability of codes for the correc- columns can be inverted. The multiplication of the inverse of
tion of soft errors such as a-particle-induced errors and other the submatrix and the transpose of the parity check matrix
temporary errors. results in a matrix of systematic form shown in Figure 2(b). 125
IBM J. RES. DEVELOP. » VOL. 28 • NO. 2 » MARCH 1984 C. L, CHEN AND M. Y. HSIAO
pairs of codewords. For a linear code, the minimum distance
of the code is equal to the minimum of the weights of all
nonzero codewords [8, 15, 16]. A code is capable of correcting
t errors and detecting t -¥ 1 errors if and only ii d>2t + 1.
(a)
C, L, CHEN AND M, Y, HSIAO IBM J, RES, DEVELOP, • VOL, 28 • NO, 2 • MARCH 1984
n i l 1 • • 1
11 1 •
• • • 1 • • I • 1
1• • 1 I M 1 I
• 1 • • 1 11 • •
• • 1 • • • • 1 1
(a)
I 1 •1
I • 1 •
I I
M i l
I I I 11 •
1 1 1
1 • 1
• 1 •
1 I 1 I 1 I 1 I I I 1 I I . . I . I . . . I . .
1 • • • 1 • • •
i n n n i l . . . 1 I I . . . 1 .
I . . . ] 1 I I 1 !1 n n I • 1 • • • I
• 1 • • • I • • 1 . . . I . . I . . .
I I • 1 1 • n I I • • • • ••
I I n I I n n I
I I I 1 n I II1I I. .
1 I 11 1 1 1 I I 1 I I 1
Mil n i l 1• • • n I• 1 1• I I
n i l I I I II I • 1 I n • • II • 1 I • I
1 1 1 1 1 I n • • I • I • 11 I I
1I n 1I • 1 n I 1
I • • • I I
. 1 . . 1 1
• • I • I . . I .
I I I I I I I I I I II I {
I • • • 1 • • I I I I 1 11 I 1 I
(d)
Figure 3 Parity check matrix of some SEC-DED codes: (a) (22,16) code (IBM Sysfem/3); (b) (40,32) code (IBM 8130); (c) (72,64) code (IBM
3033); (d) (72,64) code (IBM 3081).
any set of three columns of the H matrix are linearly inde- Table 2 Number of check bits required for SEC-DED codes.
pendent. Thus, the H matrix of a SEC-DED code must satisfy
the following conditions: Data bits Check bits
Al. The column vectors of the H matrix are nonzero and are 5
16 6
distinct. 32 7
A2. The sum of two columns of the H matrix is nonzero and 64 8
is not equal to a third column of the H matrix. 128 9
256 10
Note that the sum of two odd-weight r-tuples is an even-
weight r-tuple. A SEC-DED code with r check bits can be
constructed with its H matrix consisting of distinct nonzero
Muples of odd weights. This is an odd-weight-column code of than the maximum for a given number of check bits. There
Hsiao [6]. are various ways of shortening a maximum-length SEC-DED
code. Usually a code designer constructs a shonened code to
The maximum code length of an odd-weight-column code meet certain objectives for a particular application. These
with r check bits is 2""', for there are 2'^' possible distinct objectives may include the minimization of the number of
odd-weight r-tuples. This maximum code length is the same circuits, the amount of logic delay, the number of part num-
as that of a SEC-DED Hamming code. The maximum number bers, or the probability of miscorrecting triple errors [6].
of data bits A: of a SEC-DED code must satisfy k < 2'^' - r.
Table 2 lists the number of check bits required for a set of In a write operation, check bits are generated simultane-
data bits. Figure 3 shows examples of SEC-DED codes used ously by processing the data bits in a parallel manner accord-
in some IBM systems. ing to Eqs. (1) and (2). In a read operation, syndrome bits are
generated simultaneously from the word read according to
Most of the SEC-DED codes for semiconductor memory Eq. (.3). Typically the same XOR tree is used to generate both
applications are shortened codes in that the code length is less the check bit.s and the syndrome bits (see Figure 4). 127
IBM J, RES. DEVELOP. • VOL. 28 • NO. 2 . MARCH 1984 C- L. CHEN AND M. Y. HSIAO
check bits
signal, and n two-way XOR gates for inverting tlie code bit in
data bits (read) error. Additionally, an «-bit data register and control logic for
timing are required.
XOR
iree A UE signal can also be generated based on the logical OR
of the minterms of all UE syndromes. A subset of all UE
syndromes is the set of even-weight syndromes caused by even
Bit-wise numbers of errors. This subset of syndromes can be recognized
XOR
by an r-way XOR gate.
check bits syndrome bits The failure of a common logic support in the memory may
(write) (read)
result in an all-ones or an all-zeros pattern in a codeword. In
Figure 4 Generation of check bits a n d syndrome bits.
this case, the error vector in general contains a multiple
number of errors that are not detectable by a SEC-DED code.
To prevent this kind of data loss, the code can be constructed
r-b ! syndrome or modiied so that an all-ones or an all-zeros «-tuple is not a
codeword. For example, if the check bits are inverted before
• Error
the codeword is written into the memory, then all the code-
OR
• detected words stored in the memory are nonzero. In general, the
2 • »• r
detection of all-ones and all-zeros errors can be achieved by
Syndrome decoder inverting a subset of the check bits [9].
(« r-way AND
gates)
SEC-DED-SBO codes
•• AND UE
1 2 ••• n •
NOR In some applications it is required that the memory array
chips be packaged in a l)-bits-per-chip organization. A chip
ECC
word
K
>
Error corrector
n t wo-wav XOR
failure or a word-line failure in this case would result in a
read gati
byte-oriented error that contains from 1 to b erroneous bits.
Byte errors can also be caused by the failures of the supporting
^ modules at the memory card level. The class of SEC-DED
Corrected word
codes that are capable of detecting all single-byte errors (SEC-
Figure 5 Error detection and correction block diagram. DED-SBD codes) may be used to maintain data integrity in
these applications.
An algorithm for correcting single errors and detecting The H matrix of a SEC-DED-SBD code can be divided into
multiple errors is described as follows: iV blocks of r X b submatrices, 81,82, • • • , B„, where 8,-
1. Test whether S is 0. If S is 0, the word is assumed to be represents the parity checks for byte position /. From (3), the
error-free. syndrome of a byte error at position / is a sum of the columns
2. If S # 0, try to find a perfect match between S and a of B, that correspond to the bit error positions within the byte.
column of the H matrix. The match can be implemented The syndromes of all possible byte errors at position i are the
in n r-way AND gates. sum of all possible combinations of the columns of B,. Let
3. If S is the same as the ith column of H, the /th bit of the <B,) denote the sums of all possible nonzero linear combina-
word is in error. tions of the columns of B,. Each member of <B/) should be
4. If S is not equal to any column of H, the errors are detected nonzero and should not be equal to a column of B,, for/ # /.
Otherwise, the byte error at position / will be mistaken as no
as uncorrectable (UE).
error or as a correctable single error at position/ Thus, the H
This algorithm applied to a SEC-DED code corrects all single matrix of a SEC-DED-SBD code must satisfy the conditions
errors and detects all double errors. Multiple-bit errors may A1 and A2 given previously, as well as the following condition:
be detected or falsely corrected. The extent of multiple errors
detected depends on the structure of the code. A3. Each vector of (B,) is nonzero and is not equal to a
column vector ofBj, forj T* /.
As shown in Figure 5, hardware implementation of the
error correction and detection mainly consists of an r-way OR For i? < 4, most of the SEC-DED codes for practical
gate for testing nonzero syndrome, n r-way AND gates for applications can be reconfigured to detect single-byte errors.
128 decoding syndromes, an «-way NOR gate for generating UE The reconfiguration involves the regrouping or rewiring of the
11 I I •
1 I I I 1
1 I • 1 1
1 ! 1 1 1 - 1
I I11 I
1 I 1 1 i ! • • • •
• I 1 I • 1 • •
• I •
I 1
I I
I i I M I
I I 1 n
1I
Figure 6 Examples of SEC-DED-SBD codes: (a) (40,32) code, b = 4; (b) (72,64) code, h = 4; (c) (72.64) code, i = 3 and A = 4.
bit positions of the original code. Since the same encoding Table 3 Code length in bytes for some SEC-DED-SBD codes.
and decoding hardware can be used, no additional hardware
is required if a SEC-DED code can be reconfigured for single-
byte error detection. Figure 6 illustrates some examples of
SEC-DED-SBD codes. The codes in Figs. 6(a) and (b) are b+ 1 2 2 3 3 3 3 3
obtained from those in Figs. 3(b) and (d) by reconfiguration, h+2 5 6 7 8 9 10 11
6+3 10 12 15 16 18 20 22
and the code in Fig. 6(c) is the same as that in Fig. 3(c). The h+4 21 26 31 36 41 46 51
(72,64) codes of Fig. 6 are those used in IBM systems 3081 h+5 42 52 63 72 82 92 102
and 3033. b+6 85 106 127 148 169 190 21)
IBM J. RES. DEVELOP. « VOL, 28 % NO. 2 « MARCH \9U C. L. CHEN AND M. Y. HSIAO
• • 1 1 • • 1 1 • • 1 1 • • 1 1 • • 11 •
• 1 1 • 1• 1• 1• 1• 1• 1• 1• 1• 1• 1 1 • 1
11 • • 1 • • • 1 • • •
• I • • • 1 • • •
ro 0 n
1 1 • l i - • u - • 1 1 - • 1 1 - • 1 1 - • 1 •• 1 1- • • !• • • T = 1 0 1
• 1 • 1 1 • 1 • 1 I • 1 1 • 1 1 1 1 1 1 1 1 1 1 •. . 1 1 , . 1 . . . 0 1 0
• 1 • 1 1 • 1 1 1 1 1 1 1 1 1 • 1 1 • • 1 • • 1 •• • 1 • 1 • 1 • • •
1 .1 - J 1• U • 1 l i i 1 1 1 1 1 - 1 1 •_ 1 • • • 1 • • 1 i • • • and the H matrix for a (10,7) SBC-DBD code with ft = 3 is
I • i • 111 11 • 1 • • 1 11 • 1 1 1 1
II i 11 • 1 • • 1j 1 • 1 11 1 1 1 1 •
; •• •• ! 1• • • •• • 11 •• 1• 1• shown in Figure 7.
• 11 111 1 1• • ! • 1 • 1• 11 1 1 1
Figure 7 (10,7) SBC-DBD code with b=i. Using the H matrix of Eq. (4), the last three column
positions of H can be designated as the positions of check
bytes and the other column positions of H can be designated
as data byte positions. The check bytes can be generated with
Table 5 Number of check bits required for SBC-DBD codes. an XOR tree just as in the case of SEC-DED codes. The
syndrome can also be generated with the same XOR tree. For
Bits per byte Data bits per ECC word decoding, the syndrome S is divided into three parts, S|, S2,
S3. Each S, consists of b bits and represents the parity check
16 32 64 128 equations for the /th row of (4). From (3), if E is a single-byte
2 8 10 10 12 error pattern at data byte position /, then E is a unique solution
3 9 12 12 12 to the following three equations:
4 12 12 14 16
b>5 ib 3b 36 36 S, = E',
52 = T'-E',
53 = T^'-E'.
On the other hand, if E is a byte error pattern at check byte
code can also be defined by the parity check matrix H of (1)
position /, where / = 1, 2, or 3, then E = S,'and the other two
and (2), with the components of the matrices and vectors
subsyndromes are zeros. The following steps can be taken to
considered elements of GF(2''). Let h„ 1 < / < A^, be the
find the correctable single-byte error patterns and to detect
column vectors of the H matrix. The SBC-DBD code must
multiple uncorrectable byte errors.
satisfy the following conditions:
1. If S is a zero vector, assume that there is no error. If S is
BI. /!, ^ X-hj for / # ;, X e GF(2*).
nonzero, go to step 2.
B2. h, + Xx-hj ¥= X2-/I/, for distinct/,;,/and Ji'i,X2(; GF(2'').
2. If one of the subsyndromes S, =^ 0, and the other two
subsyndromes are zero, i = 1, 2, 3, the check byte position
Let r be the number of check bytes of an SBC-DBD code
/ with error pattern S is assumed. Otherwise, go to step 3.
over GF(2*). For r = 3, a code of length TV = 2 -I- 2* bytes can
3. Assume that E = SI Find /' that satisfies 0 < ; < Af - 4, T' •
be constructed by extending a Reed-Solomon code of length
E' = S2, and T^' • E' = S3. If / has a solution, the byte error
(2*) - 1 [15-19]. The parity check matrix of the code can be
with pattern E at data byte position / is assumed. If / has
expressed as
no solution, then an uncorrectable error is detected.
I I I I I O O
A block diagram for the generation of the error pointers for
H I T T^ T^'^ O I O (4) the code of Fig. 7 is shown in Figure 8.
I T^ T* T^(2'^* O O I
The extended Reed-Solomon codes defined in Eq. (4) are
where I is the 6 x i identity matrix, O is a A x 6 all-zero optimal in that no other SBC-DBD codes with three check
matrix, T is the b x b companion matrix of X, and X is a bytes contain more data bytes. However, there exists only one
primitive element of GF(2*) [15, 16], If X is a root of the code for a given byte size b. When b is small, the code may
primitive polynomial P(X) = Oo + a\X + a2X^ + •••,-(- be too short for memory applications. For example, the code
flA-iX*"', the companion matrix of A' is for ft = 2 can only accommodate six data bits. This code
certainly is not practical for most applications. In order to
0 0 •• 0 oo increase the code length for a given ft, additional check bits
1 0 •• 0 a, are required.
ry _ 0 1
U •••• • 0 02
C. L. CHEN AND M. Y. HSIAO IBM J. RES. DEVELOP. • VOL. 28 • NO. 2 • MARCH 1984
DEC-TED codes
A memory system with a large capacity or with high chip
failure rates may use a double-error-correcting and triple-
error-detecting (DEC-TED) code to meet its reliability require-
ments. A DEC-TED code is also attractive for a memory with
AND AND AND
a high soft error rate. Although there are schemes [21-25], to
be discussed in a subsequent section, for a SEC-DED code to
correct hard-hard and hard-soft types of double errors, these
schemes cannot correct double soft errors and they require J i , J I , * J_L J_L JJ_, J±,
the interruption of a normal memory read operation. With a [AND] |AND| I AND I I AND I | AND | | AND |
DEC-TED code, any combination of hard and soft double £,
errors, including double soft errors, can be corrected auto-
matically without system interruption. Figure 8 Generation of error vectors for a (10,7) SBC-DBD code
with 6 = 3.
1 1 1 1
H = 1 X x^ • x^-^ (5)
1 x^ X ' • . J^3(JV-2)
H = TH1. (6)
The H matrix defined by (5) can be transformed into the
systematic form of (2) for the generation of check bits (see The generation of check bits from matrix HI can be imple- 131
IBM J. RES. DEVELOP. • VOL. 28 • NO. 2 • MARCH 1984 C. L. CHEN AND M. Y. HSIAO
Table 7 Example of locating erasures. Extended error correction
Errors in semiconductor memory can be broadly divided into
Direction of 10 hard errors and soft errors [24, 25]. Hard errors are caused by
stuck faults
stuck faults or by permanent physical damage to the memory
r, (WRITE) 1 10 0 1 10 0 devices. Soft errors are temporary errors or a-particle-induced
r, (READ) 10 0 0 1 1 0 1 errors that will be erased during the next data storage opera-
T2 (WRITE) 0 0 1 1 0 0 11
T2 (READ) 10 1 1 0 0 11 tion. For this discussion, the errors that will stay in their
r, (READ) + 0 0 111110 locations during the next few write cycles are considered hard
T2 (READ) errors.
ERASURE ERROR 1 10 0 0 0 0 1
Error-correcting codes can be used to correct hard as well
as soft errors. However, the maintenance strategy for a system
mented with an XOR tree. For decoding, it is convenient to may allow the hard errors to accumulate. The presence of
define the syndrome S from (3) with the H matrix instead of errors in the memory increases the probability of uncorrecta-
the HI matrix. The syndrome can be generated using an XOR ble errors (UE) due to the lineup of multiple errors in a
tree associated with the H matrix. Thus, two separate XOR codeword. The UE rate can be reduced by repair service
trees are used to generate check bits and syndrome bits. The scheduled periodically. It can also be reduced by extending
syndrome can also be generated by first generating SI from the conventional error correction to some of the otherwise
Eq. (3) with the HI matrix, then multiplying matrix T by SI. uncorrectable errors. The latter approach is especially attrac-
Using this approach, the same XOR tree can be used to tive when the soft error rates are high, because it does not
generate check bits and SI. The validity of this procedure require the replacement of memory components. The ex-
follows directly from Eq. (6). tended error-correction schemes are discussed in this section.
The syndrome S can be divided into three parts. So, Si, and The errors for which locations but not values are known
S2, where So consists of one bit, and Si and S2 consist of m are called erasures [15, 16]. Erasures are easier to correct than
bits. Let the bit positions of the code be assigned as the powers random errors. Let / and e be the number of random errors
of X Assume that Ei and E2 are the positions of two erroneous and erasures, respectively, that a code is capable of correcting;
bits. Then So = 0 and Si = Ei + E2, S2 = E] -1- E i Since then the minimum distance d of the code must satisfy [ 15,
S] -t- S2 = E]E2 + EiEi = E1E2S1, the error positions Ei and 16],
E2 are roots of the quadratic equation
2t + e<d. (8)
S i / + S?y -f- {S? + S2) = 0. (7)
For example, a SEC-DED code is capable of correcting one
On the other hand, if there is only one error, then So = 1 and random error and one erasure.
the error position is the root of the linear equation 3^ -I- Si =
0. In memory applications, the hard errors can be considered
erasures if their locations can be identified. To locate the
The major part of the error correction is to find the error erasures of a particular word in the memory, we may apply
positions from the syndrome. Once the error positions are some test patterns to the memory. Assume that any binary
known, the errors are corrected by inverting the data bits at pattern can be written into the memory. An example is shown
the error positions. The error positions are determined by in Table 7 for finding the locations of erasures with two test
solving Eq. (7). If So = 0, and Eq. (7) has two solutions, then patterns, Ti and T2, of length 8, where T2 is the complement
the solutions are the positions of two errors. If So = 1, and of Ti. Before the test patterns are written into and read out of
Eq. (7) degenerates to a linear equation, then the solution is the memory, the word originally stored in the memory is read
the position of a single error. Uncorrectable errors are detected out and stored in a temporary storage. The erasure vector is
if Eq. (7) has no solution when So = 0, or Eq. (7) does not obtained by the complement of T,(READ) + T J I R E A D ) . The
degenerate into a linear equation when So = 1. locations of the erasures are indicated by the ones in the
erasure vector. Since Ti can be arbitrarily chosen, we may
also use the word that originally stored in the memory as Tu
There are various schemes for solving Eq. (7) [36-38]. The
This approach for locating the erasures, known as the double
equation can be solved algebraically using hardware that im-
complement algorithm, saves one write and one read opera-
plements finite-field operations as in [36]. It can also be solved
tion. An example of the algorithm is shown in Table 8.
by substituting all possible solutions into the equation, as in
[38]. Another approach is to store the error positions of
correctable errors in a table. The syndrome is used as the Some system designs permit only the codewords to be
132 address to the table of error positions [37]. written into the memory [21, 22, 25]. If the complement of a
C. L. CHEN AND M. Y. HSIAO IBM J. RES. DEVELOP. • VOL. 28 • NO. 2 • MARCH 1984
codeword is not a codeword, then the approaches just de- Table 8 Example of double complement algorithm.
scribed for the identification of erasures are not applicable. In
this case, one solution is to design codes with some special Original word = T, (WRITE) I 10 0 1 10 0
Hard and soft errors H S -
properties [21, 22]. Another solution is to employ three test r , (READ) 0 10 0 1 1 1 0
patterns in locating the erasures [25]. The test patterns are Ti (WRITE) = r , (READ) 10 1 1 0 0 0 1
chosen in such a way that they contain at least one 1 and one Ti (READ) 0 0 1 10 0 0 I
r, (READ) + Ti (READ) 0 1 1 1 1 1 1 1
0 in every bit position. It can be shown that three test patterns
Erasure error 10 0 0 0 0 0 0
are sufficient to satisfy this condition for any linear code. Ty = Tt (READ) 1 10 0 1 1 I 0
Soft error = Tj + T, (WRITE) 0 0 0 0 0 0 10
Once the locations of the erasures are identified, algorithms
can be designed to correct the hard and soft errors, provided
that the number of errors satisfies Eq. (8). Assume that the
with additional check bits, which are used to mask the erasures
double complement algorithm is applicable for locating the
in decoding. For example, a (76,64) code can be designed to
erasures. The following procedure can be used to correct up
correct double erasures and single random errors, and to detect
to two hard errors or a combination of one hard and one soft
double random errors [40].
error for a SEC-DED code:
Note that double soft errors are not correctable by this pro-
Acknowledgment
cedure. All single errors are corrected at the normal speed.
Contributions made by D. C. Bossen are gratefully acknowl-
The correction of hard-hard and hard-soft types of double
edged.
errors takes more time because additional write and read
operations are involved. The procedure can be modified or
References
refined to correct additional multiple hard errors [21, 24] at
1. L. Levine and W. Myers, "SemiconduCTor Memory Reliability
the expense of speed and cost. The procedure can also be with Error Detecting and Correcting Codes." Computer 9,43-50
extended to correct multiple errors beyond the random error- (October 1976).
correcting capability of SBC-DBD codes and DEC-TED codes. B. Richard, "Automatic Error Correction in Memory Systems."
Computer Design 15, 179-182 (May 1976).
P. K. Lala, "Error Correction in Semiconductor Memory Sys-
The procedure just described derives the information on tems," Electron. Ertg. 18,49-53 (January 1979).
A. V. Ferris-Prabhu. "Improving Memory Reliability through
erasures at the time when the double error occurs. A different Error Correction," Computer Design 18, 137-144 (July 1979).
method is to store the information on the erasure errors in a R. W. Hamming, "Error Detecting and Error Correcting Codes,"
table [22]. This approach increases the speed of correcting BeltSyst. Teck / 29, 147-160 (April 1950).
M. Y. Hsiao. "A Class of Optimal Minimum Odd-Weight-Col-
double errors. However, the table has to be constantly updated umn SEC-DED Codes," fBM J. Res. Develop. 14, 395-401 (Julv
to reflect the true status of the erasures in the memory. 1970).
M. Y. Hsiao, W, C. Carter, J. W, Thomas, and W. R. Stringfellow,
"Reliability, Availability, and Serviceability of IBM Computer
There are other schemes for the correction of multiple Systems: A Quarter Centur>' of Progress," IBM J. Re.t Develop.
erasures [39-41]. These schemes involve the design of codes 25, 453-465 (September 1981). 133
IBM J. RES, DEVELOP. *V0L- 28 • NO. 2 * MARCH 1984 C. L, CHE.N AND M. Y, HSIAO
8. S. Lin and D. J. Costello, Jr., Error Control Coding: Fundamentals 33. A. Hocquenghem, "Codes Correcteurs d'Erreurs," Chiffres 2,
and Applications, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1983. 147-156(1959).
9. G. R. Basham, "New Error-Correcting Technique for Solid-State 34. J. M. Goethals, "On the Golay Perfect Binary Code," / Comb.
Memories Saves Hardware," Computer Design 15, 110-113 (Oc- Theory n, 178-186 (September 1971).
tober 1976). 35. C. L. Chen, "On Shortened Finite Geometry Codes," Info. Control
10. D. Morris, "ECC Chip Reduces Error Rate in Dynamic RAMS," 20,216-221 (April 1972).
Computer Design 19, 137-142 (October 1980). 36. T. H. Howell, G. E. Gregg, and L. Rabins, "Table Lookup Direct
11. D. P. Siewiorek and R. S. Swarz, The Theory arui Practice of Decoder for Double-Error Correcting BCH Codes Using a Pair
Reliable System Design, Digital Press, Digital Equipment Cor- of Syndromes," U.S. Patent No. 4,030,067, June 14, 1977.
poration, Bedford, MA, 1982. 37. J. T. Yamato and T. K. Tama, "Error Correcting and Controlling
12. D. C. Bossen, L. C. Chang, and C. L. Chen, "Measurement and System," U.S. Patent No. 4,107,652, August 15, 1978.
Generation of Error Correcting Codes for Package Failures," 38. P. Golan and J. Hlavicka, "New Method for Parallel Decoding of
IEEE Trans. Computers C-21, 201-204 (March 1978). Double-Error Correcting Group Codes," Proceedings of the 13th
13. S. M. Reddy, "A Class of Linear Codes for Error Control in Byte- International Conference on Fault-Tolerant Computing, Milan,
per-Package Organized Memory Systems," IEEE Trans. Com- Italy, June 1983, pp. 338-341.
puters C-27, 455-458 (May 1978). 39. B. S. Tsybakov, "Defects and Error Correction," Problemy Pere-
14. C. L. Chen, "Error Correcting Codes with Byte Error Detection dachi Informatsii 11, 21-30 (1975).
Capability," IEEE Trans. Computers C-32,615-621 (July 1983). 40. A. V. Kuznetsov, T. Kasami, and S. Yamamura, "An Error
15. E. R. Berlekamp, Algebraic Coding Theory, McGraw-Hill Book Correcting Scheme for Defective Memory," IEEE Trans. Info.
Co., Inc., New York, 1968. Theory IT-24, 712-718 (November 1978).
16. W. W. Peterson and E. J. Weldon, Jr., Error Correcting Codes, 41. C. L. Chen, "Linear Codes for Masking Memory Defects," pre-
2nd ed., MIT Press, Cambridge, MA, 1972. sented at the IEEE International Symposium on Information
17. I. S. Reed and G. Solomon, "Polynomial Codes over Certain Theory, St. Jovite, Quebec, Canada, September 26-30, 1983.
Finite Fields," J. Soc. Ind. Appl. Math. 8, 300-304 (June 1960).
18. T. Kasami, S. Lin, and W. W. Peterson, "Some Results on Cyclic Received June 30, 1983: revised September 26, 1983
Codes Which are Invariant under the Affine Group and Their
Applications," Info. Control 11, 475-496 (November 1967).
19. J. K. Wolf, "Adding Two Information Symbols to Certain Non-
binary BCH Codes and Some Applications," Bell Syst. Tech. J.
48,2405-2424(1969). C. L. (Jim) Chen IBM Data Systems Division, P.O. Box 390,
20. D. C. Bossen, "b-Adjacent Error Correction," IBM J. Res. De- Poughkeepsie, New York 12602. Dr. Chen is a senior engineer working
velop. 14, 402-408 (July 1970). on error-correcting codes and fault-tolerant memory systems. Before
21. W. C. Carter and C. E. McCarthy, "Implementation of an Exper- joining IBM in 1974, he held a postdoctoral position at the University
imental Fault-Tolerant Memory System," IEEE Trans. Com- of Hawaii and was a faculty member of the University of Illinois. He
puters C-25, 557-568 (June 1976). received his Ph.D. degree in electrical engineering from the University
22. C.-E. W. Sundberg, "Erasure and Error Decoding for Semicon- of Hawaii. Dr. Chen is a member of the Institute of Electrical and
ductor Memories," IEEE Trans. Computers C-27,696-705 (Au- Electronics Engineers. He has received three IBM Invention Achieve-
gust 1978). ment Awards and one IBM Outstanding Innovation Award for his
23. P. K. Lala, "An Adaptive Double Error Correction Scheme for work on error-correcting codes.
Semiconductor Memory Systems," Digital Processes 4, 237-243
(1978).
24. R. Nelson, "Effortless Error Management," Computer Design 21,
163-168 (February 1982),
25. D. C. Bossen and M. Y. Hsiao, "A System Solution to the Memory M. Y. (Ben) Hsiao IBM Data Systems Division. P.O. Box 390,
Soft Error Problem," IBM J. Res. Develop. 24, 390-397 (May Poughkeepsie, New York 12602. Dr. Hsiao is a senior technical staff
1980). member and manager of the Laboratory Engineering Analysis De-
26. W. K. Mikhail, R. W, Bartoldus, and R. A. Rutledge, "The partment. His current professional interests include research and
ReliabiHty of Memory with Single-Error Correction," IEEE development in computer reliability, availability, serviceability, error-
Trans. Computers C-31, 560-564 (June 1982). correcting codes, error detection, failure-isolation techniques, and
27. C. L. Chen and R. A. Rutledge, "Fault-Tolerant Memory Simu- system engineering analysis. He joined IBM in Poughkeepsie in the
lator," IBM J. Res. Develop. 28, 184-195 (1984, this issue). Advanced Reliability Technology Department in 1960. From 1965 to
28. M. R. Libson and H. E. Harvey, "A General-Purpose Memory 1967, he was on educational leave to the University of Florida, after
Reliability Simulator," IBM J. Res. Develop. 28, 196-205 (1984, which he returned to IBM as advisory engineer in the Reliability and
this issue). Diagnostic Engineering Department. In 1969, he was promoted to
29. S. J. Hong and A. M. Patel, "A General Class of Maximal Codes senior engineer and manager of the Reliability Technology Depart-
for Computer Applications," IEEE Trans. Computers C-21, ment. He assumed his present position in 1979. Dr. Hsiao received
1322-1331 (December 1972). his B.S. in electrical engineering in 1956 from Taiwan University,
30. T. T. Dao, "Design and Implementation of a Non-Binary Code Taipei, his M.S. in mathematics in 1960 from the University of UUnois,
for Byte-Organized Memory with Binary and Quaternary Log- and his Ph.D. in electrical engineering in 1967 from the University of
ics," Proceedings of the 8th IEEE International Symposium on Florida. He has seven IBM Invention Achievement Awards, two IBM
Multi-Valued Logic, Rosemont, IL, May 1973, pp. 24-26. Outstanding Innovation Awards, and a Corporate Award in the areas
31. S. Keneda and E. Fujiwara, "Single Byte Error Correcting Double of error-correction codes, error detection, and failure-isolation tech-
Byte Error Detecting Codes for Memory Systems," IEEE Trans. niques. He has authored and co-authored two books published in
Computers C-U, 596-602 (July 1982). 1964 and 1968. Dr. Hsiao is a Fellow of the Institute of Electrical and
32. R. C. Bose and D. K. Ray-Chaudhuri, "On a Class of Error- Electronics Engineers and a member of the Fault Tolerant Computing
Correcting Binary Group Codes," Info. Control 3, 68-79 (March Committee and IFIPS Committee on Reliable Computing and Fault
1960). Tolerance.
134
C. L. CHEN AND M. Y. HSIAO IBM J. RES. DEVELOP. . VOL. 28 « NO. 2 • MARCH 1984