Design of Fault-Tolerant Versions of the Main Linear Algebra Alg

UDС 681.325

O. Maslennikov

Department of Electronics, Technical University of Koszalin

Partyzantow 17, 75-411 Koszalin, Poland

E-mail: [email protected]

Design of Fault-Tolerant Versions of the

Main Linear Algebra Algorithms

The modification of the weighted checksum method, which allows to derive the fault tolerant versions of most linear algebra algorithms is proposed. The purpose is detection and correction of calculation errors occurred due to transient hardware faults. Using the proposed method, the fault-tolerant version of Faddeeva algorithm is designed in this paper. The computational complexity of new algorithm is increased approximately on O(N2) multiply-add operations in comparison with the original one. However, new algorithm enables to detect and to correct a single error in an arbitrary row or column of input data matrices at the each algorithm step. Finally, the results of experimental verification of the proposed algorithm are represented.

Key words: algorithm-based fault-tolerance, weighted checksum method, linear algebra algorithms.

Introduction

The methods of linear algebra (LA) make a basis for mathematical models in various fields of science, engineering and technology. However, most of LA algorithms are characterized by a high computational complexity (O(N3) multiply-add operations, where N is the order of input data matrix) and regularity [1-4]. Therefore, both homogeneous and heterogeneous parallel systems are most suitable for effective realization of these algorithms.

Application areas of parallel systems demand a large degree of reliability of output results. However, the probability of physical failures increases along with increasing of the algorithm and target computing system complexity. Since a single temporary or permanent failure in a processor can break down an entire computing system, fault tolerance should be provided in these cases on hardware or (and) software levels. The most known methods for providing fault tolerance use hardware or time redundancy, which increase the cost or degrade the performance of computational systems. Therefore, the algorithm-based fault tolerance (ABFT) methods are more suitable for such systems.

© O. Maslennikov

ABFT is an error detection, localization and correction scheme, which uses redundant computations within the algorithms to detect and correct errors caused by transient failures in the hardware, concurrently with normal operation [5-8, 14]. In ABFT, the input data are encoded in the form of error detecting or (and) correcting codes. The algorithm is modified to operate on encoded data and produce encoded outputs, from which useful information can be recovered very easily. Thus, ABFT methods establish the rules of the original applied algorithms and input data arrays modification. From this, it is clear that these methods are not a general mechanism as some other methods (e.g. the triple modular TMR or triple time redundancy TTR methods), because they may be varied from algorithm to algorithm [14]. However, when the modified algorithm is actually executed on target system, the overheads are required to be minimal in comparison with other known methods.

Module-level faults are assumed in the algorithm-based fault tolerance. A module (the processor unit or the computer among a local network) is allowed to produce arbitrary logical errors under physical failure mechanism. This assumption is quite general, since it does not assume any technology-dependent fault model. Without loss of generality, a single module error is assumed in this paper.

The most known ABFT method called weighted checksum (WCS) one, which is specially tailored to matrix algorithms, has been proposed by Abraham et al [5-7]. However, the original WCS method is not suitable for most LA algorithms, since a single transient fault in arbitrary processor unit of a parallel system during computation might cause multiple output errors, which can not be located and corrected. Therefore, in the papers [8-10], we propose the modified WCS method and FT versions of Gauss elimination, Cholesky, Householder reflections and Givens rotations algorithms. In this paper, we establish the sufficient conditions to use the modified WSC-method and then apply it for design fault-tolerant versions of Faddeeva algorithm. Finally, the verification results of the proposed algorithm are represented.

Weighted checksum (WCS) method and its modification for designing of fault-tolerant versions of main LA algorithms

The WCS-code has been adopted by Jou and Abraham [5] in matrix arithmetic operations for algorithm-based fault tolerance. The idea is to compress the information contained in the row/column elements of matrix into a single element, which is named as a check element. For example, a WCS encoded data vector v(N) with Hamming distance equals three, which can correct a single error (SEC) and can be expressed as

EMBED Equation.3, (1)

where vi is an element of a data vector v(N),

EMBED Equation.3, (2)

and p(N), q(N) — are encoder vectors.

Possible choices for vector pairs p and q are, for example, [11]

EMBED Equation.3 and EMBED Equation.3 (3)

(where q is named the exponential weighted encoder vector), or [12]

EMBED Equation.3 and EMBED Equation.3 (4)

(where q is named the linear weighted encoder vector).

The difficulty with the first choice is a loss of the numerical accuracy due to large weights, while the second choice leads to larger extra computations necessary to correct an error.

Moreover, the following more advanced encoder vector pairs were proposed in the [7]:

average and weighted average encoder vectors:

pT = [ 1/N 1/N … 1/N ] and qT = [ 1/N 2/NN/N ]; (5)

2) normalized encoder vectors:

pT = [ c/a2 c/a2c/a2 ] and

qT = [1⋅c/A 2⋅c/A … Nc/A], (6)

where A is the average value of matrix column (or row) Euclidean norms and c is a constant fixed by user.

Experimental evaluation of numerical error for proposed encoder vectors also was researched in [7] for the set of random generated matrices. The main results are as follows: when round errors are the larger problem, one should use normalized encoder vectors; for overflow problems, one should use average encoder vectors.

Based on the linear encoding vector (4), for example, a matrix A(M, N) can be encoded as either row encoded matrix AR given by

EMBED Equation.3, (7)

a column encoded matrix AC,

EMBED Equation.3, (8)

or a full encoded matrix ARC [11, 12] given by

EMBED Equation.3. (9)

For example, for matrix multiplication A(M, N) · B(N, K) = C(M, K), the column encoded matrix AC of a form (8) is used. Then, the following expression is computed:

EMBED Equation.3.

To verify the computation, syndromes S1 and S2 for the j-th column of matrix C should be calculated (j = 1, …, K):

EMBED Equation.3 and EMBED Equation.3.

In order to correct a single error, the following procedure (10) is used:

if S1 = S2 = 0, then no error has been detected;

if S1 0 and S2 = 0, then PCSj is inconsistent;

if S2 0 and S1 = 0, then QCSj is inconsistent;

if S1 0 and S2 0, then i = round (S2/S1) and element cij is erroneous,

and the correction procedure is:

EMBED Equation.3, (10)

when round (S2/S1) is the nearest integer number to the value S2/S1.

Thus, the computational complexity of the modified version of the matrix multiplication algorithm increases only on the 2N2 operations and is equal to (N3 + 2N2) multiply-add operations. This version allows to detect and to correct the single error among elements of each column of an input matrix A(M, N) occurred during algorithm implementation. Consequently, it is possible to correct up to N single errors during solving the whole task.

However, the original WCS method is not suitable for such LA algorithms as, for example, Gauss elimination, Jordan-Gauss, Faddeeva and Cholesky algorithms, Householder reflections and Givens rotations algorithms, etc. In fact, the common property of all above-mentioned algorithms is the computation at the i-th algorithm step (may be not one time) the elements of leading (i-th) row or/and column of matrix Ai = { EMBED Equation.3 } and then modification of other matrix rows (columns) by means of leading ones. The example of corresponding fragment of such algorithms with leading column computations is represented by means of a construction (11), were values of variables K, K1, K2 and functions g1, g2 are dependent on the selected algorithm. Note, that in this example the input matrix A = A1 = {aji} is recursively modified during K computation steps to obtain the resulting matrix AK+1.

for i := 1 to K do

begin

{Phase 1: computation of the leading column elements EMBED Equation.3 }

for j := i + 1 to K1 do

EMBED Equation.3 := g1( EMBED Equation.3 , EMBED Equation.3 ); (11)

{Phase 2: computation of the elements of the matrix Ai+1}

for j := i + 1 to K1 do

for k := i + 1 to K2 do

EMBED Equation.3 := g2( EMBED Equation.3 , EMBED Equation.3 , EMBED Equation.3 );

end

As seen in from the (11), if at the i-th algorithm step the element EMBED Equation.3 of leading (i-th) column is calculated wrong, then errors will appear in all elements EMBED Equation.3 of j-th row of Ai+1. Analogously, if any element EMBED Equation.3 of the leading (i-th) row was calculated wrong, then errors appear in the all elements of j-th column of Ai+1. In both cases, these errors can not be located and corrected by WCS method. If the correction of elements ajk is performed during calculations, then the computational complexity of the original algorithm increases more than twice.

For removing these defects by means of modification of the original WCS method, the following confirmations were proved for all above-mentioned algorithms (see [8-10] and the next paper section):

— if during i-th step of computations the element EMBED Equation.3 is calculated wrong, then errors will not appear among other elements of matrix Ai+1, while j-th row isn’t the leading one (i.e. while ij);

— if the element EMBED Equation.3 (j = i, i + 1, …, N) was calculated wrong several times q (q < i) before performing the i-th step of algorithm (11), then it is possible to correct it using the WCS method for the row encoded matrix AR (7) at the beginning of the i-th step of the algorithm;

— if an element EMBED Equation.3 (j = i, i + 1, …, N) was calculated wrong during executing the first phase of the i-th step of algorithm (11), then it is possible to correct it using the WCS method for the column encoded matrix AC (8) after executing this phase.

The main consequence of these confirmations is the possibility of performing the detection and correction procedures during each i-th algorithm step among only elements of the leading (i-th) row and leading (i-th) column of the matrix Ai. Based on these confirmations, the modification of the origin WCS method was performed. The main idea of the proposed unified WCS method (scheme) destined for main linear algebra algorithms is the performing of its check procedures concurrently with algorithm computations or more exactly, the performing of the detection and correction procedures:

— at each i-th algorithm step;

— among only elements of the leading (i-th) row and leading (i-th) column of matrix Ai.

Note, that proposed modified method may be used to design the fault-tolerant version of an arbitrary matrix algorithm for which above-mentioned confirmations are corrected.

Therefore, these confirmations may be considered as sufficient conditions for using the modified WCS method.

As a result, the proposed checksum scheme increases the computational complexity of original algorithm (11) approximately on O(N2) operations (such as multiply-add operations). However, the proposed uniform scheme enables to correct one error among elements of an arbitrary column (or row) of an input matrix A(M, N) on any from K steps of algorithm implementation. Consequently, it is possible to correct up to K (where K = (N – 1) for case M = N) errors during solving the whole LA task.

Design of the fault tolerant version of Faddeeva algorithm

Starting with N × N, N × K, P × N and P × R input matrices A, B, C and D, respectively, Faddeeva algorithm is intended [3, 13] for solving matrix equations of the type

X = C A–1 B + D, (12)

were the four input matrices form an (N + P) × (N + R) joint matrix EMBED Equation.3when arranged in the following way:

EMBED Equation.3. (13)

The idea of Faddeeva algorithm consists of reduction the lower left quadrant of the matrix EMBED Equation.3(i.e. C-matrix) to zero matrix, while in the lower right quadrant of the matrix EMBED Equation.3 is formed the resultant P × R matrix X. In order to perform above-stated operations with A being a non-singular matrix, the Gauss elimination algorithm is used. Hence, in the course of computations, the joint matrix EMBED Equation.3 is being transformed into the following matrix:

EMBED Equation.3, (14)

where U is the upper triangular matrix.

The main practical advantage of Faddeeva algorithm is its versatility. This stems from the fact that expression (12) allows to solve a set of problems, for example (15):

— solving a system AX = B of linear algebraic equations with one or more right-hand sides (depending upon the numbers of columns in B), i.e.

X = A–1 B for EMBED Equation.3,

where I is the identity matrix;

— matrix multiplication X = C B for A = I, D = 0;

— matrix multiply — add operation X = C B + D for A = I; (15)

— matrix inversion X = A–1 for C = B = I, D = 0;

— adaptive filtering algorithms X = C · A–1 + D for B = I.

To provide a numerical stability of Faddeeva algorithm, Gauss elimination with partial pivoting within columns [1-4] is usually used. The described above version of Faddeeva algorithm can be expressed by the following Pascal-like form:

for i := 1 to N do

begin

for j := i + 1 to N do

begin

{pivoting}

if abs(fii ) < abs(fji)

then begin s := fii ; fii := fji; fji := s; vji : = 1; end

else vji := 0;

{row interchanges}

for k := i + 1 to N + R do

if vji := 1 then begin temp := fik; fik := fjk; fjk := temp; end;

end {j};

{calculation of the elements mji}

for j := i + 1 to N + P do (16)

mji := – fji / fii;

{elimination}

for j := i + 1 to N + P do

for k := i + 1 to N + R do

fjk := fjk + mji · fik;

end.

It is followed from the construction (16), that if during i-th computation step the element mji is calculated wrong, then errors will appear in all elements EMBED Equation.3 of j-th row of EMBED Equation.3. Moreover, if any element EMBED Equation.3 of the leading row is calculated wrong, then errors appear in all elements of k-th column of EMBED Equation.3. In both cases, these errors cannot be corrected by the original WCS-method. Therefore, in order to derive a fault-tolerant version of this algorithm, the proposed modified WCS-method should be used. However, the conditions represented in the previous section should be true. For algorithm (16) these conditions are transformed in the theorems 1, 2 and 3, respectively.

THEOREM 1. If during the i-th step of the algorithm (16) the element EMBED Equation.3 was calculated wrong, then errors do not appear among other elements of matrix EMBED Equation.3, while the j-th row isn’t the pivoting one (i.e. ij).

The proof of this theorem follows directly from the algorithm (16), where each element EMBED Equation.3 takes part in calculations of only elements EMBED Equation.3 , EMBED Equation.3 , …, EMBED Equation.3 , where (i + g) ≤ j and (i + g) ≤ k.

THEOREM 2. Assume that the element EMBED Equation.3 was calculated wrong q times (q < i) before executing the i-th step of algorithm (16). Then it is possible to correct its value only once, using WCS method for row encoded matrix EMBED Equation.3, at the beginning of the i-th step of the corresponding algorithm.

PROOF. Without the loss of a generality, we assume that i < j, i < k and q = 2 for element EMBED Equation.3 . Let the element EMBED Equation.3 was calculated wrong at the (i – 1)-th step of algorithms (16). Then its value will be equal to

EMBED Equation.3 = EMBED Equation.3 + EMBED Equation.3 ,

where EMBED Equation.3 is the calculation error. Then, in accordance with (16), after performing the next algorithm step, we obtain:

EMBED Equation.3 = EMBED Equation.3 + mji EMBED Equation.3 = EMBED Equation.3 + EMBED Equation.3 + mji EMBED Equation.3 .

We assume now, that last expression was also calculated wrong. In similar way, we obtain that the value of the element EMBED Equation.3 will be equal to

EMBED Equation.3 = EMBED Equation.3 + EMBED Equation.3 = EMBED Equation.3 + EMBED Equation.3 + EMBED Equation.3 + mji EMBED Equation.3 = EMBED Equation.3 + zjk + mji EMBED Equation.3 ,

where zjk = EMBED Equation.3 + EMBED Equation.3 .

Thus, the computation errors of the element fjk are accumulated in the variable zjk. Consequently, the wrong calculated element fjk may be corrected only at the beginning of the j-th step of the algorithm (16), i.e. when the j-th row will become the leading row (j = i).

THEOREM 3. Values of the checksum CSi and the weighted checksum WCSi of the i-th column of the matrix M are respectively equal to values of the checksum EMBED Equation.3 and the weighted checksum EMBED Equation.3 of the i-th column of matrix EMBED Equation.3, i.e. equals to values of the checksum and the weighted checksum of i-th column of matrix EMBED Equation.3 after performing the i-th step of the algorithms (16).

PROOF. At the beginning of the i-th step of algorithm (16) the values EMBED Equation.3 and EMBED Equation.3 of the matrix FC in accordance with (9) are equal to the following expressions:

EMBED Equation.3 = EMBED Equation.3 + EMBED Equation.3 + …+ EMBED Equation.3

and

EMBED Equation.3 = i fiii + (i + 1) ⋅ EMBED Equation.3 + … + (N + P) ⋅ EMBED Equation.3 ,

respectively.

After performing the i-th step of the algorithm (16) with the column encoded matrix FC, these values will be equal to

PCSi(i+1) = EMBED Equation.3 / EMBED Equation.3 and EMBED Equation.3 QCSi(i+1) = EMBED Equation.3 / EMBED Equation.3 .

Оn the other hand, values of the checksum CSi and the weighted checksum WCSi of the i-th column of the matrix M in accordance with the expression (8) and algorithm (16) are equal to following expressions:

CSi = 1 + m(i+1)i + m(i+2)i + …+ m(N+P)i = 1 + EMBED Equation.3 / EMBED Equation.3 + …+ EMBED Equation.3 / EMBED Equation.3 = EMBED Equation.3 / EMBED Equation.3

and

WCSi = i + (i + 1) ⋅ m(i+1)i + (i + 2) ⋅ m(i+2)i + … + (N + P) ⋅ m(N+P)i = EMBED Equation.3 / EMBED Equation.3 .

Thus, the correctness of conditions represented in the section 2 is proved. Therefore, in order to derive a fault-tolerant version of this algorithm, the proposed modified WCS method checksum scheme may be used.

However, we should be certain that the elements of i-th column of matrix Fi were calculated correctly at the (i – 1) step of the algorithm (16). It is easy proved that correctness of these elements may be verified using WCS-scheme for i-th column of matrix EMBED Equation.3 (analogously to the proof of theorem 3). Finally, the fault-tolerant version of Faddeeva algorithm without pivoting consists of execution of the following stages:

1. The original matrix F is represented in the form of the fully encoded matrix FRC:

EMBED Equation.3.

2. For i =1, 2, …, N –1, stages 3–7 are repeated.

3. At the beginning of the i-th algorithm step, error detection and correction procedure within elements belong to the i-th column row of the matrix EMBED Equation.3 is performed in accordance with the expressions (10).

4. The error detection and correction procedure within elements, which belong to the i-th row of EMBED Equation.3 matrix is performed in accordance with the procedure (10).

5. The elements mji are calculated.

6. The error detection and correction procedure for the elements mji is performed in accordance with the procedure (10).

7. The elements of matrix EMBED Equation.3 are calculated.

Note, that realization of the detection and correction procedures for elements of i-th column of matrix EMBED Equation.3and elements mji require to perform 2·(N i + P) multiply-add operations and 2·(N i + P) additions. For realization of the detection and correction procedure for the elements of i-th (leading) row of the matrix EMBED Equation.3 it is necessary to perform (N i + R) operations of multiplication with addition and (Ni + R) operations of addition. Moreover, the resulting elements xjp are not correct during computations. Therefore, these elements should be checked and corrected after algorithm implementation by means of the original checking procedure of the WCS method. For realization of this stage, it is necessary to perform N·R operations of multiplication with addition and N·R operations of addition. This means, that the computational complexity of the whole FT Faddeeva algorithm increases on O(N2) multiply-add operations and O(N2) additions. Besides, due to increasing input matrix sizes, the computational complexity of the proposed algorithm is also increased on O(N2) multiply-add operations (see Тable) in comparison with the original algorithm. However, algorithm enables to detect and to correct one error in an arbitrary row or column of the matrix EMBED Equation.3. This means, that it is possible to correct up to N errors during whole algorithm realization.

In the Pascal-like form the fault-tolerant version of Faddeeva algorithm without pivoting may be represented by the following construction (17), were δ is a small value, named a tolerance, so that a row (column) of resulting matrix will be accepted as error-free if the difference between the computed row (column) sum and checksum is less than δ. The variable ε is a machine-dependent constant with a small (roundoff) value. Note, that in the case when a single error was occurred during computations, the difference (round (S2/S1) – S2/S1) is determined by only roundoff values and therefore has a very small value. At the same time, if more errors were occurred within one column (row) computations, this difference has not a small value. Therefore, the variable ε is used here for searching multiply errors and halting program execution.

for i := 1 to N do

begin

{Error detection and correction within elements of the i-th column of EMBED Equation.3}

PCSi := 0; QCSi := 0;

for j := i to N+P do begin PCSi := PCSi + fji; QCSi := QCSi + j* fji; end;

S1 := PCSi fN+P+1,i; S2 := QCSi fN+P+2,i;

if abs (S1) > δ and abs (S2) < δ then fN+P+1,i := PCSi;

if abs (S2) > δ and abs (S1) < δ then fN+P+2,i := QCSi;

if abs (S1) > δ and abs (S2) > δ then

if abs(round (S2/S1) – S2/S1) < ε then begin j := round (S2/S1); fji := fji S1; end

else halt;

{Error detection and correction within elements of the i-th row of EMBED Equation.3}

PRSi := 0; QRSi := 0;

for k := i to N + R do begin PRSi := PRSi + fji; QRSi := QRSi + k· fji; end;

S1 := PRSi fi, N+R+1; S2 := QRSi – fi,N+R+2;

if abs (S1) > δ and abs (S2) < δ then fi,N+R+1 := PRSi;

if abs (S2) > δ and abs (S1) < δ then fi,N+R+2 := QRSi;

if abs (S1) > δ and abs (S2) > δ then

if abs (round (S2/S1) – S2/S1) < ε then begin k := round (S2/S1); fik := fik S1; end

else halt;

{Computation of elements mji of the matrix M}

for j := i + 1 to N + P + 2 do

mji := fji / fii ;

{Error detection and correction within elements of the i-th column of matrix M}

CSi := 1; WCSi := i;

for j := i +1 to N + P do begin CSi := CSi + mji; WCSi := WCSi + j* mji; end; (17)

S1 := CSi mN+P+1,i; S2 := WCSi mN+P+2,i;

if abs (S1) > δ and abs (S2) < δ then mN+P+1,i := CSi;

if abs (S2) > δ and abs (S1) < δ then mN+P+2,i := WCSi;

if abs (S1) > δ and abs (S2) > δ then

if abs (round (S2/S1) – S2/S1) < ε then begin j := round (S2/S1); mji := mji S1; end



Страницы: 1 | 2 | Весь текст


Предыдущий:

Следующий: