(navigation image)
Home American Libraries | Canadian Libraries | Universal Library | Community Texts | Project Gutenberg | Children's Library | Biodiversity Heritage Library | Additional Collections
Search: Advanced Search
Anonymous User (login or join us)
Upload
See other formats

Full text of "Parallelism exposure and exploitation in programs"

LIBRARY OF THE 

UNIVERSITY OF ILLINOIS 

AT URBANA-CHAMPAIGN 



bi.0.%4 




IS*' /1 Report No. k 2h 



' r 



y 



PARALLELISM EXPOSURE AND EXPLOITATION IN PROGRAMS 

by 
Yoichi Muraoka 



February, 1971 




LIBF 



NOV 9 1972 

UNIVERSITY OF ILLINOIS 

AT urbawa-champaign: 



DEPARTMENT OF COMPUTER SCIENCE 
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 



URBANA, ILLINOIS 



Digitized by the Internet Archive 
in 2013 



http://archive.org/details/parallelismexpos424mura 



PARALLELISM EXPOSURE AND EXPLOITATION IN PROGRAMS 



BY 



YOICHI MURAOKA 
B.Eng., Waseda University, 19^5 
M.S., University of Illinois, 1969 



THESIS 



Submitted in partial fulfillment of the requirements 

for the degree of Doctor of Philosophy in Computer Science 

in the Graduate College of the 

University of Illinois at Urbana-Champaign, 1971 



Urbana, Illinois 



iii 



ACKNOWLEDGEMENT 

The author would like to express his deepest gratitude to 
Professor David J. Kuck, the Department of the Computer Science of the 
University of Illinois, whose encouragement and good advice have led 
this work to the successful completion. Also Paul Kraska read the 
thesis and provided valuable comments. 

Special thanks should go to Mrs. Linda Bridges without whose 
excellent job of typing, the final form would have never come out. 
Thanks are also extended to Mrs. Diana Mercer who helped in getting 
the thesis finished on time. 



IV 



TABLE OF CONTENTS 



Page 



1. INTRODUCTION 1 

2. PARALLEL COMPUTATION OF SUMMATIONS, POWERS AND POLYNOMIALS 11 

2 . 1 Introduction 11 

2 . 2 Summation of n Numbers 14 

2 . 3 Computation of Powers 23 

2.4 Computation of Polynomials 31 

2.4.1 Computation of Polynomial on an Arbitrary Size 

Machine 31 

2.4.1.1 k-th Order Horner's Rule 32 

2.4.1.2 Estrin's Method 32 

2.4.1.3 Tree Method 33 

2.4.1.4 Folding Method 35 

2.4.1.5 Comparison of Four Methods 38 

2.4.2 Polynomial Computation by the k-th Order Horner's Rule ... 39 

3. TREE HEIGHT REDUCTION ALGORITHM 52 

3«1 Introduction 52 

3-2 Tree Height and Distribution 53 

3.3 Holes and Spaces 63 

3«3«1 Introduction 63 

3.3.2 Holes 70 

3.3.3 Space 76 

3.4 Algorithm 85 

3.4.1 Distribution Algorithm 86 

3*4.2 Implementation 91 



V 

Page 

3« 5 Discussion 9J+ 

3.5-1 The Height of a Tree 9U 

3-5-2 Introduction of Other Operators 98 

3-5-2. 1 Subtraction and Division 98 

3«5-2.2 Relational Operators 99 

k . COMPLETE PROGRAM HANDLING 100 

k.l Back Substitution - A Block of Assignment Statements and an 

Iteration 100 

k . 2 J'oops „ 110 

4.3 Jumps 113 

h.k Error Analysis llU 

5. PARALLELISM BETWEEN STATEMENTS „ . 122 

5. 1 Program „ 122 

5.2 Equivalent Relations Between Executions 125 

6. PARALLELISM IN PROGRAM LOOPS 135 

6.1 Introduction 135 

6.1.1 Replacement of a for Statement with Many Statements 135 

6.1.2 A Restricted Loop li+1 

■ . 2 A Loop With a Single Body Statement 1^3 

6.2.1 Introduction li+3 

6.2.2 Type 1 Parallelism 146 

6.2.2.1 General Case 146 

6.2.2.2 A Restricted Loop 153 

6.2.2.3 Temporary Locations 156 

6.2.3 Type 2 Parallelism l60 

6.2.U :onclusion 167 



Page 

6.3 A Loop With Many Body Statements 167 

6.3*1 Introduction 167 

6.3«2 Parallel Computation with Respect to a Loop Index 171 

6.3*3 leparation of a Loop 173 

o . 3 • 3 • 1 Introduction 173 

6.3«3«2 The Ordering Relation (e ) and Separation of 

Loop 174 

6.3«3«3 temporary Storage 179 

6.3*^ Parallelism Between Body Statements 182 

oTjnrrT Introduction 182 

6. 3« h.2 The Statement Dependence Graph and the Algorithm . . 184 

6.3»5 Discussion 190 

7 . EQUALLY WEIGHTED--TWO PROCESSOR SCHEDULING PROBLEM 192 

7.1 Introduction 192 

7-2 Job Graph 196 

7 . 3 Scheduling of a Tight Graph 199 

7 • -V Scheduling of a Loose Graph 2lU 

7. 5 Supplement 225 

8. CONCLUSION 230 

LIST OF REFERENCES 233 

VITA 236 



VI 1 



LIST OF TABLES 



Table Page 

2.1. The Parallel Computation Time for Summation, Power and 

Polynomial 12 

n 

2.2. The Number of Steps Required to Compute E a. on P(m), h (m,n), 

i=l x 
for n < 10 18 

2.3- Computation of p (x) by Folding Method 38 

2.k. The Number of Steps Required to Compute p (x), h (m, n), for 

n < 10 1+8 

k.l. Comparison of Back Substituted, y , and Non-Back Substituted 

Computation, y . --Iteration Formulas 10l+ 

U.2. Comparison of Back Substituted, y , and Non-Back Substituted 

Computation, y. --General Cases 108 



Vlll 



LIST OF FIGURES 

Figure Page 

1.1. Statement Dependence Relation 3 

1.2. Trees for ( (a+b)+(c+d) ) and ( ( (a+b)+c)+d) 5 

1 . 3 • Tree s for a + b x c + d and bx c + a + d 5 

lA. Trees for a(bcd+e) and abed + ae 6 

2.1. The Minimum Number, M, of PE's Required to Add Numbers in the 
Minimum Time 22 

2.2. Computation of x (l) 26 

2.3. Computation of x (2) 27 

. - . Computation of x 11 (3) 29 

2.5. Computation of x" (k) 30 

.6. Computation of a.x 3k 

2.7. A Tree for p . . (x) 35 

2.8. A Tree for p + .(x) 37 

2.9* Comparison of the Four Parallel Polynomial Computation Schemes. kO 

2 .10 . k-th Order Horner ' s Rule 1+1 

.11. The Number of Steps, h (m, n), to Compute p (x) on P(m) by the 

m- th Order kk 

2.12. The Minimum number, M, of PE's Required to Compute P (x) in the 

Minimum Time 51 

3.1. An Arithmetic Expression Tree (l) 52 

.2. An Arithmetic Expression Tree (2) 56 

3.3. Free Nodes 63 



Free Nodes in a Tree 6k 



3-5- An example of F. and F„ 66 

3.6. Elimination of a Free Node 66 



IX 



Figure Page 

5.7. A Minimum Height Tree. . . . 68 

3 .8 . Attachment of T[ t ' ] to a Free Node 71+ 

3.9. An Example of Space (l) 77 

3 .10 . An Example of Space (2) 78 

3.11. Distribution of t' over A 8l 

3 .12. Tree Height Reduction by Hole Creation 82 

3 .13» Stacks for an Arithmetic Expression 91 

k.l. A Back Substituted Tree 102 

k.2. Loop Analysis 112 

. . A Tree with a Boolean Expression 11^ 

•'i .h . Trees for a(bc+d) + e and abc + ad + e 118 

5 .1 . Conditions for the Output Equivalence 127 

6 .1 . E Q 11*8 

. . E[ Ij Ik8 

6.y. Conditions of Parallel Computation in a Loop 150 

■' . • . An Illustration of t 158 

6.5- Wave Front l6l 

. . Wave Front Travel l62 

6.7. An Illustration for Theorem k 16U 

6.8. An Execution by a Wave Front 166 

. . ultaneous Execution of Body Statements 170 

6.10. Execution of P B 173 

u '^ 

6 .11 . An Introduction of Temporary Locations 180 

6.12. Wave Front for Simultaneous Execution of Body Statements 183 

6 .13 . A Wave Front for Example 10 187 



Figure Page 

7.1. Computation of Nondistributed and Distributed Arithmetic 

Expressions on P(2) 19^ 



Common Expression 195 

A Loose Graph and a Tight Graph 197 

A Graph G 201 

An Illustration for Lemma 3 20U 

An Example of a Tight Graph Scheduling 209 

An Illustration for Lemma 11 212 

A Loose Node 21^ 



7-2 
7.3 
1-h 

7.5 
7-6 

7-7 
7.8 

7-9 

7.10. An Example of the Maximum p-connectable Distance 220 

7.11. An Illustration for Lemma 13 22U 

t P 
7 . 12 . An Example for A (-) 226 

7-13- An Example for p- connectivity Discovery 228 



The p-line Relation in B 217 

n 



1. INTRODUCTION 

1.1 Introduction 

The purpose of this research is to study compiling techniques for 
parallel processing machines. 

Due to remarkable innovations of technology today such as the intro- 
duction of LSI, it has become feasible to introduce more hardware into computer 
systems to attain otherwise impossible high speeds. For example Winograd [k-2] 
showed that the minimum amount of time required to add two t bit numbers is 
[log p t]d ([x] denotes the smallest integer not smaller than x), where we assume 

that an adder consists of two input binary logic elements, e.g. AND or OR gates 
and d is a delay time per gate. An adder which realizes this speed requires a 
huge number of gates, e.g. approximately 1300 gates for t = 6 [12], and it has 
been out of the question to build such an adder. However, the introduction of 
LSI has reduced the cost of a gate significantly, e.g. it has been anticipated 
that by 197^ the cost of LSI would be reduced to 0.7 cent per gate [33]« Another 
example is a class of parallel processing machines. The Illiac IV [7], the CDC 
6600 [k] and the D825 [^1] are included in this class. A machine in this class 
has e.g. many arithmetic units to allow simultaneous execution of arithmetic 
operations. As an extreme case it has been suggested to include special arith- 
metic units, e.g. a log taking unit (in x) and an exponent unit (x ) \ l6] . (Such 
being the case, this decade may be marked as a "computer architecture" race, 
reminiscent of the cycle-time and multiprogramming races of the 60's [12].) We 
shall not go into the details of machines further. An extensive survey of 
parallel processing machines is found in [30]. 



Having a parallel processing machine which is capable of processing many- 
operations simultaneously, we are faced with the problem of exploiting parallelism 
from a program so Uiat computational resources be kept as busy as possible to 
process the program in the shortest time. We now discuss the problem in detail. 

In this thesis, by a parallel (processing) machine P we understand a set 
of arbitrarily many identical elements called processing elements (PE). A PE is 
assumed to be capable of performing any binary arithmetic operations, e.g. 
addition and multiplication, in the same amount of time. Furthermore we assume 
that data can be transferred between any PE's instantaneously. Also we write P(m) 
if P has only m PE's. A machine of this nature may be considered as a general- 
ization of the Illiac IV. 

To date two types of parallelism exploitation techniques are known to 
compile a program written in a conventional programming language (e.g. ALGOL) for 
the parallel processing machine [36]. They may be termed as intra-statement 
parallelism and inter-statement parallelism exploitation techniques. The first 
technique is to analyze the parallelism which exists within a statement, e.g. an 
arithmetic expression and this has been explored by Stone [k-0], Squire r 39] ^ 
Hellerman [20], and Baer and Bovet [6]. For example consider the arithmetic 
expression: 

a+b +c+d + e+f+g+h 
and a syntactic tree for it: 



3 
2 
1 



level 




The tree is such that operations on the same level may be done in parallel. The 
height of a tree is the maximum level of the tree and indicates the number of 
steps required to evaluate an arithmetic expression in parallel. Note that there 
may be many different syntactic trees for an arithmetic expression, and among 
them the tree with the minimum height should be chosen to attain the minimum 
parallel computation time. Baer and Bovet's algorithm is claimed to achieve this 
end. i.e. build the minimum height syntactic tree for an arithmetic expression [6]. 

Exploitation of inter-statement parallelism has also been studied [10], 
[37]- An outcome of these works is an algorithm (the dependence relation detect- 
ion algorithm [ 10] ) which detects the dependence relation between statements in 
a loop- and jump-free sequence of statements. The dependence relation between 
S and S' holds if S proceeds S' in a sequence and S' uses the output of S as an 
input to S ' . For example the algorithm dects that the statement SI in Figure 
1.1 must be computed before the statement S2, but it may be computed simultaneous- 
ly with S3- 

SI: x := f-^y); 

S2: u := f 2 (x); 




S3: v := f 3 (w); 
Sk: z := f^(v,u); 

(a) program (b) dependence relation 

Figure 1.1. Statement Dependence Relation 

Since in a real program the major part of the execution time is spent within loops 
if it is executed sequentially, the major effort should be directed toward 
detecting inter-statement parallelism in loops. For example we would like to find 



out that all fifty statements, A[ 1] := f(B[l]), ..., A[ 50] := f(B[50]), in a loop 

E: for I := 1 step 1 until 50 do 
A[I] := f(B[I]) 
may be executed simultaneously to reduce the computation time to one fiftieth of 
the original. A technique available now which detects inter- statement parallelism 
inside a loop requires a loop to be first replaced with (expanded to) a sequence 
of statements, e.g. E in the above example must be replaced with the sequence of 
fifty statements, A[ 1] := f (B[ 1] ), ..., A[ 50] := f (B[ 50] ), so that the dependence 
relation detection algorithm can be applied [ 10] . Obviously this approach 
obscures an advantage of the introduction of loops into a program because 
essentially all loops are required to be removed from a program and replaced with 
straight-line programs so that the dependence relation detection algorithm can be 
applied on them. 

The techniques described above find out parallelism inside and between 
statements as they are presented. If the size of a machine (i.e. the number of 
PE's) is unlimited, however, then it becomes necessary to exploit more parallelism 
from a program than the above approaches provide. One obvious strategy is to 
write a completely new program using e.g. parallel numerical methods [52], [38] • 
The other approach which we will pursue here is to transform a given program to 
"squeeze" more parallelism from it. While the first approach requires programmers 
(or users) to reanalyze problems and reprogram, the second approach tries to 
accept existing sequential programs written in e.g. AIG0L and execute them in 
parallel. First we study parallel computation of an arithmetic expression more 
carefully along this line. 

For the sake of argument let us assume that an arithmetic expression 
consists of additions, multiplications and possibly parentheses. Then the 
associative, the commutative and the distributive laws hold. The first and second 



laws have been already used to exploit more parallelism from an arithmetic 
expression. For example the associative law allows one to compute the arithmetic 
expression a+b+c+das ((a+b) + (c+d)) in two steps rather than as 
( ( (a+b)+c)+d) which requires three steps. 





(a+b) + (c + d) 



(((a + b) + c) + d) 



Figure 1.2. Trees for ( (a+b)+(c+d)) and ( ((a+b)+c)+d) 



Also it has been recognized that the commutative law together with the associative 
law gives a lower height tree. For example ((a + b x c) + d) requires three steps 
while (b x c + (a + d)) requires two [39]* 





b x 



Figure 1.3- Trees for a + b x c+d and b x c + a + d 

Now we turn our interest to the third law, i.e. the distributive law and see if 
it can help speeding up computation. As we can readily see there are cases when 
distribution helps. For example a(bcd + e) requires four steps while its 
equivalent abed + ae which is obtained by distributing a over bed + e can be 
computed in three steps. 





a (b c d + e) abcd + a 



Figure l.k. Trees for a(bcd+e) and abed + ae 

However, distribution does not necessarily always speed up computation. For 
example the undistributed form ab(c+d) can be computed in fewer steps than the 
distributed form abc + abd. Hence nondiscr inn native distribution is not the 
solution to the problem. Chapter 3 of this thesis studies this situation and 
gives an algorithm which we call the distribution algorithm. Given an arithmetic 
expression A the distribution algorithm derives the arithmetic expression A by 
distributing multiplications over additions properly so that the height of A 

(we write h[A ] for this) is minimized. The algorithm works from the innermost 
parenthesis level to the outermost parenthesis level of an arithmetic expression 
and requires only one scan through the entire arithmetic expression. Chapter 3 
concludes by giving a measure of the height of the minimum height tree for A as 

well as A as a function of fundamental values such as the number of single 
variable occurrences in A. 

The idea is extended to handle a sequence of assignment statements in 
Chapter k. The distribution algorithm is applied on the arithmetic expression 
which is obtained by backsubstituting a statement into one another. 
Suppose we have a sequence of n assignment statements A , A , ..., A and we get 

the assignment statement A from this sequence by back substitution. If the 



sequence is computed sequentially, i.e. one statement after another, but each 
statement is computed in parallel, then it will take h[A_] + h[A ] + ... + h[A ] 

steps to compute the sequence (where h[A.] is the height of the minimum height 

tree for A.). Instead we may compute the back substituted statement A in 

parallel which requires h[A] steps. Obviously h[A 1 ] + ... + h[A ] > h[A] holds. 

Chapter h discusses cases when the strict inequality in the above equation holds. 

The cases include iteration formulas such as x. n := a x x. + b. 

l+l 1 

Next we study inter-statement parallelism in terms of program loops. 
Chapter 6 first establishes a new algorithm which detects inter-statement 
parallelism in a loop. The algorithm is such that it only examines index 
expressions and the way index values vary in a loop to detect parallel computa- 
bility. For example the algorithm checks index expressions I and I + 1 as well as 
the clause "I := 1 step 1 until 20" in the loop 

for I := 1 step 1 until 20 do 
A[I] := A[I+1] + B 
and detects that all twenty statements, A[ 1] := A[2] + B, ..., A[ 20] := A[ 21] + B, 
may be computed simultaneously. Thus it is not necessary to expand a loop into a 
sequence of statements as was required before to check inter-statement parallelism. 
In general, the amount of work (i.e. the time) required by the algorithm is 
proportional to the number of index expression occurrences in statements in a loop. 

Having established the algorithm, Chapter 6 further introduces two 
techniques which help to exploit more inter-statement parallelism in loops. These 
are the introduction of temporary locations and the distribution of a loop. The 
second technique resembles the idea introduced in Chapter 3, i.e. reduction of 
tree height for an arithmetic expression by distribution. Let us write 



I, J, K(S1, S2, S3) 
for an ALGOL — like program 

for I :- i 1 , ± 2 , . .., i m do 

for J := Op J 2 , .--j d n do 

for K := k n , k„, .... k do 

1 2' p — 

begin SI; S2; S3 end . 

t ^ ^ 

Furthermore by e.g. [I, J], K(S1, S2,S3) we understand a loop" 

for (I, J) := (i-^^), (i^Jg), ••., (ip^), (ig,^), ..., U n >d n ) do 

for K := k.. , k^, .... k do 

1' 2' p — 

begin SI; S2; S3 end . 
Then as in the case of arithmetic expressions we may establish the following: 

(a) Association: Introduction of brackets, e.g. I, [ J,K] (SI, S2, S3) • 

(b) Commutation: Change of the order of I,J,K e.g. I, K, J(S1, S2,S3). 

(c) Distribution: Distribution of I,J,K over SI, S2,S3, e.g. 

I,J,K(S1),I,J,K(S2,S3). 
Then while the associative law always holds, e.g. I,J,K(S) = [I,J],K(S), the 
commutative and the distributive laws do not necessarily hold for all loops, e.g 
I, J(S) / J,I(S) if I,J(S) represents a loop 
for I := 1, 2, 3 do 
for J := 1, 2, 3 do 

A[I,J] := A[I+1,J-1]. 



JL 

"This is equivalent to a TRANQUIL expression [2] 

for (I, J) seq^ ((ip^); (i^^l^-v (i m *d n )) do- 



In short, Chapter 6 shows that commutation indicates the possibility of computing 
a loop in parallel as it is and distribution indicates the possibility of intro- 
ducing more parallelism into a program. For example if I,J,K(S) = K, I,J(S), then 
S can be computed simultaneously for all values of K while I and J vary 
sequentially. Next suppose a loop l(Sl, S2) cannot be computed in parallel for all 
values of I. Then in a certain case it is possible to distribute and obtain two 
loops I (SI), l(S2) which are equivalent to the original loop, I (SI, S2), and 
execute each of two loops in parallel for all values of I separately. Chapter 6 
gives an algorithm to distribute to attain this end. 

The thesis, thus, introduces new techniques which transform a given 
program to expose hidden parallelism. All results in this thesis are also readily 
applicable to another type of machines, i.e. machines with a pipe-line arithmetic 
unit such as CDC STAR [ 18] (we regard this type of machines as a special type of 
parallel machines and call them serial array machines). Each stage of a pipe-line 
unit may be regarded as an independent PE in the sense that an operation being 
processed in one stage of a pipe-line unit must not depend on an operation being 
processed in a different stage. Hence exploiting parallelism results in busying 
many stages at once. 

Two more chapters are included in this thesis to make it complete. 
Chapter 2 studies parallel computation of special cases of arithmetic expressions, 
e.g. powers and polynomials, in detail to give a measure of the power of a 
parallel processing machine. 

As was mentioned before, unless specially mentioned, it will be assumed 
that there are a sufficient number of PE's available to perform the desired task. 
In reality, however, that may not be the case and non trivial scheduling problems 



10 



may arise. To give some insight to this problem Chapter 7 discusses a solution 
to the two processor-equally weighted job scheduling problem. 

We conclude this chapter by defining the following symbols: 
[x] ... the smallest integer not smaller than x, 
[ xj ... the largest integer not larger than x, and 
\ x~| ... the smallest power of 2 not smaller than x. 

Also unless specified, the base of logarithms is assumed to be 2, e.g. log n is 
log 2 n. 



11 



2. PARALLEL COMPUTATION OF SUMMATIONS, POWERS AND POLYNOMIALS 

2.1 Introduction 

In this chapter, we study the parallel computation of summations, 

powers and polynomials. We first assume that m processors (PE) are available. 

n 
Then the parallel computation times for the summation ( Z a. ) and the power (x ) 

i=l 1 

evaluation are given as functions of m and n. The minimum time to evaluate 

n 

Z a. or x as well as the minimum number of PE's required to attain it is also 
i- 1 

derived. 

Polynomial computation is first studied assuming the availability of 

an arbitrary number of PE's. The lower bound on the computation time "or a 

polynomial of degree n (p (x)) is presented. A scheme which computes p (x) in 

lesser time than any known scheme is obtained. Because of its simplicity in 
scheduling, the k-th order Horner's "^ule is .studied further in detail. It is 
shown that for this algorithm the availability of more PE's sometimes increases 
the computation time. 

Table 2.1 summarizes a part of results of this chapter. 

Before we go further a few comments are in order. The base of 
logarithms in this chapter is 2, e.g. log n is actually log n. The following 

lemma will be frequently referred to in the text. 



12 



















H H 




OJ 


oT" 


bO H 


iH 




CM 1 + 




i 


i 


OJ + 


+ 




1 J*i ^ 




n^ 


£ 


^L 


*L 




^ CM OJ 







II OJ 


OJ 




CM 




L_ 


t_ 








+ VI 




OJ 


OJ 


J3 + 


V 




X C 




A 


1 '"CM V 


bO 


c 




—. CM 








OJ 






cvH V 




H 


— H 




VI 




-^ VI 




H 1 


OJ 1 


(— VI 


r~ , 




— X H 




-B@ 


1 ( — 


m 


OJ H 




rH G CM 1 




1 — OJ 


\ G 


-^ bO 




1 V 1 CM 




«-r oj 


"h V 


H OJ 


^ 


C » C + 
^-, CM 


OJ 


OJ 

i 


OJ 

1 
1 


H 

+ bO 
+ OJ 


+ + 




— ' — ^ 




G 


c 


G -— 


G bD 




OJ 




G -— 


c -— 


C c^ 


cr- £L 




H 














• • > • 




. , 


, , 


• • • • 


, . 




^ «s S ^ 




.*— -s 


>- s. 


.*"— N y"~V 


^«— »s 




H CM 




H 


OJ 


H OJ 


no 


















nft 














E 














JJ 


















*4 








G 














—) 






1h 








*•—*-. | y -. 


,-— ^ 


,*■ — v 


^— V 


*-— V .*— -v 






rH OJ 

G 
, Al 


H 


H 


+ OJ 


H OJ 






n 


II 


<5) Al 


II H Al 






+ 














£ „ E 


S 


S 


OJ £ 


e + e 






vE w- 


*— 


- — 


v — ' 


■> — ' * — ' 






OJ 






1 


gi 






bO 






G 








o 

. H 


* 




Lr- 


G_, 







LL? 


H 




+ 


OJ 




c 


+ 


i 






+ 




•s 


H 




H 


H 






e 


H 


S 






c t— 






l 




1 


+ 


OJ CO. 




5 


l 


+ 






1 






G 




G 


m 


OJ 






-5 


§ 




en' 








G 














1 














—C -—~. 




^S-l^ 


,j_^ 


,-— ». *■— ^. 






rH OJ 




H 


OJ 


H OJ 














P~» 














H 




c 














•H 


f— 


n~: 


r~ 




, + 






[e 


o 







+ 




,G~ 






L_ 




G 
H -— ' 

OJ 
+ bO 
i— o 

[§id 




/ 








"g 






/ 








VI 


^-~, 




/ 


•H 


G ^ 


•H 




x 




/ 


cd 


r ! 


X 


•H 


v — ^ 




/ 


H 








G 




/ 


G Wll 




H 


VI 


ft 




/ 


•H 




H 








/ 






cd 


H 







cd 

1 

O 

ft 





G 










•H 






nd 




F) 






G 


• 


X! 


• 




cd 


X 


G 


B 


oo 


^ 




•H 






CO 


G 


cd 


ft 


cd 


> 


ft 


-P 




€ 


o 




P 


G 


C 


ft 


O 


cd 


O 


a 


•» 


ft 


O 


*. 




G 




P 


X 


0) 


O 


<V 






d) 


•H 


{=• 


13 


G 


CQ 


P 


•H 


0) 


<H 




cd 


P 


^ 




^ — s 


§ 




•H 


Jh 


* 


S 


G 


3 


O 




p 


O 


O* 


ft 




CO 


•H 


CD 








p 


!h 


CD 




^1 


cd 




B 




o 


P 


W 


■a 




ft 


3 


— 


p 


• 




ft 


W 




£ 


a 





ft 


G 


OJ 


e 


8 




O 


bO 


•H 


o 


ft 


•H 


O 


Eh 




o 


P 


H 




1 


u 


P 


II 


G 
O 


R 


0) 


3 




•H 


■H 


& 


ft 


CO. 


P 


C 




i 




cd 


•H 


R 


5 




■P 





G 


o 




2 


0) 


<X) 


CD 


G 


k 


£ 


En 


a 


OJ 
bD 

o 
H 


o 

a 

H 
d) 


G 


s 


^— v 


li 


H 


•H 




G 




H 


Fl 




^ 


a 


cd 


si 








cd 

ft 



CM 

0) 

ft 

■§ 

Eh 



13 



Lemma 1: 



Proof: 



(1) flog al - flog bl < (log a - log bl 

where a and b are non zero positive integers. 

(2) Ta + bl = Tal + b, 
la + bj = j_aj + b, 

f"b - al = b - Tal + 1 and 
l_b-aj= b - [_aj - 1 

where b is a positive integer and a is a positive real number 
(not an integer); 

(3) a + l>["al > a and 
b - 1 < (bj < b 

where a and b are non zero positive real numbers. 



(1) Let a = 2 h + k and b = 2 + g where < k < 2 - 1 and < g < 

2 - 1. Now the proof is divided into four cases, i.e. (i) k, 

g > 0, (ii) k = g = 0, (iii) k - 0, g > 0, and (iv) k > 0, g = 0, 
(i): k, g > 0. 

Then flog al - flog bl = h - f. Also let log a = 
h + x and log b = f + y where < x, y < 1. Then 
flog a - log bl > h - f . Thus (log al - flog bl < 
■log a - log bl . 

Other three cases may be proved similarly and the details are 

omitted. 
(2) (3) Proofs for (2) and (3) follow from the definition. 

(Q.E.D.) 



ih 



2.2 Summation of n Numbers 



Theorem 1: 



The minimum number of steps, h (m, n), to add n numbers on P(m) is 
'(1): n-1 (m=l) 

h a (m,n) = <((2): L n / m J - 1 + '"log (m + n -Ln/mjmjl 

( l_n/2j > m > 2 ) 
\p): log n"l (m > |_n/2j ) 

Proof : 

(l) is self evident. (3) uses the so-called log sum method [22] or 
the tree method (see Theorem 1 of Chapter 3)- It is clear that "log n"l steps 
are required and also that n numbers cannot be added in fewer steps. 

Now we prove (2). First each PE adds [_n/m| numbers independently. 
This takes |_n/mj " 1 steps and produces m partial sums. Then there will be 
m + (n - [n/mj • m) numbers le.'t. Clearly m + n - [n/mj • m < 2m. Then those 
numbers are added by the log sum method, which takes llog(m + n - (_n/mj m ) 
steps. 

(Q.E.D.) 
Now we show that for a fixed n, |_n/2j > m > m' implies that h (m, n) < 

h (m',n). To prove this it is enough to show that h (m + 1, n) < h (m, n) where 
m h- 1 £ |_n/2j . There are two cases: 

(1) (n/mj = |_n/(m + l)j = k > 1. 

Let n=km+p (p < m). Then 

m + n - L n AlJ m = m + p 
and 



15 



(m + 1) + n - l_n/(m + l)J (m + 1) = m + p + 1 - k. 
Hence we have 

flog(m + n - l_ n Ay m ) > flog((m + l) - n - |_n/(m + ill 
(m + 1)? , or h a (m,n) > h a (m + l,n). 
(2) jn/mj > L n /(m + l)j . 

Let [n/mj = k and L n A m + l)j = k - g, where k, g > 1. Then 
n = km + p (p < m) (l) 

and n - (k - g)(m + 1) + p' (p' < m + l). (2) 

Suppose h (m, n) < h (m + 1, n), i.e. 

||| - 1 + flogfm «-(2j.] < L;jij| - 1 + [WO* + 1) 

+ n . |^|(m + l)jl. (3) 

By substituting Eq. (l) and (2) into Eq. (3) and by rearranging 
we get 

g < riog(ra + 1 + p')' - r"iog( m + p )~l . (k) 

If we can prove 

g < ftog((m + 1 + p')/(m + p)) 1 , (5) 

then by Lemma 1(1), we can prove Eq. (k) . Eq. (5) holds if 

2 g < (m + 1 + p 1 )/(m + p) (6) 

holds. Since 

2 + l/m > (m + 1 + p* )/(m + p), (7) 

Eq. (6) holds if and only if g = 1 (remember that g > 1). By 
letting g = 1 in Eq. (2), we get 

p' = km + p - (k - l)(m + 1). 
Then by substituting this and g = 1 into Eq. (6) and by proper 



16 



rearrangement, we get 

p < 2 - k. 
This only holds if k = 1 and p = 0, which implies that n = m 
(see Eq. (l)) and contradicts our assumption that m + 1 < |n/2j . 

Hence h (m, n) < h (m + 1, n) never holds. 

The above two cases (l) and (2) prove that h (m + l,n) < h a (m, n). 
Thus we have the following lemma. 

Lemma 2 : 

h a (m',n) > h a (m,n) if m' < m < |_n/2j . 

The above lemma may seem insignificant. In Section 2.4.2, however, 
it will be shown that for a certain algorithm to compute an n-th degree 
polynomial on P(m), the computation time step, h (m, n), is not a nonincreasing 
function of m, i.e., m > m' does not necessarily imply that lr (m, n) < h p (m',n). 
As it will be described later, the algorithm is such that all PE's are forced 
to participate in the computation. It is true that if we are allowed to "turn 

off" some PE's, then we always get h P (m, n) < h P (m' ,n) if m > m' . Then a 
question is how many PE's are to be turned off. These problems will be studied 
in Section 2.4.2. 

It should be noted that the minimum number of PE's required to achieve 
the minimum computation time is not necessarily |_n/2J . For example let n = 17- 

Then (1T/2J = 8 and h a . n (l7) = h a (8,17) = 5- But also h a (6,17) = 5- 

As we know, the minimum computation time to add n number is 'log n'. 
Now we present the minimum number of PE's, M, which achieves this bound. 



17 



Theorem 2 : 

For a fixed n, let 

u 

M - u (L n / m J - 1 + n.og(m + n - Ln/mj m)' = log rfl )" 



m 






Then 

M = f(l) 1 + L (n - l)/2j - 2 k ~ 2 (2 k < n < 2 k + 2 k_1 ) 

(2) n - 2 k (2 k + 2 k_1 < n < 2 k+1 ) 

where k - l_log(n - l)j . 

Proof : 

For k < 3, the direct examination shows that the theorem holds (see 

Table 2.2). Therefore we assume that k > 3« 

k k k-1 

The proof is divided into two parts, i.e. (l)2 <n<2 +2 " and 

(2) 2+2 <n<2 . In either case we first prove that h (M, n) = 'log nl 

7hen for m < M we show that h (m, n) > flog n~| . 

It should be clear that in both cases 2 < n < 2 and Tlog nl = 



k + 1, 



(1) 2 k < n £ 2 k + 2 k_1 

Let n = 2 k + p. (l < p < 2 k_1 ) (8) 

>w we first show that h (m, n) = 'log nl where 

M = 1 + |jn - 1)/2J - 2 k * 2 (9) 



4 

u( condition) denotes the minimum value of m which satisfies the 
m 

condition. 



18 



m 





1 


2 


3 


k 


5 


k 


M 


Case 


2 


1 













1 


2 


3 


2 


2 








1 


1 


1 


k 


3 


2 








1 


2 


2 


5 


k 


3 


3 


3 




2 


2 


1 


6 


5 


3 


3 


3 




2 


2 


1 


7 


6 


1+ 


3 


3 




2 


3 


2 


8 


7 


k 


k 


3 




2 


4 


2 


9 


8 


5 


k 


h 


If 


3 


3 


1 


10 


9 


5 


h 


k 


k 


3 


3 


1 



(1): 2 k < n <2 k + 2*" 1 

(2): 2 k + 2 k " 1 < n < 2 k+1 
where k = [log(n-l)j. 
n: The degree of a polynomial 

m: The number of PE's 

M: The minimum number of PE's 



Table 2.2. The Number of Steps Eequired to Compute 2 a. on P(m), h (m, n), 



i=l 



for n < 10. 



19 



Now 

h a (m,n) - Ln/MJ - 1 + riog(M-n-Ln/M| M)l • (10 ) 

We then show that [n/Mj = 3 for all n. Then we get (M - n - 

|n/M]M) = n - 2M. The value of n - 2M is . evaluated in three 

ways, i.e. (i) p = 2g, (ii) p = 2g + 1 (g > l) or (iii) p « 1. 

k-1 
In any case n - 2M £ 2 " holds. Thus we can prove that 

log (M - n - Ln/MJM) 1 < k - 1. 

and 

h a (M,n) < k + 1 = Tlog ril . 

Then we prove that for all m < M, h (m, n) > ("log nl . This is 

proved by showing that h (M - 1, n) > I log nl . Then by Lemma 

2, h (m, n) > ("log n~l for all m < M. 

Now let us show the details. First we prove that h (M, n) 
log ill. From Eq. (8) and (9), we get 

M = 2 k " 2 + 1 + jjp - l)/2j, (11) 

and by Lemma 1(2), we get 

[jj| " 5 - L p J (12) 

where 

p = ^L(p - D/2J + k - y 

2 k " 2 + 1 + |ip - 1)/2J 
By Lemma 1(3), w e get 



P < P' = 



2p - h 



k-l 

2 + p - 1 



k-1 
Now we show that for all p(l<p<2 ), P' < 1. Since 



20 



^ ■ (2 k ~ 1 + p - I) 2 > ° 
we have 

max P* = %-^ — < 1 for 1 < p < 2 k_1 (k > k). 
2-1 " " 

Thus P < 1 and by Eq. (12) we have (n/Mj = 3« Substituting this 

into Eq. (10), we get h a (M,n) = 2 + llog(n - 2M)1 . Now 
subtracting two times Eq. (ll) from Eq. (8) we have 

n - 2M = 2 k_1 + p - 2 - 2j_(p - l)/2j . (13) 

Eq. (13) is evaluated in three different ways according to the 
value of p, i.e. (i) p = 2g, (ii) p = 2g + 1 (g > l) or 
(iii) p = 1 (in every case g is an integer). 

(i) p = 2g (From Eq. (l), g > l). 

n - 2M = 2 k_1 . 
(ii) p = 2g + 1, (g 1). 

n - 2M = 2 k ~ 1 +2g+l-2-2g< 2 . 
(iii) p = 1. 

k-1 k-1 

n - 2M = 2 + 1 - 2 < 2 . 

k-1 p- ~i 

Hence in any case n - 2M < 2 "or llog(n-2M)' < k - 1. Thus 

h a (M,n) < k + 1 = Tiog n"l . 
This proves the first part of (1). Next we prove the latter 

part, i.e. for m < M, h (m,n) > h (M,n). 



21 



First we show that h a (M - 1, n) > h a (M, n). From Eq. (8) 
and (9), and using Lemma 1(2) we get 

\vrh\ = i> - Q J = 3 - lQj> a 1 *) 

where 

Q= H (p - D/2J - P . 

2 k_2 + L (p - 1)/2J 
As we showed for P, we can also prove that for all p (l < p < 

2 k_1 ), Q < 1. From Eq. (1*0, we have |_n/(M - l)j =3- Then 

h a (M - 1, n) = 2 + Tiogdi - 2(M - 1)11 . From Eq. (8) and (11), 
we get 

n - 2(M - 1) = 2 k-1 + p - 2|_(p - l)/2j > 2 k-1 + 1 

or Tlog(n - 2(M - l))"l > k. Hence 

Tlog nl = k + l<k+2< h a (M - 1, n). 

Thus for all m < M, h (m, n) > Tog nl by Lemma 2, and this proves 
(1). 

(2) 2 k + 2 k_1 < n < 2 k+1 . 

Let n - 2 k + 2 k_1 + p (l < p < 2 k_1 ). (15) 

We first sketch the proof. We first show that h (M, n) = 

flog nl . To show this we prove that l_n/Mj = 2 for all n (2 + 

k-1 k+1 

< n < 2 ). Then using this we get M + n - (_n/Mj M = 

n - M. We further show that n - M = k. Thus we get h a (M, n) = 

k + 1 = flog nl. Then we prove that h a (M - 1, n) > h a (M, n) 
which together with Lemma 2 completes the proof of (2). 



22 




£ 



o 




SJ 


ir\ 




5 

8 

CD 


O 




2 


-z* 




i 

o 




, — .. 


•p 




v c 


rt 




T) 


£ 






1 




d 


(1) 


O 


<u 




K^ 


,0 


CO 




o 

-p 


s 




CO 


<H 




^ 


O 




CD 


s 




a 




O 




*§ 


CVJ 




3 



CD 



H 

OJ 

cy 



£ 



0£ 



OS 



01 



(W) s,3d JO aaqrauM 



23 



The details are similar to (l) and will not "be given here. 

(Q.E.D.) 

Theorems 1, 2 and Lemma 2 also apply to the case of multiplication of 

n numbers, and to avoid duplication, the corresponding lemmas for multiplication 
shall not be presented. 

Next let us consider the power computation, e.g. x (n > 2). 

2.3 Computation of Powers 

Lemma 3 '■ [23] 

Let N be the number of ones in the binary representation for n. 

Then the near minimum number of steps to compute x on P(l), h (l, n), 
is 

h e (l,n) - L lo S n J + N - 1. 

Up to now, there is no result about the minimum computation time to 

evaluate x [23]. Thus we shall settle for an approximation. For example let 

e 15 

15. Then h (1,15) = l_ lo § 15J +4-1 = 6. On the other hand x can be 

evaluated in fewer steps, e.g. 

(x) 2 = x 2 - 'x 2 )(x) = x 3 - (x 3 )(x 2 ) = x 5 

f 5n 2 10 , 5w 10 x 15 
-* (x^) = x - (x )(x ) = x • . 

This takes only 5 steps. For n < 70, this lemma gives the correct values for 

more than 70$ of the cases. 

While we cannot give the definite answer for the sequential case, we 

can prove the following. 



2k 



Theorem J> : 

The minimum number of steps to compute x on P(m) (m > 1), h e (m,n), 

is 

h"(m, n) = ("log nl . 

Proofs of Lemma 3 and Theorem 3 : 

Let a = log n and 3 = log (n + l) for convenience. Let I be the 

j-th most significant bit in the binary representation for n. If m = 1, then x 
is computed as follows. First let us write 

X. - (x 2 V 3+1 . 

J 

We first compute all x (i = 2 , 2 , . .., 2 ^ -• ) in iqi steps. Then 



X n = (X Q )x (X x ) x ... x (x 2Lq, ) Irpl 



and this computation takes N - 1 steps (note that if I . = 0, then X. , = l). 
Thus in total i.-^j + N - 1 steps are needed. 

If two PE's are used, then x is computed as follows. Again let us 

write 

X = ( X Q ) x ( X 1 ) x • . • X ( x ) 

Now this can be computed by the following two recursive equations. 
t (l) - x 

k k-1 k-1 



25 



k k-1 k-1' 



and 



x n - t (2) 
Two PE's are required for the simultaneous computation of t/, and t}. . 

K. K 



That the above process for P(2) is optimum is clear, because x 
cannot be computed in less than [log nj steps and at least l_log n) + 1 

(= llog nl ) steps are required to compute x . 

(Q.E.D. ) 

From the above discussion, we have the following corollary. 



2 L lo g "J 



lorollary : 



h e (m,n) = h (2,n) for all m > 2. 

wow let us study simultaneous computation of all x (i=l,2, . . .,n) 



Theorem h 



The minimum number of steps, h (m,n), required for simultaneous 



evaluation of all x (i=l, 2, . . .,n) on P(m) is 

r 

(1) n - 1 (m = 1) 



h (m,n) = -< 



V 



(2) L log mj + 1 + r (n _ 2 L lo g m J +1 )/ m 1 

(max(n.2 riogn " 1 - 1 , 2 ^ n1 " 2 ) > m > 2) 

(3) flog nl (m > max (n - 2 riog n_1 "\ 



flog nl „ HLog ril -2, , 



26 



Proof : 

(l) is obvious. (3) is illustrated in Figure 2.2. At the k-th step, 

i k-1 k-1 k\ 

the x (i = 2 + 1, 2 + 2, ..., 2 ) are computed using the results of 

k k-1 k-1 
earlier steps, e.g. x = x x x . The number of PE's required at this 

step is then 2 k - (2 k-1 + l) + 1 = 2 " . 



\ PE 

step\ 


p i 


P 2 


P 3 


\ 








" 


No. of PE's 
required 


1 


X 2 
















1 


2 


x 3 


X 














2 


3 


5 

X' 


6 

X 


x 7 


8 

X 










1+ 


• 


















. 


k 


a+1 

X 


a+2 

X 






• • 


2a 

X 






2 k-i 






















r log nl-l 


b+1 

X 






. . 






2b 

X 




b 


r log nl 


c+1 

X 






" 








n 

X 


n - 2c 



a- 2^ b= 2 ri °S n1 - 2 



c = 2b 



Figure 2.2. Computation of x (l) 



The maximum number of PE's required is the larger one of n - 2 (the 

number of PE's required at the last step) or 2 ° S " (the number of PE's 
required at the ( Hog h~l - l)-th step). This proves (3). Clearly this 
procedure is optimum in the sense that it gives the minimum computation time. 



27 



max(n - 2 



Next suppose that the number of PE's available, m, is less than 
flog ffl -1 /log nl -2 }> ^ first all x i (l < ± < 2 l_l°g *J + 1) are 



:omputed in [log mj +1 steps in the same manner as the above procedure. 



Number of PE's 
* m 



1 


2 

X 


x\\\\\\\\\\\ 


2 


x 3 


i2 


WWWWV 


• 








I log mj + 1 


a 

X 




x b 


\ V 


. 


c 

X 






• 














n 


\\\\ X 



~T" 

J log rnj+1 

r _ 1 



m 



i 



a = 2 llogmj +1 
b = 2 [ log mj + 1 



c = b +1= 2 ll ° gmJ +1 +1 



Figure 2.3- Computation of x (2) 



28 



Now there are n - 2 ^° g m -J +1 x 1 left (2^° g ^ +1 < i < n) to be 

computed. This takes "(n - 2 •- g -• )/m' steps on P(m). Clearly at each step, 

all necessary data to perform operations are available. To show this, let us 

a b 
take two successive steps. Assume that the first step computes x ~ x where 

b - a f 1 = m. Then the second step computes x ~ x . Since b + m = 
2b + 1 - a < 2b (a > 1), all inputs required at the second step are available 

from the first step. Thus in total (log mj + 1 + "(n - 2 L ° g ^ )/m~' steps 
are required. This proves (2). 

(Q.E.D.) 

Clearly for fixed n, m > m' implies that h (m,n) < h (m',n). Thus 
we have: 

Lemma h : 

w / \ 

For fixed n, h (m, n; is a non-increasing function of m. 

We again call the reader's attention that the number of PE's required 

to compute all x in the minimum number of steps, i.e. 'log n~l , is not 

• -, t o flog nl -1 _ flog n~l -2 N _ , , lQ _ 

necessarily max(n-2 ,2 ). For example, let n = io. Then 

max(l8 - 2 ,2 ) = 8, and 'log 18 = 5« Yet P(5) achieves the same result, i.e. 
h w K,l8) = 5- 

Lemma_5: 

The minimum number of PE's, M, necessary to compute all x (l < i < n) 
simultaneously in the shortest time is 



29 



M = 



(1) n - 2 ri0g 3' 1 

(2) r (n- 2&** m2 )/2 



(n - 2 ri0g ^ _1 > 2 ri ° S ^ " 2 ) 
(otherwise) 



Proof: 



Let q = ilog n~l . 

(1) n - 2 Crl •> 2°~ 2 . 

Suppose that there are only m PE's where m < n ■■ 2 



Q-l 



step\ 


p i 


P 2 


P J 


•• 


P 

m 


























riog nl - 1 














r log nl 


a 

X 










a+ni 

X 








n 

X 



s- 



m 



left out 



a= 2 fl0g nl - X + 1 



Figure 2.k. Computation of x (3) 



Then x(2°""+l+m<i<n) cannot be computed at the a- th 
step. Also none of them can be computed at an earlier step 
because their inputs are not ready. This proves (l). 

(2) n - 2~ rl < 2 r2 . 



30 



flog nl - 1 



f log nl 



,f log nl - 2 + 1 



2 riognl -2 +m 




[log nl - l 



Figure 2.5* Computation of x (k) 



Then we delay the computation of all x 1 (2 Qr<: " + m + l<i<2 a - l) 
which are originally scheduled to be computed at the (a - l)-th 
step till the next step, i.e. the <.rth step (i.e. the last step). 

At the o-th step we compute all remaining x {2° rt ' +m + 1 < i < n). 
The value of m has been chosen in such a way that 



,a-2 



cr2 



(2 U - 1) - (2 urc +m + l)+l>n- (2 urc + m + l) + 1 



i.e. 



n 



,cr2 



m = "(n - (2°^ + 1) -l)/2». 
Also since 

2(2°~ 2 + m) > 2(2^ 2 + n/2 - 2 a " 3 ) 



= n + 2 a ~ 2 > n, 



31 



all inputs required at the cr th step are ready at the (a - l)-th 
step or earlier steps. 

(Q.E.D.) 

2.k Computation of Polynomials 

In this section, we study polynomial computation. First we assume 
that there are arbitrarily many PE's. Then four schemes are studied and 
compared. Two of them are known as Estrin's method [16] and the k-th order 
Horner's rule [15]. Two new methods are also introduced. They are called a 
tree method (see Chapter 3) and a folding method. It is shown that if there 
are arbitrarily many PE's, then the folding method gives a faster computation 
time than any known method. 

Then we study the case where only a limited number of PE's are 
available. Because of the simplicity of scheduling, the k-th order Horner's 
rule is studied in detail. It is shown that on P(m) the m-th order 
Horner's rule does not necessarily guarantee the fastest computation, i.e., 
there is a case where the m'-th order Horner's rule (m' < m) gives a better 
result. Thus availability of more computational resources does not necessarily 
"speed up" the computation for a certain class of feasible parallel computation 
algorithms. 

. I . 1 "omputation of a Polynomial on an Arbitrary Size Machine 



Definition 



We write p (x) for a polynomial of degree n 



i \ n n-1 
p ( x ) = a x + a ,x + . . . + a rt . 

r n' n n-1 



32 



2. if. 1.1 k-th Order Horner's Rule [15] 

The details will "be presented in Section 2.1J-.2. Theorem 5 shows that 
the minimum time required to compute p (x) by this method is 



n 



h P . = flog n"l + (log (n+l)~l + 1 

mm 



2.4.1.2 Estrin's Method fl5]["l6l 
We first compute 

C° = a + xa i = 0, 2, ..., 2 L n/2j 



i+1 
Then successively compute 



c} = C° + x 2 C° ±+2 i = 0, k, ..., k^/kj 

C? = c] + xV , i = 0, 8, ..., 8 L n/8j 



ii i 



44 



m „m-l 2m^m-l . n _m+l m+1 /o m+l 

C. : C. + x C. +2m i = 0, 2 , ..., 2 (_n/2 J 

where m - [log nj and more over 

P n <x) . C 

The procedure may be illustrated by 

o 

p (x) -- a n + a x + x~(& + ax) 
n 1 2 .p 

^ 2/ n 

+ x (a, + ax + x (a^-+a xjj 

ftp 1+ 2 

+ x (ag + ax + x ( a 10 +a 1JL x ) + x (a 12 +a x+x (a^+a x))) 



Now notice that for each j all C? may be computed simultaneously. 



33 



Assuming the availability of an arbitrary number of PE's, it is easy 
to show that the minimum number of steps to compute p (x) by this method is 



h E . = 2 flog (n + l) 1 . 



E 

mm 



2.4.1.3 Tree Method 

In this method a tree similar to the one for arithmetic expressions 
is built for a polynomial p (x). For example p,(x) is computed as: 




Computation by this method consists of two stages, computation of 



b. = a.x (0 < i < n) and computation of Lb.. 
1 i=0 



n 

t 
i 



(l) Computation of b. = a.x (0 < i < n). 

This requires flog(i+l)l steps by Theorem 3 (see Figure 2.6) 



(2) Computation of lb,. 

i=0 x 

As soon as b.'s become available they are added in the log sum 

way. Suppose b. becomes available at the k-th step. Having a 

k-1 
variable at the k-th step is equivalent to having 2 variables 

originally at the first step (cf. the effective length in Chapter 

3)- 



^ 



Step 


Terms Which Are Computed 


No. of Terms Computed 


1 


a o 


1 


2 


a., x 


1 


J 


2 3 

a 2 X ' a 3 X 


2(=2 X ) 


■ 


h 5 6 7 

cL X « a,_-X ^ cl/'X ^ a^X 


M=2 2 ) 


riog(n+l)l 




2 riog(n+l)l -2 


riog(n-i 1)1+1 


2 riog(n+i)]-i n 
a 2 rio g (n+i)i-i x ' •••>** 


n _ 2 riog(n+l)l-l +1 



Figure 2.6. Computation of a.x 



Thus, for example, two variables at the third step are reduced 
to 8 variables at the first step. Repeating this procedure we 
get 

o-l 

Z ( 
i=l 

o-l 

Z 'c 

i=l 

variables on the first level where 
a - 1og(n + if" . 

To add these n' numbers by the log sum method, it takes 



Z (2 1 ' 1 2 i ) + 2 a (n - 2 0rl + 1) 



+ Z2 2i_1 + 2 a (n - 2 Qrl + 1) 



h = H nrr n"H 



mm 



log n' 1 steps. 



35 



2.U.1.U Folding Method 

This is 3. method which computes P n (x) in shorter time than any known 



method. 



Assume that p , (x) can be computed in h - 1 steps, p.(x) (t < i < 
s - 1) are computed in h steps and p (x) can be computed in h + 1 steps. 



Then we show that all p.(x) 



roof 



Steps 


Degree 




h - 1 


~ t - 1 




h 


t ~ s - 1 




h + 1 


s ~ (s + t - 


1) 


h + 2 


s + t ~ 





(s < j < s + t - 1) can be 
computed in h + 1 steps and 
further p (x) can be 

computed in h + 2 steps. 



(1) First we show that p g+t _.(x) (l < ,j < t) can be computed in 
h + 1 steps. 




Figure 2.7- A Tree for p .(x) 

*s+t-j v 



36 



We write p .(x) as 
s+t-j v 



/ v \ / t-J v s s-1 

.,+. .(xj = (a. .x d + ... + a )x + a n x + 
=+t-j t+S-J s s-1 



. . + a 



s 



= p t (x)x +P s . 1 (x). 



s 



Now we show that x can be computed in less than h steps. 

From Theorem 1 we know that x can be computed in flog nl 

g 

steps. Suppose that the computation of x takes longer than h, 

i.e. 

h < r io£ si . (16) 

Now from the assumption, p , (x) takes h steps to compute. Also 

S *~ J_ 

(see Section 2.U.1.5) 

h > llog(2(s-l) + l) 1 - G.og(2s-l) 1 . (17) 

Prom Eq. (16) we have s > 2 + 1 and from Eq. (17 ) we have 

2 h > 2s - 1 or 2 h_1 > s. Thus we have s > 2 h_1 + 1 > 2 h_1 > s 
which is a contradiction. Thus h > llog s~l . 

From the assumption p, . (x) can be computed in h - 1 steps 



and 



p , (x) can be computed in h steps. Hence p .(x) • x takes 



h steps and p .(x) can be computed in h + 1 steps. 
^ *s+t-j 

2) Next we show that p (x) can be computed in h + 2 steps. 



37 




p s-l (x) 



t+l 



P t (x) 



Figure 2.8. A Tree for p . (x) 

S "T* U 



We write 

/ \ / s-1 \ t+l t 

P s+t (x) = (a s + t X + •*• + a t + l )x + a t X + •'• + a 

- P s . 1 (x) • x t+1 +p x (x). 

Then from the previous proof we know that x " can be 
computed in less than h + 1 steps. Since p , (x) and p (x) take 

h - 1 and h + 1 steps respectively, we can compute p (x) • x "in 



at most h + 1 steps and p (x) in at most h + 2 steps. 



(Q.E.D. ) 



It is easy to check that p p (x) takes 3 steps, p,(x) and p^(x) can be 
computed in k steps and p q (x) can be computed in 5 steps. By induction we 
obtain the following table for h = 2, 3, ..., 10. 



38 



Minimum Steps 


Degree of Polynomial 


3 


2 


k 


3 - 1* 


5 


5 - 7 


6 


8 -12 


7 


13-20 


8 


21-33 


9 


3^-5^ 


10 


55-87 


11 


89-1^3 



Table 2.3. Computation of p (x) by Folding Method 



For example, p (x) takes 9 steps to be computed. Note that the 
first numbers in the right column form a Fibonacci sequence. 



2.4.1.5 Comparison of Four Methods 

It has been proved that at least 2n operations are required to compute 
p Lx ) . ■ roof s appeared in several papers. We owe Ostrowski [3M and Motzkin 

'I for their original works. Pan [35] summarized the results. An excellent 
review of the problem appears in [23]. Also Winograd [^3] generalized results 
of Ostrowski and Motzkin. 

Now assume that to compute p (x) in parallel h steps are required. 

Then Theorem 5 gives h < log r? + 1.og(n+l)l + 1 < 2fiog(n+l)' + 1. Also 

k 
since 2-1 operations can be performed in a parallel computation tree of 



39 



height k, we have 2 h - 1 > 2n or h > Hog ( 2n+l J 1 . Thus 

2nLog(n+l) 1 + 1 > h > flog(2n+l^. 

In Figure 2-9, these upper and lower limits are plotted together 
with the results from the previous section. It is clear that the folding method 
is the best in terms of the computational speed. It is yet an open question 
if there is a better method. 

2.U.2 Polynomial Computation by the k-th Order Horner's Rule 

Now let us study computation of a polynomial by the k-th order Horner's 

rule. 

A polynomial is computed by the k-th order Horner's rule as 

shown by the following procedure. We use k PE's. First we compute all 

x (1 < i < k) simultaneously. Then we compute k polynomials p ' (x) on k PE's 

simultaneously, where 

p. '(x) = a. + x k (a.^ + x k (a. _ +...))) (0 < i < k - l). 
^k ' l l+k i+2k — — 

Then we get k partial results which are added to get p (x). Figure 2.10 

i 1 I .si rates this. 

This scheduling may not be the best, yet it is easy to implement 

i also adaptable to any number of k. 

Theorem 5 ' 

The minimum number of steps, h P (m, n), required to compute a degree 
n polynomial p (x) on P(m) by the m-th order Horner's rule is 

(1) 2n (m = 1) 

h P (m,n) = -{ (2) 2 (log m~l + 2[n/mj +1 (n + l>m>2) 

(3) frog rH + flog(n+l) 1 + 1. (m > n + l) 



14-0 




o 

LA 



o 



O 

on 



H 
O 

Ok 

<H 
O 

<D 
<U 
U 

bO 

CD 



O 
CM 



O 
H 



I 



sdaq.s jo jaqtimjtf 






kl 



Proof : 

A proof for (l) can be found in [k-3]. (2) and (3) are self-evident 
from the above discussion. 

(Q.E.D.) 




k-1 



a + x (a k + x ... )) 



(a 1 + x (a R+1 + ... )) 



(a 2 + x (a k+2 + ... )) 



(a k-l +X (a 2k-l + '•• )} 



£ Horner ' s part 




Lemma 6 



Figure 2.10. k-th Order Horner's Rule 



The minimum number of steps h . to compute a polynomial p (x) of 



mm 



degree n by the k-th order Horner's rule (l < k < n + 1) is 

h P . = log nl + ri og ( n+ iJl + i. 
mm ° t>\ / 



Proof: 



It is enough to show that 

2 log ml + 2^/mj + 1 > 2 r iog(n+l) 1 + 1 



1+2 



or 



flog m 1 + |_n/mj > flogtn+l) 1 
for m < n, because 2flog(n+l)' > flog ir + 'log(n+l)' and if m > n + 1, then by 



Theorem 5 lr (n+1. n) = h . . 

man 



Assume that n=2 g +t(l<t<2 S ) and m=2 + s(l < s < 2 k ). Then 
e have two cases, i.e. (l) g = k or (2) g > k. If g = k, then t > s since 



we 



n > m. 



(1) g = k and t v > s. 

Then log m 1 =k+l=g+l, and 

Hence 

flog m) + |_n/mj = g + 2 > flog ( n+1 )1 . 

(2) g > k. 

We have further two subcases, i.e. (i) 1 < t < 2 or (ii) t = 2 . 

(i) 1 < t < 2 g 

Then flog (n +1)1 = g + 1, Q.og nil = k + 1 
and 






Isl 



2 B + t 



2 k + s 



> 2 e 



k - 1 



Hence 

flog ml + (_n/mj > k + 1 + 2 g "^ k+1 ' ) 
Now we show that for all k < g 

f(k) = k + 1 + 2 g " (k+l) s g + 1. 



*o 



Since for all k < g, 

|^f(k) = 1 - (log e 2) 2 g " (k+l) <0 

minf(k)=g+l where < k < g. 
Hence f(k) > g + 1 or 

log ml + [n/mj > g + 1 = riog(n+l)l . 

(ii) t = 2 g 

Similarly to the above, we can show that 

'log nil + in/mj > U-Og(n+l)l . 
The details are omitted. 

(Q.E.D. ) 

Unlike the case of h (m,n) and h (m, n), h (m, n) is not necessarily a 
non-increasing function of m. (A few curves in Figure 2.11 illustrate this.) 

Therefore it becomes important to choose an appropriate m for a given 
n to compute in an optimum way. 

Theorem 6 : 

Given an n-th degree polynomial. 
Let 



M - M (h P (m,n) - flog n"l + fl g(n+l)1 + 1), 



where 



Then 



h P ( m, n ) = 2 flog ml + 2 ^n/mj + 1 . 



kk 



ft 



ft 

-P 

OQ 

O 
U 



H 




n = 5 



10 



15 



Number of PE's (m) 



Figure 2.11. The Number of Steps, h (m, n), to Compute p (x) on P(m) by 
the ra-th Order 



h5 



(1) n + 1 

M =-< (2) ^n+l)/^ 

J3) WV2 1 

where g = ^log ry • 



(n = 2 g ) 



-1, 



(2* < n < 2* + 2 fe ) 
(2 g + 2S" 1 < n < 2 g+1 ) 



Proof: 



Proof is given for each case independently. 

(1) n = 2 g . 

The proof is divided into two parts. First it is clear that if 
we have n + 1 PE's, then p (x) can be computed in h . steps 

(see Theorem 5)- Next we show that if the number of PE's, m, is 

less than or equal to M - 1 (m < n), then Ir . < Ir (m, n), where 



nun 



h P . = flog nl + fiog(n+lJl + 1 = 2g + 2. 
man ov ' to 



Let m = 2 k + p (k < g, 1 < p < 2 k ). Then 



h P (m,n) = 2k + 5 + 2 



s k + P 



Let 



P = 



2 g " k p 
2 k + p 



Then 



(18) 



¥ 



P = 



(2 k +P ) 2 



> 0, 



and 



kG 



for 



max P = 2 g_k ~ 1 for 1 < p < 2 k . 
From this and Eq. (l8), we get 

h P (m,n) > 2k + 3 + 2 |2 g ~ k ~"!j . (19) 

Since g > k, Eq. (19) becomes 

h P (m,n) > 2k + 3 + 2 g " k . 
Now let 

f(k) = (2k + 3 + 2 g " k ) - (2g + 2) = 2(k - g) + 1 + 2 g ' k . 

Since 2 a + 1 - 2a > (note that v~(2 a + 1 - 2a) = log 2 ■ 2 a 

da °e 

2 > for a > l), we have f(k) > for all k < g. Since 

h P (m,n) - (2g + 2) > f(k) > 0, 

h P (m,n) > 2g + 2 - h p . n . 

This proves (l). 

Now since "log n^ 'log(n + l)' if n / 2 for some g, we use 

h P . = 2 ( "log(n + lp + 1 
mm to 



h P . = H.og(n + l) 1 + log n~l + 1 
mm to 

to prove (2) and (3)- Then it is enough to show that 

= u ( "log(n + l) = "log nr + l^/mj ) 
instead of 

M = tJ (2 f log(n + 1^ +1=2 flog nil + 2 (n/mj + l)« 

Since 2 g < n < 2 g+ for (2) and (3), we have riog(n + l)" 1 = g + 1. By direct 



+7 



computation, we can show that the theorem holds for n < 10 (see Table 2.k). 

(2) 2 g < n < 2 S + 2 g_1 . 
Now we show that 

h P . - 2 log M*l + 2 |n/M» + 1. 
mm u / _j 

To show this we first show 

flog M~l + (ji/Mj = g + 1. 

Since 2 g < n < 2 g + 2 g " , we have 

2 g-2 < n_+_l_ < 2 g-l (2Q) 

and 

log Ml = flog ^n + 1)/3H = g - 1. (21) 

Now let (n + l)/y = k (k > 3 as we assumed). Then 

n + 1 - 3k - p (p < 2) 

or 

n = 3k - p - 1. (22) 

Using this and the relation < (p + l)/k < 1, we have 

L n/Mj =2. (23) 

Thus from Eq. (21) and (23) 

flog M 1 + ,n/Mj = g + 1 = flog(n + l)~l, 
or 

h P (M,n) = h P . . 

mm 

Next we show that if m' < l(n + l)/y , then HLog m^ + [n/m'j > 
g + 1 = 'log(n + 1)1 (or equivalent ly h P (m', n) > h p . ). 



mm' 



k8 



m 





1 


2 


3 


k 


5 


6 


7 


8 


9 


10 11 


g 


M 


Case 


2 


it 


1+ 


k 
















h 


1 


1 


3 


6 


5 


6 


5 














5 


2 


2 


k 


8 


7 


7 


7 














7 


2 


1 


5 


10 


7 


7 


9 


7 












7 


2 


1 


6 


12 


9 


9 


7 


9 


9 


7 








7 


1+ 


2 


7 


11+ 


9 


9 


7 


9 


9 


9 


7 






7 


h 


2 


8 


16 


11 


9 


9 


9 


9 


9 


9 


9 




9 


3 


1 


9 


18 


11 


11 


9 


9 


9 


9 


9 


11 


9 


9 


U 


1 


10 


20 


13 


11 


9 


11 


9 


9 


9 


11 


11 9 


9 


1+ 


1 



(1): 2 S < n < 2 g + 2 g ~ 1 

(2): 2 g + 2 s " 1 < n < 2 6+1 
where g = | log nj . 
n: The degree of a polynomial 
m: The number of PE's 
M: The minimum number of PE's 



Table 2.k. The Number of Steps Required to Compute p (x), h p (m,n), for n < 10 



+9 



We have two cases, i.e. 

(i) ?} < m' < 2 1+1 where i + 1 < g - 2 

d (ii) 2 g ~ 2 < m' < r (n + l)/^. 

(i) 2 1 < m' < 2 i+1 . 

Since i + 1 < g - 2, we write 

g - 2 = i + 1 + j (d > 0). 

Then 

flog m'"l =i+l=g-2-j, 

and since 2 S < n < 2 g + 2 g_1 , 

L n/mJ >2 g - i - 1 = 2 2 ^. 
Thus 

(log m*~l + |_n/mj > (g - 2 - ,i) + 2 2+J " > g + 1 
= flog(n + 1)1 

because 2 ;> a + 1 if a > 1. 

(ii) 2 3 " 2 < m' < T( n + l)/fl. 
Let us write 

m' - Rn + 1)/51 - q = k - q (q > l). 

Then by Eq. (2(>) and (22), we get 

ri og m n + [_£,} . s . i + [3¥^J 

> g + 1 = flog(n + l)', 
because q > 1, p < 2 and 3q - p - 1 > 0. 

This ends a proof for (2). 



50 

(3) (2 g + 2 g < n < 2 S+ ) can be proved in a similar manner and the 
details are omitted. 

(Q.E.D.) 

It should be noted that the function Ir (m,n) is not a non-increasing 
function of m even if m < M for some n. However if n < 50, then for more than 

70$ of the cases, h p (m, n) turn out to be non-increasing functions. (The only 

cases where h p (m, n) is not a non-increasing function are the cases n = 15, 27, 

28, 29, 30, 31, 36, 37, 38, 39, ^5, ^6, and 1+7 . In any case, h P (m,n) increases 
by at most one. ) 



51 



o 

LTN 




O 



O 



o 

H 



O 



O 

CD 
CD 

ta 

0) 

Q 



CD 
-P 



O 

o 

o 
-p 





H 


•d 


o 


*d 


u 
a; 


on 


o 


K 




£ 


W 




o 


H 




Ph 


Ph 



o 



3 

OJ 
CD 



£ 



£ 



o s 



I 







(W) s,aj jo jaqrariM 



52 



3. TREE HEIGHT REDUCTION ALGORITHM 

3.1 Introduction 

In this chapter, recognition of parallelism within an arithmetic 
statement or a block of statements is discussed. There are several existing 
algorithms which produce a syntactic tree to achieve this end. The tree is 
such that operations on the same level can be done in parallel. Among them, 
the algorithm by Baer and Bovet [6] is claimed to give the best result. For 
example, a statement 

a+b+c+dxe xf+g+h 
can be computed in "our steps by their algorithm (Figure 3*1) • 



k 

3 
2 

1 

level 




Figure 3-1- An Arithmetic Expression Tree (l) 



The algorithm reorders some terms in a statement to decrease 
tree height. However, this algorithm does not always take advantage of 
distributions of multiplication over addition. An arithmetic expression 



53 



a(bcd+e) takes four steps as it is, whereas the equivalent distributed 
expression (abcd+ae) requires only three steps. A further example is 
Horner's rule. To compute a polynomial 

p (x) = a^ + a, x + a^x + ... + a x , (l) 

n 12 n 

Horner ' s rule 

p (x) = a. + x(a n + x(a, + . . . x(a . +x a ))) ...) (2) 
*n 13 n-1 n 

gives a good result for serial machines. However, if a parallel machine is to be 

used, (l) gives a better result than (2). Namely, if we apply Baer and Bovet's 

algorithm [6] on (2), we get 2n steps whereas (l) requires only 2flogp(n+l)] steps 

(see Chapter 2). Thus it is desirable for the compiler to be able to obtain (l) 

from (2) by distributing multiplications over additions properly. An algorithm to 

distribute multiplications properly over additions to obtain more parallelism 

(henceforth called the distribution algorithm or the tree height reduction 

algorithm) is discussed now. 

3-2 Tree Height and Distribution 

Definition 1 : 

An arithmetic expression A consists of additions, multiplications, and 
possibly parentheses. We assume that addition and multiplication require the same 
amount of time ^see Chapter l). Subtractions and divisions will be introduced 
later. Small letters (a,b,c, ...), possibly with subscripts, denote single 
variables. Upper case letters and t, possibly with subscripts, denote arbitrary • 
arithmetic expressions, including single variables, t is used to single out 
particular subexpressions, i.e. terms. 



5h 



n 
Then A can always be written as either (l) A = E t. or 

i=l x 



(2) A = w (t.), e.g. A = abc + d (ef+g) = t + t where t = abc and 
i=l 

t = d(ef+g), or A = (a+b)c(de+f) = (t )t (t ) where t = a + b, t = c 



and 



t-, - de + f . Note that when we write A = tt (t. ), we implicitly assume that for 

i=l x 

n(i) 
each i t. = Z t.'. h[A] denotes the height of a tree T for A, which is of the 

minimum height among all possible trees for A in its presented form. 

A minimum height tree (henceforth by a tree we mean a minimum height 
tree) for A, T[A], is built as follows [6]. 

n n 

Let us assume that A = E t. or A ^ tt (t. ) and that for each i, a 

. , i . -, l ' 

i=l i=l 

minimum height tree T[t.] has been built. Then first we choose two trees, say 

T[ t ] and T[ t ] , each of whose height is smaller than height of any other tree. We 

combine these two trees and replace them by the new tree whose height is one higher 

4 

than max (h[t ], h[ t ]] : w 

Combined tree 



i. 



A^Vli 



hr V +1 X A h I^ 

t t t r 

I p q 

This procedure is repeated until all trees are combined into one tree, which is 
T[A]. The procedure is formalized as follows: 



4 

"Figures are in scale as much as possible. 



55 



(1) Let ST = -j(l), (2), (3), -.., (n)J- and let h'[i] = h[t.] for 
all it ST. 

(2) Choose two elements of ST, say p and q such that h'[p], h'[q] < M" 
where M = rain jh'[u]l for all u€ST- -|p, q|- . 

(3) Now let ST = jsT - |p, qjj U j(p, q)| and h'[ (p, q)] = 
max jh'[p], h'[q]| + 1. 

(k) If I ST | = 1, then stop else go to Step 2. 

After we apply the above procedure on A, we get e.g. ST = i ( ( ((l)(2) ) (3) ) ( (k) ( 5) ) ) )- 

where a pair (ab) indicates that trees corresponds to a and b are to be combined. 
Thus in this case we get: 

^((((l)(2))(3))(CO(5))) 

(((D(2))(3)) 2i ^-^^^-- >Nf (00(5)) 

((D(2)) 




as a minimum height tree of A. In general the procedure is applied from the lowest 
parenthesis level (see Definition 3) to the higher parenthesis levels. 

]xample 1 : 

Let A = (a+bc)(d+efg) + hi 

= *1 + *2 



If there are many choices, then choose those subtrees with smaller 
-4(Sh[t ]) values first (see Definition 8). 



56 



where t = (t )(t^) 



t_, - a + be = t, + t,_ 
3 ^5 



and t,- = d + efg = t„ + t Q . 
b 7 8 



T[A] 




Tft 3 ] 



g h 



Figure 3-2. An Arithmetic Expression Tree (2) 



The < ffective length e of an arithmetic expression is defined as 

e[A] . 2 h[A1 . 

The number of single variables in an arithmetic expression is the number 
of single variable occurences in it. 

The height of a minimum height tree can be obtained without actually 
building a tree. 

Theorem 1: 



(1.1) If A s it a. or A = Z a. then h[ A] 
i=l x i=l x 



■log 2 fpl g 



57 



(1.2) If A = It. then h[A] - log 
i=l 2 



P 

£ e[t ] 
i=l 



P I 

(1.3) If A = tt a. x 7T (t.) then h[A] = log 
i=l X 3=1 J 



r P i + z ert.] I . 

2 J-l J '2 



Given an arithmetic expression A, to obtain the height of a tree for 
A, Theorem 1 is applied from the inner most parts of A to the outer parts, 
recursively. 

Proof: 



(1) It is obvious that if A = Z a. or A = tt a. then h[A] = log r pi . 

i=l i=l 



(2) Now let A = Z t. . Then we can replace each t. by a product of 
i=l x x 

e[t.] single variables without affecting the total tree height h[A]. 

(Note that each t. must be computed before the summation over t. is 

e[t.] 
p L i J 

taken.) Thus A becomes A' = Z tt a . and h[A] = MA']. Let us 



i=l j=l 



eft.] 
1 



call a tree for tt a. a subtree. Then a tree for A' is built 
,1=1 J 

using subtrees in the increasing order of their heights. Since a 



binary tree of height h cannot accomodate more than 2 leaves, we 



have 2 



h[A'] 



> Z e[ t.l > 2 
" i=l X 



h[A']-l 



or h[A'] = h[A] = log. 



P 

Z e[ t ] 
i=l 



(3) can be proved in a similar manner. 



(Q.E.D.) 



58 



Definition 2 : 

The additive length a and the multiplicative length m of an arithmetic 
expression A is defined as follows: 

P 

(2.1) If A = ir a.., then 

i=l X 

(i) a[A] = e[A] and 
(ii) m[ A] = p. 

P 

(2.2) If A - Z t., then 

i=l X 



P 
(i) a[A] = I e[t.] and 
i=l x 

(ii) m[A] = e[A]. 

P I 

(2.3) If A = tt a. x 7r (t.), then 

i=l x d=l J 

(i) a[A] = e[A] and 



(ii) m[A] = p + I e[t ]. 
0=1 J 



It is to be noted that 



(!) £[AI |a[A]l 

2 '.m[A]j - 



(2) h[A] = log p |a[A] i = log m[A] I ; 
d ! * 2 ' '2 

compare this definition with Theorem 1. 

Definition 3 : 

The level £ of a parenthesis pair in an arithmetic expression is defined 
as follows : 



59 



First we start numbering parentheses at the left of the formula, pro- 
ceeding from left to right, counting each left parenthesis, (, as +1 and each 
right parenthesis,), as -1 and adding as we go. We call the maximum number m 
the depth of parentheses. Now the level 1 of each parenthesis pair is obtained 
as I = p, where p is the count for each parenthesis. The arithmetic expressions 

enclosed by the level I parenthesis pair are called the level I arithmetic 

I 
expressions, A • Also for convenience we assume that there is an outermost 

parenthesis pair which encloses A. 



Example 2 



A = 



123 3 3 3 21 

(ab((cd + e)(f + g)+k)) 





3 




3 








2 







Now several lemmas are in order. 



Lemma 1 



n 



n 



Let A = T t. or A = ir (t.). Also let A 1 = t, + t. + 
. , l . , l 12 

1=1 1=1 



+ t. ' + 

l 



... + t or A' = (t. ) x (t) x ... x (t. ') x ... x (t ), and A" = A + t . or A" = 
n 12' l n" n+1 



A x (t n+1 ). Then 



6o 



(i) h[A'] >h[A] if h[t.'] > h[t ± ] 
(ii) h[A"] > h[A]. 



Proof: 



Obvious from Theorem 1. 

What Lemma 1 implies is that the height of the tree for an arithmetic 
expression is a non-decreasing function of term heights, and the number of terms 
involved. 

In an arithmetic expression, there are four possible ways of parenthesis 
occurence: 

P x ) ... + (A) + ... 

P_) ... 6(t n x t_ ... x t ) x (t ' x t ' ... x t ') 9 ... 

2 12 n 1 2 m 

P 2 ) . .. a x a x ... a x (A) 8 . . . 

3 12 n 

p u ) ... e(t 1 + t 2 + ... + t n ) x (t^ + 1 2 ' + ... + t m ») e ... 

where 6 represents +, x, or no operation. 

Lemma 2 : 

Let D - B + ( A) + C and D n = B+(t n x ... xt ) x(t' x ... xt ') +C. 

1 1 n 1 m 

Also let D = B + A + C and D n =B+t n x ...xt xt' x ... xt ' + C. Then 

11 n 1 m 

r\ d 

h[D] >h[D] and h[ D ] > h[ D ] . 

Proof : 

Obvious Prom Theorem 1. 



6l 



As an example, let D=(a+b+c)+d and D = (abc)(defgh) . Then 

A A A 

D = a + b + c + d and DJ = abcdefgh, h[ D] = 3 > h[D ] = 2 and h[D ] = k > 
h[Dj] = 3- 



Lemma > 



Let D =| Z t .|| Z t . ] and D d = t, t , ' • t, t n ' ■ ... +tt' ... t t 
,1=1 *A.1-1 « 

Then h[D d ] > h[D]. 

Proof : 

n m 

Let D = (A)(B) where A = Z t. and B = St.'. Also without losing 

i=l x 1=1 x 

nerality, assume that h[A] > h[ B] . Then h[D] = h[A] + 1. For each j, let d. = 

J 

t.' t n +t.' t„ + ... + t.' t. It is clear that h[ t . ' t.] > h[t.l for all i and 
j 1 2 j n j i J - i J 

d m d 

Thus from Lemma 1, we have h[d.] > h[A] for all j. Since D = Z d., h[D ] > 

3 0=1 J 

min(h[d ]) + log J ml > h[A} + log 2 rml 2 , or since h[D] = h[A] + 1, h[ D ] > h[ D] . 

(Q.E.D.) 



Note that the above lemma does not imply necessarily that if D = 

ft \ H" H" 

Zt.HB), and D = t (B) + t (B) + ... + t (B), then h[ D ] > h[D]. Actually, 
i=l 7 n 

it can be shown that there is a case when h[ D] > hf D ] . What Lemma 3 says is 
that D = (A)(B) should not be fully distributed, but partial distribution, as in 
D , may be done in some cases. 



62 



Lemmas 2 and 3 together indicate that distribution in case (P^) and 

partial distribution for case (P, ) are the only cases which should be considered 

for lowering tree height. In casos (P ) and (Pg), removal of parenthesis 
leads to a better result or at least gives the same tree height. Full 
distribution in case (P. ) always increases tree height and should not be 

done. Also it should be clear that in any case tree height of an arithmetic 
expression can not be lower than that of a component term even after 

A \ ■ d 

distributions are done. For example, let D = t(A) = t x J F t J and D = 

n 

F ft v t.). Then from Lemma 1, we have h[tt.] > h[t ] for each i. Thus 

i=l 

h[r> d ] > h[A]. 

The same argument holds for all four cases. This assures that evaluation 
of distributions can be done locally. That is, if some distribution increases 
tree height for a term then that distribution should not be performed because once 
tree height is increased, it can never be remedied by further distributions. 
Actually, there are two cases where distribution pays. For example, if A = 
a(bcd + e), then h r A] = k. However, if we distribute a, then we get A = abed + 

ae and h r A ] 3- The idea is to balance a tree by filling the "holes" because 
a balanced tree can accommodate the largest number of variables among equal height 
trees. The situation is, however, not totally trivial, because by distribution, 
the number of variables in an expression is also increased. Next let A = 

a(bc + d) +e=t+e and A = abc + ad + e = t + e. In this case h[A] = k but 

h[A ] = h[ t ] =3- What happened here is that t is "opened" by distributing a 
over (be + d) and the "space" to put e in is created. 



63 



At each level of parenthesis pair, cases (P3) and (P^), i.e., instances 
of "holes" and "spaces", are checked and proper distribution is performed. Next 
we give definitions of holes and spaces, and formalize these ideas. 

3-3 Holes and Spaces 

3-3«l Introduction 

Before we proceed further, let us study trees for arithmetic expressions 

more carefully. 

P 
Let A = Z t.. By Definition 1 we first build minimum height trees T[t.] 
i=l 1 x 

for all i, and T[AJ is built by combining these T[t.]. Once T[t.] is built the 

details of t. do not matter, and the only thing that matters is its height h[t.]. 

Suppose T[t.] and T[t.] are combined to build T[A]. Assume also that 

hrt.l = hrt.l + s. Then we will get s nodes to which no trees are attached other 
l 3 

than T[t.]. We call these free nodes whose heights are h[t.] +1, h[t.] +2, ..., 
J J J 

h r t.]. 
i 




T[t.: 


<? T[t.]-1 

1 

<? T[t )+l 


X -^ 


A h[t d ] 



Free Nodes 
(S) 



Figure 3-3- Free Nodes 



6k 



Similarly we can enumerate all free nodes in T[A] with their heights. 



O free node 
£^ occupied node 



root of T[ t . ] 



root of T[A] 




Figure ^>.h. Free Nodes in a Tree 



Free nodes in a tree T 



P 

it (t.) 



L i=l 



are defined similarly. 



Let us emphasize that once we get Tft.] we treat it as a whole and do not 

care about its details when we build T[A]. That is, when we consider free nodes 
in T[A] we mean free nodes "in" T[A] but "outside" of Tft.]. For example let A = 



(a+b)(cde+f) = (\){t ). Then 



T[A] 




d e 



65 



a and 3 are free nodes in T[A] while y and 8 are free nodes in T[t ] and not in 
T[A]. 

Now suppose there are m free nodes in T[A]. We number them arbitrarily 
from 1 to m. Also let us denote the height of a free node a '»y h[al« 
Given a free node q- whose height is hi" a] in T [A = I t.] 

(or T [A - tt( t . ) ] ), by definition we can attach a tree T[t] whose height is h[a]-l 

(or whose effective length is 2 ^ aJ " ) to a without affecting the height of A: 




Definition k: 



For A = 7r(t.) or A = Zt . , we build a tree. Then 
l i 

(k.l) define F [A] to be a set of all free nodes in T[A], and 

(U.2) for each i define F [A, t.] to be a set of all free nodes which 

r\ 1 

exist between the roots of T[A] and T[ t . ] , i.e. the free nodes 
which we encounter when we traverse T[A] down to the root of 
T[t.]. 

For example let us consider the following tree (see Figure J. 5)- 



66 




Figure J5-5- An Example of F and F 



Then F A [A] = fa, 3, 7, 6, e] and F R [A, t ± ] = fa, 3, y), F R [A, t g ] = { c& 3} etc. 

Lemma h : 

Suppose h[ a] = h[ 3] for some free nodes a and 3 in T[A]. Then without 
changing the tree height h[A], we can replace two free nodes a and 3 hy one free 
node 3' whose height is h[a]+l. 



Proof: 



original T[A] 





Figure 3-6. Elimination of a Free Node 



67 



modified T[A] N 





Figure 3- 6- (continued) 



We can combine subtrees 1 and 3, and hence eliminate free nodes a 
and 3 and create a new free node 3' (see Figure 3-6). 

(Q.E.D.) 

Given F.[A], two free nodes a and (3 of equal height can be replaced by 
one free node 3 1 whose height is h[ a] + 1« Repeating this procedure finally we 
get a new F '[A] in which no two free nodes have the same height. Let y and 8 
be free nodes in F.[A] and F '[A] respectively. Assume that for all free nodes a 
in F [A] (or F '[A]), h[ 7] > h[ a] (or h[61 > h[ a] ) . Then obviously h[ 7] <h[5]. 
However, it is clear that h[6] < h[A], i.e. if h[6] = h[A] then h[A] is not 
the height of a minimum tree (see Figure 3*7 )• Hence we have the following 
corollary. 



'orollary 1 : 



In a minimum height tree T[A], 



I 2 h[a] " 1 <kA]. 
C*F A [A] 



68 



Proof: 



Assume that S 2 h[a] ~ 1 = |e[ 

aeF [A] 



re[A]. Then by Lemma h, alia in F [A] 



can be replaced by one free node 3 whose height is h[A] (note 2 *■ ■•" = e[A]/2) 
This, however, implies that we can build a tree T'[A] whose height is h[A]-l, 
which contradicts our assumption that T[A] is a minimum height tree for A. 




T[A] 



Figure 3«T« A Minimum Height Tree 



(Q.E.D.) 



The following definitions are also used in the next section. 



Definition 5 : 

An integer set is a set of integers with possibly duplicated elements, 
e.g. (2,2,2,^,8,8). If an integer x occurs in an integer set Y at least once then 
we write x in Y, e.g. 2 in {2,2,2,^,8,81 . Let Y and Z be two integer sets. Then 

by Y uni Z we get an integer set where if an integer x occurs i times in Y and j 
times in Z, then x occurs i + j times in Y uni Z, e.g. [2,k,k) uni (2,^,8) = 
[2,2,k,k,k,8) . Also if Y is an integer set, then #(Y) = (the sum of values of 
all elements of Y), e.g. #({2,k,k) ) - 10, and le Y = (the value of the largest 



69 



element :ln Y), e.g. le (2,4,4) = 4. Furthermore if x in Y, then by Y del {x} we 
mean the integer set which is obtained by deleting one occurence of x from Y, e.g. 
(2, 4,4,8, 16, 16} del (4) = (2,4,8,16,16). 

P 

Let Y, , Y„, ..., Y be integer sets. Then by M (d) Y. we mean the set 
r 2 p u l 

i=l 

Y constructed as follows. 

(1) Let Y = (empty) and s = d. 

(2) If s = 0, then stop else go to (3). 

(3) Let u - min(le Y.) and k be an index of Y. such that u = le Y, . 

— i i — k 

If there are more than one k which satisfy this, then pick an arbitrary 
one. Now let Y = Y uni (ul and s = s - u. Also let Y = Y, del (u) and Y. = 

(Y. del {le Y.) ) uni { le Y.-ul for all i (i^k). Go to (2). 

For example let Y ± = (1,2,2,8) , Y 2 = (4,4,8) and Y, = {16} . Then Y = 

3 

•• 

i=l 



(13) Y. is constructed as follows, 
u 1 



(1) Y - and s = 13- 

(2) le Y ± = 8, le Y = 8, le Y = 16. Hence u = 8 and k = 1. Then Y = 

(8^ and s = 13 - 8 = 5- Also Y- [ = (1,2,2), X^ = {4,4} and Y, = (8) 

(3) le Y - 2, le Y = 4, le Y, = 8. Hence u = 2 and k = 1. Then Y = 
(2,81 and s = 5 - 2 = 3- Also Y 1 = (1,2), Y g = (4,2), Y^ = {6} . 

(4) Repeating the procedure we finally get Y= (1,2,2,8). 

lefinition 6 ; 

Let m be an integer. Now write m as a sum of powers of 2 in which each 

power appears at most once. Then by 5b(m) we mean the integer set of powers of 
2 which appear in the above sum. 



TO 



3 2 
For example since 13 = 2^+2 +2 = 8+4+1, I b(l3) = {1,4,8} but 

Fb (13) 4 {l,k,k,k} since k appears more than once. 

3.3.2 Holes 

Now we discuss holes in an arithmetic expression. Intuitively, if an 
arithmetic expression A has a hole of size u then an arithmetic expression t' with 
e[t'] < u may be distributed over A without increasing tree height. 

Definition 7 » 

For each A = 7r(t.) we define a total hole f unct ion H m and for each t. in 

A = r t. we define a relative hole function H^ as follows: 
1 R 

(7.1) For each A = 7r(t.), build a tree T[A]. Then define H [A] = 

I 2 h ^ ^" 1 for all a e F [A] . 

(7.2) For each A = Ft., build a tree T[A"|. Then for each t. define 

H R [A,t.] = T 2 h[a]_1 for all a e F R [A,t.]. If F R rA,t.] = 0, 

then let HJA.t.] = 0. 
R 1 i J 

As stated before if a is a free node in T[A], then 2 "• ^ " is the 
effective length of a term t' whose tree T[t'] can be attached to a without chang- 
ing the tree height h[ A] . Also let u in Zb(HJA]). Then this implies that there 

is a free node q in F A [ A] such that 2 = u (see Lemma k) . Similarly if u in 
T b(HjA, t.]) then there is a free node n in FjA,t.] such that 2 ^ Q '~ = u. Thus 

K 1 n 1 

in general 

(1) if A = 7r(t. ), then h[ (t 1 )x (A)] = h[A] if m[t'] < u, and 



71 



(2) if A = 2 t., then h[t'+A] = h[ A] if a[ t ' ] < u, where u = 

I 2 h[a] " 1 . 
aeF A [A] 

Definition 8 : 

The set of holes Sh[A] for an arithmetic expression A is an integer set 
defined as follows: 

P 

(8.1) If A = ir a., then we let Sh[A] = (Zb(e[A]-p)} , e.g. Sh[abcde] = 

i=l X 

(a(3)) = (i,2). 

p p 

(8.2) If A = 7T (t.), then we let Sh[ A] = yiii(Sh[ t . ] ) uni Z b(H [ A] ), 

i=l 1 i=l it 

e.g. Sh[ (abc+d)(efg+h)(i+j)] = Sh[abc+d] uni Sh[ efg+h] uni Sh[i+j] 
uni [Zb(H [A])) = {1} uni (1) uni Zb(l^) - (1,1,2,3,8). 

P 

(8.3) If A = Z t., then we first let Sh'[t.J = Sh[t.] uni Z b(H_[A,t.]) 

, i 1 i - ~ ~~ H i 

1 = 1 

P 
and d = min (#(Sh'[ t. ] ) ) . Then we let Sh[A] = M u (d) Sh'[t.]. 
i i=l 

Example 2 : 

Let A = (a+b)(c+def) + ghi = (t 1 )(t 2 ) + t, = t^ + t,. Then Sh[ t ] = (1} 

and Sh[(t 1 )(t )] = (11 uni I b(6) = {1,2, k) . Also Sh[ t ] = {1}. Hence Sh'ft^] = 

(1,2,41 and Sh'[t ] = ( 1) uni Z b(12) = [l,h,Q] . Now d = min ( (#Sh'[ t, ] ), 

#(Sh'[t 3 ])) = 7. 

Hence Sh[A] = ^ (7) Sh'[t.] = (1,2,4). 
1-3A 



72 



In (8-3) above note that for every i 

(1) if u in Sh[A], then there is u. in Sh'[t.] such that u < u., 

(2) if u in Sh[A], then u < #(Sh'[t.]). Also there is at least one 
k such that le Sh[A] = le Sh'[t ]. 

q 

Given an arithmetic expression A, and t' = £ t.', if e[t'] < le Sh[A], 

i=l 

then t' way be distributed over A without increasing tree height. Informally 

we say that A has a hole which can accomodate t 1 . 

Now we show that the above assertion is indeed valid. First we observe 

q 

that if t' = 7T (t.'), then each t. ' may be distributed independently over A. 
i=l X X 

q 

Thus in the following we can assume that t* = £ t.'. Note that if t can be 

i=l X 

distributed over A without increasing tree height, then any t'(e[t'] < e[ t] ) 

can be distributed over A as well. 



Lemma 5: 



P 
Let A = ir a.. Then h[A] = h[(A) x (t*)] if e[t'] < le Sh[A]. 



i=l 

Proof: Obvious by Theorem 1 



Theorem 2 



p - d 

(1) Let A = 7T (t.). Then h[A] = h[ (t'A) ] if e[t*] < le Sh[A] 



i-1 x 



73 



P _ d # 

(2) Let A - It.. Then h[A] = h[(t'A) ] if e[ t ' ] < le Sh[ A] 

i=l X 



Proof : 

We use a mathematical induction to prove the above theorem. Lemma 5 
serves as a basis. 

P P 

Let A= 7T (t.) or A = St.. Assume that the theorem holds for t.. 

. , l . -i i l 

i=l i=l 

P 

(1) A = 7T (t ). 

i=l 

We show that if e[t'] < le Sh[A] then A can be multiplied by f 
without increasing tree height. There are two cases: 

(i) There is k such that eft 1 ] < le Sh[ t ] . Then we dis- 

tribute t' over t, without increasing tree height. 

(ii) V e[t'] >leSh[t ]. In this case e[ f ] < le Z b(H [A"] ) . 

Then there is a free node a in F A [ A 1 sucn that u = 2 
and e[t'] < u means that T[t'] can be attached to a with- 
out increasing tree height. 
Hence A can be multiplied by t' without increasing tree height. 

P 

(2) A = I t . 

i=l 

We show that if e[t'] < le Sh[A] then t can be distributed over A 
without increasing tree height. 



# - d 

We write (ft) for the expression obtained after distributing f over 



t such that the tree height is deduced, e.g. (a, b+cd) = ab + acd. 



7^ 



First note that for all i, if a€F [A,t.] then 2 h ^ a ^ _1 > le Sh[t.] 

Now assume that e[t'] < le Sh[A]. For fixed k, that u = le Sh[A] 
implies either 

( i ) u < le Sh[ t k ] 

or (ii) there is a in F_,[A, t, ] such that u < 2 "- a ^ (or 

equivalently u < le Ib(H [A,t ])). 

In the first case we have h[ t ] = h[(f t ) ] by assumption and 

h[A] - h[ 7 t + (ft ) d ]. 
i/k X k 

In the second case we attach T[ f ] to a (Figure 2.8(a)) without 

increasing the tree height h[A], i.e. 

h[A] = h\ T t + (f ) x (t )]. 
i/k x * 

In general let 1" be a subtree in T[A] whose root is a (Figure 

3.8(b)). Then 





\ T* 



(a) (b) 

Figure 3-8. Attachment of T[ f ] to a Free Node 



75 



there may be other term trees besides T[ t ] in T", e.g. T[t.] and 

T[t ] in Figure 3-8(b). Hence we get, 

h[A] = h[ L t. + (f ) x (t^+t.+tj^)] 
i#c,j,h 

in this case. Note that a is also in F [A, t.] and TJA, t ]. 

a j a n. 

Repeating this procedure for all k we can get an arithmetic expres- 

P 
sion equivalent to X (t 1 ) x (t.) or (t')(A) without increasing 

i=l x 

tree height. This proves (2). 

It is obvious that in both (l) and (2) if e[t'] > le Sh[A], then t' can 

not. be distributed (multiplied) over A without increasing tree height. 

(Q.E.D.) 

Lemma 6 : 

Let A = F t. and e[t»] < le Sh[A]. Then after t' is distributed over A, 

we have 

Sh[ (t'A) d ] = (Sh[A] del (u} ) uni T b(u-e[t'] ) uni Sh[t'] 
where u is the smallest element in Sh[A] bigger than e[t']. 

Now let us summarize what we have so far. Let A and t' be arithmetic 
expressions where A = T t. and t' = 7t'.. If e[t'] <le Sh[A], then t' can be 

distributed over A without increasing tree height, i.e. h[(t') x (A)] > 

h[(t*A) = hfA]. A set of holes in (t'A) is given by the above lemma. Since it 
is obvious that le Sh[ A] < e[A], we have the following lemma. 



76 



Lemma J : 

Let f = Zt.'. Then h[(t')(A)] > h[(t'A) d ] implies that h[t'] < h[A]. 

q 

In general let t' = Ti(t.'). For convenience let us assume that e[t. ' ] 

< e[t '] < e[t '] < ... < e[t ']. Then if the following procedure can be 
- 2 - 3 ~ - q 

accomplished successfully, we say that A has holes to accomodate all t.. ' (i=l, 
2,...,q). # 



Procedure: 



i 



(1) Let V = (t i '(t i _ 1 *,...,(t 1 , A) d ) d ...) d and V Q = A. Let i = 1. 

(2) Check if e[t.'] < le Sh[V. .]. 

i l-l 

(3) If so, then distribute t. ' over V. , and we have V.. 

(k) Evaluate Sh[V.]. 

(5) If i=q then stop, else let i = i + 1 and go to (l). 
The procedure may be accomplished successfully if m[t'] < #(Sh[A]). 

3- 3. 3 Space 

Now the second possible distribution case, i.e. space is studied. 

The idea of the second distribution case is that given an arithmetic 

expression D of the following form 

D = ... 9 (f) x (A) + t e ... 

1 s 

distribute t' over A so that t can be hidden under the combined tree as shown 

s 

in Figure 3.9* 



We assume that each t. ' does not have any holes, i.e. Sh[t.'] = 

for all i. Hence Sh[(t*A) d ] = (Sh[A] del { u }) uni b(u-e[t']) for 
example . 



77 







t'^ t't 2 



(a) 




A A A A 




ft 



(b) 



Figure 3.9. An Example of Space (l) 



In other words, in case of D, the addition +, cannot be done before (f )(A) 
(we write f (A)) is computed while in case of D it may be done earlier. 

Note that if h[f (A)] > h[(f A) ] then A has enough holes to 
accomodate f and the distribution of f over A is done anyway. Henceforth 
throughout the rest of this section we assume that A does not have any holes 
to accomodate f . Thus we now deal with the case when h[f (A)] < h[(f A) ]. 
However, if h[f (A)] < h[(f A) ] holds, then clearly there is no way to 



78 



get h[J) ] < h[ D] by Lemma 1. Thus we have: 



Lemma 8: 



Proof: 



(1) h[t* (A)] = h[(t*A) d ] must hold to get h[D d ] < h[ D] . 

n _ d 

(2) Let A = it.. Then h[t'(A)] > h[(t'A) ] if and only if e[t'] < 

i=l X 

e[A] (i.e. h[t f ] < h[A]). This implies that h[t'(A)] = h[A] + 1. 



By inspection. 

P 
Intuitively the space Sp in an arithmetic expression A = it. with 

i=l 1 

respect to t' is defined as 

P 
Sp[A,f] - e[(A) x (t')] - I e[(f) x (t )]. 

i=l 

For example let A = ab + c and t* = d. Then S [A, t'] = 2. 



free node 





(a) (b) 

Figure 3.10. An Example of Space (2) 

Let D = t'(A) + t - (t'A) + t. Note that a free node in T[t»(A)] cannot be used 

to attach T[ t] while a free node in T[(t'A) ] may be used to attach T[t]. 
Now the formal definition of space follows. 



79 



Definition 9 : 

p q 

Given arithmetic expressions A = Ft. and t' = Ft. ' , the space function 

i=l i=l 

Sp of A w.r.t. t' is defined as follows. First we build trees T[A] and T[t'], and 

in T[A] let F be a set of free nodes f higher than T[t'] (h[f] > h[t']). Also we 

define a set I as follows. We let iel if h[t.] > h[t'] and e[ V ] < le Sh[ t. ] . 

Now the space function is obtained as : 

h[f] h ^V 

Sp[A,f] = Z 2 L J + T 2 
fe F i€ I 

To show how Definition 9 works, we first describe how to build T[(t'A) ] 
by attaching T[t'] to T[A] properly (i.e. by distributing t' over A properly). 

Since h[(t'A) ] = h[A] +1 (Lemma 8(1)), we first study to build T'[A] which is 
obtained by replacing T[t.] in T[A] by T'[t.] whose height is h[t.] + 1. Then the 

height of T'[A] is h[A] + 1. Building T[(t'A) ] from T[A] may be explained in an 
analogous way. 

As stated before the only case to be considered is when h[t'(Aj] = 

h[(t'A) d ] = h[A] +1 holds (Lemma 8(1 )). Suppose that all T[t.] in T[A] are re- 
placed by T'[t.] whose height is h[t.] + 1. Then the new tree T'[A], whose height 

is h[A] + 1, is obtained. Note that a free node a in T[A] now becomes a free node 
a' in T'[A] with height h[a] + 1. In T'[A] if T'[t.] is replaced by T[t.] again, 

then a new free node 3', whose height is h[t.] +1, is created. Having these 
facts in mind, now we describe the way t' is distributed over A to create space. 

The tree T[(t'A) d ] is built from T[A] as follows (note h[(t'A) d ] = 
h[A] +1). Depending on the height of T[t.], we have two cases. 

J 



8o 



(1) h[t.] > h[f ]. 

If T[t.] has a hole to put T[t'], then we fill it by T[t']. In 

J 

this case h[t'(t.)] = h[t J and a new free node 3' whose height is 

h[t.] + 1 is created in T[(t'A) ]. Otherwise h[t'(t.)] = h[t.] 
J J J 

+ 1. 

(2) h[t.] < h[t']. 

J 

Find the tree T[Zt ] whose height is h[t'] and which includes 
s 

T[t.]. We multiply Zt by t' and get h[t*(Zt )] = h[Zt ] + 1. 
J s s s 

Note that t. and Zt are treated as terms of A. In the resultant tree T[(t*A) ], 
j s J ' 

those free nodes in T[A] whose heights are less than or equal to h[t T ] (i.e. 

free nodes in T[Zt ] ) do not appear, 
s 




(a) (b) 

Figure 3-H- Distribution of t' over A 



(c) 



.m*- 



A free node a in T[A] (h[a] > h[t']) appears in T[(t*A) ] as a free 
node a' where h[a' ] = h[a] + 1. Thus T[(t'A) ] has those a' and 3' described in (l) 
as free nodes. 



81 



If 6' is a free node in T[ (t f A) d ], then a tree T[t] (h[t] <h[6'] - 1) 
can De attached to T[(t'A) ] without increasing tree height (i.e. h[(t'A) ] = 
h[(t'A) + t]). Since h[6'] is either h[t.] + 1 or h[ a ] + 1, we have e[t] = 

J 

2 J or 2 h ^ ] . 

In general if a[t] < Sp[A,t f ] then we have h[t'(A)] = h[(t'A) d ] = 
h[(t'A) d + t]. 

q 

Definition 9 niay be generalized to include the case where t' = ir(t.'). 

In this case we first obtain h[t'] and e[t.'] (i=l, 2, . . ., q), and build T[A"|. 

In T[A] let F be a set of free nodes f which are higher than h[t'] (h[ f ] > 
h[t']). Also let I be a set such that i is in I if h[t.] >h[t'] and T[t.] has 

enough holes to put all T[t.'] (i.e. a[t'] < /Sh[t.])). Then Sp[A,t f ] = 

feF iel 

Informally we say that space to put t is created by distributing t' 

over A if h[t'(A) + t] > h[(t'A) + t] . Then the procedure is called space 

filling. Now we study how much we can reduce tree height by space filling. 

Let B = Z't.')(A.) + Zt . and assume that by distributing t. ' over A. space can 
li. j l l 

i J 

be created for all i. 



Lemma 9 " 

Let B = Z(t. ')(A.) + It. and B d = l(t. 'A. ) d + Ft.. Then h[B d ] = h[ Bl 
.11 .3 .11 .j LJ LJ 

i i 3 

1 if l!3p[A.,t '] >a[B] - e[B]/2, otherwise h[ B d ] = h[ B] . 

i 



82 



Proof : 

First note that to lower the height of a tree for B, some terms must 
be removed from B so that an effective length of a resultant expression becomes 

e[B]/2. Hence ZSp[A. ,t.'] must be 

greater than or equal to a[B] - e[B]/2. 

r d n 

Next we show that h|_B J cannot be 
smaller than h[B] - 1. 

As before we assume that 
h[t i *(A.)]= h[(t.'A.) d ] for all i 
(see lemma 8(1)). It is equivalent to 




( — erB]/2 
-a[B] — 



Figure 3-12. Tree Height Reduction 
by Hole Creation 



the assumption that Sp[A.,t.'] < e[t.'(A. )]/2 = e[(t.'A. ) ]/2. First we get B ! 

from B as follows. We replace every (t.'A.) in B by a product P. of 

e[(t 'A ) ]/2 single variables. This amounts to assuming that Sp[A. ,t '] = 
i i ii 

e[(t 'A ) ]/2. Futher we get B" from B' by replacing every t. in B' by a 
i i 

product Q. of e[t.]/2 single variables. Then it is clear that h[B] > h[B ] > 
J J 

h[B'] > h[B"] = h[B] - 1. If ZSp[A.,t.*] > a[B] - e[B]/2, then h[B] > h[B d ]. 

Hence h[B d ] = h[B] - 1. 

(Q.E.D.) 
What the lemma implies is that by space filling we can lower tree 
height at most by one . In other words to see if the space creation by distri- 
bution is effective it is enough to see if total tree height can be lowered by 
one, and we know if tree height is once lowered by one it is not necessary 



83 



(i.e. useless) to try to lower tree height further by creating more space by 
further distribution. 

Unlike a set of holes (Theorem 2), the space function for A = 2t. 

does not carry any information about space in components of A, t.. For example let 

B = a(b(c+defg) + irl6) + Trl6 + irk = a(A) + tt16 + irk 
where Ti is a product of i single variables. Then h[ B] = 7- Now Sh[A] = and 
space creation is tried. Note that SpfA.a] = 16 < a[ B] - e[B]/2 = 20, but 
Sp[ c+defg, ab] = k. That is, a as well as b should be distributed over c+defg. 
Thus we c -et B 1 = abc + abcdefg + a(7rlo) + tt16 + wh where h[B'] = 6. 

Now this situation is studied in detail. In general we have a form: 

F = ... +t'(... +t" ( C ) +. . . +D+. ..) + ... + E + ... 



^ 




J 




V 

A 




V 




J 



w~ 



:-n if Sp[A,t'] is not enough to reduce the tree height h[ F] , we have to further 
check components of A, e.g. Sp[C,t't"]. As we will show later (see Substep 2 of 
Step i\ of Algorithm given in Section j.k-.l) an arithmetic expression is examined 
from the inner most pairs of parentheses to the outer most pair. In the above 
diagram, the distribution of t" over C is first checked to see if it reduces the 
tree height h[A] and then the distribution of t' over A is examined. If the 

itribution of t" over C creates space and reduces the tree height h[A], then 
there is no problem. However if that distribution does not lower the tree height 

.], then t" will not be distributed over C (see Algorithm). As we showed in the 
above example when we check the possibility of reducing the tree height h[ F] by 
creating space by the distribution of t' over A, it may be necessary to check 
, ft"] as well. 



Qk 



Let A' = ... + (t"C) d + ... + D + ..., G* = (t'A) d and G" = (t'A') d - 
Now we show that if Sp[A, t'] = 0, then it is not necessary to examine components 
of A, e.g. Sp[C,t't"] further. This helps to reduce the number of checks 
required. 

Lemma 10 : 

(1) If Sp[A, t'] =0, then h[...+G"+...+E+. ..] < h[...+G'+. ..+E+. ..] 

never holds. 

(2) If Sp[A,t'] = 0, then h[G"] < h[G'] never holds. 



Proof: 



(1) We prove this by showing that if Sp[A,t'] = then Sp[A',t'] = 0. 
Note that Sp[A, t'] = implies that either 
(i) h[t'] > h[t"(C)] 
or (ii) h[t n ] = h[C] (Definition 9). 



By Lemma 9, in either case we get h[(t'(t"C) ) ] > h[t't"(C)]. 

Note that the only difference between G' and G" is that a term 

-— sr d 

t't"(C) in G' is being replaced by (t'(t"C) ) in G". Since 



h[(t'(t"C) d ) ] > h[t't"(C)], G" cannot have more free nodes 
than G'. Hence Sp[A f ,t'] = 0. 
(2) This may be proved in a similar way and the details are omitted. 

(Q.E.D.) 
Thus in F = ... + t' ( . . .+t" (C)+. . .+D+. . . ) + ... + E ... the distri- 
bution of t" over C should be done if it reduces the tree height of T[A], 
otherwise it should be left untouched. In the latter step when the distribution 
of t' over A is examined, the possibility of distributing t" over C as well 
shall be checked if and only if Sp[A, t'] / 0. Otherwise we shall leave t'(A) 



85 



as it is and we need not check inside of A again. 

3.4 Algorithm 

Having these results, an algorithm to reduce tree height of an 
arithmetic expression is now described. Given an arithmetic expression, the 
algorithm works from the inner most pairs of parentheses to the outer most pair. 
We assume that cases (P, ) and (P ) (see Lemma 2) are already taken care of. At 

each level of a parenthesis pair, first upon finding a form t'(A), a hole of A 
is tried to be filled by t' (Theorem 2). After all holes are filled, a form 
t'(A) + t" is checked, i.e. if the distribution of t' over A creates enough space 
to accomodate t", then the distribution is made. Note that it is not necessary 
to fully distribute (A)(b) = (lt.)(Xt. T ) (see Lemma 3)- 

It is not necessarily true that reducing tree height of a term t 
of an arithmetic expression A reduces tree height of A. However, we show 
that reduction of tree height should be made in any case to help later steps of 
the distribution algorithm. 

n-1 n-1 

Let A = Tt. + t (or ir (t . ) x (t ) ) . Assume that the distribution 
. , 1 n . , 1 n 
i=l i=l 

somehow reduced the height of T[ t ], i.e. the distribution algorithm 

n-1 n-1 

reduced A to A' = I t. + t ' (or it (t. ) x (t ')) where h[ t ] > h[ t ']. Also 

. , 1 n . n i 7 n ; ' L n J L n J 

i=l i=l 

assume that MA] = h[A']. Yet it is obvious that #(Sh[ A]) < #(Sh[ A' ] ) (i.e. le 

£h r ;.l < le Sh[A']) and also for any t", Sp[A,t"] < Sh[A,t"]. That is, even if 

distribution only reduces the tree height h[ t ] and .does not reduce the tree 

height h[A], that distribution does not cause any bad effect on the later steps 

when A appears as a term of a bigger expression d with respect to holes and spaces , 



86 



The arithmetic expression thus obtained may give the lowest tree 
height, i.e. the fewest number of computation time steps. 

3.k.l Distribution Algorithm 

In the steps below we refer to the notation: 

. k - 1 k - 1 

I . k 1 I r— k — r i k ! 1 



«(...«( k )&.. .)©... ©(...©( )e...e( )©...)© 

' s-l,i I I sj I I s,j+l I 



H — tl — h k << 

A = .. . + t . , 7T (t - ) + ... 



k p 
where t . , = tt a„ or empty. 
s,J-l l=1 t 



Step 1 : 



Go to an arithmetic expression enclosed in an innermost parenthesis 



pair which is not checked. Let this level be k-1. In the above diagram we are 

«k-l 
now working on, say, A 

s 



Step 2 : 

Obtain a set of holes for all t . which are enclosed in the k-th 

k-1 
parenthesis pair and are components of A , as well as their heights h and 



effective lengths e 
Step 3 : 



k-1 
In this step, the (k-l)th parenthesis pair level A "" is examined. 



87 



Substep 1 : Hole filling (see Theorem 2) 
Let 

0+n 




k P 



t 



where t = ir a or empty. Also without loss of generality we assume 

that e[t g£ ] < e[t g i+J ] for l = j, j+1, ..., j+n-1. 

k-1 

(1) Find an occurrence of a form 7r(t.) or Tra. x 7r(t.) in A .If there 

J i s 

is no such occurrence, then go to Substep 2. If an occurence of a 
form ir(t.) is found, then skip (2) and (3). Otherwise go to (2). 

k-1 

(2) Suppose we find t in A as an instance of Tra. x Tr(t.). 

s 1 j 

k k 

Fill holes in Sh[t ] (h=j, j+1, . . ., j+n) using a.'s in t . If there 

are many holes to be filled in, fill the smallest ones first, i.e. 
in order of increasing size. Reevaluate Sh by Lemma 8 for those 

t . whose holes are filled, 
sh 

(3) If t , (h=j, j+1, . . ., j'+n) do not have enough holes to accomodate all 
a.'s then go back to (l) to find out another occurrence of Tr(t.) or 

j+n k 

Tra. x Trft.) form. Otherwise we work on tt (t 1 , ) which we get from 
i j . . sh' & 



P J+n v 
ira, x tt (t\ ) after (2). 
1-1 J h=i Sh 

(J+) We start from h = j. Check if t' can fill in one of holes in 

u sh 



88 



Sh[t' ] (l=h+l, . . ., j+n). If there are many holes which can 
s x. 

]^ 
accomodate t' , fill them in order of increasing size. Continue 

the procedure until all t' (h=j+l, . . ., j+n-l) are put in some holes 

or there is no hole to accomodate t' . Go back to (l) to find 

sn 

out another occurrence of 7r(t.) or ira.. x 7r(t.) form. 

J i J 

Substep 2 : Space filling 

k-1 k 

After Substep 1, we again check A , where all holes in t . have 

s sj 

been filled in as much as possible by Substep 1. 

(1) Let Ex = a[A k_1 ] - e[A k_1 ]/2 (see Lemma 9)- 

(2) Let 

k-1 k ^ +n-1 k n k 
A "=...+t . -, x it (t,)x(t .,)+... 

s S ' J_1 h=j S cf' J t^ 

~Y f t 

k p k 

where t . n = tt a, or empty. We also assume that eft . 1 is 
B,d-1 ! = i * s,j+n J 

k k 

the largest among all e[t ,] (h=j, . . ., j+n) . Let t' = t . . x 

sn s f j -± 

j+n-1 . . . 

tt_ (tg h ). If h[t'] «h[t* ], then evaluate Sp[t* ^ f ] . 

Otherwise leave it as it is. 

(3) Repeat (2) for every occurrence of a form tt( t . ) (or ira.. x ir(t .)) 

J -'-J 

k-1 
in A . Assume that there are m such occurrences. Arrange all 
s 

~p[t,t'] in order of decreasing size. For convenience we write 

Sp r Sp 2 , ..., SpJSp. > s P . +1 ). 



89 



m d-1 d 

(+) If Z Sp. > Ex then let d be such that Z Sp. < Ex and Z Sp > Ex. 

i=l x i=l x i=l 

k ^ +n_1 k k 

(5) Let t -tX it ( t >.)x('t • ) be a form which corresponds to 

s,j-± ' ._. sn s, j+n 

1 2 ' 

t' t 

Sp.(i < d). Then distribute t' over t, and create space Sp.. 

Repeat the same procedure for all i = i, 2 f . .., d. 

m 

(6) In the case where enough space to accomodate Ex (i.e. Z Sp. < Ex) 

i=l x 

]^ 
is not found, a check is made against the component terms of t 

(see Lemma 11). 

For example let 

t .-=a.a rt ..«a,n=l 
s,j-l 12 p' 

and t k . = b n b ... b (t^ 1 ) + Z t*!" 1 . 
s,j 12 q v sf i=1 si 

Then 



A k-1 . Jc ,,. ,. ,. ^k+l N . ™ , 
s 



.. +t* _ x (b n b ... b (t"I x ) + lO + .. 
s, j-1 12 q sf . _ si 

i=l 

i/f 



k+1 
Then the distribution is done if the sum of Sp. and Sp[t „ , 

t . _ x b n . . . b ' is greater than or equal to Ex. Here the dis- 
s,J-l 1 q J 

k+1 
tribution of a, ... a b, ... b over Z t .' is to be made as well 
1 pi q si 

k+1 
as the distribution of a, ... a over t _ . This checking is to 

1 p sf 



90 



be made until enough space to accomodate Ex is found or else 
until the innermost level of parenthesis pair is reached. 



Step k ; 



k-1 
Mark A as checked, 
s 



Step 5 : 



If all levels are checked, then stop, otherwise go back to Step 1. 



For example let us consider the following 
A = ... + a 1 a 2 a 5 (t 1 )(t 2 )(t 5 ) + 

t 
Further assume that 

Sh[t x ] = Ul, e[t 2 ] = 16 

Sh[t 2 ] = (16,21, e[t 2 ] = 61+ 

and 

Sh[t 3 ] , e[t 3 ] = 64. 

Then a, a p a, can be distributed over t,, and in turn this whole thing can be distri- 



buted over t, 



. . + a.. a 2 a-, { t, 4 



o 



(t )(tj + ... 



and we get h[t'] = 7 whereas h[t] = 8. 



91 



3.^-2 Implementation 

A few words about implementation of the algorithm described above are 
given as well as the total number of checks required to process an arithmetic 
expression. Suppose we are given the aritmetic expression 

A = ... + (7r24^rl)(7rlO+(7r4-Pn-l)(7rl7-H7r2)-P7r3) + ... 

= ... + ( d 1 +d 2 )( e 11 + ( e 2i +e 22^ e 31 +e 32^ + e l+l^ + "•• 
= ... + (D)(E 1 +(E 2 )(E 3 ) + E^) + ... 

= ... + (D)(E) + . . . 
where iri represents a product of i single variables. Then we build nested stacks 
as shown in Figure 3.13(a). Note that a new stack is created for each form ir(t.) 

or Z t . . 

1 





e 21 (x) 








B(+) 


e u (x) 


E 2 (+) 




(+) 




TVT f v \ 










1 


im^x; 






"2 KAJ 














«= 


















e 00 (x) 












A <— 




^ — 








<— 


>> 


E ? (+) 




















D(+) 


\ 




I 














\l lXJ 


~'~~~- e ^n 








d.(x) 






31 






d 2 (x) 




^^ e 32 



level 



- 



m-k 



m-3 



m-2 



m-1 



m 



Figure 3-15- Stacks for an Arithmetic Expression 



92 



(b) 



E 4- 



w 2 (x; 



Sh 



A 



_L 



1,2, If, 8, 
8,16,32 



t,5,6 



E 2 (+) 



Sh 



if a 



iii 



Sh 



A 



1,2,4,8 



2,3,4,5 



2, 3 ,4, 5 



e 21 (x) 





h 


2 




Sh 







F A 










e 22 (x) 



/" e 31 (x) 



e„(x) 



h 


1 


Sh 





F A 








N l «■ 



(c) 



Sh 



A 



El£L 



2,4 



3,^,5 



3,4 





"MJ. 


N 2 '(x) 




h 


5 




Sh 


1,2,4,6 




F A 


1,2,3,5 


/* 




e 3i 



N 2 "(x) 




Sh 



e 21 (x) 



E ( + ) 





h 


3 




Sh 







F A 


1,2 




F E 









<- 




1,2 





'22 



Figure 3- 13 • (continued) 



93 



Each stack is assigned a level number (cf. Definition 3) where the first stack 
which corresponds to A receives the level number (Figure 3« 12(a)). 

We start working from a stack with the largest level number, say m. For 
each stack t, where t = It. or t = ir(t.), h[t] is evaluated. Also if a stack 

represents a form t = It., then Sh[t], F [t] and F [t, t.] are evaluated. If a 

stack represents a form t = 7r(t. ), then Sh[t] and F [t] are evaluated. These 

values are obtained by Definitions 1, h- and 8. Note that this information is 
sufficient to evaluate Sp. Figure 3.12(b) gives an illustration. 

Upon finding a form 7r(t.) (or Tra . x 7r(t.)) (e.g. the stack N ), we apply 

the distribution algorithm and decide if distribution is to be made. If a stack 
represents a form t = 7r(t.), then Substep 1 of the distribution algorithm, i.e. 

hole filling, is tried. Otherwise a stack represents a form t = I t. and Substep 

2 of the distribution algorithm, i.e. space filling is applied. In our example E 

is distributed over E,(e[E p ] < le Sh[E,]). Then stacks are revised as shown in 

Figure 3 .12(c). Note that the stack N is replaced by two new stacks N ' and N ", 

and the stack E disappears. 

If all stacks with the level number k have been checked, then stacks with 
the level number k-1 will be checked. In our example, stacks E(or E' since it has 
been revised) and D are now checked. 

The total number of checks required to process a whole arithmetic 
expression thus depends on the number of parenthesis occurrences in it. Assume 
that there are p parenthesis pairs in an arithmetic expression A. For each pair, 
space creation should be examined. Hence in total p space creation checks are 
required. Now for each ir(t.) form hole filling should be tried. The number of 



Qk 



occurrences of a form ir(t. ) in A is obviously less than p. Hence the total number 
of checks required is less than 2p (i.e. the order of p). 

3.5 Discussion 

3.5.1 The Height of a Tree 

Given a tree for an arithmetic expression, the distribution 
algorithm tries to lower tree height by distribution if possible. However, 
in general it may not give the minimum tree height. For example let A = ac + ad + 
be + bd whose tree height is 3> and since no further distribution is possible, 
the distribution algorithm yields the same value. There is, however, an equivalent 
expression A' = (a+b)(c+d), whose tree height is lower than 3, i.e. 2. That is, 
even though factorization lowers tree height sometimes, the distribution algorithm 
does not take care of it. 

The question we ask now is how much the distribution algorithm lowers 
tree height. Before giving an answer to this question let us study tree height 
in more detail. Given an arithmetic expression, Theorem 1 gives the exact height 
of a tree obtained by Bovet and Baer's algorithm. It is also desirable if we can 
get an approximate tree height without actually building a tree for an arithmetic 
expression. Since the number of single variable occurrences (the number one less 
than this gives the number of operators in an arithmetic expression) and the 
depth of parenthesis nesting may well represent the complexity of arithmetic 
expressions, let us try to approximate tree height in terms of them. 

Let A be an arithmetic expression with n single variable occurrences 
and depth d of parenthesis nesting. Now build a tree for A by Bovet and Baer's 
algorithm. Then it can be proved that: 



95 



jemma 11 : 

log 2 rnl 2 < h[A] < n - 1. 

Moreover we can prove the following theorem. 

Theorem 3 • 

h[A] < 1 + 2d + log[nl 2 . 

The following lemma is helpful to prove Theorem 3< 



Lemma 12 



(1) 2a>ra], 



(2) f2al 2 = 2fal 2 



(3) log 



r P 




r p 


-[ml 
i=l 


< log (2 • 

2 


r m 
1=1 



i 2 



Proof: 



(l) and (2) are obvious and (3) can be proved by (l) and (2) 



(Q.E.D.) 



Proof of Theorem 2 : 

Proof is given by induction on d. First let us prove the theorem for 
d = 0- Then A has the following pattern: 

A - F it a.. 
i=l j=l r 

Then by Theorem 1, 

h[A] = log Z T q 1 

i=l 



96 



< log 2 



r p 



by Lemma 5(3) 



= 1 + log[n] 2 . 
Nov assume that the theorem holds for d < f . 

Let t. be an arithmetic expression with depth d.(< f) parenthesis 

J J 

i i ^ 

nesting and n. single variable occurences. Then by assumption h[t.] < 1 + 2d't 

J ■ J J 

+ log[n_. lp- Now an arithmetic expression with f + 1 parenthesis nesting can be 

built from t. as follows: 

(q. m. 
n. . i . 
ir a' tt (tih 

j=l 3 k=l K 

where a. are single variables and at least one of t. has f nested parentheses. 

Now each t. can be replaced with a product of e[t.] single variables without 

J J 

affecting the total tree height. Instead of using the value e[t.], let us use 



2d' 



the value 2-2 J . Tn 1 ^. (Note that h[t^] < 1 + 2d 1 + logrn 1 ! = log (2 • 



2d 1 . . . h[t X ] 2d* 

2 J • TrulJ and eft 1 ] =2 J < 2 • 2 J • rn*"L.) Since d 1 . < f, e[t*] < 

2f i 
2-2 fn 1 • Now from Theorem 1, we have 
J 2 



h[A] < log 



r p 



TCI. 



1 2+ Z 2- S 2frn j 1 2l2 

3=1 



r P 



m. 



< log 2 



s 2f i- 



2 [q + r 2 • 2 nt] 

3=1 J 



1-= 



97 



|2.2.2.2 2f r (q, + rnbl 



< logl2-2-2-2 

= 1 + 2(f+l) + logrnl 2 . 
Thus the theorem holds for d = f + 1 and this proves the theorem. 

(Q.E.D.) 

Now let us examine the original question i.e. how effective is the 

algorithm presented in this chapter. Let A and A be arithmetic expressions 

d 
where A is the resultant expression obtained from A after the application of 

the distribution algorithm. Now build trees for A and A by Bovet and Baer's 

algorithm. Then it should be clear that h[A] > h[A ]. Moreover experience 
suggests that: 

Conjecture : 

h[A d ] ^ 2 log 2 rnl 2 

where n is the number of single variable occurrences in the original arithmetic 
expression A. 

Note that the distribution algorithm speeds up a Horner's rule 
polynomial in a logarithmic way. Also note that the distribution algorithm 
does no distributions in the case of 




which takes 21ogr nl steps as it is presented but which would take (n+l) logrn] 

steps if fully distributed. Thus the algorithm can save a factor of n/2 steps 
over a scheme which would distribute indiscriminately and in some cases achieves 
a logarithmic speed up. 



98 



3.5.2 Introduction of Other Operators 

3.5.2.1 Subtraction and Division 

Subtractions can be introduced into an arithmetic expression without 
causing any effect on the distribution algorithm. It may be necessary to 
change operators to build a minimum height tree. For example let A = a + b - 
c + d. This will be computed as A = a + b - (c-d): 




Divisions may require special treatment since the distributive law does not hold 
in certain cases, e.g. (a+b+c)/d = a/d + b/d + c/d but a/(b+c+d) / a/b + a/c + 
a/d. Hence in general minimization of the height of trees for a numerator and 
a denominator is tried independently, and then distribution of a denominator 
over a numerator is tried if appropriate. Also let A = t/t'. Then T[A] is 
built from T[t] and T[ t ' ] as follows: 





T[t], 

or 



If h[t] / h[t'], then we get nodes to which only one tree is attached, e.g. 
a and (3 if h[t] < h[ t'], and a' and 3* if h[t] > h[t']i Then a and p are 
treated as free nodes in T[A] while a' and 3' are not treated as free nodes in 
T[A], because later when another expression, say t", is multiplied to A, t must 






99 



be multiplied by t" not t', i.e. t"(A) = t"(t/t') = (t"t)/t' ^ t/(t't**): 



T[t" 




T[f] 



3- 5-2.2 Relational Operators 

If an arithmetic expression A contains relational operators e.g. 
B I ) C -where RO = [>. <, =, >, <, . ••}, trees can be built for B and C 
independently: 



A = 



T[B] 




T[C] 



If h[B] -*' h[C], then operators may be moved from one side to the other to 
balance two trees. For example let A=a+b+c>d. Then we modify this as 
A' = a + b > d - c and get h[A'] =2 while h[A] = 3: 





b d 



100 



k. COMPLETE PROGRAM HANDLING 

Chapter 3 presented the algorithm which reduced tree height for a 
single arithmetic expression by distributing multiplications over additions 
properly. In this chapter we will discuss some ideas about how to handle 
complete programs, i.e. given one program, how can it be executed in the 
shortest time by building a tree as well as executing a statement in a for 
statement simultaneously for all index values. Ideas include back substitution. 
We do not have the solution to the problem, but this chapter presents some 
details of the problem and some ways to attack them. 

We conclude this chapter by comparing serial and parallel computations 
in terms of a generated error. It is shown that in general we could expect less 
error from parallel computation than serial computation. It is also shown that 
distribution would not increase the size of an error significantly. 

4.1 Back Substitution - A Block of Assignment Statements and an Iteration 

While the distribution algorithm in the previous chapter discusses 
tree height reduction for a single arithmetic expression, it can be used for 
any jump free block of assignment statements. If we define those variables which 
appear only on the right hand sides of assignment statements or in read statements 
in a block as inputs to the block, and those variables which appear only on the 
left hand sides of assignment statements or in write statements as outputs from 
the block, then we can rewrite the block with one assignment statement per 
output by substitution of assignment statements into one another. For example 



101 



a := b + c; 

d := e x f; 

g := a + c; 

h := a + g x d 
can be rewritten as 

g := (b+c) + c; 

h := (b+c) + ((b+c) + c) x (exf ) . 
After such a reduction only input variables appear on the right hand sides of 
assignment statements. At this point, the distribution algorithm could be 
applied to each remaining assignment statement and if sufficient computer 
resources were available, all of the reduced assignment statements could be 
executed at once- In the above example if each statement is computed in 
parallel (by building a tree) independently then 5 steps are required, while 
if the back substitution is done then the computation requies only + steps. 
Suppose we have assignment statements, A , A , ...,A . Also suppose that by 

back substitution we can rewrite this block as A. We build minimum height 
trees for A, ,A , ...,A and A. Now we apply the distribution algorithm on those 

trees. Let the resultant tree heights be h| A ],..., h[A ] and h[A] . Then 

obviously h r A-,] + ... + hi" A ] > h[A], i.e. back substitution never increases 
the computation time (in the sense of tree height) (Figure l+.l). 

Our main interest here is the case where strict inequality in the 
above relation holds, because that h[A, ] + ... + h[A ] > h[A] holds is 

equivalent to a speed up of the computation by back' substitution. Note that 
back substitution amounts to symbol manipulation (i.e. replacement) and should 
not be confused with arithmetic simplification. For example from 



102 



w 





V 



h [An] 





Figure k.l. A Back Substituted Tree 



a := x + y 

b := a + y 
we get 

b := (x+y) + y 
orb :=x+y+y 
but we do not get 

b := x + 2y . 
Now we shall study this kind of speed up. 

We shall discuss a limited class of assignment statements, i.e. an 
iteration. This may serve to give some insight to the problem of speed up by 
back substitution in the general assignment statement case. 

By an iteration we mean a statement 

'i ■ f(y i-i'- 

Usually a statement is executed repeatedly for 1 = 1, 2,'. . .,n. An example is: 
for I := 1 step 1 until 10 do 
A[I] := A[I-1] + A[I]; 



103 



Also a block of assignment statements such as: 
S 1 : a := h + i + j; 

S ? : b :-■ a + k + m; 

S,: c := b + n + p; 

S. : d := c + q + r; 

falls into this category (note that all statements have a form 
output of S. : = output of S. + x + y 

where x and y are pure inputs in the sense that they do not appear as outputs). 
Assume that we are only interested in the value of y (the other 

results, i.e. y ,,-y rt , . . . , y, may be obtained similarly to y but in less 

n-1 n-2 1 n 

time). Then instead of n statements, i.e. y, = f(..), y p = f(..),...,y = 

f(..), we n^ay obtain one statement for y by back substitution. For example, 

let y. = a. y. . Then y = a n y , = a .(a ^y _) = a .(a _(a ,y _)) 
1 l-l l-l n n-l^n-1 n-1 n-2 J n-2 n-1 n-2 n-3 n-3 

n-1 

= • • • - y tt a, . We use the superscript "b" to distinguish the back 

b n_1 
substituted form from an iteration form, e.g. y = a ,y . and y = y^ tt a. . 

J n n-1 n-1 n . _ k 

k=0 

Then instead of computing each y. repeatedly for i = 1, 2, . .,n, y may be 

computed directly. In the above example y. can be computed in one step* and 

to get y n steps are required while y can be computed in r log nl steps in 

parallel (i.e. by building a tree for y ). The following table summarizes the 
results for some primitive yet typical iteration formulas. 



1C4 



y i 


b 
y n 


T 
s 


nT 
s 


T 
P 


ay i-l 


n 
a v~ 


1 


n 


ri g 2 (n + ljl 


yi.i + b 


n 

y Q +*b +"/."+ b A 


1 


n 


flog 2 (n + 1? 


a i-i y i-i 


n-1 

Vo 
k=0 K u 


1 


n 


log^ 


y -, + a 
°i-l i-1 


n-1 

z \ + y o 

k=0 K 


1 


n 


flog 2 n1 


ay._ x + b 


*\ + P n-l (a ^ 


2 


2n 


*2fiog 2 (n + 1? 


ay i-l + X i-1 


. . ■*-* 
p;(a) 


2 


2n 


ar2flog 2 (n + 1)1 


y i-l + bx i-l 


n-1 

z bx k + y o 

k=0 K u 


2 


2n 


« 'log nl + 1 

2 


ay i-l + bx i-l 


p" a) 
*n 


3 


3n 


-2riog 2 (n + 1)1 



/• \ -l. n-1 , n-2 
p - (a) = ba + ba + . . . + b 



„ . . / \ n , n-1 , n-2 

** P n (a) = a y + a X + a X l + "• + "n-2 + X n-1 

„v.j, ti / \ n , n— 1 , n-2 , 

*** p (a) = a y + ba x^ + ba at, + . . . + bax n + bx . 

n J 1 n-2 n-1 

T : The time required to compute y. in parallel, i.e. h[y.]. 
T : The time required to compute y in parallel, i.e. h[ y ]. 



Table k.l. Comparison of Back Substituted, y, and Non-Back Substituted 
Computation, y. — Iteration Formulas 



105 



From Table *4-.l, the following lemma is obtained by exhaustion. 

Lemma 1 : 

Let y. = f (y. , ) be linear in y. . where we assume that in the 

presented form additions are reduced to multiplications as much as possible, 
e.g. y. = 2y i _ 1 instead of y ± = y i _ 1 + y i _ 1 - Then n x h[y.,] > h[y n l- 

Thus if we have enough FE's, then instead of computing each y. 

repeatedly for i = 1, 2, ...,n, we should obtain y by back substitution and 

compute it by building a minimum height tree. 

If an iteration y. = f(y. , ) is not linear in y. _, e.g. y. = a y? , 

+ b y^ + c, or if it is linear in y. but there are some additions not being 

reduced to multiplications, e.g. y. = y. , + a y. , then it is not clear if 

back substitution speeds it up. For example, back substitution does not speed 
up the computation of y ± = y i _ 1 + y i _ 1 + y ± _ 1 + Y ± _ 1 ' Also let y i = f (y i _ 1 ) 

be a polynomial of y. , where in the presented form additions are again reduced 

to multiplications as much as possible. Then it is not likely that we can 

speed up the computation by back substitution. Let 

„/ \ m 

f (y. , ) = a y. . + ... 

where y. is the highest power of y. , among those which appear in f(y. ). 
Note that f (y. ) is not necessarily a dense polynomial (a polynomial in which 

all powers of y. , i.e. y. ,, y. ,, ..., y. ., appear). While the exact height 
of T[ f (y. , )] depends on f(y. ), we may content ourselves with (see Chapter 2) 



106 



h [ f(y i-l )] ~ 2 r io g 2 m1 ' 



Hence 2nriog ? ml steps are required to compute y . 

Now let us consider y . Then 

n 



b , b m 



n m n-l 



m m 

= a (a (y ) + . . . ) + 
m v m w n-2 



m m 

= a (a ( . . .a y n . . . ) . . . ) + ... 

mm nrO ' 

i n 
n-l m 
= a m y Q + ... . 

That is, y becomes a polynomial of y of degree m . Leaving the computation 

of a out of consideration, we have (see Chapter 2) 

h[y£] * 2[-log 2 m n l 

* 2nr logpinl . 

Hence back substitution does not help to speed up computation significantly 
in this case. 

To gain a better understanding of more general cases, let us study 
the situation from a different point of view. Given an iteration y. = f (y. , ), 

let us consider the number of single variable occurrences in y as a measure of 

e J n 

the complexity. We study two cases separately, i.e. (i) y. .. appears only once 
in y. and (ii) y. appears k times in y. . In both cases we assume that there 



107 



are m single variable occurrences (including the occurrences of y. ) in y. . 

For convenience we write N(y) for the number of single variable occurrences in 
y, e.g. N(a+b+cd4a-e) = 6. 

(l) y. , appears only once in y. . In this case we have 

N(y 1 ) = m 

N(yg) = N( y;L ) + m - 1 = 2m - 1 



N(y^) = N(y^) + m - 1 = 2m - 2 



N(y ) = N(y ,)+m-l = mn-n + l~ ran. 
J n n-1 

(2) y. appears k times in y. . In this case we have 
N(y 1 ) = m 

N(y^) - k • N(y^) + m - k = m + k(m-l) 

N(y^) = k ■ N(Vg) + m - k = m + (k+k 2 )(m-l) 

• • • 

N(y^) - k ■ N(y^_ 1 ) + m - k 

= m + (k + . . . +k~ )(m-l) 

= 1 +^-i(m-l). 

If k n » 1 and m » 1, then N(y b ) s k n_1 m. 

> w n 

Now if we use 21"log p N(y)l as a measure of the height of a tree, then 
we have (see Section 3- 5*1 of Chapter 2): 



108 



h[y ± ] 


n x HV ± ] 


«# 


(1) 2flog 2 m] 


2n[ log 2 ml 


2flog 2 (mn)l * 2(riog 2 m] + rioggii"!) 


(2) 2riog 2 ml 


2nf log ml 


2riog 2 (k n " 1 m)l « 2((n-l)riog 2 kl + floggml) 



Table k.2. Comparison of Back Substituted, y , and Non-Back Substituted 
Computation, y^ — General Cases 

For example let m = 5, k = 2 and n = 20 in (2). Then we have 
n . hfy.l = k0 • T log 51 =120 

and h[y£] = 2(l9flog 2 2l + flog^l) = kk. 

Also if we let m = 5 and n = 20 in (l), then we get 
n • h[y.] = i+oriog 2 5l = 120 

and h[y*] = 2(riog 2 5l + Tlog^Ol) = 16. 

Now a few comments about implementation are in order. As for back 
substitution of a block of assigment statements, the step by step substitution 
is the only possible scheme. In case of an iteration formula, we may use the 

z-transformation technique to obtain y [8]. For example let y. = y. n + x. . 
Then by applying z-transformation on it, we get Y(z) = zY(z) + X(z) or Y(z) = 



^ Z ^ . Hence Y(z) = X(z)(l + z + z 2 + 7? + . . . ) 

2 2 

= (x + X-.Z + X z + . . . ) (l + z + z +...) 



= x Q + (x 1 + x Q ) z + (x 2 + x ± + X Q ) 



z + . . . 



109 



or y = Z \> 

1 k=0 * 

Two other related problems become evident in the example presented 
above. First is algebraic simplification. For example, a := b + kc could be 
executed more quickly than a := b+c+c+c+c We shall not discuss this subject 
further here. A second problem is the discovery of common subexpressions. In 
our example, (b+c) appears twice in the right hand side of it. 

If we had an algorithm, e.g. [11] which discovered common sub- 
expressions in one (or more) tree which could be simultaneously evaluated, the 
number of FE's required could be reduced by evaluating the common subexpressions 
once for all occurrences. On the other hand, by removing common subexpressions 
the execution time (the height of a tree) may be increased in some cases. For 
example, if we have x := a(b+c+de) and y := f(g+c+de), then we might try to 
replace c + de in x and y by z as follows to save the number of PE's required: 

x 






However, note that x = a(b+z) or y = f(g+z) takes k steps while the original x 
and y require only 3 steps, i.e. h[a(b+c+de)] = 3 and h[a(b+z)] = k. Thus an 
overall strategy must be developed for the use of a common subexpression 
discovery algorithm in conjunction with overall tree height reduction. 



110 



h . 2 Loops 

This section is included here to complete this chapter, and discusses 
the subject superficially. Details will be presented in the following chapters. 

Consider the following example. 
El: for I := 1 step 1 until 10 do 
for J := 1 step 1 until 10 do 
S3: A[I,J1 := A[I,J-1] + B[j]; 
In this case ten statements, A[l, J] := A[1,J-1] + B[J], A[2, J] := A[2, J-l] + 
B[J], ..., A[10, J] := A[10,J-1] + B[J] can be computed simultaneously while J 
takes values 1,2, ...,10 sequentially. We say that S3 can be computed in parallel 
with respect to I. Note that originally the computation of El takes 100 steps . 
(One step corresponds to the computation of S3, i.e. addition. For the sake of 
brevity we only take arithmetic operations into account and shall not concern 
with e.g. operations involved in indexing.) By computing S3 simultaneously for 
all values of l(l = 1,2, . ..,10) the computation time can be reduced to 10 steps . 

b 10 

Finally by building a minimum height tree for A [1,10] (:= A[I,0] + Z B[J]) 

J=l 

for each I (i = 1,2,..., 10), we can compute all ten trees simultaneously in 
h_ steps . 

To help understanding, let us further consider 
L: for I := 1 step 1 until N do 

for J := 1 step 1 until N do S; 
Then Figure ^.2 (a) shows the execution of L as it is presented. The total 
computation time required (t) is N, x N x m, where we assume that m arithmetic 

operators are in S. Now suppose S can be computed in parallel with respect to 



Ill 



I, (Figure 4.2(b)). In Figure 4.2(b), each box has the form shown in Figure 

lj-.2(c). Here S 11 computed sequentially, i.e. T. = mN p . Now let us 

compute S in parallel i.e. by building a tree (Figure 4.2(d)). Then we have 
T Q = N h[S]. Note that m > hJ"S]. Further if we back substitute S for J = 

1,2 N and get S , we have T = h[ S ]. As stated before (Section 4.1), 

N 2 h[S] > h[S b ], or T Q > T ± > T g > T . 

general we have 
L: for I := 1 step 1 until N, do 

for I : = 1 step 1 until N do 



for i := 1 step 1 until N do S; 

n *- n — 

re S is an assignment statement. Then the computation of L takes T = 

n 

7T N. m steps as it is presented, where we assume that m arithmetic operations 
] 

are involved in the computation of S. If S can be computed for all values of 

I = 1,2, ...,N. ) simultaneously, then the computation time can be reduced to 

K 



T, ~ N.lm) stens, i.e. N. statements can be computed simultaneously 

, I , ...,I _,I, .,..., I change sequentially. In general there are n 

possibilities, i.e. we examine if S can be computed in parallel with respect 
to I for k -• 1,2, ...,n. Let P = [k|o can be computed in parallel with respect 

to I, . hen we would compute 3 in parallel with respect to I where T a = 



112 






<Z5> 




I=N 




C3> 



(a) 



(b) 



J: =J+1 



J 



T 
m 



<^> 



-^ ^J+l ) 




h[S] 



3 




h[s b ] 



(c) 



(d) 



3 
(e) 



Figure k.2. Loop Analysis 



113 



min T . Clearly each statement of the resultant N statements can be computed 
kcF k g 

by building a tree. Further if it is appropriate we perform back substitution 

and obtain a big tree as the above example (El) illustrates. 

If a loop is a limit loop which terminates when e < 6 for some pre- 
determine! B and computed c, it may be approximated by a counting loop (e.g. 
or T :- 1 i s - er 1 until II do ) which is executed a fixed number of times before 

"est is made, and then repeated if the test fails. 

Consider a program containing n two way forward jump statements (or 

if statements). Let the tests for jumps be Boolean expressions B„,B ,...,B . 
— 1 2' n 

Assume that there are m output variables from the program given as expressions 

,A_, ...,A , where parts of A. may depend on B.'s. In a program when B. is 
1' 2' ' nr l j j 

encountered, one of two choices is taken depending on the value of B . . It is 

possible to start computing all of these possible alternatives at the earliest 
time, and choose proper results as soon as values of B.'s become available. 

For example 

a := g + c; 

B : a > ':): 

if B then d := e + f + s else d := a + g + t; 

: i + f + i x j x k x p x q; 

yield 

:= B x (e+f+s) + ( not B) x ((g+c)+g+t) +f+ixjxkxpxq 
or 



114 



h := ((g+c) > 0) x (e+f+s) + ((g+c) < 0) x ((g+c)+g+t) + f + i x j 
x k x p x q, 
where we let B = 1 for true , B = otherwise. Then we may build a tree for h 
as follows . 




gcefsg eg t f ijkp q 
Figure k.J. A Tree with a Boolean Expression 

The box \B • produces or 1 depending on the value of (g+c>0). 

In general, Boolean expressions can be embedded in arithmetic 
expressions as shown in the above example, and a minimum height tree can be 
built for it. 

h.k Error Analysis 



In this section parallel and serial computation are compared in terms 
of error. We are only concerned with a generated error , i.e. an error which is 
introduced as a result of arithmetic operations. It is shown that in general 
parallel computation would produce less error than serial computation. It 
is also shown that distribution would not increase the size of an error 
significantly. Let co represent any arithmetic operation. In general, we do 
not perform the operation co exactly but rather a pseudo-operation (jx>) . Hence 
instead of obtaining the result xo^y, we obtain a result x(a?)y. We may write 



115 



x y = (^y)(i+e ) (1) 

where <: represr I in error introduced by performing a pseudo-operation. For 

example, we have 

-: y = (x+y)(l-f€ a ) 

and 

x Q y = Cxy)(l+e m ). 

Let us write A for an approximation to an arithmetic expression A with an 
error obtained by computing A using pseudo operations, e.g. fy\ or (+) . Then 
^an also be written as 
(xuoy) = (xo>y)(l+e ). 

Now let us consider the computation 

A = la.. 
i=l X 

First we compute A serially, i.e. 

A = . . .f((a 1 +a 2 )+a,)+a^)+...+a N ). 



.e have 
* 



= a i a 2 = (a 1 +a 2 )(l+e & ) 

- & + a 2 + e (a»j+a ), 

{+) a = (a +a +e (a +a,,)+a )(l+€ ) 

J 3 -L^SlJ-.^ 3, 



= a, + a,- + a, *■ e (2a., +2a^+a-. ) . 
1 2 3 a 1 2 3 



• higher terms of e are neglected. 

St 



116 



# * 



\ = A 3 © % = a l + a 2 + a 3 + \ + e a ^ a i + 5a 2 +2a 3 + ai+ ), 



\ = \-l O *N = ._\ a i + e a ((N-l)a 1+ (N-l)a 2+ (N-2)a 3+ ... +aN ) 



N 



2 a. + € (Na +(N-l)a +(N-2)a,+. . .+0 
i=l 



We let E = e (Na +(N-l)a +(N-2)a +. . . +a ) . Next let us compute A in parallel, 
i.e. by building a tree: 



A 




Without loss of generality we assume that N is a power of 2. Then 
A I 2 ■ a l © a 2 = a l + a 2 + e a ( V a 2 ) 
1-k = A 12 © A lk = a l + a 2 + a 3 + % + 2c a (a l + V a 3 4 V 
1-8 = A l-ii A 5-8 = a l + a 2 + •" + a 8 + 3e a (a 1+ a 2+ ...+a 8 ) 



A 



A 



A l-N = 



l-N/2 O V+l-N " J/i + ri0g 2 N1 £ a .f^i 



117 



We let -r = r log Nl € £ a. . To compare E with E , let a = a = a = ... = a 



Then we get 



N' 



and 



a 2 a 



:: r log 2 Nl a • e a , 

or 

E S > E P. 
a a 

N 
An error for B = it b. can be analyzed in a similar manner. In this case we 
i=l ^ 

N 

E S -- E P = (N-l) e Trb. 
m m m . , i 

i=l 

Hence, Ln general, we could expect that parallel computation produces less 

error than serial computation. 

* 
hat if higher terms of e and e are neglected, then A can be 

a m ' 

written as 

A + i E (A) + ( v E (A) 

a a mm 

where an I ''A) are arithmetic expressions consisting of variables in A. 

For example, if we compute A = afbc+d) serially (i.e. A = a((bc)+d)) then we 
get 



118 



A = (ax((bxc)(l+€ m )+d)(l-f€ ))(l+€ ) 



~ a(bc-d) + e a(bc+d) + € (2abc+ad), 
am " 



and E (A) = a(bc+d) 



and E (A) = 2abc+ad. 



m 



Usually E (A) and E (A) depend on how A is computed as we have shown for 

A -- E a.. 

Now let us compare parallel computation of two arithmetic expressions 

A and A , where A is the resultant expression obtained by applying the 

distribution algorithm on A, in terms of a generated error. Note that we can 

write 

A = A + e x E (A) + e x E (A) 
a a m m 

and 

A d * = A d + e x E (A d ) + e x E (A d ) 
a a ' m nr 

= A + g x E (A d ) + e x E (A d ). 
a a m m 

/ n d 

As an example let us study A = a(bc+dj + e and A = abc + ad + e. 




bed e 

(a) A = a(bc+d) + e 




(b) A = abc + ad + e 



Figure k.k. Trees for a(bc+d) + e and abc + ad + e 



119 



Then we have 
* 



A = (a(bc(l+€ )+d)(l+e )(l+€ )+e)(l+€ ) 
m a nr ' v a y 

= (a(bc+d)+e)+e (2abc+2ad+e)+e (2abc+ad) 
a m. 



and 



A d * = (abc(l+€ ) 2 + (ad(l+e )+e)(l+e ))(l+€ ) 
m m &' J a 

= (abc+ad+e) + e (abc+2ad+2e) + e (2abc+ad). 

Note that E (A) = E (A ) in the above example, which is not mere chance. We 
can show that this holds for all cases. 



Lemma 2 : 



E (A) = E (A d ) 
m m 



Proof : 



where 



First let us consider 

t#a \ © *2 © ••' © *n 



t* = t. + € E (t.) + e E (t. ). 

1 i aai mmi 



Then clearly 

n 

E (t) = € E E (t.) 

m m . ., m l 
i=l 

regardless of the order of additions whereas E (t) depends on the order of 

additions. Hence we may write 

* n n 

t = Z t. + e E (t J + e ZE (t.) 
. ., i a a Z m . n m i 

i=l 1=1 

where Z indicates that E (t) depends on the order of additions. 



120 



and 



Now let us consider 

A* = t* (tj* © t* © ... © t*) 



A = t t x © t © t 2 © ... © t t n 



where 



and 



t = t + e E (t) + e E (t) 
a a mm 



t. = t. + e E (t. ) + e E (t. ). 
i i a a l m irr i' 



Then we have 



n 



n 



and 



Hence 



A = (t-K E(t)+e E(t))( Z t 4* E (t )+€ Z E(t ,)(l4€ ni )) 

aa mm . ., i a a i. m..mi m 

1=1 1=1 

n n n 

= t St. + e ( Z (t.E (t)) + tE (t ))+ € ( Z (tE (t )+t E (t)+tt.)) 

. ,i a . , i a a a m.,miim i 

i=l i=l 1=1 



A = (t4€ E (t)+€ E (t))(t 1 4€ E (tJ+€ E (t 1 ))(l+€ ) (+) ... 
a a mm 1 a a 1 mm 1 m v-/ 

= (tt n + e (tE (tJ+t n E (t)) + e (tE (tJ+t.E (t)+tt n )) (+) 
v 1 a a 1 1 a ' m m 1 1 nr 1" v-^ 



t Z t, + e E (a£) + € ( Z (tE (t )+t ,E (t)+tt )) 
. , l a a L m . , mi lm l 

i=l 1=1 



E(A) = E (A u ). 
m m 



(Q.E.D.) 



121 



As for E (A) and E (A ), they depend on the order of additions and cannot be 

compared simply. However, they may not differ significantly. As a simplified 

case, let us study the following: 

N 
A = t( £ a. ) 

i=l X 

and 

H N 

A = ^ (ta.). 
i=l X 

Again we assume that N is a power of two. Then to compute A, we first compute 

:; n ^ n n 

T. a. in parallel. As we showed before, (La..) = L a. + riog_Nle_ L a.. 
i=l i=l i=l i=l 

Hence 

* N N 

A = t( L a . +r log^Nl e Za.)(l+e ) 

. , i d a . _ i m 

i=l i=l 

N N N 

= t L a. + e r log Nit Z a. +£ t Z a. . 
. , l a . _ i m . - l 

i=l i=l i=l 

On the other hand, we have 

A. = (ta. ) = ta. + e ta. 
11 l mi 

d + 
and A is obtained by summing A. in parallel, i.e. 

A d * = (..(((a* © a 2 ) © ( A ; © a*)) © ( (A ; © $ © (a; © 

Ag)))...) 



N N N . 

= t Z a. + e riog^Nlt Z a. + e t Z a. . 

.,i a & 2 •-,! m . . i 

i=l i=l i=l 



Hence in this case E (A) = E (A a ) as well as E (A) = E (A ). 

a a mm 



122 



5- PARALLELISM BETWEEN STATEMENTS 

This chapter should be read as an introduction to the following 
chapter which discusses loops in a program. In this chapter we study 
parallelism between statements, i.e. inter- statement parallelism. Given a 
loop and jump free sequence of statements (we call this a program), it is 
expected that they are executed according to the given (i.e. presented) order. 
However if two statements do not depend on each other, they may be executed 
simultaneously in hopes of reducing the total computation time. In general, 
statements in a program may be executed in any order other than the given 
order as long as they produce the same results as they will produce when they 
are executed in accordance with the given sequence. In this chapter we give 
an algorithm which checks if the execution of statements in a program by some 
sequence gives the same results as the execution of statements by the given 
sequence does. Also a technique which exploits more parallelism between 
statements by introducing temporary locations is introduced. 

5-1 Program 

A program P with a memory M is a sequence of assignment statements 
S(i), i.e. P -= (S(l); S(2); ...; S(i); ...; S(r)) where i is a statement 
number and r is the length of a program P (we write r = lg(P)). The memory 
M is a set of all variables (or identifiers ) which appear in P. 

Associated with each S(i) is a set of input variables, IN(S(i)) and 
an output variable, OUT(S(i)). Then M =.U, (lN(S(i)) UOUT(S(i))). Further 



123 



we define two regions in a memory; a primary input region M and a final 

output region M as 

Mj. = (m | meIN(S(i)) and Vk < i, m/OUT(S(k) )} . 

and M = {m | meOUT(S(i) )1 . 

A program uses the values of those variables in VL. as primary input data and 

puts final results into M . 

C(m) refers to a content ( value ) of a variable m. C(M) refers to 
the contents of variables in the memory M as a whole and is called a config - 
uration of M. Also C T (m) refers to a value which m has before a computation 

(i.e. an initial value of m). Thus C(M ) refers to primary input data given 

to a program. We call it an initial configuration . 

The following relations are established among statements in P. 
A triple (id, i, j) where id e M (id for an identifier) and i, j e 
{0,1, . . . ,r,r+l] (r = lg(P)) is in the dependence relation DR(P) if and only if: 
(1) (i) i < j and 

(ii) id e 0UT(S(i)) and id e IN(S(j)) and 
(iii) Vk, i < k < j, id ft 0UT(S(k)), 
or (2) (i) i = and 

(ii) id e IN(sCj)) and 
(iii) Vk, < k < j, id I 0UT(S(k)), 

(s(,j) is the first statement to use id), 
or (3) (i) $ = r + 1 and 

(ii) id e 0UT(S(i)) and 
(iii) Vk, i < k < r + 1, id / 0UT(S(k)). 



3.214- 



only if: 



(S(i) is the last statement to update id). 
Similarly a triple (id, i, j) is in the locking relation LR(P) if and 

(i) i < j and 

(ii) id e IN(S(i)) and id e OUT(S(j)) and 
(iii) Vk, i < k < j, id / CUT(s(k)). 

Example 1 (The notations follow ALGOL 60| 3] ) : 
Let P be 



S(l): a 

I: d 

): f 

): g 



= b + c; 

- a + e; 
= g + d; 

- h + i. 



Tli en 



DE(P) = {(b,0,l),(c,0,l),(e,0,2),(g,0,3),(h,0,i|),(e,0,^),(a,l,2), 

(d,2,5),(f,3,5),(g,^,5)) and 
LR(P) = [(g,3,^)). 

Since we are only interested in meaningful programs, we assume that 
there is no superfluous statement, i.e. there is no id e M such that 

(i) id e OUT(S(i)) and id e OUT(S(j)) where i < j, and 
(ii) Vk, i < k < j, id / IN(S(k)). 
Also v/e assume that there is no statement that has no inputs other than constant 
numbers, e.g. "a := 5" • 

Now we define an execution order E of a program P as : 
E(P) = {(i,j)|ie{l,2,...,lg(P)},je(l,2,...}}. 



125 



JL 

We also write E (i) = j if (i, j) e E(T). W To execute a program by E(P) means 

that at step j, all statements with statement number E_ (j) are computed 

simultaneously using data available before the j-th computation as inputs. 
A pair (P, E) is used to denote this execution . Also by E n (P), we understand 

the execution order given by a program, i.e. 

E (P) = ((i,i) | Vi e [l,2,...,lg(P))}. 
E n is called a primitive execution order . 

We assume that at each time step at least one statement of P must 
be executed. That is, for any E there is k such that Vj > k, E (j) = empty 

and Vj < k, E (j) / empty. We call k the length of an execution and write 
lg(E). 

As stated before, C(OUT(S(i))) refers to the contents of a variable 
OUT(S(i)). This value, as we expect, varies from time to time throughout an 
execution. Thus it is essential to specify the time when a variable is referred 
to. 

S(i)(P,E) refers to a computation of S(i) of P in an execution (P,E). 
C(m) after S(i)(P,E) refers to the value of a variable m right after S(i)(P,E). 
C(m) after (P, E) refers to the value of a variable m after an execution of a 
whole program. 

5-2 Equivalent Relations Between Executions 

Now we define two equivalent relations between executions. 



J 

For convenience we define that Vi, E_(0) < E_(i) and E-^i) < 



E p (lg(P)+l) 



126 



Definition 1 ; 

Given a program P and two execution orders E and E , (P,E, ) and 

(P, E_) (or simply E.. and E ? ) are said to be output equivalent if and only if: 

for all initial memory configurations C_(M_), 

Vi C(OUT(S(i))) after s(i)(P,E 1 ) = C(OUT(S(i))) after S(i)(P,Eg). 

We write (P,E )~(P,E ) if (P,E.) is output equivalent to (P,E ). 

Definition 2 : 

Given two programs P and P , let their execution orders be E, and 

E p respectively. Also let their memories be M, and M p . Then two executions 

(P ,E,) and (P ,E ) are .said to be memory equivalent if and only if: 

(l) there is a one-to-one function 

fi <"ll U V - lM SI U «20> 

such that 

f(M ) = M 

1 II 21 

and f(M 1Q ) = M 2Q , 

and (2) for all initial memory configuration pairs C_(M-]_j) and C T (M ?T ) 

such that 

Vm € U ir CjU) = C I (f(m)). 

Vn € M 1Q , 



C(n) after (P^E^ = C(f(n)) after (P 2 ,E 2 ). 



M 



We write (P ,E,)~(P ,E ) if (P-,E, ) is memory equivalent to (P ,E p). 



127 



In principle, a program is written assuming that it will be executed 
sequentially, i.e. by E Q . It, however, need not necessarily be executed by 
E~ as long as it produces the same results as (P,E n ) when it terminates, i.e. 



(V 



lM, 



it may be executed by any E as long as (P,E)=(P,E_) holds. 

Now the following theorems can be proved directly from the above 
definitions. 



Theorem 1 



(P,E)=(P,E ) if and only if: 

(1) Vi, (id, i,.j)eDR implies that E (i) < E (j). 
and (2) for any two triples (id, i, j) and (id, i', j ' ) in DR with the 
same identifier id, either E p (j') < E (i) or E (j) < E (i') 

holds. 



What condition (l) implies is that variables must be properly up- 
dated before used, and condition (2) prevents variables from being updated 
before they are used by all pertinent statements. 



(a) Condition (l) 



«p(i) 


A 

\ 

Ep(j) Q 


id 

i) 



(b) Condition (2) 



E p (i) © Kp(iO 


© 


A i id A 
E p (d) © Ep(j«) 


4 id 

© 


/A or /A 




Ep(i') © E p (i) 





1 ld A 

A A 

Ep(j') © E p (d) 


4 id 





Figure 5»1« Conditions for the Output Equivalence 



128 



Proof of Theorem 1: 



(1) if part: 

Assume that a statement S(i) receives data from statements 
S(0,S(i ), . . .,S(i, ), i.e. for each pair i and i (s=l,2, . . .,k) 

there is an identifier id such that (id ,i ,i)eDR. Now let E 

— s — s s 

be an execution order which guarantees that (l) before S(i) is 

computed, all S(i ) are computed, and (2) between the computation 
s 

of S(i ) and S(i), no statement updates id , then it is clear 
s s 

that (C(OUT(S(i))) after S(i)(P,E) = C(OUT(s(i))) after S(i)(P,E Q ) 
providing that all OUT(S(i )) have appropriate values. Note that 

5 

the above two requirements are equivalent to conditions (l) and 

(2) of the theorem. Then by induction, we can show that if 


conditions (l) and (2) hold for all statements, then (P, E)~(P, E~). 

(2) only if part: 

We give an example to show that if an execution order 
violates condition (l) or (2) then we cannot get an output 
equivalent execution. Now let P be 

S(l): a := b; 

S(2): c := a; 

S(3): b := e. 
Then DR = [(b,0,l),(e,0,l),(a,l,2),(c,2,4),(b,3A)}, and (P,E Q ) 

gives 






129 



C(0UT(S(1))) after S(l)(P,E Q ) = C^b), 

C(0UT(S(2))) after S(2)(P,E Q ) = C^b), 

and C(0UT(S(3))) after S(3)(P,E Q ) = C^b). 

Now let E (P) = {(1,2), (2,1), (3,3)} which violates the first 

condition of the theorem, and Eg(P) = { (l, 2), (2,3), (3,1)) which 

violates the second condition. Then 

C(0UT(S(2))) after S(2)(P,E 1 ) = C (a) 

and C(0UT(S(1))) after S(l)(P,Eg) = C (e) 

which do not agree with corresponding values produced by (P, E»). 

(Q.E.D.) 



Theorem 1 gives more meaningful executions compared to the previous 
results [5l [10]. For example let P be: 

S(l): a:=f 1 (x) 

8(2): b:=f 2 (a) 

8(5): c:=f 3 (b) 

S(k): b:=f u (x) 

S(5): d:=f (b,c). 

Fisher [10], for example, would give the following execution (P,E) 
as an "equivalent" execution to (P,E ). 




130 



Step 

6) Q 

1 aJ |b E= ((1 ' 1) ' (2 > 2) > (5 > 5) ' ^'V' (5A)) 

s I- - 

This, however, does not give correct results unless P is properly 
modified. Note that the variable b carries two different values between 
steps 2 and 3 which is physically impossible. Theorem 1 does not recognize 
such an execution as "equivalent" to (P,E n ). 

Theorem 'c 

M 
(P,E)=(P,E Q ) if and only if 

(1) Vi, (id, i, j) eDR implies that E p (i) < E p (j), 

and (2) V., (id, i, j) eLR implies that E (i) < E (j). 

J 

Example 2 : 

P: S(l): a:=b+c; 

S(2): d:=a+e; 

S(3): a:=q+r; 

SCO: h:=a+s. 
Let E(P) = {(3,1), (U,2), (1,3), (2,i+)}- Then (P,E)£(P,E Q ). 

E, however, violates the second condition of Theorem 2, i.e. (a, 2, 3) eLR but 
E p (2) <E p (3). 

The following lemma is helpful to prove the theorem. 



131 



Lemma 1 ; 

If two conditions of Theorem 2 hold for an execution order E, then 

(P,E)-(P,E Q ). 

Proof : 

We show that if conditions (l) and (2) of Theorem 2 (we write C(2-l) 
and C(2-2) for them) hold, then conditions (l) and (2) of Theorem 1 (c(l-l) 
and C(l-2)) follow. 

First note that C(2-l) is identical to C(l-l). Next we show that 
C(2-2) together with C(2-l) satisfy C(l-2). Assume that (a, i^ j 1 ), (a,i , J ) 

eDR where j. < i p . Then there exist statements S(h,), S(k.), S(h ? ), S(k ? ), . . ., 

S(h m ),S(k m ) such that (a, j^l^), (a,^,^), (a,^,^), . . ., (a,k g ,h ^ . . ., (a,k m ,i 2 ) 

eLR and (a^,^), (a,h 2 ,k 2 ), ..., (a,h,k ),..., (a,h m ,k m ) eDR. Then if C(2-l) and 

C(2-2) hold, then E (j ) < E (i ). Thus C(l-2) follows. 

(Q.E.D.) 

Proof of Theorem 2 ; 

(1) if part: 

Let E be an execution order which satisfies C(2-l) and 


C(2-2). Then by Lemma 1, (P,E)"(P,E Q ). Now let i be a statement 

number such that (id,i,r+l) eDR where r = lg(P), i.e. S(i) is 

the last statement in (P,E-.) which updates id. Then 

C(id) after S(i)(P,E Q ) = C(id) after (P,E Q ). (l) 



132 



Also by a similar argument used to prove Lem m a 1, we can show 
that for all J such that id e OUT(s(j)), E (J) < E (i) holds. 

Thus S(i) is the last statement to update id in (P,E), too. Thus 

C(id) after S(i)(P,E) = C(id) after (P,E). (2) 


Also since (P,E ) = (P,E), we have C(CUT(S(i))) after S(i)(P,E ) = 

C(OUT(S(i))) after S(i)(P,E) or 

C(id) after S(i)(P,E Q ) = C(id) after S(i)(P,E). (3) 

Thus from (l), (2) and (3), we have C(id) after (P,E) = C(id) 

after (P, E n ). Using the same argument for all i such that 

(id, i,r+l) eDR, we can show that for all m e M , C(m) after 

(P,E) = C(m) after (P,E ). 

{?.) only if part: 

This part is again proved by giving a counter example. It 
is easy to show that a program 



S(l): 


a := e; 


8(2): 


c := b; 


8(3): 


b := a. 



together with execution orders E, = { (1,3), (2,1), (3,2)] and 
E p = ( (1,1), (2,3), (3,2)} serves as a counter example. The 
details are omitted. 



(Q.E.D. ) 
Note that Lemma 2 can be modified as : 



133 



Corollary 1 ; 

If (P,E)=(P,E Q ), then (P,E)=(P,E Q ) . 

Now let us study the memory equivalence relation between two 
different programs and execution orders, (P,,E ) and (P ,E ), in detail. As 

a subcase of this let us consider (P ,E_) and (P ,E ). In general it is 

M 
impossible to show whether (P ,E )=(P ,E ). That is, this problem can be 

reduced to the Turing machine halting problem which is known to be recursively 

unsolvable [28]. In our discussion we have put the restriction on P so that 

a program is a loop-free block of assignment statements. Even with this 

restriction it may still be practically impossible to show memory equivalence 

between (P..,E n ) and (P ,E„). For example if P and P are different polynomial 

approximations for the same function, then (P , E ) and (P ,E ) are likely 

to produce different results due to e.g. a truncation error. We do not pursue 

this problem further. 

Finally let us consider the following example: 



Example 3 : 






P: S(l): 


k 


:= a; 


S(2): 


b 


:= k; 


S(3): 


k 


:= c; 


SflO: 


d 


: = k; 


S(5): 


k 


:= e; 


S(6): 


f 


:= k. 



Let E(P) = {(1,1),(2,2),(3,2),(U,3),(5,3),(6,U)} 



M 

Then (P,E )~(P,E) and lg(E) 



k. 



I3h 



However, the following program P' 
P': S(l): k» := a; 
S(2): b := k*; 
S(3): k" := c; 



S(k): d 
S(5): k 
S(6): f 



= k"j 

= e; 

= k. 



together with an execution order E(P') = { (l,l), (2,2), (3,1), (1<.,2), (5,1), (6,2)} 
gives a memory equivalent execution to (P,E n ), and lg(E(P')) = 2. 

This suggest the introduction of the following transformation, which 

M 
when applied on a program P, produces a new program P' such that (P,E n )~(P',E n ) 



Transformation T 

Let S - (S(i),S(i+l), ...,S(j)} and S 2 = {S(k), S(k+l), . . ., S(m)) where 

j < k. Assume that there is an identifier id such that (id, j,k) €LP, and id 
e OUT(S(i)), and Vu, i < u < j, id / OUT(S(u)). Also assume that for any v and 
w > i £ v £ J and - k < w < m, there is no id' such that (id',v, w) eDP. Then 
replace every occurrence of id in S, by id' where id' ^ M. 

Gold [17] presented a similar transformation to describe his model 
for linear programming optimization of programs. 

After the transformation is applied S, and S can oe processed in 



M 

parallel, and still (P,E )~(P',E ) holds where P' is the result^ic program a: 

the application of T, on P. 

This shows that the second condition of Theorem 2 is not essential, 
i.e. it can be removed by introducing extra locations if necessnry. 



135 



6. PARALLELISM IN PROGRAM LOOPS 

6.1 Introduction 

6.1.1 leplacement of a j f p r . Statement with Many Statements 

Using the results from the previous chapter, now let us study loops 
in a program, e.g. ALGOL for statements or FORTRAN DO loops, to extract potential 
parallelism among statements. Given a loop P, we seek an execution order E with 
the minimum length among all possible ones. Sometimes it may be appropriate to 

get a loop P' from P by the previously introduced transformation for which there 

M 
is an execution order E' such that (P',E* ) = (P,E ) and lg(E' ) is the minimum 

(For the definition of =, see Chapter 5)- 

As stated before, in this chapter our main concern is the parallelism 
among statements (inter statement parallelism). For example, we are interested 
in finding out that all 10 statements (A[I] := A[ I + 1] + FUNC(B[I]); (1=1, 2, ..., 
10)) in Fl can be computed in parallel, whereas statements in F2 cannot be 
(The notation follows ALGOL 60 [3]): 

Fl: for I := 1 step 1 until 10 do 

A[I] := A[I + 1] + FUNC(B[I]); 
F2: for I := 1 step 1 until 10 do 

A[I] := A[I - 1] + FUNC(B[I]). 

First several notations are presented. According to the ALGOL 60 
report [3], a for statement has the following syntax: 



136 



< for statement > ::= < for clause > + < statement > + 

< for clause > ::= for < variable > :» < for Hat > do. 

An instance of this is : 

for I. := ... do 

■ m 9 

for I := ... do 

begin EL ; S ; . . . ; S end . 

For the sake of brevity, we shall write (i. *-L n , I_ *■ L_, .... I *- L ) (S, , S^, 

112 2 . n n 1' 2 

. .., S ) or (I , I , ..., I )(S. , S_, . .., S ) for the above for statMMnt 

m ± d n ± d m ■ 

instance where I, is called a loop index, L is an ordered set and called a loop 

list set , and S is called a loop body statement with a statement identification 

number p (which is different from a statement number (see Chapter 5))- 

As its name suggests, a loop list set represents a < for list >, e.g. 
L, = (1,2, 3 A, 5,6) represents "I. := 1 step 1 until 6." In general we write 

L, (i) for the i-th element of L thus L. (|L. |) is the last element of L, . 

Now to facilitate later discussions we introduce the following notation. 

Let B = (b.,,b . ...,b ) (b. > for all i) and (i n ,i ,....i ) be n-tunles 
1 2 n i v 1 2 n' 

# 

of integers. Then we define the value of (i.,i , ...,i ) w.r.t. B as follows.* 

— — — id n 



7r <A>+=<A><A>* where * is the Kleene star. 

JUL 

7nr F or convenience we write i(s..t) for (t-s+l) integers i , i , ..,, 

S S ^* A. 

i + i) i + , e.g. (i(l..s), i(s+2..n)) means (i.,....i ,i ^,...,i ). 

L ~- L b is s+2 a 

Also (|L(s..t)|) means ( |L g | , |L g+1 | , . . ., |L t | ). Finally (i(n)) means 
n i's e.g. (1(3)) = (1,1,1). 



137 



n n 
V ((i(i..n))|B) = 2 i.B - IB. +1 

n+1 

where B. = f b. ,. and B = b ,, = 1. 
j , . k+1 n n+1 

This notation is introduced so that the relations 
V((1,1,...,1,D!B) = 1, 
V((1,1,...,1,2)|b) = 2, 



V((l,l,...,l,b )|B) = b , 



n 



V((l,l,...,l,2,l)|B) = b n + 1, 
V((1,1,...,1,2,2)|B) = b n +2, 



hold. 



and V((b 1 ,b 2 ,...,b n )|B) = b^x-.^ 



For example V( (2,3,1) I (3A> 5) ) = 31. An n -tuple B is called a base. 



The inverse function of V is also defined as V~ (t | B) = (i(l..n)) if V( (i(l. .n)) |b) 

= t. Note that V" 1 is not one-one e.g. V _1 (l5| (J>,k, 5) ) = (2,0,0) or V~ 1 (15| (3A, 5) ) 
= (l,i+,0). An n-tuple (i(l..n)) is said to be normalized if b . > i. > for all 

J J 

n n 
j. Let (i(l..n)) be normalized. Then 1 < V( (i( i. .n) ) | B) < 2 b.B. - 2 B.+l. 

3-1 J J 0=1 ° 

n n , 

If 1 < t < Z b.B. - ZB.+l, then V* (t | B) has unique normalized (i(l..n)) as 

d-i J J j=i J 

its value. 



138 



VTe say that normalized (i(i..n)) ranges over B = (b(l..n)) in 

n 
increasing order if V( (i(i. .n) ) |B) takes all values, between 1 and £ b.B. - 

J=l J J 

n 

£ B.+l in increasing order as (i(l..n)) changes. Notationally we write 

J=l J 
(l(n)) < (i(l..n)) < (b(l..m)). 

Finally we let 

(i(l..n)) > (j(l..n)) if V((i(l..n))|B) > V( (j(l. .n)) |b) 
and 

(i(l..n)) = (j(l..n)) if V((i(l..n))|B) - V((j(l..n))|B) 

The following lemma is an immediate consequence of the above 
definition. 



Lemma 1: 



Let B = (b(l..n)) where V.b. > 2. Then 

li — 

(1) v((a(l..n))|B) < V((a'(l..n))|B) implies that 

n 
v(( ai - a 1 ',...,a n - a n ')|B) < - z B . or v((a 1 - e^', . . ,,a -a n ' ) |b) 

< V((0(n))|B). 

(2) V((a 1 '+c 1 ',...,a n '+c n ')|B) = V( (a^, . . . , a n +c n ) | B) ana 

V((a'(l..n))|B) > V((a(l..n))|B) imply that V((c-- C,',,.., 

n 
C n' C n' )|B) - ' . ZB j or V (( c i- c 1 '^-^ c n - c n ')l B ) < V((0(n))|B). 

(3) Let < [a. | < b for all j. Then V( (a(l. .n) |b) < Vi(0(n) ) |b) if 
and only if there is h such that Vk (1 < k < h), a. = ana a. < 0. 



139 



A loop must be replaced with a sequence of statements so that we can 
use the results of the previous chapter. For example we replace 

for I := 1 step 1 until 10 do 

SI: A[I] := A[I] + B[I]; 
with the sequence of ten statements 

A[l] := A[l] + B[l]; 

A[2] := A[2] + B[2]; 

A[10] := A[10] + B[10]. 

/" \ 

In general we will get J it |L.|-nJ statements after the replacement 

of a loop P: (I, . I_, . . , I )(S -S^;..:S ). Any statement in the set of replaced 

r 1' 2 n 1' 2 m 

statements can be identified by an n-tuple (i(l..n)) which corresponds to values 
of I 1 , I 2 , ...,I n (i.e. L 1 (i 1 ),L 2 (i 2 ), . .,L n (i n )), and p which represents a 

statement identification number. Thus an (n + l) -tuple (i(l..n),p) serves as a 

statement number , and we write S((i(l..n),p)) to denote a particular statement 

in the set of replaced statements, e.g. in the above example S( (3, l) ) = A[ 3] := 

A[3l + B[31- The actual statement which corresponds to this is the statement 

S with L n (i n ),..., L (i ) substituted into every occurrence of I,.....I in S , 
p ll'nn 1' ' n p 

and we also write S [L, (i, ),..., L (i )] for this. 

p 1 1 ' n n 

/ n , \ 

These ir |L. m I statements are to be executed according to the 

presented order (i.e. the order specified by for loop lists). In other words, 
the statement S( (l, 1, . . ., 1, l) ) is executed first, S((l, 1, . . ., 1,2) ) second, ..., 



1U0 



the statement S( (i(l. .n),p) ) is executed V( (i(l. .n),p) | ( |L(l. .n) | ,m) )-th, .. ., 
and the statement S(( IL^J, . . ., |L |,m)) is executed lastly. Formally, as the 

essential execution order we have: 

E Q (P) = {((i(l..n),p),V((i(l..n),p)|B)) | (l(n),l) < (i(l..n),p) < 

(|L(l..n)|,m)) 
where B = ( |L(l. .n) | ,m) . 



Example 1 : 



for I, := 1 step 1 until 10 do 



for I := 1 step 1 until 10 do 



begin 

SI: A 1 [I,,I ] := A 2 [I--1,I ] +B 1 [I 1 ,I ]; 



■1^2- 



'1 



•l'*2- 



S2: B^I^Ig-l] := A 5 [I 1+ 1,I 2 ] + B 5 [I 1 , Ig+1] ; 

end 
is executed as 

S( (1,1,1)): A X [1,1] := A 2 [0,1] +B 1 [1,1]J 



S((l,l,2)) 
S( (1,2,1)) 
S((l,2,2)) 



B 2 [1,0] : = A 5 [2,l] + B 3 [l,2]; 



S((10,10,2)): B 2 [10,9] := A 3 [ll,10] +B^[10,11]; 
The superscript is used to distinguish different occurences of A and B. 



11H 



A[(i(l..n))] represents a form in which L.. (i. ),..., L (i ) are 
substituted into T , .. .,1 in index expressions, e.g. in the above example 

A 2 [(i 1 ,i 2 )] = A 2 ti x -l,i 2 ] 

and A 2 r(3,2)] = A 2 [2,2]. 

Finally a set of inputs to a statement S( (i(l. .n),p)) is denoted by 
IN(S((i(l..n),p))). Similary OUT(S((i(l. .n),p))) represents a set of outputs 
from S((i(l..n),p)). From the above example we have, e.g. 

IN(S(l,l,2)) = (A 5 [2,1]*B 5 [1,2]) 

and 0UT(S(1,1,2)) = {B 2 [1,0]}. 

6.1.2 A Restricted Loop 

In what follows, we mainly deal with a restricted class of for 
statements. Two restrictions are introduced. Let a loop with m body statements 
be 



P™ = (l 1 ,I 2 ,...,I n )(S 1 ;S 2 ;...;S m ). 



Restriction 1: 



A for list set L. must be an arithmetic sequence, i.e. 

L = (.1,2,3,..., t) (1) 



for all i 



Restriction 2 



Let (A ,A , ...,A 1 be a set of all array identifiers in F where the 
h-th occurrence of A, in P has the following form (where the superscript h is 



142 



used only if it is important to distinguish different occurrences of A, ): 

A^[F(k,h,l), F(k,h,2), ..., F(k,h,n)]. (2) 

For fixed k and j, F(k,h, j) has an identical form for all h, i.e. either 
F(k,h,j) = I. + w(k,h,j) 

J 

or F(k,h, j) = (i.e. vacant). 
w(k,h, j) is a constant number. Also we assume that each A, appears on the left 

hand side of statements at most once. 

An example of a restricted loop is: 
for I := 1 step 1 until 20 do 

for I := 1 step 1 until 30 do 

for I, := 1 step 1 until kO do 

begin 

SI: A 3 [ 1^1,12+3,0] := AjCIylg-J,)*] + ^[0,0,1^]; 

S2: A 2 [0,I 2 ,I 3 -1] := A ? [ 1^1,12,0] + A.^0,0, 1^1] ; 

S3: A 1 [0,0,I 5 +1] := A 2 [0,I 2 -1,I 3 ]; 

end ; 

Note that, for example, A, always appears as 

A,[I, +w(3,h,l), I +w(3,h,2),0], thus the first occurrence of A, is 

A 3 [F(3,1,1), F(3,l,2),0] = A 3 [ 1^(3,1,1), I 2 +w(3,l,2),0] • AjCXj-1, Ig+3, 0] - 

If there is no ambiguity, we write e.g. A,[I, -1, I 2 +3] f° r A,f I, -1, I 2 +3>0] 

(which is the conventional form). 



1U3 



We also write F(k,h, j)(i) for the resultant expression obtained by- 
substituting i into I. in F(k,h,j), e.g. A,[F(3, 1, l)(2), F(3,l,2)(3),0] = 

A,[ 1,6,0] (= A,[l,6] conventionally). 

A single variable may be introduced as a special case of array 

indentifiers, e.g. we write 

for I := 1 step 1 until 1 do 
A[I] := 
for 

T := .... 

6.2 A Loop With a Single Body Statement 
6.2.1 Introduction 

First we shall deal with the case where a loop has only one body- 
statement (i.e. m = l). Let a loop with a single body statement be P = 
(I , I , ...,I )S. Since there is only one statement we may drop the statement 

identification number. Then a statement number for a replaced statement becomes 
(i(l..n)) and as the essential execution order we have: 

EqCP 1 ) = {((i(l..n)), V((i(l..n))|(|L(l..n)|)) | 

(l(n)) < (i(l..n)) < (|L(l..n)|)}. 
Also in this case we only have to consider the array identifier which 
appears on the left hand side of S. Hence instead of s array identifiers we 
only have one array identifier (see Restriction 2 of Section 6.1.2). Hence we 
drop k and write 

A h [F(h,l), F(h,2), ..., F(h,n)] 
and 



Ikk 



F(h,j) = I + w(h,j) 

J 

for the h-th occurrence of A for Eq. (2) of Section 6.1.2 (the superscript is 
used if it is necessary to distinguish the different occurrences of A). 
Furthermore we assume that F(h, j) ^ for any h and j. 
Now let us study the following two examples. 
Gl: for I := 1 step 1 until 10 do 

A[I] := A[I] + 5; 
G2: for I := 1 step 1 until 2 do 
for J := 1 step 1 until 10 do 
A[I,J] := A[I-1,J+1] + 5; 
Assume that an arbitrary number of PE's are available. Then: 
Gl: All ten statements (A[I] :- A[I] + 5) can be computed simultaneously 

by 10 PE's. 
G2 : A[1,J] and A[2,J-2] can be computed simultaneously by two PE's at the 
J-th step (J=l,2, . ..,10). 

In what follows, the above two types of the interstatement parallelism 
are studied. 

Before we go into the details, a few comments are in order with regard 
to real programs. A for statement with a single body statement, (i, ,--.,I )S, 

can be classified from several different points of view. First of all let us 
take a for list set L.. As a simplified case we have L. = (s ., s .+1, . . ., t . ) 

(t . = (|l.|-1) + s.) which is equivalent to an ALGOL statement "for I. := s. 

step 1 until t. do". Knuth stated [9] that examination of published algorithms 

KJ 

showed that well over 99 percent of the use of 'the ALGOL for statement, the 



ll*5 



value of the step was ' +1 1 , and in the majority of the exceptions the step was 
a constant. This statement was confirmed by checking all Algorithms published 
in the Communications of the ACM in 1969- There were 23 programs and 263 for 
statements used. Only six uses were exceptions (z 3 percent). 

Next let us examine a body statement S. Then either (l) the left 
hand side variable of S (i.e. OUT(S)) is a single variable, or (2) OUT(S) is 
an array identifier. In case of (2) S is of a form 

A°[F(0,l),...,F(0,n)] := f (A^FCl, l), . . . ,F(l, n)], . . ., A P [ f (p, l), . . . ] ) . 

Now S has either one of the following five forms. 

M 

(1) OUT(S) is a single variable t. 

(i) t := a function which does not depend on t, 
e.g. t := a + 5, 
(ii) t := f(t), e.g. t:= t + a, 

(2) OUT(S) is an array variable A: 

(i) A [F , ...,F ] := a function which does not depend on A, 

e.g. A[I,J] := b + 5, 
(ii) for all h F(0, j) - F(h, j) is a constant for each j, 
e.g. A[I,J] := A[I-5,J+3] + A[I+l,J-3] + 5 
(iii) other cases, e.g. A[I,J] :- A[2I,J-5] + a. 
Note that if S is of Form (l-i), then 

P 1 = Sfl^ClLj), L 2 (|L 2 |), ..., L n (|L n |)]. For example let P 1 be 



We use a lower case letter for a single variable and an upper case 
letter for an array variable. 



146 



for I := 1 step 1 until 5 do t := A[I] - 1. 

Then after the execution of r, t = A[ 5] - 1« 

Again all Algorithms published in the CACM were checked (this time the 
check was made against Algorithms published in 1968 and 1969. ) There were 52 
programs altogether and 117 for statements with a single body statement. The 
details were: 



(1) 



(2) 



No 


. of Exam 


pies 


Percentage 


(i) 










(ii) 


1+2 




35.8 


(i) 


18 




15.4 


(ii) 


33 




28.2 


(iii) 


2k 




20.6 



117 100.0 

In what follows we deal with Forms (2-i), (2-ii) and (2-iii). Form 
(l-ii) has been discussed in Chapter h. 

6.2.2 Type 1 Parallelism 
6.2.2.1 General Case 

As stated in Chapter 5> a block of statements P need not be executed 
according to the essential execution order E„ and may be executed by any 

execution order E as long as (P, E n )=(P, E) holds. In this section we study a 

special class of execution orders called type 1 execution orders. This execution 
order is defined for each loop index I (u=l,2, . . . ,n) and hence there are n of 

these. 






ll+7 



Definition 1: 



A type 1 parallel execution order with respect to I (we write 1-p 



w.r.t. I ) is given by 



E(P) = {((i(l..n)),V((i(l..u-l),i(u+l..n))|(|L(l..u-l)|, |L(u+l. .n) | )) 
|(l(n)) < (i(l..n)) < (|L(l..n)|)), 
and is represented by E[ I ] . 

Figures 6.1 and 6.2 illustrate execution orders E~ and Efl 1. 
D L u J 

Note that Efl ]((i(l..n))) = E[I ] ( (i f (l. .n) ) ) if i = i* for all 

U li K. K. 

k = 1,2, . . . , u-l,u+l, . . . ,n. Furthermore note that if 

V((i(l..u-l))|(|L(l..u-l)|)) > V((i'(l..u-l))|(|L(l..u-l)|)), 
then E[I u ]((i(l..n))) > E[ y ( (i» (1. .n) ) ) . 

By introducing extra |L | PE's, the computation time becomes one 

n n 



|L I -th of the original, i.e. tt |L.| steps instead of it |L.| steps, where one 
U j=l J j=l 3 

step corresponds to the computation of a body statement. 

We now introduce TRANQUIL notation [2] to illustrate Definition 1. In 
TRANQUIL 

for (I) sec} (L) do S 
st^oids for 

for I := (for list set) do S. 

Also in TRANQUIL for (i) sim (L) do S indicates that statements 
S(L(i)) are executed simultaneously for all L(i) in L. Then Definition 1 
amounts to obtaining 



H+8 




C\J 



\H 



?■ 














r^ : ' 














O — -H • 














V 














> 















V 



+ 
3 



+ 



-<><^o 



h 

*1 

H O 

a; -p 
fit v 



CO • 

•H -> 

•H * 

bO • 

v c\? 



0) -P 



3 ti 49 

' E -s 

C CO 

• o 

o 
^ h 

- O -P 
3 co 



C 
cd 
,£ 
•P 



O a5 
-P^ 
0) -^ 



o 
fa 



£i 
-P 

co 

0) 

-p 

ctJ 


•H 

TJ 

•H 

SB 

H 

O 



o co 

c u- ( 

O CO 



W 



ft co 



CM 




3 
3 



+ 



3 

3 



+ 



0<X) 



fa 






ll+9 



from 



for (I ) se£ (L 1 ) do 



for (I , ) sea ( L . ) do 

u-1 — ■* u-1 — 



for (I ) sim (L ) do 



for (I u+1 ) seq, (L u+1 ) do 



for (I ) seq (L ) do S 

v n — ■* n — 



for (I ) se£ (iO do 



for (I ) seq (L ) do 

u — ■* u — 



for (I ) seq (L ) do S. 

n — ■* n — 

First we study a type 1 parallel execution order for a general loop 

in detail. Let the two-dimensional plane I - I . x ••• X I be an |L by 

u u+1 n ' u ' 

n 

7T |L.| grid (see Figure 6.3)« 
j=u+l D 

The grid is labeled by 1,2, ...,i , ...etc. rather than by L (l),L (2), 
...,L (i ),... etc. for convenience. 

Note that each square of the plane represents the computation 
S((i(l. .u-l),i(u. .n))) for some (i(l..u-l)). If P' is executed by E , then the 



150 




<**♦! 



u\J> 



Figure 6.3* Conditions of Parallel Computation in a Loop 



computation proceeds from the leftmost column to the rightmost column while in a 
column the computation proceeds from the top to the bottom sequentially. On the 

other hand if P is executed by E[ I ] then we proceed to compute from the top 

row to the bottom row while we perform computation in each row simultaneously. 
Each computation S( (i(l. .u-l), i(u. .n) ) ) uses inputs IN(S( (i(l. .n))) and updates 
the output 0UT(S((i(l..n)))). Then as we studied in Chapter 5, we have to make 
sure that the computation S((i(l..n))) (marked x in Figure 6.3) does not receive 
any data which are to be updated by the computation in the region R, i.e. there 
must be no id such that 

(id, (i(l..u-l),i'(u..n)), (i(l..n))) cDR 



151 



holds where (i'(u..n)) > (i(u..n)) and 1^ < i^. 

Similarly the computation S((i(l..n)}) must not use any data which are to be 
updated by the computation in the region Q, i.e. there must be no id' such that 

(id\ (i(l..n)),(i(l..u-l),i"(u..n))) €LR 
holds where (i"(u..n)) < (i(u..n)) and i u " > i u « 
The above observation gives the following theorem. 

Theorem 1 : 

Let E[I ] be a type 1 parallel execution order w.r.t. I . Then 

M 
(P^Efl ]) = (P 1 ,E Q ) if and only if there are no id, id', (1(1. .n)), (i(l..u-l), 

J 

i'(u..n)) and (i(l. .u-l), i"(u. .n) ) for which either 

(1) (i) i ' < i and 
' u u 

(ii) (i'(u+l..n)) > (i(u+l..n)) and 

(iii) (id, (i(l. .u-l),i' (u. .n)), (i(l. .n))) eDR, where id e 

OUT(S((i(l..u-l),i"(u..n)))) and id € IN(S( (i(l. .n) ) ) 

or (2) (i) i " > i and 
u u 

(ii) (i"(u+l..n)) < (i(u+l..n)) and 

(iii) (id*, (i(l..n)), (i(l.. u-l), i"(u..n))) eLR where id'e 

OUT(S((i(i..u-l),i"(u..n)))) and id e IN(S((i(l. .n))) 

hold. 



T (±' (u+1. .n)) ^ (i(u+l..n)), for example, means V( (i 1 (u+1. .n) ) | B) > 
V((i(u+1. .n))|B) where B = ( |L(u+l. .n) | ) . Unless specified, the 
base ( | L(s. .n) | ) (=( |L |,...,|L |)) is to be understood for (i(s. .n)) 



152 



Let S be of a form A := f(A , ...,A P ) where A is an array identifier 
and the superscript is used to distinguish different occurrences of A. Then 
id in the first condition of Theorem 1 corresponds to those A [(i(l..n))] for 

which A [(i(l..u-l),i'(u..n))] = A [(i(l..n))] holds together with the three 
conditions (i), (ii), and (iii) of (1) between (i'(u..n)) and (i(u..n)). 

Similarly id' corresponds to those A [(i(l..n))] for which A [(i(l..u-l), 

i' (u. .n))] = A [ (i(l..n))] holds. Thus A (l < h < p) can be classified into 
three groups : 

CI = {h|A satisfies the first condition} 

C2 = {h|A satisfies the second condition) 
C3 = (1,2, ...,p) - CI - C2. 

Note that CI n C2 = 0. 



Example 2 : 

Let P 1 be 

(I x - (1,2,3), I 2 - (1,2,3))(A°[I 1 ,I 2 ] := f(k\ I^Ig+l])). 

Then for i^ = 1< i g = 2 and (i 2 ') = (3) > (ig) = (2), we have A°[ (i^i^ )] = 

A [(i ,i )] = A[l,3], or (A[l,3], (1,3), (2,2)) eDR. Thus P cannot be computed 

in 1-p w.r.t. I,, and CI = [1] . 

From this argument it should be clear that if a body statement is of 
Form (2-i) (see Section 6.2.1), then the loop can be computed in 1-p w.r.t. 
any I u (u=l,2, ...,n). 



153 



6.2.2.2 A Restricted Loop 

If a loop is a restricted loop, then Theorem 1 may be simplified. 
First we define a vector R(h) for each h = 1,2, . ..,p: 
R(h) = (R 1 (h),...,R n (h)), 

where 

R (h) = F(0,j) - F(h,j) 

J 

= I, + w(0,j) - (I + w(h,j)) 

J J 

= w(0,j) - w(h,j). 
For example we get R(l) = (-1,8) from a statement 

A°[ I r l,I 2 +3] := f(A X [ I^Ig-5]). 

Then we use these vectors to check parallel computability as follows. 
Also for convenience we write 

R'(u,h) = (R 1 (h),...,R u _ 1 (h)) 

and 

R"(u,h) = (R u+1 (h),...,R n (h)). 

Theorem 2 : 

If one of the following two holds for any of R(h) (h = 1,2, ...,p), 

then P cannot he computed in 1-p w.r.t. I . 

(1) (i) R'(u,h) = (0,...,0) and 

(ii) R (h) > and 
u 

(iii) R"(u,h) < (0, ...,0) and V.(u + 1 < j < n) | R.(h)|<|L.| - 1. 



15U 



(2) (i) R*(u,h) = (0, ...,0) and 
(ii) R (h) < and 

(iii) R"(u,h) > (0,...,0) and V (u + 1 < j < n) | R (h)|<|L.| - 1. 

That the theorem is valid is the direct consequence of Theorem 1, i.e. 
the first check of the theorem corresponds to the first condition of Theorem 1 
and the second check corresponds to the second condition. For example the 
first condition of Theorem 1 says that if 

(id, (i(l..u-l), i'(u..n)), (i(l..n))) eDR holds f or i ' < i and 

(i'(u+l..n)) > (i(u+l..n)), then P cannot be computed in 1-p w.r.t. I , where 



u* 



id e 0UT(S((i(l..u-l), i'(u..n))) 
and id e IN(S( (i(l. .n) ) ) ) . 
Then id represents the element of A for which 

A h [(i(l..n))] = A°[(i(l..u-1), i'(u..n))] 
holds. Now this implies that 

F(h,j)(L.(i.)) = F(0,J)(L (i .)) 

J J J J 

for j < u and 

F(h,d)(L,(i,)) = F(0,j)(L.(i ')) 



for j > n. Hence 



L (i ) + w(h,j) = L (i ) + w(0,j) 

J J J J 



or R.(h) = for j < u, and 
J 



or 



L (i ) + w(h,j) = L.(i') + w(0,d) 

J J J J 



i.' = i. - R.(h) 

3 J d 



155 



for J > u. Also 

(i'(u+l..n)) > (i(u+l..n)) 
with 

B = (|L(u+l..n)|) 
becomes 

V((i u+1 -R u+1 (h),...,i n -R n (h))|B) > V((i(u+l..n))|B). 

Then by Lemma 1, 

V((R u+1 (h),...,R n (h))|B) < V((0, ...,0)|b). 

Thus the first check of Theorem 2 is verified. The second check can be 
varified similarly. 

Now let us consider the number of checks required. For each A (h=l, 2, 
. ..,p) which appears on the right hand side of S, we first obtain a vector R(h). 
Then for each loop index I , we perform the two checks given by Theorem 2 for 

all R(h) (h=l, 2, . . .,p). Since there are n loop indicies, in total we perform 
2np checks . 

The procedure described in this section can be extended to cover 
nonrestricted loops, too. Let S be of a form 

A°[F(0,l),...,F(0,n)] := f (aV(1, l), . . . ,F(l,n)] , . . . , A P [ (F(p, l), . . . ] ) 

and we define a vector R(h) for each h = 1,2, • ..,p as we did before, i.e. 
R(h) = (R 1 (h), ..., R n (h)) 

and 

R (h) = F(0,j) - F(h,j). 

J 

Since a loop is not restricted, F(0, j) and F(h, j) may take any form and hence R ^h) 

d 



156 



may not be a constant number but rather a function of loop indicies, e.g. R(h) = 

I, + 21, - 5. Hence, in the most general case, it is necessary to check the 

two conditions of Theorem 2 for all values of (i(l..n)) (i.e. (l(n)) < (i(l..n)) 
< ( lL-1, |L |, • .., |L | )) to examine type 1 parallel comput ability, i.e. 

n 
2( it |L.|) checks are required for each R(h) (h=l, 2, . . .,p). In many cases, we 

d=i J 

can expect that the number of checks required is far smaller than that. For 
example if 

R(D = (21^21^,1^21^), 
then only 2( | L, |x|L, | ) checks are required, i.e. it is not necessary to check 
for those loop indicies, e.g. 1^, which do not appear in R.(l) (d"l, 2,3**0. 

6.2.2.3 Temporary Locations 

In this section we mean a restricted loop by a loop. The second 
condition of Theorem 1 (or 2) may be dropped by introducing extra temporary 

locations by applying Transformation T of Chapter 5 on P , i.e. if CI = and 

and C2 / 0, then temporary locations may be set up so that P can be computed 
in parallel (for CI and C2, see Section 6.2.2.1). Let heC2. This implies that 
there are (i(l..n)), (i(l. .u-1), i T (u. .n) ) and id (see Figure 6.^) for which 

(id, (i(l..n)),(i(l..u-l),i'(u..n))) eLR 
holds and 

id = A h [(i(l..n))] € IN(S((i(l..n)))) 
and 



157 



id = A°[(i(l..u-l),i'(u..n))] e OUT(S( (i(l. .u-l), i ' (u. .n)) )) ) . 
If a loop is confuted in 1-p w.r.t. I , then we have 

E[I u ]((i(l..u-l),i'(u..n))) < E[I u ]((i(l..n))) 

while 

E ((i(l..u-l),i'(u..n))) > E ((i(l..n))). 

Hence A [(i(l..n))] will be updated "by S( (i(l. .u-l), i' (u. .n) ) ) before being 
used by S( (i(l. .n)) ). That is, if we compute P in 1-p w.r.t. I we must keep 

the old value of A [(i(l..n))] which otherwise will be updated by S( (i(l. .u-l), 
i'(u..n))) at the E[I ] ((i(l. .u-l), i* (u. .n) ) )-th step separately until it is 

used by S((i(l..n))) at the E[ I ] ( (i(l. .n)))-th step. The period of time, t , 

through which the old value of A [(i(l..n))] must be kept for the computation 
S((i(l..n))) is given by 

t h = E[I u ]((i(l..n))) - E[I u ]((i(l..u-l),i'(u..n))) 

= V((i(u+l..n))|B) - V((i'(u+l..n))|B), 
where B = |L(u+l. .n) | ). Then as we showed in Section 6.2.2.2, in case of a 
restricted loop, we can show that 

n 

t = V(R"(u,h) B) + Z B - 1 
h , t s 

s=u+l 

n 

where B = ir I L , . I . The details are omitted. 
s ' t+1 ' 

t=s 



158 



T u+r • * * 


vl 

' n\ 




i 
u 




i ' 
u 






•n)) 
•n)) 












(i'(u+l. 








/> 










y 






(i(u+l. 




o" 





















A. 



Figure G.k. An Illustration of t 



Now max t gives the maximum period of time through which A [(i(l..n))] must be 
h€C2 

kept. Since we have | L | of them (i.e. |L | statements are computed 



simultaneously), the total amount of temporary locations required will he |L | x 



u' 



max t, . Additional |L | locations are required for buffering (see Example 5)« 



u 



Hence we have the following theorem. 



Theorem 3: 



The maximum number of temporary storage locations required is 



L | x (max [V((R (h),...,R (h))|B) + Z B ] ) 
heC2 U+1 n s=u+l S 



where B = ( |L(u+l. .n) | ) and B = w |L , , | and B = 1. 

t=s 



159 



Example 3 : 

Let P 1 be 

for (I ) se£ (1,2, ...,U0) do 

for (I ) se£ (1,2, ...,1j-0) do 

A[I X ,I 2 ] := A[I 1+ 2,I 2 -3] + 2; 

P as it is cannot "be computed in 1-p w.r.t. I because it violates 

the second condition of Theorem 2. Now we modify P as follows by introducing 
temporary arrays Tl(UOxl) and T2 (1*0x3). 
for (I ) se£ (1,2,. ..,^0) do 

for (I ) se£ (1,2, ,..,kQ) do 

begin SI: T1[I 1 ] := Afl^Ig] ; 

S2: k[l v l 2 ] := T2[I ] _,I 2 mod 3] +2;^ 

S3: T2[I 1 ,I 2 mod 3] := Tl[ Ij ; 

end . 

Then all three statements can be computed in 1-p w.r.t. I , i.e. we 

can replace seq in the first for statement by sim . The original P , if 

executed sequentially, takes 1600 steps whereas the modified P takes only 120 
steps if executed in parallel with respect to I . 



JL 

"a mod b = a 



Also we assume that T2 is properly initialized before the computation 
of the loop, i.e. store A[l,*], A[2,*] and A[3,*] in T2[l,*], T[2,*] 
and T[0, *]. 



i6o 



6.2.3 Type 2 Parallelism 

In this section we mean a restricted loop by a loop. Since the 
conflict between two statements S(i) and S(j) due to the existence of an 
identifier id such that (id, i, j) eLR may be resolved by introducing temporary 
locations (see Chapter 5 and the previous section), such conflict will not be 
taken into account to check parallel computability throughout the rest of this 
chapter. 

This section describes the second type of parallelism, i.e. type 2 
parallelism, in a for statement with a single body statement. Type 2 
parallelism is introduced to resolve the conflict due to the first condition 
of Theorem 1. The following example illustrates it. 

Example k : 

P: for I := 1 step 1 until kO do 

for I := 1 step 1 until kO do 

A°[I 1 ,I 2 ] := a\ 1^1,12+1] + A 2 [I 1 ,I 2 -1]; 

Since R 1 (l) a 1 > and (Rp(l)) = (-1) < (0) hold, P cannot be 
computed in 1-p w.r.t. I . 

Now let us consider the I -I plane (Figure 6.5). Suppose that all 

S((i, ,i )) in the shaded area have been computed. Then at the next step those 

S((i * , i ' ) ) marked as HJ can be computed simultaneously, and at the following 

step all (2) can be computed simultaneously, and so forth. We can see that a 
heavy zigzag line travels from left to right like a "wave front" indicating 
that all statements on that front can be computed simultaneously. 



161 




Figure 6.5- Wave Front 

Note that computation of P by this scheme takes approximately 120 
steps, while if P is computed sequentially it takes k-0 x h-0 = 1600 steps. 

Given a loop P , if P is computed in 1-p w.r.t. I , then a "wave 

front" is in parallel with the I axis of the I - I , x ••• X I plane, and 
^ u u u+1 n * ' 

it travels in the increasing order of (i(u+l..n)). 

If P cannot be computed in 1-p w.r.t. I then it may be possible to 
find a "wave front" which is diagonal rather than horizontal as in Example k on 



the I - I - x ... x I plane, 
u u+1 n ^ 



162 




The direction of wave 
front travel 



tan a = slope of a wave front 



Figure 6.6. Wave Front Travel 



This wave front is such that all computations S((i(l..n))) which corresponds to 
points (i ,i ,,..., i ) which lie right next to a wave front can be computed 

simultaneously. In other words all necessary data to compute S( (i(l. .u-l), 
i(u..n))) have been already computed in the shaded area. The direction of a 
wave front's travel is perpendicular to the wave front. 

Now let us obtain the slope of a possible wave front for a restricted 
loop. 

Let P be a restricted loop. Assume that P cannot be computed in 
1-p w.r.t. I . Then according to Theorem 2, this means that there are R(h) for 

which R (h) > and (R (h), . . ,,R (h)) < (0, ...,0) hold (i.e. CI / 0). 



163 



Theorem k : 

The slope of a possible wave front in the I - I , x ••• x I plane 



is given by 



max 
he CI 



^ ((-V((R u+1 (h),...,R n (h))| B ) - r B s +2)1 
u s=u+l -1 



n 

where B = ( lL(u+l. .n) | ) and B = 7r L, ,. and B = 1. 

s t+1 n 
t=s 



In Example k, the slope of the wave front is ■r(-("l"l + l)-l + 2) * 2. 

Proof of Theorem k : 

Let us consider S( (i(l. .u-l), l(u. .n) ) ) on the I - I x ... x I 

u u+l n 

plane. Assume that there is a variable id such that (Figure 6.7) 

(id, (i(l..u-l),i'(u..n)),(i(l..u-l),i(u..n))) eDR 

holds together with 

i ' < i and (i"(u+l..n)) > (i(u+l..n)), 
u u — 

i.e. 

id e IN(S((i(l..u-l),i(u..n)))) 
and 

id e OUT(S((i(l..u-l),i'(u..n)))). 
This implies that there is he CI such that 

A h [(i(l..u-l),i(u..n))] = A°[(i(l..u-l),i'(u..n))] 
holds. 



l£k 



(i(u+l..n)) 



(i'(u+l..n)) 



1 1 T 






< "T" 

t 




/ 


h 




// 




/ 


7\ a 


, \L. 


_x 






k R u (h) — ■* 





Figure 6.7. An Illustration for Theorem k 



In case of a restricted loop, we have 



i ' = 1 



lUh) 



for j > u. Now let 

t, = V((i'(u+l..n))|B) - V((i(u+l..n))|B) 



then we get 



n 



t. = -V((R (h),...,R (h))|B) - Z B + 1 where B e 
n u+1 n . s s 

s=u+l 



n 



7T |L, I and 



t=s 



t+1 



B - 1. Now if we let the slope of a wave front be equal to 



165 



t + l t, + 1 

h h 

i - i ' " R (h) 

u u u 

then A [ (i(l. .n))] and A [ (i(l. .U-l), i' (u. .n))] will be separated by it 
(Figure 6.7). 

The actual wave front is a zigzag line, rather than a straight line 
as shown in Figure 6.7- 

If there are more than one h in CI, then we choose a to be large 

enough so that all inputs to S( (i(l. .u-1), i(u. .n)) ) be inside of a wave front, 

i.e. 

t + 1 
tan 3 = max ^-^p. 
h£Cl u 

(Q.E.D.) 

Now suppose we compute F in parallel w.r.t. I using a diagonal 
wave front whose slope is D = tan q> Then how many steps (one step corresponds 
to the computation of a body statement S) does it take to compute P ? 

Theorem c / \ 

The total number of steps required to compute P in parallel w.r.t. 
I using a diagonal wave front whose slope is D is given by 





/u-1 \ 


i n 




|L |D^ 

1 u ' 

i 


T - 
P 


' TT |L.| 


'.j=u+l J 


+ 



166 



Proof: 



Let us consider the I - I ,, x ... x I plane. 

u u+i n 




end 



Figure 6.8. An Execution by a Wave Front 



Wave front W must travel from the start position to the end position on the 

n 
plane. How long does it take? It takes L + |L |d steps where L = it |L. 

U j=u+l 3 

u-1 

Since we have to process 7rlL.ll - I ,-■ x ••• X I planes, in total it 

^ ._, ' j ' u u+1 n * ' 

u-1 
becomes T = tt |L . I (L + |L |D). 



(Q.E.D.) 



Note that if a wave front is horizontal (i.e. if P can be computed 



in 1-p w.r.t. I ), then D = and T = tt IL.I. 



167 



6.2.4 Conclusion 

Assume that there are an arbitrary number of PE's available. Given 
a restricted single body statement loop, 

P 1 = (l r ...,I n )(A rF°,...,^] := f(A 1 [F^,...,Fj],...,A P [FP,... > ^])) 

we can check if F can be computed in 1-p w.r.t. I (u=l, ...,n) by Theorem 2. 

If it cannot be, then we can check for type 2 parallel computability w.r.t. 

I , i.e. find a possible wave front. In either case we obtain the number of 

u' 

n u-1 n 

computational steps required, i.e. T = ir |L.| or T = ( tt |L.|)( ir |L.| + 

U j=l 3 u j=l 3 j=u+l J 

lL I'D), where one step corresponds to the computation of the body statement S. 

Then among all possible choices, we would choose to compute in parallel w.r.t. 
I where T = min T . - 

6.3 A Loop With Many Body Statements 

6.3-1 Introduction 

In what follows we mean a restricted loop by a loop. Again a check 
against all published Algorithms in 1968 and 1969 CACM issues has been done, 
and it has been revealed that well over 50 percent of the cases of for statement 
usage (with more than two-body statements) are instances of restricted loops. 

Also as stated in Section 6.2.2.3 and Chapter 5, the LR relation may 
be disregarded by introducing temporary locations. Hence it will not be taken 
into account throughout the rest of this chapter. 



168 



Given a loop with m body statements, P , there are three different 
approaches to compute it in parallel. First it is possible to extend the 
procedure described in Section 6.2 by treating m body statements as if they 
were one statement. That is, we consider body statements as a function 

OUT(S(i(l..n))) = f(lN(S((i(l..n)))). 
For example 

SI: A[I,J] := f(A[I,J-l],B[I-l,J-l]); 
S2: B[I,J] := g(A[I-l,J-l],B[I,J-l]) 
yield 

S: {A[I,J],B[I,J]} := f(A[I,J-l],A[I-l,J-l],B[I-l,J-l],B[I,J-l]). 
Then we can apply Theorem 1 directly to check if e.g. S can be computed in 1-p 
w.r.t. I. 

The second and the third approaches can be illustrated by the follow- 
ing two examples. 

El: for I := 1 step 1 until kO do 
begin 

SI: A[I] := f(A[I],B[I]); 
S2: B[I] := g(A[ I],B[ 1-1] ); 
end ; 
E2: for I := 1 step 1 until i+0 do 
begin 

SI: A[I] := f(A[I-l],B[I-2]); 
S2: B[I] := g(A[I]); 
end. 



169 



In El, note that SI and S2 cannot be computed in parallel for all values of I 
because S2 has an iteration form 

B[I] := g'(B[I-l]). 
However, El may be replaced with two for statements : 

for I := 1 step 1 until kO do 
SI: A[I] := f(A[I],B[I]); 

for I := 1 step 1 until kO do 

S2: B[I] := g(A[ I], B[ 1-1] ) ; 
Now the first loop can be computed in parallel for all values of I while the 
second for statement is still an iteration. In general by replacing a single 
or statement with two or more for statements the parallel part may be exposed. 

In the second example, SI and S2 can not be computed in parallel 
for all values of I, nor can they be separated into two independent loops 
because SI uses values which are updated by S2 (i.e. B[I-2]), and S2 uses 
values being updated by SI (i.e. A[I]). However SI and S2 could be computed 

simultaneously while I varies sequentially if the index expression in S2 is 

A 

"skewed" as follows. 

E2' for I := 1 step 1 until kO do 
begin 

SI: Ari] := f(A[I-l],B[I-2]); 
S2»: B[I-1] := g(A[I-l])j 
end. 



JL 

Strictly speaking S2' should not be executed when 1=1 and an extra 

statement S2" : Bl^O] := g( k[kO]) is required after this loop. For 
the sake of brevity those minor boundary effects are ignored through- 
out this section. 



170 



Figure 6.9 illustrates the computation of the modified loop as well as the 
original loop. 



I 


SI 


S2 


• 
* 






i-2 




B[l-2] 


1-1 


A[ 1-1] 

1 / 


/ 


i 


1 / 
A[i]- 


-»B[i] 



i-2 



1-1 



SI 



A[i-1] 
^ 



A[i] 



S2' 



^ 



B[i-2] 



B[i-1] 



Figure 6.9. Simultaneous Execution of Body Statements 

In general, the above three approaches could be tried in any 
combination. For example, we may first try the first approach, i.e. we try to 
execute body statements simultaneously for all values of some loop index. If 
this Tails, then we may use the second approach, i.e. we separate a loop or 
we replace a loop with as many for statements as possible. On a resultant 
for statement we again try the first approach (if it has only one body statement, 
then the results of the previous section can be used). If we fail again, then 
the third approach can be taken. 

We now describe each approach separately. 

Before we go further, we define the following notations. Without 

loss of generality we assume that the p-th occurrence of A, appears in S and 

k p 



also assume that S and S have forms 
P q 



171 



and 



S * (AP[F(k,p,l),...,F(k,p,n)] := f _.(...)) 



S = (-.. := f(...,A*[F(k,q,l),...,F(k,q,n)], ...))• 



q q 

Then we define a vector R(k,p,q) as follows 

R(k,p,q) = (R 1 (k,p,q),...,R n (k,p,q)) 

where 

R,(k,p,q) - F(k,p,i) - F(k,q,j). 

J 

- w(k,p,j) - w(k,q,j). 
If F(k,p, c i) = F(k, q, j) = 0, then we let R.(k,p, q) = 0. Finally we write 

J 

R'(u,k,p,q) = (R x (k,p,q), ...,R u _ 1 (k,p,q)) 

and 

R"(u,k,p,q) = (R u (k,p,q),...,R n (k,p,q)). 

6.3*2 Parallel Computation with Respect to a Loop Index 

We first study the first approach described in the previous section, 
i.e. we treat body statements as if they were one statement and try to execute 
them in parallel with respect to some loop index. 

Let us consider P = (I. , I_, .... I ) (S., ; . . . :S ) . Then we treat m body 

12 ' n 1 m J 

statements as one statement S where 



m 



0UT(S((i 1 ,i 2 ,...,i n ))) = U OUT(S (d r i 2 , .••,!))) 

P=l 



and 



172 



m 
IN(S((i r i 2 ,...,i n ))) = IN(S 1 ((i 1 ,i 2 ,...,i n ))) U U [IN(S ((i^ig, 

p-1 
...,i n ))) - U OUTCS^dpig,...,^)))]. 

Having these two sets, we can use results of Section 6.3 directly. For example 
let us consider Theorem 2. Then we may modify Theorem 2 as follows. First 

suppose an array A£ appears in OUT(S( (i(l. .n) ) ) ) and A^ appears in 

IN(3((i(l..n)))). Then obtain R(k,p,q). 

Theorem 6 ; (cf. Theorem 2) 

For each A^ in 0UT(S( (i(l. .n) ) ) ), we obtain R(k,p, q) for all q such 

that A^ is in IN(S( (i(l. .n) ) ) ). Then if there is any R(k,p, q) which satisfies 

all three conditions described below, then S cannot be computed in type 1 
parallel w.r.t. I . Conditions: 

(1) R.(k,p,q) = or for all j = 1,2, ...,u-l, 

J 

(2) R u (k,p,q) > 0, and 

(3) there is £(u+l<£<n) such that V .(u+l<j<£-l), R .(k,p, q) = and 

J J 

R/k,p,q) = or R £ (k,p,q) < 0. 

The above theorem can be proved similarly to Theorem 2, and the 
details will not be given. 

Since we have to apply the above check for all R(k,p,q) vectors, the 
number of checks required is proportioned to the total number of R(k,p, q) vectors, 
#R(k,p, q). Also since we can try to compute S in type 1 parallel in n ways, i.e. 
with respect to I (u=l,2, . . .,n), the total number of checks we would perform is 
given by n x (#R(k,p, q)). 



173 



6. J. 3 Separation of a Loop 
6.3-3-1 Introduction 

In this section we study replacement (or separation) of a single for 
statement with two or more for statements. Let 

7® = (l 1 ,I 2 ,...,I n )(S 1 ;S 2 ;...;S m ) 

= (l 1 ,...,I u _ 1 ){(l u ,...,I n )(S 1 ;...;S m )] 

= (I,,..., I JF*. 

1' ' u-1 u 

For fixed values of I,,..., I , , let us consider P^. 

1' ' u-1' u 




Figure 6.10. Execution of F^ 

u 



Yjk 



If we write down the primitive execution order E (p), it can be represented 

by the straight line as shown in Figure 6.10. Now let us consider a statement 
S((i(l. .n), q) ) (See Figure 6.10). If it does not receive any computed results 
from part A, i.e. if there is no id and a statement S( (i(l. .u-l),i'(u. .n),p)) 
for which 

(id,(i(l..u-l),i'(u..n),p),(i(l..n),q))eDR 
and (i'(u..n),p) < (i(u..n),q) 
hold, then it may be computed independently of part A. If this holds for all 
p and all (i(u..n)) for some fixed q, then we can compute S before any S , i.e. 

M 

P 111 ~ (I ,...,1 )(S );(I ,...,1 )(S.;...;S -S .;...;S ). 
u u n q u' n 1' ' q-1' q+1' ' m 7 

Similarly if S( (i(l. .n), q) ) does not give any output to part B, then we have 

M 

P 111 " (I ,...,1 )(S.;...;S ,;S _;...;S );(l ,...,1 )(S ). 
u v u' ' n 1' q-1' q+1' ' m ' u* ' n' q_' 



6. 3« 3*2 The Ordering Relation (6 ) and Separation of a Loop 

Now we study how P may be replaced with several for statements. We 

first define the relation 6 between body statements. The relation is such 

u 

that if B (p, q) holds, then for any given (i(u..n)) there are (i'(u..n)) and id 

such that 

(id, (i(l..u-l),i'(u..n),p),(i(l..n),q))eDR 
and (i*(u..n)) < (i(u..n)) 
hold. That is, for some fixed q, if there is no p for which B (p, q) holds, then 



u 



S can be computed before any S and 



175 



where 



P^-(l)(S q );(l)(S 1 ;...;S q _ 1 ;S q+1 ;...;S m ) 



(I) = (I ,...,1 ). 



Definition 2: 



Between two body statements, S and S we first obtain R(k,p, q). Then 
9 (p, q) holds if and only if 

(1) R.(k,p, q) = or for all J = 1,2, ...,u-l. 

J 

and (2) the first nonzero element of R"(u,k, p, q) is either or a positive 
number. Also if all elements of R"(u,k,p, q) are then p < q 
holds. 
We also write 9 = { (p, q) | 9 (p, q) holds}. 

If A, appears more than twice in S , then we modify Definition 2 as 

follows. Suppose A, appears twice in S as q, -th and q^-th occurrences. Then 

we construct two vectors R(k,p, q, ) and R(k,p,q_). For each vector we check the 

above two conditions. If at least one of two vectors satisfies the two 
conditions, then we let 9 (p, q) hold. 



u 



Example 6 ; 



SI: A^-1,1,,+3] :=A 2 [I 2 ,I 3 ] + A^[ 1^1] ; 

S2: A 2 [I 2 +1,I 3 -1] := A 1 [I 1 ,I 2 -3] +^[1^1^; 

S3: ^[1^5,1^-1] :- A^Lylj] + Aj; I^^+l] ; 

S^: A^LJ : = A^I^l] + A^I^I^]; 



176 



give 

R(l,l,2) = (-1,6,0), R(l,.l,3) = (-1,2,0), . 

R(2,2,l) = (0,1,-1), R(2,2,*0 = (0,1,-1), 

R(3,3,2) = (-5,0,-1) andR(^Al) = (1,0,0). 
Then we have = { (2,1), (2,k), (k-,1)) . 

Now let us study Definition 2 in detail. First we note that id in 
Figure 6.10 corresponds to those A, for which 

F(k,p,j)(i ) = F(k,q,j)(i ) 

J J 

holds for all j = 1,2, . ..,u-l and 

F(k,p,j)(i •) = F(k,q,j)(i ) 

holds for all j = u,u+l, ...,n. The former implies that 

F(k,p,j) = F(k,q,j) 
or 

R,(k,p,q) = (or 0) 

J 

for j - 1, 2, ...,u-l which is equivalent to the first condition of Definition 2. 
Next note that (i'(u..n),p) < (i(u..n),q) implies that either 
(i) (i'(u..n)) < (i(u..n)) 
or (ii) (i'(u..n)) = (i(u..n)) and p < q. 
In the first case 

(F(k,h P ,u),...,F(k,h P ,n)) > (F(k,h q ,u),...,F(k,h q ,n)) 
must hold- In the second case 

(F(k,h P ,u),...,F(k,h P ,n)) = (F(k,h q ,u),...,F(k,h q ,n)) 
must hold. These two make up the second condition. 






177 



From we can construct a dependence graph D with m nodes each of 
which represents a body statement, e.g. from Example 6 we get: 




In D u we call a series of © u , ^(p^Pg), © u (p 2 ,P 5 ), . . .^(p^p^), . . ., 

© (p k TtV^) a chain and write ch(p., p, ) for it. If p, = p.. then it is called a 

mesh M. We say that anode p. is in the chain ch(p ,p ) (or in the mesh M), or 

the chain ch(p, ,p, ) (or the mesh M) includes p.. Note that for nodes p and q 

there may be more than one chain which connects p to q. 
Now let 
Z = (p | there is no q such that 6 (q, p) holds) 

and 

Z = (p | there is no q such that 6 (p, q) holds}. 

Furthermore let 

PD(p) = {q | ch(q,p) exists} U (p} 
and 

SC(p) - (q | ch(p,q) exists} U {p} • 

(PD for predecessors and SC for successors). Then we classify nodes in D as 
_ u 

follows : 

Z 3_ = (P | For all r e PD(p), there is no mesh in D which includes r} , 

Z = {p I For all r e SC(p), there is no mesh in D which includes r) , 
j u 

and 



178 



Z 2 = N - Z 1 - Z 3 - 

Let Z 1 (or Z ) = (p^Pg, . ..,p } • Then we can order this set as p ' p ',..., p ' 

M 

in such a way that 6 (p.', p.') does not hold if i > j. Let us write 

9 9 

Z.(or Z,) for a resultant ordered set. Also we order Z. = {q.,q_, . . .,a } as 

Q 

L 2 = ^i''^'* '••> C V ) in such a way that ^i' < q i' if i < J- 
Now given a loop t where 

F = (I^,Ig, . . ., I n ) (S^jSgj . . . ;S ; 

■ (i r ---,i u . 1 )UV.-.,i n )(s 1 ;S 2 ;...;s m )) 
= (i n ,...,i J?*, 

1 u-l u 

we build the dependence graph D and obtain sets Z , Z and Z,, say Z. = 

{P 1 *P 2 >'"*P U ^ z 2 = ( <!■]_» 12' •'•' ^ andZ 3 ■ t r i> r 2> "^ r w^ " ( m=u+v+w )- 

6 6 6 

From Z and Z, we obtain ordered sets Z and Z,, say Z = (p ',p p ', . . . ,p ') 

6 6 

and Z^ = (r 1 l ,P 2 , ,...,P w ')- A1 so we have Z g = (q^ , q^*, . . ., q^' ) . Then 

M 
p^~ (i)(s ,);(i)(s ,);...;I(S ,);(i)(s ,;...;S ,); 

^1 y 2 *u 4 1 ^v 

(l)(S r ,);(I)(S ,);...j(l)(S ,) 

12 w 

where (i) = (I ,1 .,...,1 ). 
u' u+1 n 



Note that Z (or Z,) together with 9 makes a graph which does not 
contain any mesh. To order Z (or Z,) the technique discussed in 
Chapter 7 niay be used. 



179 



Thus we have replaced a loop P with as many for statements as possible. We 
say that p is separable from !F if peZ (or peZ_) with respect to u, and that 
F is separable with respect to I . Also we say that p is separated with 
respect to I if P°J is replaced by many for statements as we showed above. 

6.3-3.3 Temporary Storage 

Now let us study the following : 

2 
P : for I := 1 step 1 until 1 do 

for I : = 1 step 1 until kO do 

begin 

SI: A^y := Ag[I 2 ] + A^Ig]; 

S2: A k [I 2 ] := A.J Ij +A^[Ig]; 

end 
Then we have R(l,l,2) = (0,0) or © (1,2) holds and: 

V 

© <D 

Hence we get Z, = (1,2) and 

P 1 * 2 : (I 1 ,I 2 )(S1: A^IJ := A 2 [I 2 ] + A 3 [I 2 ]); 
(I 1 ,I 2 )(S2: A 4 [I 2 ] := A^IJ +A^[I 2 ]). 



180 



This, however, does not give the same results as produced by the original P^. 
Note that after the execution of the first loop (I,, I_)(Sl), the only outcome 

is A[I X ] (i.e. A[l]) = A 2 [>0] + AJlfO] . However the second loop, (I ,1 p)(S2), 

requires forty different inputs, i.e. Ap[l] + A,[l], . ,,A,JkO] + A,[^0]. Hence 

2 
it becomes necessary to modify P ' as follows: 

P l 2; (\^ 2 ^ S1: ^l'V := A 2 [I 2 ] + A 3 [I 2 ] ) ; 

(I 1 ,I 2 )(S2: AJI 2 ] := AJI^y ♦ A^]). 






>A[1] 

)A[1] 



Figure 6.11. An Introduction of Temporary Locations 



In general we apply the following transformation rule on a loop when 
it is separated. Assume that S and S are body statements in a loop F, and 

6 (p, q) holds. Further assume that p is separated from TT with respect to 



181 



I (i.e. peZ.,qeZ ). Now let us consider the vector R"(u, k,p, q) . Let the 

value of the first element which is neither nor be e and its position he 
i. Then we let 

R(u,k,p,q) = (J | u < j < i and R,(k,p, q) = 0} . 

We order elements of R(u,k,p, q) by their positions in R"(u,k,p, q) and write 
R(u,k,p,q) = (r(l),r(2),...,r(t)). 

Then we apply the following on the loop F . 

Transformation T p : 

Transformation T is defined for the cases e < and e > 
separately. 

(1) e > 0. 

Change F(k,p,r(j)) and F(k,q,r(j)) to I y /j\ for j = 1,2, . ..,t. 

(2) e < 0. 

(i) Change F(k,p,r(j)) to 1,/j) for j = 1, ...,t. 

(ii) Change F(k, q,r(j)) to the following ALGOL program for 



j - l,...,t. -If (I r( . +1) = 1) and (l r( . +2) = 1) and ... 
(I r(t) = 1) then (if i r( . } = 1 then |L r(j) | else I r(j) - 1) 
else I r(i)'" Also chan g e F(k,h q , r(t) ) to the following ALGOL 

program: "if (I , , = l) then |L ,, J else I ,. % - 1." 
— r(t; ' r(t) 1 r(t) 



Example 7 • 



Let R"(5,k,P,q) = (R 5 (k,p,q),...,R 9 (k,p,q)) = (0,0,0,0,-1). 
Then we get e = -1 and l = 9 and R(5,k,p,q) = (6,8). Also assume that \lA = 



182 



|Lq| = 3« Originally S and S may look like 

y WyW :=f p ( -- ); 

S q : .. := f (...^[I^I^^yi],...); 
Now after Transformation T^ is applied, S and S become: 

y : w^'VW :=f ( -- ); 

S q' : " := t q^'" fA T/S I l' I 5 > B 6>I 7 >Bg,I 9 ],...), 
where 

Bg = if Ig = 1 then (if Ig = 1 then 3 else Ig-l) else I, 

and Bq = if In = 1 then 3 else I« - 1. 

Note that by applying Transformation T ? , temporary locations are 
eventually introduced. For example in Example 7, A, is changed to a seven- 
dimensional array from a four dimensional array by Transformation T . 

6.3.k Parallelism Between Body Statements 

6-3- i +-l Introduction 

Now we describe the parallelism between body statements. As stated 
before it becomes necessary to modify index expressions. In this section we 
give an algorithm which modifies index expressions properly. 

We first describe the algorithm in terms of a restricted loop with 

only one loop index, i.e. F^ = (i ) (S ;S ; . . . ;S ). Accordingly every array 
identifier in P™ is of a form A.[F(k,h, 1)] (= A. [I,+w(k,h, l)] ) where this is 

the h-th occurrence of A in F^. For convenience we drop the subscript of 






183 



loop index. The primitive execution order for p becomes 

E Q (P rn ) = {((i,p),V((i,p)|(|L|,m))|(l,l) < (i,p) < (|L|,m)}. 

For a given loop p , we consider the I-S plane which is an L by 
m grid. For example we have the following 1+0x3 grid for: 
for I := 1 step 1 until kO do 
begin 
81: A^I-1] := A 2 [I] + A^I-l]; 

S2: A^I] := A^I+1] - A^I]; 

S3: A 3 [I] := A^I] + Ag[I]; 



end. 



On 



this grid, we only show the relation DR, e.g. (A,[i-1], (i-l,3)> (i>l)) e DR, 





SI 


S2 


S3 


# 








• 








i-1 








i 


V 


.St. 


-+ 


i+1 




• 





























The direction of 
wave front travel 



Figure 6.12. Wave Front for Simultaneous Execution of Body Statements 



184 



Then the objective of this section is to discover a wave front W (cf. Section 
6.2.4) which separates all inputs from the computation, e.g. in Figure 6.12 
inputs (shown by 0) to S( (i,l)), S( (i+1,2)) and S((i,3)) (shown by t) lie above 
the wave front indicated by a dotted line. Hence S((i, l)),S( (i+1,2)) and 
S((i,3)) can be computed simultaneously while I takes values 1,2, ...,40 
sequentially. In general to discover a wave front is equivalent to discovering 
a constant C(p) for each body statement S so that all statements S((i-C(l),l) ), 

.. .,S((i-C (p),p )),..., S((i-C(m),m)) can be computed simultaneously. 

6.3.4.2 The Statement Dependence Graph and the Algorithm 

Let us consider the I-S plane again and consider the computation 

S((i,p))- Assume that there is id such that 
(id,(j,q),(i,p))eDR 

where either (i) j = i and q < p, or (ii) j < i and p / q, then clearly S((i,p)) 

and S((j,q)) cannot be computed simultaneously. 

Definition 3 ' 

The statement dependence graph (cf . the dependence graph in Section 

6.3-3), D(p ), is defined by a set N of nodes 1, 2, . ..,m each of which 

corresponds to a body statement of P and the arrow relation a. From node p 
to q there is an arrow a(p, q) if and only if either one of the following two 
conditions hold. 

(l) For fixed i, there is k such that 

AJJ[F(k,h,l)(i)] € OUT(S((i,p))), 
A*[F(k,g,l)(i)] e IN(S((i,q))), 



185 



F(k,h,l)(i) = F(k,g,l)(i) 
and p < q. 

(2) For fixed i, there exist k and i' such that 
A£[F(k,h,l)(i')l € OUT(S((i',p))), 

A*[F(k,g,l)(i)] € IN(S((i,q))), 

F(k,h,l)(i«) = F(k,g,l)(i) 
and i" < i. 

In the first case we label the arrow and write f(p, q) = 0. In the second 
case the arrow is labeled 1 and we write f(p, q) = 1. 

The statement dependence graph for the previous example is: 



© © — °-*6) 



A chain of arrows, a(p r p 2 ), a(p 2 ,p^), . . ., a(p k _ 1 ,p k ),a(p k ,p 1 ) in 

D(P ) is called a mesh M and we say e.g. a(p.,p. ) is in M. If i(p.,p. ) = 

for some arrow in M, then M is called a part zero mesh . The following lemma 
is obtained immediately. 

Lemma 2 : 

If D(P ) contains a part zero mesh, then there is no wave front for 

P*. 

Henceforth we assume that D(F ) has no part zero mesh. Given D(FJ, 
we define a subset Z of N as follows: 

Z = (p there is no q such that i(p, q) = or f(q,p) = 0} . 

Z together with arrows gives a subgraph D g of D(F m ). Further we let 



186 



Zy. = (P peZ and there is no q such that i(q, p) = 0} . 
Now we give an algorithm to find a wave front for D(P m ). 

Algorithm 1 ; 

(1) Let C(p) = + oo for all p e N. 

(2) (i) Take any p from ZL. If Z. = 0, then go to Step (5). 

(ii) Let C(p) = 0. 

(3) (i) If there are nodes s and t such that a(s,t) exists, £(s,t) = 1 and 

C's) > C(t), then we let the value of C(s) be equal to C(t). 
(ii) If there are nodes s and t such that a(s,t) exists, f(s,t) = and 
C(s) > C(t), then let the value of C(s) he equal to C(t) - 1. 

Repeat (i) and (ii) until there are no s and t which satisfy (i) or (ii) in 

D(F m ). 

(h) (i) If for all p in Z C(p) ^ + », then go to Step (5). 

'Otherwise take any p from Z for which there is q in Z such that 

a(p,q) exists and C(q) / + »• Let M = max (C(s)] where s 6 Z and 

C(s) ^ + oo. Then let C(p) = max (c(q) + 1,M} . Go to Step (3). 

For all p in Z with C(p) = + oo, let C(p) = M where M = max (C(s)} and 

seZ 

c(s) ^ + co. If Z ■ $, then let C(p) = for all p in Z. 



Example ( 
(1) D: 




187 



(2) Z = fl,2,3,U,6,7,8l, 




KD- J ^© ©-^<T>^(D 



(3) Let C(l) = and apply Step (3) of Algorithm 1. Then we get C(l) = 

since there is no q such that a(q, l) exists. 
(k) Let C(2) = 1. 

(5) Let C(3) = 1. 

(6) Let C(*0 = 2. Then we apply Step (3) of Algorithm 1 on a(S,l+), a(8, 5), 
a(5,2),a(2,l),a(7,8),a(6,7),a(9,6),a(3,9),a(3,l) and a(9,l)« And we get 
C(5) = C(8) = 2. C(2) = 1. C(7) = 1, C(6) = 0, C(9) = 0, C(3) = 0, and 
C(l) = -1. 

(7) There is no p in Z with C(p) = + ». Hence Algorithm 1 terminates and 
we get 

C(l) = -1, C(2) = 1, C(3) = 0, COO = 2, C(5) = 2, C(6) = 0, C(7) = 1, 
C(8) = 2 and c(9) = 0. 



I^ 


1 


2 


5 


h 


5 


6 


7 


8 


9 


























f\„. 






/> 












y- 




•N 


i-2 






o- 


.• 


V 


>^ 


o- 


>• 




i-1 


9- 


*• 


% 






O- 


»• 








o\ 




V> 


S^i 










/ 


l 


(J 


s 


*w 




• 














^ 


/ 


i+1 























































Figure 6.13- A Wave Front for Example 10 



188 



Now we show that Algorithm 1 gives a valid wave front. To prove this first 
we show that Algorithm 1 is effective, i.e. every step of Algorithm 1 is always 
applicable and terminates. 

Lemma ~> : 

Algorithm 1 is effective. 



Proof : 

That Step (l),(2),(k) and (5) are effective is clear. Now we show that 
Step (3) is effective. First we define U(p) to be a set of nodes such that 
U(p) = (q I there is a chain of arrows a(p, ,p ),a(p ,p_), . . ., 

a(p n _ x ,P n ) exist where j> ± = q and p n = p) U {p}, e.g. 

3 6 

© •Ch KpyA^ »k: 



D(P 8 ): 




and U(5) = (1>2, h, 5>7>9) . U(p) = U(q) implies that there is a mesh which 
includes p and q. By assumption this mesh is not a part zero mesh and c(q) 
will be assigned the same value as C(p) in a finite number of steps after c(p) 
has been assigned a value. 

If U(p) 3 U(q), then c(q) will be assigned a value less than or equal to 
C(p) in a finite number of steps after C(p) has been assigned a value. 

Thus after a finite number of applications Step (3) eventually terminates. 
Hence Algorithm 1 is effective. 

(Q.E.D.) 



189 






Theorem 6 : 

Algorithm 1 gives a valid wave front . ■ 

Proof : 

To prove this, it is enough to show that' (i) if f(p, q) = 0, then 
C(p) < C(q) and (ii) if i(p,q) = 1, then C(p) < C(q). However, from Steps (3) 
and (k) of Algorithm 1, clearly the above conditions hold. Also if p is 
assigned a value C(p) by Step (5) it implies that either (i) there is r e U(q) 
where q e Z, and f(r, q) = 1 or (ii) there is no such r. In the second case 

C(p) may take any value (S may be computed at any time), and in the first case 

C(q) > C(r) must hold. Hence we let C(q) = max{C(s)} . 

(Q.E.D.) 
To handle a restricted loop with more than one loop indicies, we 
modify Definition 3 as follows. For each S and S in P we first obtain a 

vector R(k,p, q). 

Definition 3' '> 

The statement dependence graph of t f D(f ), is defined by a set N of 
nodes 1, 2, ...,m each of which corresponds to a body statement of P^ and the 
arrow relation a. There is an arrow a(p, q) if and only if either one of the 
following s holds : 

(1) R.(k,p, q) -- or for all j = 1, 2, ..,n and p < q. We let i(p, q) = 0. 

(2) V((R 1 (k,p,q),...,R n (k,p,q)|B) > V( (0, . . . , 0) | B) where B = ( |Lj, . . ., |Lj ). 
We let i(p, q) = 1. 



190 



From Definition 3' clearly 

(1) !(p,q) = if S ((i-^ig, ...,1^)) US6S the out P ut of S p(( i 1 ; i 2>"-.» i )) and 
p < q. 

(2) f(p,q) = 1 if S ((i^ig, •••»i n )) uses the output of S ((^^ig',...,! •)) 

where (i^ig', . . .,1^ ) < (i^ig', . . ,,i n ). 

Algorithm 1 is then applied on D(Fj. For example let P^ be 
SI: A^Ig^g] := A 2 [I 2 ] + A 3 [I 2 -1,I 3+ 1]; 

S2: Ag[I 2 +l] := A^I^Ig] + k^I^I^i 

S3: A 3 [I 2 ,I 3 ] := Ag[Ig-l]; 

Then we have 




jn 



6.3*5 Discussion 

Given a loop P = (I , . . ., I ) (S, ; . . . ;£L ), we first try to execute body 
statements in parallel with respect to some loop index. If this fails for any 
loop index, or if this does not give a satisfactory result, then we try to 
replace the loop with many for statements. Then we can attempt to execute a 
body statement (or body statements) of a resultant for statement in parallel 
with respect to some loop index. If this fails, then we may try the third 



191 



approach, i.e. we try to execute all body statements simultaneously while 
loop indices vary sequentially. Often the number of loop indiciea, n, is 
very small (typically n = 2), and it will be easy to try all variations. 



192 



7- EQUALLY WEIGHTED— TWO PROCESSOR SCHEDULING PROBLEM 

T'l Introduction 

This chapter gives a solution to the so-called equally weighted- 
two processor scheduling problem. Informally the problem may be stated as 
follows. Given a set of tasks along with a set of operational precedence 
relationships that exist between certain of these tasks, and given two identical 
processors (PE),P(2), how does one schedule these tasks on the two processors 
so that they execute in the minimum time? It is assumed that either one of 
two processors is capable of processing any task in the same amount of time, 
say 1 unit of time. Informally a set of tasks together with procedence relations 
forms a graph. 

Clearly the problem of scheduling any given equally weighted task 
graph on k identical processors, P(k), in an optimal way is effectively 
solvable by exhaustion. But this is far from possible in practice. The only 
practical solution so far obtained is a result for scheduling a rooted tree 
(a restricted class of graphs) with equally weighted tasks on k identical 
processors, P(k) [21]. 

Now let us study how the equally weighted— two processor scheduling 
problem is related to the computation of arithmetic expressions on a parallel 
machine. 

In Chapter 3, the parallel computation of an arithmetic expression by 
building a syntactic tree was studied. There we were only concerned with the 
height of a tree and reducing it by distribution, and we did not introduce any 






193 



physical restrictions of a machine. For example, in reality, the size of a 
machine, i.e. the number of PE's is limited rather than arbitrarily big. One 
problem which will arise immediately is whether the distribution algorithm 
should be applied or not to reduce tree height since distribution introduces 
additional operations. For example assume that we have a two PE machine, P(2). 
Now let us consider two arithmetic expressions, A = a(bc+d) + e and B = 

abc(defgh+i). Then we have h[A] = k, h[B] = 5, h[A ] = h[abc+ad+e] = 3 and 

h[B ] = h[abcdefgh+abci] = k. Thus distribution reduces the height of T[A] and 

T[3]. On the other hand Figure 7-1 shows that if A, B,A and B are computed 

on P(2), A is still computed in less time than A while B now takes more time 

than B. 

Assume that we get A from A by the distribution algorithm. If the 
size of a machine is limited, then it may not necessarily be true that A can 

be computed in less time than A even if h[A ] < h[A] holds. Actually it is a 
nontrivial problem to decide whether distribution is to be made or not to 
reduce computation time (which is different from tree height) if the size of 
a machine is limited. It depends on the form of an arithmetic expression as 
well as the machine organization. We will not go into this problem any 
further. 

Now let us look at the situation from a different point of view. 
Given an arithmetic expression A and its minimum height tree, it is possible 
to take advantage of common expressions to reduce the number of operations to 
be performed in hopes of reducing computation time. For example let us 
consider the computation of A = (a+b+c+d)ef + (a+b)g on P(2). If we evaluate 



19^ 



(a+b) only once then A can be computed in k steps on P(2) while if (a+b) is 
evaluated twice, then it takes 5 steps to compute A (see Figure 7*2). 



e b c 




cab a d e 




(a) A (k steps) 



(b) A (3 steps) 



i h d e f 



b c 





c 1 



(c) B (5 steps) 



(d) B (6 steps) 



Figure 7-1. Computation of Nondistributed and 
Distributed Arithmetic Expressions 
on P(2) 



Our main concern in Chapter 2 was to reduce tree height assuming 
that the size of a machine is unlimited. Hence we were not interested in 
reducing the number of operations. As mentioned there, it was an open 
problem to find out common expressions while keeping the height of a tree 
minimum. However, if we could take advantage of common expressions while 



195 




1* 

3 
2 

1 

level 



(a) A Minimum Height Tree for A 




5 
k 

3 
2 

1 

Step 




(b) (a+b) computed twice 



(c) (a+b) computed once 



Figure T«2. Common Expression 



keeping the height of a tree minimum, then we would obtain a graph of operations 
rather than a tree for an arithmetic expression (see Figure 7-2. (c)). 

While we do not know how to compute an arithmetic expression A on 
P(2) in the minimum time (e.g. should distribution be done?), the scheduling 
algorithm presented in this chapter schedules a given graph of operations for 
an arithmetic expression on P(2) so that the given graph is processed in the 



196 



minimum time, assuming that each FE of P(2) may perform addition or multipli- 
cation independently but in the same amount of time, say 1 unit of time. Note 
that we may be able to construct many graphs for A. Hence while the scheduling 
algorithm schedules a given graph for A on P(2) in an optimal way, the algorithm 
does not necessarily compute A itself in the minimum amount of time. 

7-2 Job Graph 

Let G be an acyclic graph with nodes N. (i=l,2, . . .,n) and a set of 

directed arrows connecting pairs of nodes. For nodes N and N' we write N -* N' 
if there is an arrow from N to N' . We say that N is an immediate predecessor 
of N' and N' is an immediate successor of N. Also we let 

SR(N) = {N 1 |N -* N'l (a set of successors of N) 
and PR(N) = {N' |N* - Nl (a set of predecessors of N). 

Nodes which have no incoming arrows are called initial nodes , and 
nodes which have no outgoing arrows are called terminal nodes . For the sake 
of simplicity we assume that a graph has one initial node and one terminal 
node. If there are more than two, then we can add a dummy initial/ terminal 
node. We write N and N for them, respectively. We also write N =5> N' if 

there is a chain N, ,N_, . . . ,N such that N -»1 ■♦ ... -»M -> N' , or N -* N' . 
12m 1 m 

Furthermore we write N / N' or N ^> N' to show that the relation N -» N' or 

N» I' does not hold. 

Definition 1 : 

The forward distance (or level ) from the initial node to a node N, 
d (N), is the length of the longest path from the initial node to N, thus 



197 



d_(N T ) = 0. Similarly the backward distance from the terminal node to N, d (n)j 



is defined, thus •!. (lO = 0. 

Thus a node N cannot be initiated before time cL(n) but may be 
initiated at cL.(N) or at any time after that. 

Definition 2 ; 

The height of a graph G, h(G), is defined as 
h(G) = d^V- 

Then we say that a graph G is tight if for all nodes N, 
cL^N) + d T (N) = h(G). 

Otherwise we say that a graph G is loose. 
Example 1 ; 





Figure T>3- A Loose Graph and a Tight Graph 



Th 



e graph G, is a loose graph because d (N ) + d_(N p ) = 2 ^ h(G. ) 



whereas the graph G p is a tight graph. 



198 



First we shall study an optimum scheduling for a tight graph. A 
scheduling for a loose graph will be discussed in a latter section. In what 
follows, we use words "process" and "schedule" interchangably. 

Definition 3 : 

Let A(i) be the set of all nodes of forward distance i, i.e. 
A(i) = (Nld^N) = il. 

Tli is is called a node set . 

All nodes in A(i) can be scheduled independently of each other since 
there can be no precendence relations between nodes in A(i). In other words, 
if N => N 1 then N and N' cannot be processed simultaneously. 

Now we have the following lemma which characterizes a tight graph. 

Lemma 1 : 

If a graph G is tight, then for every node N, there exists N'e SR(N) 
and N"e PR(n) such that d (N 1 ) = d-^N) + 1 and d^N") = ci^(N) - 1. (For the 

terminal node SR(N ) = and for the initial node PR(N ) = 0. Those are 

exceptions. ) 

Proof : 

Obvious by Definition 2. 

(Q.E.D.) 

Corollary 1 ; 

Let G be a tight graph. Let N be a node of G. Let Ne A(t). Then 
for every i, < i < t - 1, there is at least one node N'e A(i) such that N' => 
N. Also for every j, t + 1 < j < h(G), there is at least one node N"e A(j) 
such that N => N" . 



199 



Definition k : 

To p-schedule a set Q of nodes is to partition Q, into subsets of size 
2 in an arbitrary way (if |q| is odd, there will be one subset of size l) and to 
order those subsets in an arbitrary way. 

A node N is said to be available if all predecessors of N have been 
processed. 

7.3 Scheduling of a Tight Graph 

Having these definitions, now we discuss a scheduling of a tight graph 
G on two processors. The idea of this scheduling scheme is rather simple. We 
start checking |A(i)| from i = 1 to h(G). For each i, if |A(i)| is even, then 
we p-schedule A(i), and no processor time will be wasted. If | A ( ± ) | is odd, 
and if we simply p-schedule A(i), then there will be one node, N, left which 
cannot be processed in parallel with another node in A(i). Thus we will waste 
processor time. Therefore a node which can be processed in parallel with the 
above left out node N must be found. Where can that node be found? It will be 
shown that we have to look as far as the smallest i' larger than i with |A(i')| 
= odd to find it. Thus the amount of work to look ahead is always bounded. 

Before we go further, a few more definitions are in order. 

t n 
For some t and n, let us consider a set A = U A(t+i). Now let us 

n i=0 
take a node from each of A(t+,j) and A(t+i) (j < i). Let them be N J and N 1 . If 

N £> IT", then N and N may be processed simultaneously, providing that all 

predecessors of N and N" have been processed. 

Now we establish this relati on formally on A . 



200 



Definition 5 : 



,t . 



The p-line r elati on ( ) between two nodes N and N' in A is defined 
n 



as follows. 

E 

WW if and only if 

(1) N ^> N' and N 1 £> N 
and (2) d (n) ^ d(W ). 

We also write (N, W ) for WW. A pair (N,W) is called a p-line pair . 
Note that in general (N,N') and (W,N" ) do not necessarily imply (N,N"). 

Further we define A^p) - {(N,N') | N€A(t+i),N'eA(t+j),0 < i,j < n] , 

i.e. A (p) is a set of all pairs of nodes in A between which the p-line relation 

n f * n * 

holds . 

-L. J- 

Since (N,N') e A (p) implies that (W>N)e A (p), we shall in general 
put only one of them in A (p) and drop the other. 

An algorithm to find A (p) is given in Section 7-5- 



Definition 6 : 

A p-line set L on A is an ordered set of p-line pairs 

*■ n n 

L n = ((N ,N 1 '),(N r N 2 '),...,(N k ,N , k+1 )) 



where 



(1) N Q e A(t) and N' k+1 e A(t+n), 



and (2) for all g(l < g < k),ci (N ') = cL (N ) but N ' + N . 



201 



t t t 

We say that A is p-connectable if there is a p-line set L on A . 
n * n n 

t ' t 

Also we write L (N~, N , ,) when the first and the last nodes in L 
n u K+l n 

are of special interest. 

Further, we write L^N-.N' . ) = L^(N n ,N ') U L t+: J-(N ,N' ) if 

n 0' k+l 1 0' g n-i g' k+l 

(N ,,N* ) and (N ,N') are two adjacent elements of L (N n ,N' ) and d_(N' ) = 
g-i g g g+i n u K+i l g 

dj(N ) = t + i. 

An algorithm to build a p-line set for A is given in Section 7»5« 



Example 2 : 




A(l) = (b,c,d) 
A(2) = {e,f} 
A(3) = (g,h,i) 



Figure f.k. A Graph G 



1 d 1 

v = U A(l+i) = {b,c,d,e,f,g,h,i}. A Q (p) = ((b,f),(d,e),(f,h),(e,i)). A 



1=0 



202 



typical L = ((b,f), (e,i)). Hence A is p-connectable. 

Further we define a special p-line set called a p-line (l) set. 

Definition 7 * 

Given a set A . We call a p-line set 
n 

a p-line (l) set (write L (l)) if 

cL(N. ) + 1 = (L(N^ +1 ) for all i(0 < i < k). 

Note that in this case k = n - 1. 

Also we write L (l)(N~,N,' ,,) when the first and last nodes are of 
n 0' k+l y 

particular interest. 

Now a few lemmas are in order. 

Lemma 2 : 

Suppose N e A(t) and N' e A(t+n) for some t and n in graph G. Assume 
that (N, N') holds. Then there is a p-line (l) set 

I>(1)(N,N') = ((N ,N 1 '),(N 1 ,N 2 '),...,(N n _ 1 ,N n ')) 

where N^ = N and N ' = N 1 ■ 
n 

Proof : 

A proof is given by an induction on n. 

First note that |A(t+i)| > 2 for all i, 1 < i < n - 1. Otherwise 
N $* N' holds and (N,N') does not. 



203 



(1) Now let n = 2. Then there must be two distinct nodes 

N ,N p e A(t+1) such that N - N, and N -* N' . Otherwise the graph 

G is not tight. Hence (N^N 1 ) and (N,N 2 ). Thus 1^(1) = ((N,Ng), 

(N r N')). 

(2) Now assume that the lemma holds for n < i. 

Let n = i + 1. Let N e A(t), N'e A(t+i+l), and (N,N')« Then 
there must be two distinct nodes N ,N e A(t+i) such that (N,N..) 

and N $> N p . This follows from Lemma 1 and Corollary 1. Then 

(N p ,N*)> since otherwise N => N' holds and this contradicts the 

assumption. By the induction hypothesis, there is a 

L^_ 1 (1)(N,N 1 ) = ((N,ir L '),(N 1 ,^ , ),...,(M n " 2 ,N 1 )). 
Thus there is a p-line (l) set 

L*(1)(F,H') = L^_ 1 (1)(N,N 1 ) U (N 2 ,N«) = ( (N, N 1 ' ), (n\ N 2 ' ), 

...,(M n ' 2 ,N 1 ),(N 2 ,N')). 

(Q.E.D.) 



Lemma 3: 



Suppose that N e A(t),N 1 ,N 2 £ A(t+i),N^,N e A(t+j), and N 1 e A(t+n) 
where i < j < n. Also assume that (N,N ),(N ,N*), and (n ,N') hold. 

Then there is a p-line (l) set L^(l)(N,N') = ((N,^ 1 ),..., (N ^N')). 



Proof: 



20U 



'© W 



© 



: © : © 



\ / 



/ N 



© 



i w) « 






*f(K 



© ©: 



© 







© 



<3 ® 



Figure 7-5* An Illustration for Lemma 3 

Since (N, N ) by the previous lemma, there is a p-line (1) set 
L^(1)(N,N 3 ) = Lj(l)(N,N) U L^(1)(N',N 3 ), 

u -*- u 

and since (rr,N'), there is a p-line (l) set 

L^aKw^N') = L^CDCN 2 ^) U L t+ ^(1)(NSN'). 
n-i ' j-i v ' n-j v ' 



By definition, N ^ N' , N / N' , W £ N and N 1 ^ N 2 . 



205 



Now we have two cases. 



(1) N 5 = N' 

Then N ^ N 1 . Now let 



l*(i)(n,n') = l*(i)(n,n) u lJ^CDCh 1 ,/) U L^ (l ) ft. , r ) . 

(2) N 5 ^ N'. 
Then let 

L*(i)(H,r) = lJ(i)(n,n 3 ) U L^(l)(N',N'). 

Thus in either case there is a p-line (l) set on A . 

n 

(Q.E.D. ) 

From Lemmas 2 and 3> we can prove the following lemma. 

Lemma k : 

If A is p-connectable, then there is a p-line (l) set L (l)(N, N') 

where N e A(t) and N 1 e A(t+n). 

What Lemma h implies is the following. 

Let L*(l)(N,N') = ((N,N 1 '),(N 1 ,N 2 '),(N 2 ,N 3 '),...,(N n _ 1 ,N')), i.e. 

d I (N.') = cL^N.) = t + i and cL^N.) + 1 = ^(N^). 

Now assume that a set A is p-connectable. Then we have L (l)(N,N') = 

n n ' 

((N,^'), (N r N 2 '), ...,(N n _ 1 ,N')) where N e A(t) and N' e A(t+n). Since (N.,N'), 
we can process both at the same time. To do this we first process A(t+i) - {N.} 



206 



(notice that djC^) = t+i). Then process (N.,N' }. Finally process A(t+i+l) - 
(N! ,}. This leads, us to the follovring scheduling- 



Definition 8: 



Assume A is p-connectable. Then by to p-line schedule A bv I (l). 
n * n n ' 

we mean the following scheduling. 

Let L*(l) = ((N ,N 1 '),(N 1 ,N 2 , ),...,(N n-1 ,N n ')). 

(1) p-schedule A(t) - {N Q } . 

(2) g = 1 

(3) p-schedule {N _ ,N '} 

o o 

(h) p-schedule A(t+g) - {N ',N ]. 

D g 

(5) g = g + 1. If g < n, then go to (3). 

(6) p-schedule (N ,,N '} . 
v ' * n-1' n 

(7) p-schedule A(t+k) - (N '}• 

Now an algorithm to schedule a tight graph for two processors is 
described. 

Scheduling is done according to node sets A(i), for i = 1,2, . . .,h(G). 

All nodes in A(i) can be processed independently of each other, i.e., 
in any order. 

If |A(i)| is even, then two processors can be kept busy all the time 
to process A(i) and A(i) can be processed in time |A(i)|/2. 

If |A(i)| is odd, then there is a possibility that a machine becomes 
idle, i.e., one node will be left out from A(i) which does not have any partner 



207 



to be processed with. Let it be N. Then a partner must be found from some 
A(j), j > i. First we may try to find N' € A(i+l) which can be a partner of N. 
However if |A(i+l)| is even, then |A(i+l) - (N'} | becomes odd and we have the 
same problem again, and may try to find a node from A(i+2) to fill an idle 
machine, and so on. If this cycle is ever to stop, it must stop when A(i+k) 
is hit where |A(i+k)| is odd. Otherwise there is no way to remedy the cycle, 
and machine time must be wasted. 

Tight Graph Scheduling Algorithm : 
Step 1: t = 

Step 2: If t = h(G) then p-schedule A(t) and stop, else go to Step 3* 
Step 3: If |A(t)| is even, then 
(3-1) p-schedule A(t) and 
(3-2) go to Step 7- 
Step k: |A(t) | is odd. 

Find A(t+1). 
Step 5: If VN e A(t) |SR(N)| = |A(t+l)|, then 
(5-1) p-schedule A(t) and 
(5-2) go to Step 7. 
Step 6: There is N' e A(t) such that |SR(N')| < |A(t+l)|. 
(6-1) If |A(t+l)| is odd, then 

(6-1-1) p-schedule A(t) - (N']- 

(6-1-2) p-schedule (N*} U {N"l where N" e A(t+l) - SR(N'). 

(6-1-3) p-schedule A(t+1) - (N") . 

(6-1-U) go to Step 7- 



208 



(6-2) |A(t+l)| is even. 

(6-2-1) Find out the smallest k greater than 1 such that |A(t+k)| 

is odd. 
(6-2-2) If there is no such k (i.e., we have checked up to A(h(G))) 

then p-schedule A(i) t < i < h(G) individually, and stop. 
(6-2-3) Else we have a set (2= (A(t),A(t+l), . . .,A(t+k)} where |A(t)| 

and |A(t+k)| are odd and other |A(t+i)| are all even. Check 

p-connectivity of A, . 

(1) A, is not p-connectable. 

(l-i) p-schedule A(t),A(t+l), .. .,A(t+k-l) individually. 
(1-ii) Let t = t + k - 1. 
(l-iii) go to Step "J. 

(2) A, is p-connectable. 

(2-i) Find a p-line (l) set l£(l) = ( (N Q , N^ ), (N^N^ ), . . ., 
(N k+1 ,N k ')). 

(2-ii) p-line schedule A, by L. (l). 

(2-iii) t = t + k. 
(2-iv) go to Step 7- 
Step 7: t = t + 1. 

Go to Step 2. 

Example 3 : 

(l) |A(l)| and |A(3)| are odds, and |A(2) | is even. Thus we have &= 
{A(1),A(2),A(3)1 - (by Step 6 of Algorithm) 



209 




A(l) = {a,b,c,d,e} 
A(2) = (f,g) 
A(3) = (h,i,J). 



Figure 7-6. An Example of a Tight Graph Scheduling 



(2) For A^, we have Lg(l) = ( (d, f), (g,h)). 

(3) According to Step (6-2-3) (2), we schedule as follows. 

(i) p-schedule A(l) - {d} = (a,b, c, e). 
(ii) p-schedule { d, f } . 
(iii) p-schedule A(2) - (f,g) = 0. 
(iv) p-schedule (g,h) . 
(v) p-schedule A(3) - (h) = {i,j}. 
Thus we have an optimum schedule : 



210 



Step 12 3^5 

machine A 



B 



a c ' d 



g i 



b e f 



h J 



Proof for the algorithm : 

Lemma 5 ' 

Step 3 is optimum and "whatever p-schedule is made, it does not affect 
the later stages. 

Lemma 6 : 

Step 5 is optimum and whatever p-schedule is made, it does not affect 
the later stages. 

Proof : 

First note that after A(t) has been processed, nodes in A(t+l) only- 
can become available. Since VN e A(t), |S(N)| = |A(t+l)|, all nodes in A(t) 
must have been processed before any node in A(t+l) can be processed. 

(Q.E.D.) 



Lemma 7 • 

Step 6-1 is optimum and whatever p-schedule is made, it does not 
effect the later stages. 

Proof : 

The algorithm actually processes A(t) and A(t+l) (where |A(t) | and 
|A(t+l)| are odd) in time ( |A(t) | + |A(t+l) | )/2, which is optimum. 

(Q.E.D.) 



211 



Lemma 8 : 

Step 6-2-2 is optimum. 

Proof : 

Lettf= (A(t),A(t+l), ...,A(h(G))). Since |A(t)| is odd and all other 



|A(i)| is even (t < i < h(G)), it takes at least time 
I" lA(t)K!A(t + l)l + ...^|A(h(G))l -| 



to process cl. 

Step 6-2-2 achieves this. 

(Q.E.D.) 

Lemma 9 : 

Assume |A(t)| and |A(t+k) | are odd and |A(t+i)| are all even (l < i < 
k). If A, is p-connectable, then Step 6-2-3 (2) is optimum. 

Proof : 

Oc= {A(t),A(t+l), . . .,A(t+k)} can be processed in time 

|A(t)|+|A(t+l)|+...+|A(t+k)| 
2 

Step 6-2-3 (2) achieves this. 

(Q.E.D.) 

Lemma 10 : 

Assume |A(t)| and |A(t+k) | are odd and |A(t+i)| are all even (l < i < 

k). If A, is not p-connectable, then Step 6-2-3 (l) is optimum. 



212 



Proof : 

Let£^= (A(t),A(t+l), . . . ,A(t+k)} . Since A^ is not p-connectable, 

there is no p-line set 1^(1)- Thus there will be N in some A(t+i) (0 < i < k) 

which does not have any partner to be processed with it, thus making a machine 
idle. Now if this situation could be remedied, then it would be only because 
there is N' e A(t+n+j) (j > 0) which can be done in parallel with N. 




A(t+n) 



A(t+n+j) 



odd 



even 



odd 



Figure T«T- An Illustration for Lemma 11 



Suppose that the above could be done. The parallel processing of N 



t , 



and N 1 can, however, be advantageous only if A. is p-eonnected and there is 
a p-line (l) set L^(l)(N,N f ) where N e A(t),N' e A(t+i) and N' 4 N - Otherwise 



213 



processors cannot be kept busy for A(t),A(t+l), . . . , A(t+i)-{N ) and we do not 
gain at all by delaying a processing of N. Now from the assumption, N £> N" 

for all N" € A(t+n) because otherwise A becomes p-connectable. Also by 

n " 

Corollary 1, there is a node N in A(t+n) such that N > N' . This implies that 
N 5> N'. Thus N cannot be processed in parallel with N. This proves the 

lemma. 

(Q.E.D.) 

Essentially what the above algorithm does is upon finding A(t) where 
|A(t)| is odd to try to delay the processing of a node N in A(t) till the next 
node set A(t+l). If |A(t+l)| is even, then again the algorithm tries to delay 
the processing of a node N' in A(t+l) till the next level A(t+2), etc. until it 
finds some A(t+k) where |A(t+k) | is odd, or |A(t+k) U (N"} | is even where N" 
is a node whose processing is being delayed from the previous stage. In other 
words, the algorithm tries to establish p-connectability between two adjacent 
node sets both of which have an odd number of elements. Note that it is not 
necessary to try to establish p-connectability among more than two odd node 

sets, say A(t),A(t+k) and A(t+m)(m > k), because A cannot be p-connectable if 

A, is not (see Lemma 10). 

Now the above argument together with Lemmas 5-10 prove the following 
theorem. 

Theorem 1 : 

The algorithm gives an optimum schedule of a tight graph. 



20A 



1 .k Scheduling of a Loose Graph 

Now let us extend the above algorithm so that it can handle a loose 
graph. To facilitate presentation, a few definitions are in order. 

(From now on by "a graph", a "loose graph" is to be understood. ) 

Definition 9 ; 

A node N in a graph G for which d_(N)+d T (N) ^ h(G) is called a loose 

node . Otherwise a node is tight. Let N be a loose node. Then we define the 
far distance d_(N) as <L(N) = h(G) - d^N). 

A set B(i) is a node set (cf. Definition 
k) such that 

B(i) = {N|(N is a tight node) and 
(d^N) = i)} U {N|(N is a 

loose node) and (cL(N) = i)} . 

We write Bt(i) and Bi(i) for the above 
two component sets respectively, i.e. 

B(i) = Bt(i) U B*(i). 

Note that a loose node N can be 
processed anywhere between cL(N) and 




d i ( V 




d (N) without affecting the rest of a 



Figure 7-8. A Loose Node 



graph. Definition 9 says that loose 
nodes are pushed far down and 

classified in terms of the far distance rather than the forward distance. 

Note that a loose node N receives two attributes, the far distance 

d (N) and the forward distance d (N), and is classified in terms of cL(N), 



215 



e.g. we say that a loose node N with the forward distance d (N) is in B(i) 
where i = cL.(N). 

We also define a set C(i). 

Definition 10 : 

A set C(i) of loose nodes is defined as follows. 
C(i) = (N | N is a loose node and <L(n) < i < d_(N)} . 

C(i) is a set of those loose nodes which may be processed in parallel 
with a node in B(i). If cL.(N) = i, then N is put in B(i) rather than in C(i). 

C(i) is a set of loose nodes which will be used to fill up waste processor time 
if necessary. 

Scheduling a loose graph is similar to that for a tight graph. 
A loose graph is scheduled in accordance with B(i) for i=l, 2, . . .,h(G) . A loose 
node N, even though it is in B(cL(N)), may be scheduled with any B(k) where 

d_(N) < k < d (N). All tight nodes are scheduled first and loose nodes are 

scheduled as late as possible. If it becomes inevitable to waste processor 
time if only tight nodes are used, then loose nodes are used to fill those 
otherwise wasted times. 

In what follows B(i) plays a similar role to that of A(i) in the 
previous discussion. All definitions for A(i) are also applicable to B(i) with 
a few modifications. 

Now the p-line relation (==) between two nodes N and N' in B is re- 

n 

defined as follows. 



216 



Definition 5' : 

t £ 
Let N and N' be two nodes in B . NTT ((n,N' )) holds if either one of: 

(l) (i) N 1 is not a loose node (N may be a loose node), 
and (ii) (^(N) (or d^N) if N is a loose node) + 1 = cL(N')^ 

(In other words, N e B(t+k) and N' e Bt(t+k+l), < k < n. ) 
and (iii) N / N'. 
or (2) (i) N* is a loose node (N may be a tight node), 

and (ii) MN' ) < d_(N) (or d_(N) if N is a loose node) < d_(ir) 

(In other words, N* € Bl(t+k) and N e B(t+j) where <L. (N 1 ) < 

t+j < t+k < t+n. ) If (L-(N') < t, then the above inequality 

becomes 

t < d^(N) (or IjW) < d^N 1 ) 

holds . 

Then L on B is defined similarly to Definition 6. 
n n 

Example k (see Figure 7»9) - 

(2,5), (3,*0,(^,8),(5,7) e B n (p) b y (!)' Since <W = t + 1 and d J (N 6 ) = 
t + 3, (2,6),'(5,6), (1^,6), (5,6) e B*(p) by (2). 






217 




B(t) = (N^ 



B(t+1) = {N 2 ,N 3 1 



B(t+2) = {N V N 5 1 



V B(t+3) = (N^N^Ngl 

Bt(t+3) = {N 7 ,Ng),Bl(t+3) = {N 6 } 



B(t44) = {N } 



Figure 7.9. The p-line Relation in B. 



n 



(1) of the above definition resembles Definition 5 and takes care of 
tight nodes. As Lemmas 2, 3 and h showed Definition 5 is more general than 
necessary. That is, the p-line relation need only to be defined between 
adjacent levels, i.e. A(i) and A(i+l). And Definition 5' follows this 
simplification. 

(2) , on the other hand, takes care of loose nodes. If a loose node 
N is in B(i) then this means that d. (N) = i and N may be processed in parallel 

with any node N' e B(j) where d_(N) < j < i. Because of this addition, Lemmas 

2, 3 and k do not hold anymore. For example let us consider the graph G: 



218 



© B(t+1) 




It is easy to see that (N ,N ) e B,(p) by (2) of Definition 5'. Since (N ,N ) 

/ B,(p), there is no p-line (l) set on B,(,p). This, however, does not bother 

us. Assume that B is p-connectable, and let L = ( (tL.N ' ), (N n , N ' ), . . ., 
n nO'11'2'' 

(N.,N. '),..., (N,,,N' )), where N € B(t) and N ' e B(t+n). Further assume 

that N. e B(t+a. ) and N. * e B(t+a. ,). If a. + 1 / a. ., then by Definition 
i i l+l l+l i ' l+l 7 J 

5' N. ' is a loose node and d_(N. ') < t + a. . Thus N. ' can be processed 
l+l I l+l — l l+l 

in parallel with N. without affecting the rest of a graph, i.e. we first process 

B(t+a. ) - fN.) and then process (N..N. ']. If a. + 1 = a. ,, then the old 
v i y l * i' l+l i i+l' 

strategy can be applied, i.e. we first process B(t+a. ) - {N.}, then {N.,N. '}, 

then B(t+a. .) - {N. n '} . 
l+l l+l ' 

This leads us to modify Definition 8 as follows. 



219 



Definition 8': 



Given B* and L*, where L^ = ((N^N^ ), (N^M, • • -, (N^N^' ), . . ., 
(N, ,, N. '))• Then by to p-line schedule B by L , we mean the following 

scheduling. 

(1) p-schedule B(t) - {N Q } . 

(2) g = 1. 

(3) p-schedule (N ,N '} • 

g -J - 6 

(U) Let N , e B(t+a . ) and N ' e B(t+a ). 
g-1 g-1 g g 

(i) If a . +1 = a , then p-schedule B(t+a ) - {N ' N }. 
g-1 g g g ' g 

(ii) If a , + 1 4 a > then p-schedule B(t+b) where b = a , + 
g-1 T g g-1 

1, a , + 2, .... a - 1. Then p-schedule B(t+a ) - 
g-1 g g 

{N f ,N } . 

g g 

(5) g = g + 1. If g < k, then go to (3). 

(6) p-schedule (Nj^N'}. 

(7) p-schedule B(t+k) - [N.'}- 

Finally in connection with Definition 8', we define the following. 

t t t 

Suppose that B is not p-connectable, i.e. there is no p-line set L on B . 
* n * n n 

It is, however, possible to find a p-line set L on B for some s, < s < n. 

Definition 11 : 

t t 

Given B which is not p-connectable. Now we check if B is p- 
n s 

connectable for s = 1,2, . ..,n-l. Let m be the smallest s such that B is p- 

S 



220 



connectable but B . is not. We call m the maximum p-connectable distance, 
s+l ■ * ' 



and L a maximum p-line set. 

m * 



The following example illustrates the above definition. 



Example 5 ; 



B(t+1) 




Figure 7*10. An Example of the Maximum p-connectable Distance 



Let us consider p-connectability of B^. It is clear that B^,(p) = 
{ (N,,N, ), (N, ,N/-), (N , No)} , and B, is not p-connectable. Since B, is p- 
connectable (N , N. ) but B is not, the maximum p-connectable distance in the 
above B, is 1 and L = ((N ,N, )). 



221 



Using a similar technique to the one described in Section 7«5> we can 
check p-connectability of B . 

An algorithm to schedule a loose graph for two processors is now 
given. The algorithm resembles the one for a tight graph. The major difference 
lies in the treatment of loose nodes. Loose nodes are scheduled as late as 
possible. They will be used when it becomes inevitable to waste processor 
time. 

In what follows we modify the definitions for B(i) and C(i) so that 
they do not include those loose nodes which hare been already scheduled. 

Loose Graph Scheduling Algorithm 

Step 1: t = 0. 

Step 2: If t = h(G) then p-schedule all unscheduled nodes and stop, else go 

to Step 3» 
Step 3: If |B(t)| is even, then 
(3-1) p-schedule B(t) 
(3-2) go to Step 7- 
Step k: |B(t)| is odd. 

Find B(t+1). 
Step 5: If V N e B(t) |SR(N) |= |B(t+l) |, 
then 
(5-1) Check C(t). If C(t) = 0, then p-schedule B(t) and go to Step 7. 
(5-2) Otherwise pick N with the minimum cL(n) in C(t). (if there are 

more than one such N, pick any N. ) Now p-schedule B(t) U {N} and 
go to Step 7* 



Step 6: There is N' e B(t) such that |SR(N')| < |B(t+l)|. 
(6-1) If |B(t+l)| is odd, then 

(6-1-1) p-schedule B(t) - {N 1 }. 

(6-1-2) p-schedule (N') U (N") where N" e B(t+l) - SR(N'). 
(6-1-3) p-schedule B(t+l) - {N"} . 
(6-1-U) go to Step 7- 
(6-2) |B(t+l)| is even. 

(6-2-1) Find out the smallest k greater than 1 such that |B(t+k)| is 

odd. 
(6-2-2) If there is no such k (i.e. we have checked up to B(h(G))), 
then p-schedule B(i) (t < i < h(G)) individually and stop. 
(6-2-3) Else we have a set & = (B(t),B(t+l), . . .,B(t+k)} where |B(t)| 
and |B(t+k)| are odd and other |B(t+i)| are all even. Check 

if B, is p-connectable. 
k 

(l) B, is not p-connectable. 

(l-i) Find the maximum p-connectahle distance in B, . Let 

K. 

it be m. 

(1-ii) Check C(t+m). If C(t+m) = (/>, then p-schedule B(t), 

B(t+l),...,B(t+k-l) individually. Let t = t+k-1. 

Go to Step 7- 

(1-iii) Otherwise let B' (t+m) = B(t+m) U {N} where N e C(t-fm) 

t m_1 
and has the minimum d-CN). Let B' = U B(t+i) U 



B'(t+m). Then p-line schedule B' by a maximum 

' r m 



223 



p-line set L . Let t = t + m. Go to Step 7* 

t 
(2) B^ is p-connectable. 

(2-i) Find a p-line set L. . 

(2-ii) p-line schedule B by L . 

(2-iii) t = t + k. 
(2-iv) Go to Step 7. 
Step 7: t = t + 1. 

Go to Step 2. 

Proof for the Algorithm : 

That the above algorithm is optimum can be proved by a similar 
argument used to prove the previous algorithm. For example, we can show that 
Steps 3,5-1,6-1,6-2-2 and 6-2-3(2) are optimum easily by previous Lemmas. Now 
we have to show that Steps 5-2 and 6-2-3 (l) are optimum. 

Lemma 11 : 

Step 5-2 is optimum. 

Proof : 

Suppose that we do not use N in C(t) to fill up otherwise wasted 
processor time, where N is the node with the minimum cL(n) in C(t). Saving N 

for later use, however, does not improve the situation because a node N cannot 
be used more effectively than to fill up wasted processor time anyway. 

Also the choice of N from C(t) is optimum. Suppose for example that 
N' with cL(N') > cL(N) is also in C(t). Now let us consider those two nodes 
(See Figure 7-12). Whether we use N or N' to schedule with B(t) will not make 



22k 



any difference to schedule B(t+l),B(t+2), . . ,,B(d_(N)) because a node N or N' 

is available if necessary. Suppose we used N 1 with B(t). Then it is not 
possible to fill a later request -which may arise when B(u) (d_(N) + 1 < u < 

dL(N' )) is scheduled, whereas if we used N with B(t) then we can fill the 

request. Thus this proves the lemma. 




djCN') 



Figure 7.11. An Illustration for Lemma 13 



(Q.E.D. ) 



That Step 6-2-3 (l) is optimum is proved similarly. 
Now we have the following theorem. 



Theorem 2 



The algorithm gives an optimum schedule of a loose graph. 



225 



7 • 5 Supplement 

(1) An algorithm to establish A (p) on A is now discussed, 
v *© n n 

n 
Let m = L |A(t+i) \, and B be an m x m connection matrix where the 
i=0 

first |A(t)| columns and rows are labeled by nodes in A(t), the second |A(t+l) | 

columns and rows by nodes in A(t+l), and so forth. An element b. . of B is 1 if 

and only if N. -* N . where N. and N. are labels for the i-th row and the j-th 

column. 

Now define the multiplication of the connection matrices as follows. 

Let A = B x C where A, B and C are m x m connection matricies. Then a. . = 

ij 

m - m , - — 

V (b * C ). Now complete B m = V B . In B m , b?. = 1 implies that N. >N. 
k=l lK * J k=l 1J x J 

and b. . =0 implies that (N.,N.) • For example, let us consider the graph in 
J- j -J- J 

Figure 7-13» Then we have 

A^ = { (a,f), (a,g), (b,e), (b,h), (b,g), (c,d), (d,h), (d,j), (f,h), (g,h), 
(g,i)). 

(2) Given A (p) for A , an algorithm to find p-connectivity and L (l) is 
described. 

According to Lemma k, if A is p-connectable, then there is a 

p-line (l) set L (l). Thus to check p-connectability it is enough to examine 

if there is a p-line (l) set L (l). 

n 



226 




(a) 





a 


b 


c 


d 


e 


f 


g 


h 


i 


J 


a 








1 


1 












b 








1 




1 










c 










1 


1 


l 








d 


















1 




e 
















1 


1 


1 


f 


















1 


1 


g 




















1 


h 






















i 






















•i 

























a 


b 


c 


d 


e 


f 


e 






,1 


a 








1 


1 








i 


i 


b 








1 




1 








i 


c 










1 


1 


1 


1 




i 


d 


















i 




e 




















i 


f 


















• 


i 


g 




















i 


h 
















\ 


K — 




i 


















\ 


. 


J 




















\ 



(b) B 



(c) B c 



t/t, 



Figure 7.12. An Example for A ( ) 



227 



First let 

A*(p)(l) = a£( P ) - C(N,N f )| I^CN) - d^N')! > 1} 

— t 
Now we construct a graph A as follows. 

n 

(1) A has following nodes: 

(i) an initial node N , 

s 

(ii) a terminal node N_, 

. . t 

(iii) all nodes in A , 

(iv) for each node N (N € A , N / A(t), N / A(t+n)) a new 

duplicate node N' . 

— t 

(2) A has following verticies: 

n 

(i) a vertex from N to every node N which was in A(t), 

(ii) a vertex from every node N which was in A(t+n) to N„, 

iii 

(iii) if (N^Ng) e A*(p)(l), then a vertex from V to N . 

(iv) for every A(t+k),l < k < n-1, for every N e A(t+k), a vertex 
from N to every N' which is a duplicate of N" e A(t+k) but 
is not identical to N. 
To illustrate the above definition, let us consider the following 

example . 



Example : 



Let A 2 = {a,b,c,d,e, f,g} 
where a, b e A(t) 



228 



c, d, e e A(t+l) 
and f,g € A(t+2). 
Further let 

a£( P )(1) = {(b,c),(d,f),(d,g),(c,f)} 




Figure "J. 13- An Example for p-connectivity Discovery 



Then A has nodes = {N , N_, a, b.c, d, e, f, g (which are nodes in A ), c*, d',e' 
(which are duplicates of nodes in A(l))} and verticies = { (N ,a),(N ,b), (f,N_), 

(g^N-,), (b,c), (d,f ), (d,g), (c,f )(which are original verticies in A ), (c,d ! ), (c, e' ), 

hj n 

(d, c' ), (d, e' ), (e,c' ), (e, d' ) (which are verticies from N to a duplicate N' of N" 
which is not identical to N).] 

Now it is clear that A has a p-line (l) set L (l) if and only if 

n n 

there is a path from N to N in A . There is a well-known algorithm for path 

finding, e.g. f 19] . 



229 




Figure 7-13. (continued) 



230 



8. CONCLUSION 

This thesis introduced new techniques to expose hidden parallelism 
in a program. Techniques included the use of one of the fundamental arithmetic 
laws, i.e., the distributive law, extensively. Furthermore it was suggested 
that with the help of these techniques computation of a program might be 
speeded up logarithmically in the sense that computation time became a log- 
arithmic function of the number of single variable occurrences in a program 
rather than its linear function. Even though discussions were based on an 
ILLIAC IV type machine, as mentioned before, they are readily applicable to 
pipe-line machines such as CDC STAR. 

Chapter 2 of the thesis studied parallel computation of summations, 
powers and polynomials. The minimum time to evaluate summations or powers as 
well as the minimum number of PE's required to attain it was given. A scheme 
which computed a polynomial in parallel in lesser time than any known scheme 
was also introduced. Because of its simplicity in scheduling, the k-th order 
Horner's rule for parallel polynomial computation was studied in detail. It 
was shown that for this algorithm the availability of more PE's sometimes 
increased the computation time. The algorithm was such that all PE's were 
forced to participate in computation. 

Chapter 3 presented an algorithm which reduced tree height for an 
arithmetic expression by distribution. The algorithm worked from the inner 
most parenthesis pair to the outer most one and scanned an arithmetic expression 
only once. A measure for the height of the minimum height tree for an arith- 






231 



metic expression was given as a function of the depth of parenthesis pair 
nesting and the number of single variable occurrences in it. 

Chapter h extended the above idea to cover a sequence of arithmetic 
expressions. It was shown that by replacing a sequence of arithmetic expres- 
sions with an arithmetic expression by back substitution, the computation 
time could be speeded up in a logarithmic way for a certain class of iteration 
formulas, e.g., x. . := a x x. + b . The chapter also showed that parallel 
computation was in general more favorable than sequential computation in terms 
of the round off error. Furthermore it was shown that distribution would not 
introduce the significant amount of the round off error. 

Chapter 5 studied inter- statement parallelism as an introduction 
to the following chapter. An algorithm which checked if the execution of 
statements in a program by some sequence gave the same results as the execution 
of statements by the given sequence did was given. The algorithm was new in 
the sense that it prevented variables from being updated before they were used. 
This had not been taken into account by the previous works. Also a technique 
which exploited more parallelism between statements by introducing temporary 
locations was introduced. 

Chapter 6 presented an algorithm which checked if a statement in a 
loop could be executed simultaneously for all values of a loop index. The 
algorithm checked index expressions and the way the values of indices varied 
only and did not require a loop to be replaced with a sequence of statements. 
In case a statement in a loop could not be executed in parallel with respect 
to a loop index as it was, the algorithm "skewed" the computation of a state- 
ment with respect to a loop index so that the statement could be executed in 
parallel for all values of the loop index. Also to expose hidden parallelism 



232 



from a loop, replacement of a loop with several loops was discussed. 

A solution for the equally weighted- two processor scheduling problem 
was given in Chapter 7« The only practical work so far obtained was a result 
for scheduling a rooted tree with equally weighted tasks on k identical 
processors. The solution given in Chapter 7 scheduled a graph with equally 
weighted tasks on two identical processors. If we considered common expres.- 
sions in an arithmetic expression then we would obtain a graph of operations 
rather than a tree for the arithmetic expression and the scheduling algorithm 
was readily applicable for scheduling that graph on P(2) . 

Suggestions for further research have been given in several places 
throughout the thesis and need not be repeated here. We conclude by giving 
two possible extensions that deserve brief mention. 

(1) The design of a better machine. 

Even though we assumed that a PE can communicate with any 
other PE instantaneously, this may not be the case in reality 
because it is costly and impractical to provide data paths 
between every PE pair. Hence it is necessary to design PE 
interconnection which is economical yet powerful enough to 
simulate the above idealized interconnection [25], [26]. 

(2) Generalization of the idea given in this thesis. 

The three laws of arithmetic were utilized in this thesis 
in terms of parallelism exploitation. We should, however, pay 
more attention on these laws even in terms of serial computation. 
For example suppose an arithmetic expression which involves 
matrices, row and column vectors is given. Then by the appro- 
priate application of the associative law, the number of multi- 
plications required may be reduced drastically. 



233 



LIST OF REFERENCES 

[1] Abrahams, P. W. , "A Formal Solution to the Dangling else of ALGOL 60 and 
Related Languages", Coram. ACM, 9 (September, 1966), pp. 679-682. 

[2] Abel, N. E.j et al., "TRANQUIL: A Language for an Array Processing 

Computer Tr , Proc . of the Spring Joint Computer Conference (1969), 
pp. 57-73- 

[3] Naur, P., et al. , "Revised Report on the Algorithmic Language ALGOL 60", 
Coram . ACM, 6 (January, 1963), pp. 1-17 • 

[4] Allard, R. W., Wolf, K. A. and Zemlin, R. A., "Some Effects of the 6600 
Computer on Language Structures", Comm . ACM, 7 (February, 1964), 
pp. 112-119. 

Baer, J. E., "Graph Models of Computations in Computer Systems", Ph.D. 

Dessertation, University of California, Los Angeles, Report No. 
68-46 (October, 1968) . 

[6] Baer, J. E. and Bovet, D. P., "Compilation of Arithmetic Expressions for 

Parallel Computations", Proc . of IFIP Congress (1968), pp. 340-346. 

[7] Barnes, G. H., et al., "The Illiac-IV Computer", IEEE Trans , of Computers, 
C-17 (August, 1968), pp. 746-757. 

[8] Beightler, C. S., et al., "A Short Table of z-Transforms and Generating 

Functions", Operations Research, 9 (July-August, 1961), pp. 574- 
578. 

[9] Bingham, H. W., Reigel, W. E. and Fisher, D. A., "Control Mechanisms for 
Parallelism in Programs", Burroughs Corporation, ECOM-02463-7 
(October, 1968). 

[10] Bingham, H. W. and Reigel, W. E., "Parallelism Exposure and Exploitation in 
Digital Computing Systems", Burroughs Corporation, EC0M-02463-F 
(June, 1969). 

[11] Brewer, M. A., "Generation of Optimal Code for Expressions via 

Factorization", Coram . ACM, 12 (June, 1969), pp. 333-340. 

[12] "Newsdata", Computer Decision (March, 1970), p. 2. 

[13] Conway, M. E., "A Multiprocessor System Design", Proc . of the Fall Joint 
Computer Conference (1963), pp. 139-146. 

[14] Conway, R. W., Maxwell, L. W. and Miller, L. W., Theory of Scheduling, 
Addis on -Wis ley Publishing Company, Inc., New York (1967). 

[15] Dorn, W. S., "Generalizations of Horner's Rule for Polynomial Evaluation", 
IBM Journal of Research and Development, 6 (April, 1962), pp. 239- 
245. 



•d.0* 



[16] Estrin, G., "Organization of Computer Systems--the Fixed plus Variable 

Structure Computer", Proc . of Western Joint Computer Conference 
(May, I960), pp. 33-40. 

[17] Gold, D. E., "A Model for Linear Programming Optimization of I/O — Bound 

Programs", M.S. Thesis, Department of Computer Science, University 
of Illinois at Urbana-Champaign, Report No. 340 (June, 1969). 

[18] Graham, W. R., "The Parallel and the Pipeline Computers", Datamation (April, 
1970), pp. 68-71. 

[19] Harary, F., Norman, R. Z. and Cartwright, D., Structural Model : An 

Introduction to the Theory of Directed Graphs , John-Wiley and Sons, 
Inc., New York (l9o"6]T 

[20] Hellerman, H., "Parallel Processing of Algebraic Expressions", IEEE Trans . 
of Electronic Computers, EC-15 (January, I966), pp. 82-91. 

[21] Hu, T. C, "Parallel Sequencing and Assembly Line Problems", Operation 
Research, 9 (November-December, 1961), pp. 841-848. 

[22] Knowls, M., et al. , "Matrix Operations on Illiac-TV", Department of Computer 
Science, University of Illinois at Urbana-Champaign, Report No. 222 
(March, 1967). 

[23] Knuth, D. E., The Art of Computer Programming, Vol. 2, Addis on -Wesley 
Publishing Company, Inc., New York (1969). 

[24] Kuck, D. J., "I Iliac -IV Software and Application Programming", IEEE Trans . 
of Computers, C-17 (August, 1968), pp. 758-769- 

[2 5] Kuck, D. J. and Muraoka, Y., "A Machine Organization for Arithmetic 

Expression Evaluation and an Algorithm for Tree Height Reduction", 
unpublished (September, 1969). 

[26] Kuck, D. J., "A Preprocessing High Speed Memory System", to be published. 

[27] Logan, J. R., "A Design Technique for Digital Squering Networks", Computer 
Design (February, 1970), pp. 84-88. 

[28] Minsky, L. M., Computation : Finite and Infinite Machines , Prentice -Hall, 
Inc . , New Jersey (1967). 

[2 9] Motzkin, T. S., "Evaluation of Polynomials and Evaluation of Rational 
Functions", Bull . A. M.S ., 6l (1965), P- 163. 

[ 30] Murtha, J. C, "Highly Parallel Information Processing System", in Advances 
in Computers, Academic Press, Inc., New York, 7 (1966), pp. 2-116. 

[31] Muntz, R. R. and Coffman, E. G., "Optimal Preemptive Scheduling on Two 

Processor Systems", IEEE Trans . of Computers, C-l8 (November, 1969), 
pp. 1014-1020. 



235 



[32] Nievergelt, J., "Parallel Methods for Integrating Ordinary Differential 
Equations", Comm . ACM, 7 (December, 1964), pp. 731-733. 

[33] Noyce, R. N., "Making Integrated Electronics Technology Work", IEEE 
Spectrum, 5 (May, 1968), pp. 63-66. 

[34] Ostrowski, A. M., "On Two Problems in Abstract Algebra Connected with 

Horner's Rule", in Studies in Mathematics and Mechanics Presented 
to R. von Mises, Academic Press, New York (1954), pp. 40-48. 

[35] Pan, V. Ya., "Methods of Computing Values of Polynomials", Russian 

Mathematical Surveys, 21 (January-February, 1966), pp. 105-136. 

[36] Ramamoorthy, C. V. and Gonzalez, M. J., "A Survey of Techniques for 

Recognizing Parallel Proces sable Streams in Computer Programs", 
Proc . of the Fall Joint Computer Conference (1969), pp. 1-15. 

[37] Russel, E. C, "Automatic Program Analysis", University of California, 
Los Angeles, Report No. 69-72 (March, 1969). 

[38] Shedler, G. S. and Lehman, M. M., "Evaluation of Redundancy in a Parallel 
Algorithm", IBM System Journal, 6, 3 (1967), pp. 142-149. 

[39] Squire, J. S., "A Translation Algorithm for a Multiple Processor Computer", 
Proc . of the l8th ACM National Conference (1963). 

[40] Stone, H. S., "One-pass Compilation of Arithmetic Expressions for a 

Parallel Processor", Comm . ACM, 10 (April, I967), pp. 220-223. 

[41] Thompson, R. N. and Wilkinson, J. A., "The D825 Automatic Operating and 
Scheduling Problem", in Programming Systems and Languages, 
McGraw-Hill, New York (1967), pp. 647-660. 

[42] Winograd, S., "On the Time Required to Perform Addition", JACM, 12 
(April, 1965), pp. 277-285. 

[43] Winograd, S., "The Number of Multiplications Involved in Computing 

Certain Functions", Proc. of IFIP Conference (1968), pp. 276-279. 



236 



VITA 

Yoichi Muraoka was born in Sendai, Japan, on July 20, 19^2. He 
graduated from Waseda University, Tokyo, Japan, in Electrical Engineering 
in March, 19&5 and started his graduate study at the graduate college, 
Waseda University. 

Since September 1966, he has been a research assistant with the 
project of Illiac IV computer in the Department of Computer Science of 
the University of Illinois at Urbana-Champaign. In 1969 he received his 
degree of Master of Science in Computer Science. 

He is a member of the Association for Computing Machinery and 
the Institute of Electrical and Electronics Engineering. 



•& 



at 



^