LIBRARY OF THE
UNIVERSITY OF ILLINOIS
AT URBANACHAMPAIGN
bi.0.%4
IS*' /1 Report No. k 2h
' r
y
PARALLELISM EXPOSURE AND EXPLOITATION IN PROGRAMS
by
Yoichi Muraoka
February, 1971
LIBF
NOV 9 1972
UNIVERSITY OF ILLINOIS
AT urbawachampaign:
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF ILLINOIS AT URBANACHAMPAIGN
URBANA, ILLINOIS
Digitized by the Internet Archive
in 2013
http://archive.org/details/parallelismexpos424mura
PARALLELISM EXPOSURE AND EXPLOITATION IN PROGRAMS
BY
YOICHI MURAOKA
B.Eng., Waseda University, 19^5
M.S., University of Illinois, 1969
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at UrbanaChampaign, 1971
Urbana, Illinois
iii
ACKNOWLEDGEMENT
The author would like to express his deepest gratitude to
Professor David J. Kuck, the Department of the Computer Science of the
University of Illinois, whose encouragement and good advice have led
this work to the successful completion. Also Paul Kraska read the
thesis and provided valuable comments.
Special thanks should go to Mrs. Linda Bridges without whose
excellent job of typing, the final form would have never come out.
Thanks are also extended to Mrs. Diana Mercer who helped in getting
the thesis finished on time.
IV
TABLE OF CONTENTS
Page
1. INTRODUCTION 1
2. PARALLEL COMPUTATION OF SUMMATIONS, POWERS AND POLYNOMIALS 11
2 . 1 Introduction 11
2 . 2 Summation of n Numbers 14
2 . 3 Computation of Powers 23
2.4 Computation of Polynomials 31
2.4.1 Computation of Polynomial on an Arbitrary Size
Machine 31
2.4.1.1 kth Order Horner's Rule 32
2.4.1.2 Estrin's Method 32
2.4.1.3 Tree Method 33
2.4.1.4 Folding Method 35
2.4.1.5 Comparison of Four Methods 38
2.4.2 Polynomial Computation by the kth Order Horner's Rule ... 39
3. TREE HEIGHT REDUCTION ALGORITHM 52
3«1 Introduction 52
32 Tree Height and Distribution 53
3.3 Holes and Spaces 63
3«3«1 Introduction 63
3.3.2 Holes 70
3.3.3 Space 76
3.4 Algorithm 85
3.4.1 Distribution Algorithm 86
3*4.2 Implementation 91
V
Page
3« 5 Discussion 9J+
3.51 The Height of a Tree 9U
352 Introduction of Other Operators 98
352. 1 Subtraction and Division 98
3«52.2 Relational Operators 99
k . COMPLETE PROGRAM HANDLING 100
k.l Back Substitution  A Block of Assignment Statements and an
Iteration 100
k . 2 J'oops „ 110
4.3 Jumps 113
h.k Error Analysis llU
5. PARALLELISM BETWEEN STATEMENTS „ . 122
5. 1 Program „ 122
5.2 Equivalent Relations Between Executions 125
6. PARALLELISM IN PROGRAM LOOPS 135
6.1 Introduction 135
6.1.1 Replacement of a for Statement with Many Statements 135
6.1.2 A Restricted Loop li+1
■ . 2 A Loop With a Single Body Statement 1^3
6.2.1 Introduction li+3
6.2.2 Type 1 Parallelism 146
6.2.2.1 General Case 146
6.2.2.2 A Restricted Loop 153
6.2.2.3 Temporary Locations 156
6.2.3 Type 2 Parallelism l60
6.2.U :onclusion 167
Page
6.3 A Loop With Many Body Statements 167
6.3*1 Introduction 167
6.3«2 Parallel Computation with Respect to a Loop Index 171
6.3*3 leparation of a Loop 173
o . 3 • 3 • 1 Introduction 173
6.3«3«2 The Ordering Relation (e ) and Separation of
Loop 174
6.3«3«3 temporary Storage 179
6.3*^ Parallelism Between Body Statements 182
oTjnrrT Introduction 182
6. 3« h.2 The Statement Dependence Graph and the Algorithm . . 184
6.3»5 Discussion 190
7 . EQUALLY WEIGHTEDTWO PROCESSOR SCHEDULING PROBLEM 192
7.1 Introduction 192
72 Job Graph 196
7 . 3 Scheduling of a Tight Graph 199
7 • V Scheduling of a Loose Graph 2lU
7. 5 Supplement 225
8. CONCLUSION 230
LIST OF REFERENCES 233
VITA 236
VI 1
LIST OF TABLES
Table Page
2.1. The Parallel Computation Time for Summation, Power and
Polynomial 12
n
2.2. The Number of Steps Required to Compute E a. on P(m), h (m,n),
i=l x
for n < 10 18
2.3 Computation of p (x) by Folding Method 38
2.k. The Number of Steps Required to Compute p (x), h (m, n), for
n < 10 1+8
k.l. Comparison of Back Substituted, y , and NonBack Substituted
Computation, y . Iteration Formulas 10l+
U.2. Comparison of Back Substituted, y , and NonBack Substituted
Computation, y. General Cases 108
Vlll
LIST OF FIGURES
Figure Page
1.1. Statement Dependence Relation 3
1.2. Trees for ( (a+b)+(c+d) ) and ( ( (a+b)+c)+d) 5
1 . 3 • Tree s for a + b x c + d and bx c + a + d 5
lA. Trees for a(bcd+e) and abed + ae 6
2.1. The Minimum Number, M, of PE's Required to Add Numbers in the
Minimum Time 22
2.2. Computation of x (l) 26
2.3. Computation of x (2) 27
.  . Computation of x 11 (3) 29
2.5. Computation of x" (k) 30
.6. Computation of a.x 3k
2.7. A Tree for p . . (x) 35
2.8. A Tree for p + .(x) 37
2.9* Comparison of the Four Parallel Polynomial Computation Schemes. kO
2 .10 . kth Order Horner ' s Rule 1+1
.11. The Number of Steps, h (m, n), to Compute p (x) on P(m) by the
m th Order kk
2.12. The Minimum number, M, of PE's Required to Compute P (x) in the
Minimum Time 51
3.1. An Arithmetic Expression Tree (l) 52
.2. An Arithmetic Expression Tree (2) 56
3.3. Free Nodes 63
Free Nodes in a Tree 6k
35 An example of F. and F„ 66
3.6. Elimination of a Free Node 66
IX
Figure Page
5.7. A Minimum Height Tree. . . . 68
3 .8 . Attachment of T[ t ' ] to a Free Node 71+
3.9. An Example of Space (l) 77
3 .10 . An Example of Space (2) 78
3.11. Distribution of t' over A 8l
3 .12. Tree Height Reduction by Hole Creation 82
3 .13» Stacks for an Arithmetic Expression 91
k.l. A Back Substituted Tree 102
k.2. Loop Analysis 112
. . A Tree with a Boolean Expression 11^
•'i .h . Trees for a(bc+d) + e and abc + ad + e 118
5 .1 . Conditions for the Output Equivalence 127
6 .1 . E Q 11*8
. . E[ Ij Ik8
6.y. Conditions of Parallel Computation in a Loop 150
■' . • . An Illustration of t 158
6.5 Wave Front l6l
. . Wave Front Travel l62
6.7. An Illustration for Theorem k 16U
6.8. An Execution by a Wave Front 166
. . ultaneous Execution of Body Statements 170
6.10. Execution of P B 173
u '^
6 .11 . An Introduction of Temporary Locations 180
6.12. Wave Front for Simultaneous Execution of Body Statements 183
6 .13 . A Wave Front for Example 10 187
Figure Page
7.1. Computation of Nondistributed and Distributed Arithmetic
Expressions on P(2) 19^
Common Expression 195
A Loose Graph and a Tight Graph 197
A Graph G 201
An Illustration for Lemma 3 20U
An Example of a Tight Graph Scheduling 209
An Illustration for Lemma 11 212
A Loose Node 21^
72
7.3
1h
7.5
76
77
7.8
79
7.10. An Example of the Maximum pconnectable Distance 220
7.11. An Illustration for Lemma 13 22U
t P
7 . 12 . An Example for A () 226
713 An Example for p connectivity Discovery 228
The pline Relation in B 217
n
1. INTRODUCTION
1.1 Introduction
The purpose of this research is to study compiling techniques for
parallel processing machines.
Due to remarkable innovations of technology today such as the intro
duction of LSI, it has become feasible to introduce more hardware into computer
systems to attain otherwise impossible high speeds. For example Winograd [k2]
showed that the minimum amount of time required to add two t bit numbers is
[log p t]d ([x] denotes the smallest integer not smaller than x), where we assume
that an adder consists of two input binary logic elements, e.g. AND or OR gates
and d is a delay time per gate. An adder which realizes this speed requires a
huge number of gates, e.g. approximately 1300 gates for t = 6 [12], and it has
been out of the question to build such an adder. However, the introduction of
LSI has reduced the cost of a gate significantly, e.g. it has been anticipated
that by 197^ the cost of LSI would be reduced to 0.7 cent per gate [33]« Another
example is a class of parallel processing machines. The Illiac IV [7], the CDC
6600 [k] and the D825 [^1] are included in this class. A machine in this class
has e.g. many arithmetic units to allow simultaneous execution of arithmetic
operations. As an extreme case it has been suggested to include special arith
metic units, e.g. a log taking unit (in x) and an exponent unit (x ) \ l6] . (Such
being the case, this decade may be marked as a "computer architecture" race,
reminiscent of the cycletime and multiprogramming races of the 60's [12].) We
shall not go into the details of machines further. An extensive survey of
parallel processing machines is found in [30].
Having a parallel processing machine which is capable of processing many
operations simultaneously, we are faced with the problem of exploiting parallelism
from a program so Uiat computational resources be kept as busy as possible to
process the program in the shortest time. We now discuss the problem in detail.
In this thesis, by a parallel (processing) machine P we understand a set
of arbitrarily many identical elements called processing elements (PE). A PE is
assumed to be capable of performing any binary arithmetic operations, e.g.
addition and multiplication, in the same amount of time. Furthermore we assume
that data can be transferred between any PE's instantaneously. Also we write P(m)
if P has only m PE's. A machine of this nature may be considered as a general
ization of the Illiac IV.
To date two types of parallelism exploitation techniques are known to
compile a program written in a conventional programming language (e.g. ALGOL) for
the parallel processing machine [36]. They may be termed as intrastatement
parallelism and interstatement parallelism exploitation techniques. The first
technique is to analyze the parallelism which exists within a statement, e.g. an
arithmetic expression and this has been explored by Stone [k0], Squire r 39] ^
Hellerman [20], and Baer and Bovet [6]. For example consider the arithmetic
expression:
a+b +c+d + e+f+g+h
and a syntactic tree for it:
3
2
1
level
The tree is such that operations on the same level may be done in parallel. The
height of a tree is the maximum level of the tree and indicates the number of
steps required to evaluate an arithmetic expression in parallel. Note that there
may be many different syntactic trees for an arithmetic expression, and among
them the tree with the minimum height should be chosen to attain the minimum
parallel computation time. Baer and Bovet's algorithm is claimed to achieve this
end. i.e. build the minimum height syntactic tree for an arithmetic expression [6].
Exploitation of interstatement parallelism has also been studied [10],
[37] An outcome of these works is an algorithm (the dependence relation detect
ion algorithm [ 10] ) which detects the dependence relation between statements in
a loop and jumpfree sequence of statements. The dependence relation between
S and S' holds if S proceeds S' in a sequence and S' uses the output of S as an
input to S ' . For example the algorithm dects that the statement SI in Figure
1.1 must be computed before the statement S2, but it may be computed simultaneous
ly with S3
SI: x := f^y);
S2: u := f 2 (x);
S3: v := f 3 (w);
Sk: z := f^(v,u);
(a) program (b) dependence relation
Figure 1.1. Statement Dependence Relation
Since in a real program the major part of the execution time is spent within loops
if it is executed sequentially, the major effort should be directed toward
detecting interstatement parallelism in loops. For example we would like to find
out that all fifty statements, A[ 1] := f(B[l]), ..., A[ 50] := f(B[50]), in a loop
E: for I := 1 step 1 until 50 do
A[I] := f(B[I])
may be executed simultaneously to reduce the computation time to one fiftieth of
the original. A technique available now which detects inter statement parallelism
inside a loop requires a loop to be first replaced with (expanded to) a sequence
of statements, e.g. E in the above example must be replaced with the sequence of
fifty statements, A[ 1] := f (B[ 1] ), ..., A[ 50] := f (B[ 50] ), so that the dependence
relation detection algorithm can be applied [ 10] . Obviously this approach
obscures an advantage of the introduction of loops into a program because
essentially all loops are required to be removed from a program and replaced with
straightline programs so that the dependence relation detection algorithm can be
applied on them.
The techniques described above find out parallelism inside and between
statements as they are presented. If the size of a machine (i.e. the number of
PE's) is unlimited, however, then it becomes necessary to exploit more parallelism
from a program than the above approaches provide. One obvious strategy is to
write a completely new program using e.g. parallel numerical methods [52], [38] •
The other approach which we will pursue here is to transform a given program to
"squeeze" more parallelism from it. While the first approach requires programmers
(or users) to reanalyze problems and reprogram, the second approach tries to
accept existing sequential programs written in e.g. AIG0L and execute them in
parallel. First we study parallel computation of an arithmetic expression more
carefully along this line.
For the sake of argument let us assume that an arithmetic expression
consists of additions, multiplications and possibly parentheses. Then the
associative, the commutative and the distributive laws hold. The first and second
laws have been already used to exploit more parallelism from an arithmetic
expression. For example the associative law allows one to compute the arithmetic
expression a+b+c+das ((a+b) + (c+d)) in two steps rather than as
( ( (a+b)+c)+d) which requires three steps.
(a+b) + (c + d)
(((a + b) + c) + d)
Figure 1.2. Trees for ( (a+b)+(c+d)) and ( ((a+b)+c)+d)
Also it has been recognized that the commutative law together with the associative
law gives a lower height tree. For example ((a + b x c) + d) requires three steps
while (b x c + (a + d)) requires two [39]*
b x
Figure 1.3 Trees for a + b x c+d and b x c + a + d
Now we turn our interest to the third law, i.e. the distributive law and see if
it can help speeding up computation. As we can readily see there are cases when
distribution helps. For example a(bcd + e) requires four steps while its
equivalent abed + ae which is obtained by distributing a over bed + e can be
computed in three steps.
a (b c d + e) abcd + a
Figure l.k. Trees for a(bcd+e) and abed + ae
However, distribution does not necessarily always speed up computation. For
example the undistributed form ab(c+d) can be computed in fewer steps than the
distributed form abc + abd. Hence nondiscr inn native distribution is not the
solution to the problem. Chapter 3 of this thesis studies this situation and
gives an algorithm which we call the distribution algorithm. Given an arithmetic
expression A the distribution algorithm derives the arithmetic expression A by
distributing multiplications over additions properly so that the height of A
(we write h[A ] for this) is minimized. The algorithm works from the innermost
parenthesis level to the outermost parenthesis level of an arithmetic expression
and requires only one scan through the entire arithmetic expression. Chapter 3
concludes by giving a measure of the height of the minimum height tree for A as
well as A as a function of fundamental values such as the number of single
variable occurrences in A.
The idea is extended to handle a sequence of assignment statements in
Chapter k. The distribution algorithm is applied on the arithmetic expression
which is obtained by backsubstituting a statement into one another.
Suppose we have a sequence of n assignment statements A , A , ..., A and we get
the assignment statement A from this sequence by back substitution. If the
sequence is computed sequentially, i.e. one statement after another, but each
statement is computed in parallel, then it will take h[A_] + h[A ] + ... + h[A ]
steps to compute the sequence (where h[A.] is the height of the minimum height
tree for A.). Instead we may compute the back substituted statement A in
parallel which requires h[A] steps. Obviously h[A 1 ] + ... + h[A ] > h[A] holds.
Chapter h discusses cases when the strict inequality in the above equation holds.
The cases include iteration formulas such as x. n := a x x. + b.
l+l 1
Next we study interstatement parallelism in terms of program loops.
Chapter 6 first establishes a new algorithm which detects interstatement
parallelism in a loop. The algorithm is such that it only examines index
expressions and the way index values vary in a loop to detect parallel computa
bility. For example the algorithm checks index expressions I and I + 1 as well as
the clause "I := 1 step 1 until 20" in the loop
for I := 1 step 1 until 20 do
A[I] := A[I+1] + B
and detects that all twenty statements, A[ 1] := A[2] + B, ..., A[ 20] := A[ 21] + B,
may be computed simultaneously. Thus it is not necessary to expand a loop into a
sequence of statements as was required before to check interstatement parallelism.
In general, the amount of work (i.e. the time) required by the algorithm is
proportional to the number of index expression occurrences in statements in a loop.
Having established the algorithm, Chapter 6 further introduces two
techniques which help to exploit more interstatement parallelism in loops. These
are the introduction of temporary locations and the distribution of a loop. The
second technique resembles the idea introduced in Chapter 3, i.e. reduction of
tree height for an arithmetic expression by distribution. Let us write
I, J, K(S1, S2, S3)
for an ALGOL — like program
for I : i 1 , ± 2 , . .., i m do
for J := Op J 2 , .j d n do
for K := k n , k„, .... k do
1 2' p —
begin SI; S2; S3 end .
t ^ ^
Furthermore by e.g. [I, J], K(S1, S2,S3) we understand a loop"
for (I, J) := (i^^), (i^Jg), ••., (ip^), (ig,^), ..., U n >d n ) do
for K := k.. , k^, .... k do
1' 2' p —
begin SI; S2; S3 end .
Then as in the case of arithmetic expressions we may establish the following:
(a) Association: Introduction of brackets, e.g. I, [ J,K] (SI, S2, S3) •
(b) Commutation: Change of the order of I,J,K e.g. I, K, J(S1, S2,S3).
(c) Distribution: Distribution of I,J,K over SI, S2,S3, e.g.
I,J,K(S1),I,J,K(S2,S3).
Then while the associative law always holds, e.g. I,J,K(S) = [I,J],K(S), the
commutative and the distributive laws do not necessarily hold for all loops, e.g
I, J(S) / J,I(S) if I,J(S) represents a loop
for I := 1, 2, 3 do
for J := 1, 2, 3 do
A[I,J] := A[I+1,J1].
JL
"This is equivalent to a TRANQUIL expression [2]
for (I, J) seq^ ((ip^); (i^^l^v (i m *d n )) do
In short, Chapter 6 shows that commutation indicates the possibility of computing
a loop in parallel as it is and distribution indicates the possibility of intro
ducing more parallelism into a program. For example if I,J,K(S) = K, I,J(S), then
S can be computed simultaneously for all values of K while I and J vary
sequentially. Next suppose a loop l(Sl, S2) cannot be computed in parallel for all
values of I. Then in a certain case it is possible to distribute and obtain two
loops I (SI), l(S2) which are equivalent to the original loop, I (SI, S2), and
execute each of two loops in parallel for all values of I separately. Chapter 6
gives an algorithm to distribute to attain this end.
The thesis, thus, introduces new techniques which transform a given
program to expose hidden parallelism. All results in this thesis are also readily
applicable to another type of machines, i.e. machines with a pipeline arithmetic
unit such as CDC STAR [ 18] (we regard this type of machines as a special type of
parallel machines and call them serial array machines). Each stage of a pipeline
unit may be regarded as an independent PE in the sense that an operation being
processed in one stage of a pipeline unit must not depend on an operation being
processed in a different stage. Hence exploiting parallelism results in busying
many stages at once.
Two more chapters are included in this thesis to make it complete.
Chapter 2 studies parallel computation of special cases of arithmetic expressions,
e.g. powers and polynomials, in detail to give a measure of the power of a
parallel processing machine.
As was mentioned before, unless specially mentioned, it will be assumed
that there are a sufficient number of PE's available to perform the desired task.
In reality, however, that may not be the case and non trivial scheduling problems
10
may arise. To give some insight to this problem Chapter 7 discusses a solution
to the two processorequally weighted job scheduling problem.
We conclude this chapter by defining the following symbols:
[x] ... the smallest integer not smaller than x,
[ xj ... the largest integer not larger than x, and
\ x~ ... the smallest power of 2 not smaller than x.
Also unless specified, the base of logarithms is assumed to be 2, e.g. log n is
log 2 n.
11
2. PARALLEL COMPUTATION OF SUMMATIONS, POWERS AND POLYNOMIALS
2.1 Introduction
In this chapter, we study the parallel computation of summations,
powers and polynomials. We first assume that m processors (PE) are available.
n
Then the parallel computation times for the summation ( Z a. ) and the power (x )
i=l 1
evaluation are given as functions of m and n. The minimum time to evaluate
n
Z a. or x as well as the minimum number of PE's required to attain it is also
i 1
derived.
Polynomial computation is first studied assuming the availability of
an arbitrary number of PE's. The lower bound on the computation time "or a
polynomial of degree n (p (x)) is presented. A scheme which computes p (x) in
lesser time than any known scheme is obtained. Because of its simplicity in
scheduling, the kth order Horner's "^ule is .studied further in detail. It is
shown that for this algorithm the availability of more PE's sometimes increases
the computation time.
Table 2.1 summarizes a part of results of this chapter.
Before we go further a few comments are in order. The base of
logarithms in this chapter is 2, e.g. log n is actually log n. The following
lemma will be frequently referred to in the text.
12
H H
OJ
oT"
bO H
iH
CM 1 +
i
i
OJ +
+
1 J*i ^
n^
£
^L
*L
^ CM OJ
II OJ
OJ
CM
L_
t_
+ VI
OJ
OJ
J3 +
V
X C
A
1 '"CM V
bO
c
—. CM
OJ
cvH V
H
— H
VI
^ VI
H 1
OJ 1
(— VI
r~ ,
— X H
B@
1 ( —
m
OJ H
rH G CM 1
1 — OJ
\ G
^ bO
1 V 1 CM
«r oj
"h V
H OJ
^
C » C +
^, CM
OJ
OJ
i
OJ
1
1
H
+ bO
+ OJ
+ +
— ' — ^
G
c
G —
G bD
OJ
G —
c —
C c^
cr £L
H
• • > •
. ,
, ,
• • • •
, .
^ «s S ^
.*— s
> s.
.*"— N y"~V
^«— »s
H CM
H
OJ
H OJ
no
nft
E
JJ
*4
G
—)
1h
*•—*.  y .
,— ^
,*■ — v
^— V
*— V .*— v
rH OJ
G
, Al
H
H
+ OJ
H OJ
n
II
<5) Al
II H Al
+
£ „ E
S
S
OJ £
e + e
vE w
*—
 —
v — '
■> — ' * — '
OJ
1
gi
bO
G
o
. H
*
Lr
G_,
LL?
H
+
OJ
c
+
i
+
•s
H
H
H
e
H
S
c t—
l
1
+
OJ CO.
5
l
+
1
G
G
m
OJ
5
§
en'
G
1
—C —~.
^Sl^
,j_^
,— ». *■— ^.
rH OJ
H
OJ
H OJ
P~»
H
c
•H
f—
n~:
r~
, +
[e
o
+
,G~
L_
G
H — '
OJ
+ bO
i— o
[§id
/
"g
/
VI
^~,
/
•H
G ^
•H
x
/
cd
r !
X
•H
v — ^
/
H
G
/
G Wll
H
VI
ft
/
•H
H
/
cd
H
cd
1
O
ft
G
•H
nd
F)
G
•
X!
•
cd
X
G
B
oo
^
•H
CO
G
cd
ft
cd
>
ft
P
€
o
P
G
C
ft
O
cd
O
a
•»
ft
O
*.
G
P
X
0)
O
<V
d)
•H
{=•
13
G
CQ
P
•H
0)
<H
cd
P
^
^ — s
§
•H
Jh
*
S
G
3
O
p
O
O*
ft
CO
•H
CD
p
!h
CD
^1
cd
B
o
P
W
■a
ft
3
—
p
•
ft
W
£
a
ft
G
OJ
e
8
O
bO
•H
o
ft
•H
O
Eh
o
P
H
1
u
P
II
G
O
R
0)
3
•H
■H
&
ft
CO.
P
C
i
cd
•H
R
5
■P
G
o
2
0)
<X)
CD
G
k
£
En
a
OJ
bD
o
H
o
a
H
d)
G
s
^— v
li
H
•H
G
H
Fl
^
a
cd
si
cd
ft
CM
0)
ft
■§
Eh
13
Lemma 1:
Proof:
(1) flog al  flog bl < (log a  log bl
where a and b are non zero positive integers.
(2) Ta + bl = Tal + b,
la + bj = j_aj + b,
f"b  al = b  Tal + 1 and
l_baj= b  [_aj  1
where b is a positive integer and a is a positive real number
(not an integer);
(3) a + l>["al > a and
b  1 < (bj < b
where a and b are non zero positive real numbers.
(1) Let a = 2 h + k and b = 2 + g where < k < 2  1 and < g <
2  1. Now the proof is divided into four cases, i.e. (i) k,
g > 0, (ii) k = g = 0, (iii) k  0, g > 0, and (iv) k > 0, g = 0,
(i): k, g > 0.
Then flog al  flog bl = h  f. Also let log a =
h + x and log b = f + y where < x, y < 1. Then
flog a  log bl > h  f . Thus (log al  flog bl <
■log a  log bl .
Other three cases may be proved similarly and the details are
omitted.
(2) (3) Proofs for (2) and (3) follow from the definition.
(Q.E.D.)
ih
2.2 Summation of n Numbers
Theorem 1:
The minimum number of steps, h (m, n), to add n numbers on P(m) is
'(1): n1 (m=l)
h a (m,n) = <((2): L n / m J  1 + '"log (m + n Ln/mjmjl
( l_n/2j > m > 2 )
\p): log n"l (m > _n/2j )
Proof :
(l) is self evident. (3) uses the socalled log sum method [22] or
the tree method (see Theorem 1 of Chapter 3) It is clear that "log n"l steps
are required and also that n numbers cannot be added in fewer steps.
Now we prove (2). First each PE adds [_n/m numbers independently.
This takes _n/mj " 1 steps and produces m partial sums. Then there will be
m + (n  [n/mj • m) numbers le.'t. Clearly m + n  [n/mj • m < 2m. Then those
numbers are added by the log sum method, which takes llog(m + n  (_n/mj m )
steps.
(Q.E.D.)
Now we show that for a fixed n, _n/2j > m > m' implies that h (m, n) <
h (m',n). To prove this it is enough to show that h (m + 1, n) < h (m, n) where
m h 1 £ _n/2j . There are two cases:
(1) (n/mj = _n/(m + l)j = k > 1.
Let n=km+p (p < m). Then
m + n  L n AlJ m = m + p
and
15
(m + 1) + n  l_n/(m + l)J (m + 1) = m + p + 1  k.
Hence we have
flog(m + n  l_ n Ay m ) > flog((m + l)  n  _n/(m + ill
(m + 1)? , or h a (m,n) > h a (m + l,n).
(2) jn/mj > L n /(m + l)j .
Let [n/mj = k and L n A m + l)j = k  g, where k, g > 1. Then
n = km + p (p < m) (l)
and n  (k  g)(m + 1) + p' (p' < m + l). (2)
Suppose h (m, n) < h (m + 1, n), i.e.
  1 + flogfm «(2j.] < L;jij  1 + [WO* + 1)
+ n . ^(m + l)jl. (3)
By substituting Eq. (l) and (2) into Eq. (3) and by rearranging
we get
g < riog(ra + 1 + p')'  r"iog( m + p )~l . (k)
If we can prove
g < ftog((m + 1 + p')/(m + p)) 1 , (5)
then by Lemma 1(1), we can prove Eq. (k) . Eq. (5) holds if
2 g < (m + 1 + p 1 )/(m + p) (6)
holds. Since
2 + l/m > (m + 1 + p* )/(m + p), (7)
Eq. (6) holds if and only if g = 1 (remember that g > 1). By
letting g = 1 in Eq. (2), we get
p' = km + p  (k  l)(m + 1).
Then by substituting this and g = 1 into Eq. (6) and by proper
16
rearrangement, we get
p < 2  k.
This only holds if k = 1 and p = 0, which implies that n = m
(see Eq. (l)) and contradicts our assumption that m + 1 < n/2j .
Hence h (m, n) < h (m + 1, n) never holds.
The above two cases (l) and (2) prove that h (m + l,n) < h a (m, n).
Thus we have the following lemma.
Lemma 2 :
h a (m',n) > h a (m,n) if m' < m < _n/2j .
The above lemma may seem insignificant. In Section 2.4.2, however,
it will be shown that for a certain algorithm to compute an nth degree
polynomial on P(m), the computation time step, h (m, n), is not a nonincreasing
function of m, i.e., m > m' does not necessarily imply that lr (m, n) < h p (m',n).
As it will be described later, the algorithm is such that all PE's are forced
to participate in the computation. It is true that if we are allowed to "turn
off" some PE's, then we always get h P (m, n) < h P (m' ,n) if m > m' . Then a
question is how many PE's are to be turned off. These problems will be studied
in Section 2.4.2.
It should be noted that the minimum number of PE's required to achieve
the minimum computation time is not necessarily _n/2J . For example let n = 17
Then (1T/2J = 8 and h a . n (l7) = h a (8,17) = 5 But also h a (6,17) = 5
As we know, the minimum computation time to add n number is 'log n'.
Now we present the minimum number of PE's, M, which achieves this bound.
17
Theorem 2 :
For a fixed n, let
u
M  u (L n / m J  1 + n.og(m + n  Ln/mj m)' = log rfl )"
m
Then
M = f(l) 1 + L (n  l)/2j  2 k ~ 2 (2 k < n < 2 k + 2 k_1 )
(2) n  2 k (2 k + 2 k_1 < n < 2 k+1 )
where k  l_log(n  l)j .
Proof :
For k < 3, the direct examination shows that the theorem holds (see
Table 2.2). Therefore we assume that k > 3«
k k k1
The proof is divided into two parts, i.e. (l)2 <n<2 +2 " and
(2) 2+2 <n<2 . In either case we first prove that h (M, n) = 'log nl
7hen for m < M we show that h (m, n) > flog n~ .
It should be clear that in both cases 2 < n < 2 and Tlog nl =
k + 1,
(1) 2 k < n £ 2 k + 2 k_1
Let n = 2 k + p. (l < p < 2 k_1 ) (8)
>w we first show that h (m, n) = 'log nl where
M = 1 + jn  1)/2J  2 k * 2 (9)
4
u( condition) denotes the minimum value of m which satisfies the
m
condition.
18
m
1
2
3
k
5
k
M
Case
2
1
1
2
3
2
2
1
1
1
k
3
2
1
2
2
5
k
3
3
3
2
2
1
6
5
3
3
3
2
2
1
7
6
1+
3
3
2
3
2
8
7
k
k
3
2
4
2
9
8
5
k
h
If
3
3
1
10
9
5
h
k
k
3
3
1
(1): 2 k < n <2 k + 2*" 1
(2): 2 k + 2 k " 1 < n < 2 k+1
where k = [log(nl)j.
n: The degree of a polynomial
m: The number of PE's
M: The minimum number of PE's
Table 2.2. The Number of Steps Eequired to Compute 2 a. on P(m), h (m, n),
i=l
for n < 10.
19
Now
h a (m,n)  Ln/MJ  1 + riog(MnLn/M M)l • (10 )
We then show that [n/Mj = 3 for all n. Then we get (M  n 
n/M]M) = n  2M. The value of n  2M is . evaluated in three
ways, i.e. (i) p = 2g, (ii) p = 2g + 1 (g > l) or (iii) p « 1.
k1
In any case n  2M £ 2 " holds. Thus we can prove that
log (M  n  Ln/MJM) 1 < k  1.
and
h a (M,n) < k + 1 = Tlog ril .
Then we prove that for all m < M, h (m, n) > ("log nl . This is
proved by showing that h (M  1, n) > I log nl . Then by Lemma
2, h (m, n) > ("log n~l for all m < M.
Now let us show the details. First we prove that h (M, n)
log ill. From Eq. (8) and (9), we get
M = 2 k " 2 + 1 + jjp  l)/2j, (11)
and by Lemma 1(2), we get
[jj " 5  L p J (12)
where
p = ^L(p  D/2J + k  y
2 k " 2 + 1 + ip  1)/2J
By Lemma 1(3), w e get
P < P' =
2p  h
kl
2 + p  1
k1
Now we show that for all p(l<p<2 ), P' < 1. Since
20
^ ■ (2 k ~ 1 + p  I) 2 > °
we have
max P* = %^ — < 1 for 1 < p < 2 k_1 (k > k).
21 " "
Thus P < 1 and by Eq. (12) we have (n/Mj = 3« Substituting this
into Eq. (10), we get h a (M,n) = 2 + llog(n  2M)1 . Now
subtracting two times Eq. (ll) from Eq. (8) we have
n  2M = 2 k_1 + p  2  2j_(p  l)/2j . (13)
Eq. (13) is evaluated in three different ways according to the
value of p, i.e. (i) p = 2g, (ii) p = 2g + 1 (g > l) or
(iii) p = 1 (in every case g is an integer).
(i) p = 2g (From Eq. (l), g > l).
n  2M = 2 k_1 .
(ii) p = 2g + 1, (g 1).
n  2M = 2 k ~ 1 +2g+l22g< 2 .
(iii) p = 1.
k1 k1
n  2M = 2 + 1  2 < 2 .
k1 p ~i
Hence in any case n  2M < 2 "or llog(n2M)' < k  1. Thus
h a (M,n) < k + 1 = Tiog n"l .
This proves the first part of (1). Next we prove the latter
part, i.e. for m < M, h (m,n) > h (M,n).
21
First we show that h a (M  1, n) > h a (M, n). From Eq. (8)
and (9), and using Lemma 1(2) we get
\vrh\ = i>  Q J = 3  lQj> a 1 *)
where
Q= H (p  D/2J  P .
2 k_2 + L (p  1)/2J
As we showed for P, we can also prove that for all p (l < p <
2 k_1 ), Q < 1. From Eq. (1*0, we have _n/(M  l)j =3 Then
h a (M  1, n) = 2 + Tiogdi  2(M  1)11 . From Eq. (8) and (11),
we get
n  2(M  1) = 2 k1 + p  2_(p  l)/2j > 2 k1 + 1
or Tlog(n  2(M  l))"l > k. Hence
Tlog nl = k + l<k+2< h a (M  1, n).
Thus for all m < M, h (m, n) > Tog nl by Lemma 2, and this proves
(1).
(2) 2 k + 2 k_1 < n < 2 k+1 .
Let n  2 k + 2 k_1 + p (l < p < 2 k_1 ). (15)
We first sketch the proof. We first show that h (M, n) =
flog nl . To show this we prove that l_n/Mj = 2 for all n (2 +
k1 k+1
< n < 2 ). Then using this we get M + n  (_n/Mj M =
n  M. We further show that n  M = k. Thus we get h a (M, n) =
k + 1 = flog nl. Then we prove that h a (M  1, n) > h a (M, n)
which together with Lemma 2 completes the proof of (2).
22
£
o
SJ
ir\
5
8
CD
O
2
z*
i
o
, — ..
•p
v c
rt
T)
£
1
d
(1)
O
<u
K^
,0
CO
o
p
s
CO
<H
^
O
CD
s
a
O
*§
CVJ
3
CD
H
OJ
cy
£
0£
OS
01
(W) s,3d JO aaqrauM
23
The details are similar to (l) and will not "be given here.
(Q.E.D.)
Theorems 1, 2 and Lemma 2 also apply to the case of multiplication of
n numbers, and to avoid duplication, the corresponding lemmas for multiplication
shall not be presented.
Next let us consider the power computation, e.g. x (n > 2).
2.3 Computation of Powers
Lemma 3 '■ [23]
Let N be the number of ones in the binary representation for n.
Then the near minimum number of steps to compute x on P(l), h (l, n),
is
h e (l,n)  L lo S n J + N  1.
Up to now, there is no result about the minimum computation time to
evaluate x [23]. Thus we shall settle for an approximation. For example let
e 15
15. Then h (1,15) = l_ lo § 15J +41 = 6. On the other hand x can be
evaluated in fewer steps, e.g.
(x) 2 = x 2  'x 2 )(x) = x 3  (x 3 )(x 2 ) = x 5
f 5n 2 10 , 5w 10 x 15
* (x^) = x  (x )(x ) = x • .
This takes only 5 steps. For n < 70, this lemma gives the correct values for
more than 70$ of the cases.
While we cannot give the definite answer for the sequential case, we
can prove the following.
2k
Theorem J> :
The minimum number of steps to compute x on P(m) (m > 1), h e (m,n),
is
h"(m, n) = ("log nl .
Proofs of Lemma 3 and Theorem 3 :
Let a = log n and 3 = log (n + l) for convenience. Let I be the
jth most significant bit in the binary representation for n. If m = 1, then x
is computed as follows. First let us write
X.  (x 2 V 3+1 .
J
We first compute all x (i = 2 , 2 , . .., 2 ^ • ) in iqi steps. Then
X n = (X Q )x (X x ) x ... x (x 2Lq, ) Irpl
and this computation takes N  1 steps (note that if I . = 0, then X. , = l).
Thus in total i.^j + N  1 steps are needed.
If two PE's are used, then x is computed as follows. Again let us
write
X = ( X Q ) x ( X 1 ) x • . • X ( x )
Now this can be computed by the following two recursive equations.
t (l)  x
k k1 k1
25
k k1 k1'
and
x n  t (2)
Two PE's are required for the simultaneous computation of t/, and t}. .
K. K
That the above process for P(2) is optimum is clear, because x
cannot be computed in less than [log nj steps and at least l_log n) + 1
(= llog nl ) steps are required to compute x .
(Q.E.D. )
From the above discussion, we have the following corollary.
2 L lo g "J
lorollary :
h e (m,n) = h (2,n) for all m > 2.
wow let us study simultaneous computation of all x (i=l,2, . . .,n)
Theorem h
The minimum number of steps, h (m,n), required for simultaneous
evaluation of all x (i=l, 2, . . .,n) on P(m) is
r
(1) n  1 (m = 1)
h (m,n) = <
V
(2) L log mj + 1 + r (n _ 2 L lo g m J +1 )/ m 1
(max(n.2 riogn " 1  1 , 2 ^ n1 " 2 ) > m > 2)
(3) flog nl (m > max (n  2 riog n_1 "\
flog nl „ HLog ril 2, ,
26
Proof :
(l) is obvious. (3) is illustrated in Figure 2.2. At the kth step,
i k1 k1 k\
the x (i = 2 + 1, 2 + 2, ..., 2 ) are computed using the results of
k k1 k1
earlier steps, e.g. x = x x x . The number of PE's required at this
step is then 2 k  (2 k1 + l) + 1 = 2 " .
\ PE
step\
p i
P 2
P 3
\
"
No. of PE's
required
1
X 2
1
2
x 3
X
2
3
5
X'
6
X
x 7
8
X
1+
•
.
k
a+1
X
a+2
X
• •
2a
X
2 ki
r log nll
b+1
X
. .
2b
X
b
r log nl
c+1
X
"
n
X
n  2c
a 2^ b= 2 ri °S n1  2
c = 2b
Figure 2.2. Computation of x (l)
The maximum number of PE's required is the larger one of n  2 (the
number of PE's required at the last step) or 2 ° S " (the number of PE's
required at the ( Hog h~l  l)th step). This proves (3). Clearly this
procedure is optimum in the sense that it gives the minimum computation time.
27
max(n  2
Next suppose that the number of PE's available, m, is less than
flog ffl 1 /log nl 2 }> ^ first all x i (l < ± < 2 l_l°g *J + 1) are
:omputed in [log mj +1 steps in the same manner as the above procedure.
Number of PE's
* m
1
2
X
x\\\\\\\\\\\
2
x 3
i2
WWWWV
•
I log mj + 1
a
X
x b
\ V
.
c
X
•
n
\\\\ X
~T"
J log rnj+1
r _ 1
m
i
a = 2 llogmj +1
b = 2 [ log mj + 1
c = b +1= 2 ll ° gmJ +1 +1
Figure 2.3 Computation of x (2)
28
Now there are n  2 ^° g m J +1 x 1 left (2^° g ^ +1 < i < n) to be
computed. This takes "(n  2 • g • )/m' steps on P(m). Clearly at each step,
all necessary data to perform operations are available. To show this, let us
a b
take two successive steps. Assume that the first step computes x ~ x where
b  a f 1 = m. Then the second step computes x ~ x . Since b + m =
2b + 1  a < 2b (a > 1), all inputs required at the second step are available
from the first step. Thus in total (log mj + 1 + "(n  2 L ° g ^ )/m~' steps
are required. This proves (2).
(Q.E.D.)
Clearly for fixed n, m > m' implies that h (m,n) < h (m',n). Thus
we have:
Lemma h :
w / \
For fixed n, h (m, n; is a nonincreasing function of m.
We again call the reader's attention that the number of PE's required
to compute all x in the minimum number of steps, i.e. 'log n~l , is not
• , t o flog nl 1 _ flog n~l 2 N _ , , lQ _
necessarily max(n2 ,2 ). For example, let n = io. Then
max(l8  2 ,2 ) = 8, and 'log 18 = 5« Yet P(5) achieves the same result, i.e.
h w K,l8) = 5
Lemma_5:
The minimum number of PE's, M, necessary to compute all x (l < i < n)
simultaneously in the shortest time is
29
M =
(1) n  2 ri0g 3' 1
(2) r (n 2&** m2 )/2
(n  2 ri0g ^ _1 > 2 ri ° S ^ " 2 )
(otherwise)
Proof:
Let q = ilog n~l .
(1) n  2 Crl •> 2°~ 2 .
Suppose that there are only m PE's where m < n ■■ 2
Ql
step\
p i
P 2
P J
••
P
m
riog nl  1
r log nl
a
X
a+ni
X
n
X
s
m
left out
a= 2 fl0g nl  X + 1
Figure 2.k. Computation of x (3)
Then x(2°""+l+m<i<n) cannot be computed at the a th
step. Also none of them can be computed at an earlier step
because their inputs are not ready. This proves (l).
(2) n  2~ rl < 2 r2 .
30
flog nl  1
f log nl
,f log nl  2 + 1
2 riognl 2 +m
[log nl  l
Figure 2.5* Computation of x (k)
Then we delay the computation of all x 1 (2 Qr<: " + m + l<i<2 a  l)
which are originally scheduled to be computed at the (a  l)th
step till the next step, i.e. the <.rth step (i.e. the last step).
At the oth step we compute all remaining x {2° rt ' +m + 1 < i < n).
The value of m has been chosen in such a way that
,a2
cr2
(2 U  1)  (2 urc +m + l)+l>n (2 urc + m + l) + 1
i.e.
n
,cr2
m = "(n  (2°^ + 1) l)/2».
Also since
2(2°~ 2 + m) > 2(2^ 2 + n/2  2 a " 3 )
= n + 2 a ~ 2 > n,
31
all inputs required at the cr th step are ready at the (a  l)th
step or earlier steps.
(Q.E.D.)
2.k Computation of Polynomials
In this section, we study polynomial computation. First we assume
that there are arbitrarily many PE's. Then four schemes are studied and
compared. Two of them are known as Estrin's method [16] and the kth order
Horner's rule [15]. Two new methods are also introduced. They are called a
tree method (see Chapter 3) and a folding method. It is shown that if there
are arbitrarily many PE's, then the folding method gives a faster computation
time than any known method.
Then we study the case where only a limited number of PE's are
available. Because of the simplicity of scheduling, the kth order Horner's
rule is studied in detail. It is shown that on P(m) the mth order
Horner's rule does not necessarily guarantee the fastest computation, i.e.,
there is a case where the m'th order Horner's rule (m' < m) gives a better
result. Thus availability of more computational resources does not necessarily
"speed up" the computation for a certain class of feasible parallel computation
algorithms.
. I . 1 "omputation of a Polynomial on an Arbitrary Size Machine
Definition
We write p (x) for a polynomial of degree n
i \ n n1
p ( x ) = a x + a ,x + . . . + a rt .
r n' n n1
32
2. if. 1.1 kth Order Horner's Rule [15]
The details will "be presented in Section 2.1J.2. Theorem 5 shows that
the minimum time required to compute p (x) by this method is
n
h P . = flog n"l + (log (n+l)~l + 1
mm
2.4.1.2 Estrin's Method fl5]["l6l
We first compute
C° = a + xa i = 0, 2, ..., 2 L n/2j
i+1
Then successively compute
c} = C° + x 2 C° ±+2 i = 0, k, ..., k^/kj
C? = c] + xV , i = 0, 8, ..., 8 L n/8j
ii i
44
m „ml 2m^ml . n _m+l m+1 /o m+l
C. : C. + x C. +2m i = 0, 2 , ..., 2 (_n/2 J
where m  [log nj and more over
P n <x) . C
The procedure may be illustrated by
o
p (x)  a n + a x + x~(& + ax)
n 1 2 .p
^ 2/ n
+ x (a, + ax + x (a^+a xjj
ftp 1+ 2
+ x (ag + ax + x ( a 10 +a 1JL x ) + x (a 12 +a x+x (a^+a x)))
Now notice that for each j all C? may be computed simultaneously.
33
Assuming the availability of an arbitrary number of PE's, it is easy
to show that the minimum number of steps to compute p (x) by this method is
h E . = 2 flog (n + l) 1 .
E
mm
2.4.1.3 Tree Method
In this method a tree similar to the one for arithmetic expressions
is built for a polynomial p (x). For example p,(x) is computed as:
Computation by this method consists of two stages, computation of
b. = a.x (0 < i < n) and computation of Lb..
1 i=0
n
t
i
(l) Computation of b. = a.x (0 < i < n).
This requires flog(i+l)l steps by Theorem 3 (see Figure 2.6)
(2) Computation of lb,.
i=0 x
As soon as b.'s become available they are added in the log sum
way. Suppose b. becomes available at the kth step. Having a
k1
variable at the kth step is equivalent to having 2 variables
originally at the first step (cf. the effective length in Chapter
3)
^
Step
Terms Which Are Computed
No. of Terms Computed
1
a o
1
2
a., x
1
J
2 3
a 2 X ' a 3 X
2(=2 X )
■
h 5 6 7
cL X « a,_X ^ cl/'X ^ a^X
M=2 2 )
riog(n+l)l
2 riog(n+l)l 2
riog(ni 1)1+1
2 riog(n+i)]i n
a 2 rio g (n+i)ii x ' •••>**
n _ 2 riog(n+l)ll +1
Figure 2.6. Computation of a.x
Thus, for example, two variables at the third step are reduced
to 8 variables at the first step. Repeating this procedure we
get
ol
Z (
i=l
ol
Z 'c
i=l
variables on the first level where
a  1og(n + if" .
To add these n' numbers by the log sum method, it takes
Z (2 1 ' 1 2 i ) + 2 a (n  2 0rl + 1)
+ Z2 2i_1 + 2 a (n  2 Qrl + 1)
h = H nrr n"H
mm
log n' 1 steps.
35
2.U.1.U Folding Method
This is 3. method which computes P n (x) in shorter time than any known
method.
Assume that p , (x) can be computed in h  1 steps, p.(x) (t < i <
s  1) are computed in h steps and p (x) can be computed in h + 1 steps.
Then we show that all p.(x)
roof
Steps
Degree
h  1
~ t  1
h
t ~ s  1
h + 1
s ~ (s + t 
1)
h + 2
s + t ~
(s < j < s + t  1) can be
computed in h + 1 steps and
further p (x) can be
computed in h + 2 steps.
(1) First we show that p g+t _.(x) (l < ,j < t) can be computed in
h + 1 steps.
Figure 2.7 A Tree for p .(x)
*s+tj v
36
We write p .(x) as
s+tj v
/ v \ / tJ v s s1
.,+. .(xj = (a. .x d + ... + a )x + a n x +
=+tj t+SJ s s1
. . + a
s
= p t (x)x +P s . 1 (x).
s
Now we show that x can be computed in less than h steps.
From Theorem 1 we know that x can be computed in flog nl
g
steps. Suppose that the computation of x takes longer than h,
i.e.
h < r io£ si . (16)
Now from the assumption, p , (x) takes h steps to compute. Also
S *~ J_
(see Section 2.U.1.5)
h > llog(2(sl) + l) 1  G.og(2sl) 1 . (17)
Prom Eq. (16) we have s > 2 + 1 and from Eq. (17 ) we have
2 h > 2s  1 or 2 h_1 > s. Thus we have s > 2 h_1 + 1 > 2 h_1 > s
which is a contradiction. Thus h > llog s~l .
From the assumption p, . (x) can be computed in h  1 steps
and
p , (x) can be computed in h steps. Hence p .(x) • x takes
h steps and p .(x) can be computed in h + 1 steps.
^ *s+tj
2) Next we show that p (x) can be computed in h + 2 steps.
37
p sl (x)
t+l
P t (x)
Figure 2.8. A Tree for p . (x)
S "T* U
We write
/ \ / s1 \ t+l t
P s+t (x) = (a s + t X + •*• + a t + l )x + a t X + •'• + a
 P s . 1 (x) • x t+1 +p x (x).
Then from the previous proof we know that x " can be
computed in less than h + 1 steps. Since p , (x) and p (x) take
h  1 and h + 1 steps respectively, we can compute p (x) • x "in
at most h + 1 steps and p (x) in at most h + 2 steps.
(Q.E.D. )
It is easy to check that p p (x) takes 3 steps, p,(x) and p^(x) can be
computed in k steps and p q (x) can be computed in 5 steps. By induction we
obtain the following table for h = 2, 3, ..., 10.
38
Minimum Steps
Degree of Polynomial
3
2
k
3  1*
5
5  7
6
8 12
7
1320
8
2133
9
3^5^
10
5587
11
891^3
Table 2.3. Computation of p (x) by Folding Method
For example, p (x) takes 9 steps to be computed. Note that the
first numbers in the right column form a Fibonacci sequence.
2.4.1.5 Comparison of Four Methods
It has been proved that at least 2n operations are required to compute
p Lx ) . ■ roof s appeared in several papers. We owe Ostrowski [3M and Motzkin
'I for their original works. Pan [35] summarized the results. An excellent
review of the problem appears in [23]. Also Winograd [^3] generalized results
of Ostrowski and Motzkin.
Now assume that to compute p (x) in parallel h steps are required.
Then Theorem 5 gives h < log r? + 1.og(n+l)l + 1 < 2fiog(n+l)' + 1. Also
k
since 21 operations can be performed in a parallel computation tree of
39
height k, we have 2 h  1 > 2n or h > Hog ( 2n+l J 1 . Thus
2nLog(n+l) 1 + 1 > h > flog(2n+l^.
In Figure 29, these upper and lower limits are plotted together
with the results from the previous section. It is clear that the folding method
is the best in terms of the computational speed. It is yet an open question
if there is a better method.
2.U.2 Polynomial Computation by the kth Order Horner's Rule
Now let us study computation of a polynomial by the kth order Horner's
rule.
A polynomial is computed by the kth order Horner's rule as
shown by the following procedure. We use k PE's. First we compute all
x (1 < i < k) simultaneously. Then we compute k polynomials p ' (x) on k PE's
simultaneously, where
p. '(x) = a. + x k (a.^ + x k (a. _ +...))) (0 < i < k  l).
^k ' l l+k i+2k — —
Then we get k partial results which are added to get p (x). Figure 2.10
i 1 I .si rates this.
This scheduling may not be the best, yet it is easy to implement
i also adaptable to any number of k.
Theorem 5 '
The minimum number of steps, h P (m, n), required to compute a degree
n polynomial p (x) on P(m) by the mth order Horner's rule is
(1) 2n (m = 1)
h P (m,n) = { (2) 2 (log m~l + 2[n/mj +1 (n + l>m>2)
(3) frog rH + flog(n+l) 1 + 1. (m > n + l)
140
o
LA
o
O
on
H
O
Ok
<H
O
<D
<U
U
bO
CD
O
CM
O
H
I
sdaq.s jo jaqtimjtf
kl
Proof :
A proof for (l) can be found in [k3]. (2) and (3) are selfevident
from the above discussion.
(Q.E.D.)
k1
a + x (a k + x ... ))
(a 1 + x (a R+1 + ... ))
(a 2 + x (a k+2 + ... ))
(a kl +X (a 2kl + '•• )}
£ Horner ' s part
Lemma 6
Figure 2.10. kth Order Horner's Rule
The minimum number of steps h . to compute a polynomial p (x) of
mm
degree n by the kth order Horner's rule (l < k < n + 1) is
h P . = log nl + ri og ( n+ iJl + i.
mm ° t>\ /
Proof:
It is enough to show that
2 log ml + 2^/mj + 1 > 2 r iog(n+l) 1 + 1
1+2
or
flog m 1 + _n/mj > flogtn+l) 1
for m < n, because 2flog(n+l)' > flog ir + 'log(n+l)' and if m > n + 1, then by
Theorem 5 lr (n+1. n) = h . .
man
Assume that n=2 g +t(l<t<2 S ) and m=2 + s(l < s < 2 k ). Then
e have two cases, i.e. (l) g = k or (2) g > k. If g = k, then t > s since
we
n > m.
(1) g = k and t v > s.
Then log m 1 =k+l=g+l, and
Hence
flog m) + _n/mj = g + 2 > flog ( n+1 )1 .
(2) g > k.
We have further two subcases, i.e. (i) 1 < t < 2 or (ii) t = 2 .
(i) 1 < t < 2 g
Then flog (n +1)1 = g + 1, Q.og nil = k + 1
and
Isl
2 B + t
2 k + s
> 2 e
k  1
Hence
flog ml + (_n/mj > k + 1 + 2 g "^ k+1 ' )
Now we show that for all k < g
f(k) = k + 1 + 2 g " (k+l) s g + 1.
*o
Since for all k < g,
^f(k) = 1  (log e 2) 2 g " (k+l) <0
minf(k)=g+l where < k < g.
Hence f(k) > g + 1 or
log ml + [n/mj > g + 1 = riog(n+l)l .
(ii) t = 2 g
Similarly to the above, we can show that
'log nil + in/mj > UOg(n+l)l .
The details are omitted.
(Q.E.D. )
Unlike the case of h (m,n) and h (m, n), h (m, n) is not necessarily a
nonincreasing function of m. (A few curves in Figure 2.11 illustrate this.)
Therefore it becomes important to choose an appropriate m for a given
n to compute in an optimum way.
Theorem 6 :
Given an nth degree polynomial.
Let
M  M (h P (m,n)  flog n"l + fl g(n+l)1 + 1),
where
Then
h P ( m, n ) = 2 flog ml + 2 ^n/mj + 1 .
kk
ft
ft
P
OQ
O
U
H
n = 5
10
15
Number of PE's (m)
Figure 2.11. The Number of Steps, h (m, n), to Compute p (x) on P(m) by
the rath Order
h5
(1) n + 1
M =< (2) ^n+l)/^
J3) WV2 1
where g = ^log ry •
(n = 2 g )
1,
(2* < n < 2* + 2 fe )
(2 g + 2S" 1 < n < 2 g+1 )
Proof:
Proof is given for each case independently.
(1) n = 2 g .
The proof is divided into two parts. First it is clear that if
we have n + 1 PE's, then p (x) can be computed in h . steps
(see Theorem 5) Next we show that if the number of PE's, m, is
less than or equal to M  1 (m < n), then Ir . < Ir (m, n), where
nun
h P . = flog nl + fiog(n+lJl + 1 = 2g + 2.
man ov ' to
Let m = 2 k + p (k < g, 1 < p < 2 k ). Then
h P (m,n) = 2k + 5 + 2
s k + P
Let
P =
2 g " k p
2 k + p
Then
(18)
¥
P =
(2 k +P ) 2
> 0,
and
kG
for
max P = 2 g_k ~ 1 for 1 < p < 2 k .
From this and Eq. (l8), we get
h P (m,n) > 2k + 3 + 2 2 g ~ k ~"!j . (19)
Since g > k, Eq. (19) becomes
h P (m,n) > 2k + 3 + 2 g " k .
Now let
f(k) = (2k + 3 + 2 g " k )  (2g + 2) = 2(k  g) + 1 + 2 g ' k .
Since 2 a + 1  2a > (note that v~(2 a + 1  2a) = log 2 ■ 2 a
da °e
2 > for a > l), we have f(k) > for all k < g. Since
h P (m,n)  (2g + 2) > f(k) > 0,
h P (m,n) > 2g + 2  h p . n .
This proves (l).
Now since "log n^ 'log(n + l)' if n / 2 for some g, we use
h P . = 2 ( "log(n + lp + 1
mm to
h P . = H.og(n + l) 1 + log n~l + 1
mm to
to prove (2) and (3) Then it is enough to show that
= u ( "log(n + l) = "log nr + l^/mj )
instead of
M = tJ (2 f log(n + 1^ +1=2 flog nil + 2 (n/mj + l)«
Since 2 g < n < 2 g+ for (2) and (3), we have riog(n + l)" 1 = g + 1. By direct
+7
computation, we can show that the theorem holds for n < 10 (see Table 2.k).
(2) 2 g < n < 2 S + 2 g_1 .
Now we show that
h P .  2 log M*l + 2 n/M» + 1.
mm u / _j
To show this we first show
flog M~l + (ji/Mj = g + 1.
Since 2 g < n < 2 g + 2 g " , we have
2 g2 < n_+_l_ < 2 gl (2Q)
and
log Ml = flog ^n + 1)/3H = g  1. (21)
Now let (n + l)/y = k (k > 3 as we assumed). Then
n + 1  3k  p (p < 2)
or
n = 3k  p  1. (22)
Using this and the relation < (p + l)/k < 1, we have
L n/Mj =2. (23)
Thus from Eq. (21) and (23)
flog M 1 + ,n/Mj = g + 1 = flog(n + l)~l,
or
h P (M,n) = h P . .
mm
Next we show that if m' < l(n + l)/y , then HLog m^ + [n/m'j >
g + 1 = 'log(n + 1)1 (or equivalent ly h P (m', n) > h p . ).
mm'
k8
m
1
2
3
k
5
6
7
8
9
10 11
g
M
Case
2
it
1+
k
h
1
1
3
6
5
6
5
5
2
2
k
8
7
7
7
7
2
1
5
10
7
7
9
7
7
2
1
6
12
9
9
7
9
9
7
7
1+
2
7
11+
9
9
7
9
9
9
7
7
h
2
8
16
11
9
9
9
9
9
9
9
9
3
1
9
18
11
11
9
9
9
9
9
11
9
9
U
1
10
20
13
11
9
11
9
9
9
11
11 9
9
1+
1
(1): 2 S < n < 2 g + 2 g ~ 1
(2): 2 g + 2 s " 1 < n < 2 6+1
where g =  log nj .
n: The degree of a polynomial
m: The number of PE's
M: The minimum number of PE's
Table 2.k. The Number of Steps Required to Compute p (x), h p (m,n), for n < 10
+9
We have two cases, i.e.
(i) ?} < m' < 2 1+1 where i + 1 < g  2
d (ii) 2 g ~ 2 < m' < r (n + l)/^.
(i) 2 1 < m' < 2 i+1 .
Since i + 1 < g  2, we write
g  2 = i + 1 + j (d > 0).
Then
flog m'"l =i+l=g2j,
and since 2 S < n < 2 g + 2 g_1 ,
L n/mJ >2 g  i  1 = 2 2 ^.
Thus
(log m*~l + _n/mj > (g  2  ,i) + 2 2+J " > g + 1
= flog(n + 1)1
because 2 ;> a + 1 if a > 1.
(ii) 2 3 " 2 < m' < T( n + l)/fl.
Let us write
m'  Rn + 1)/51  q = k  q (q > l).
Then by Eq. (2(>) and (22), we get
ri og m n + [_£,} . s . i + [3¥^J
> g + 1 = flog(n + l)',
because q > 1, p < 2 and 3q  p  1 > 0.
This ends a proof for (2).
50
(3) (2 g + 2 g < n < 2 S+ ) can be proved in a similar manner and the
details are omitted.
(Q.E.D.)
It should be noted that the function Ir (m,n) is not a nonincreasing
function of m even if m < M for some n. However if n < 50, then for more than
70$ of the cases, h p (m, n) turn out to be nonincreasing functions. (The only
cases where h p (m, n) is not a nonincreasing function are the cases n = 15, 27,
28, 29, 30, 31, 36, 37, 38, 39, ^5, ^6, and 1+7 . In any case, h P (m,n) increases
by at most one. )
51
o
LTN
O
O
o
H
O
O
CD
CD
ta
0)
Q
CD
P
O
o
o
p
H
•d
o
*d
u
a;
on
o
K
£
W
o
H
Ph
Ph
o
3
OJ
CD
£
£
o s
I
(W) s,aj jo jaqrariM
52
3. TREE HEIGHT REDUCTION ALGORITHM
3.1 Introduction
In this chapter, recognition of parallelism within an arithmetic
statement or a block of statements is discussed. There are several existing
algorithms which produce a syntactic tree to achieve this end. The tree is
such that operations on the same level can be done in parallel. Among them,
the algorithm by Baer and Bovet [6] is claimed to give the best result. For
example, a statement
a+b+c+dxe xf+g+h
can be computed in "our steps by their algorithm (Figure 3*1) •
k
3
2
1
level
Figure 31 An Arithmetic Expression Tree (l)
The algorithm reorders some terms in a statement to decrease
tree height. However, this algorithm does not always take advantage of
distributions of multiplication over addition. An arithmetic expression
53
a(bcd+e) takes four steps as it is, whereas the equivalent distributed
expression (abcd+ae) requires only three steps. A further example is
Horner's rule. To compute a polynomial
p (x) = a^ + a, x + a^x + ... + a x , (l)
n 12 n
Horner ' s rule
p (x) = a. + x(a n + x(a, + . . . x(a . +x a ))) ...) (2)
*n 13 n1 n
gives a good result for serial machines. However, if a parallel machine is to be
used, (l) gives a better result than (2). Namely, if we apply Baer and Bovet's
algorithm [6] on (2), we get 2n steps whereas (l) requires only 2flogp(n+l)] steps
(see Chapter 2). Thus it is desirable for the compiler to be able to obtain (l)
from (2) by distributing multiplications over additions properly. An algorithm to
distribute multiplications properly over additions to obtain more parallelism
(henceforth called the distribution algorithm or the tree height reduction
algorithm) is discussed now.
32 Tree Height and Distribution
Definition 1 :
An arithmetic expression A consists of additions, multiplications, and
possibly parentheses. We assume that addition and multiplication require the same
amount of time ^see Chapter l). Subtractions and divisions will be introduced
later. Small letters (a,b,c, ...), possibly with subscripts, denote single
variables. Upper case letters and t, possibly with subscripts, denote arbitrary •
arithmetic expressions, including single variables, t is used to single out
particular subexpressions, i.e. terms.
5h
n
Then A can always be written as either (l) A = E t. or
i=l x
(2) A = w (t.), e.g. A = abc + d (ef+g) = t + t where t = abc and
i=l
t = d(ef+g), or A = (a+b)c(de+f) = (t )t (t ) where t = a + b, t = c
and
t,  de + f . Note that when we write A = tt (t. ), we implicitly assume that for
i=l x
n(i)
each i t. = Z t.'. h[A] denotes the height of a tree T for A, which is of the
minimum height among all possible trees for A in its presented form.
A minimum height tree (henceforth by a tree we mean a minimum height
tree) for A, T[A], is built as follows [6].
n n
Let us assume that A = E t. or A ^ tt (t. ) and that for each i, a
. , i . , l '
i=l i=l
minimum height tree T[t.] has been built. Then first we choose two trees, say
T[ t ] and T[ t ] , each of whose height is smaller than height of any other tree. We
combine these two trees and replace them by the new tree whose height is one higher
4
than max (h[t ], h[ t ]] : w
Combined tree
i.
A^Vli
hr V +1 X A h I^
t t t r
I p q
This procedure is repeated until all trees are combined into one tree, which is
T[A]. The procedure is formalized as follows:
4
"Figures are in scale as much as possible.
55
(1) Let ST = j(l), (2), (3), .., (n)J and let h'[i] = h[t.] for
all it ST.
(2) Choose two elements of ST, say p and q such that h'[p], h'[q] < M"
where M = rain jh'[u]l for all u€ST p, q .
(3) Now let ST = jsT  p, qjj U j(p, q) and h'[ (p, q)] =
max jh'[p], h'[q] + 1.
(k) If I ST  = 1, then stop else go to Step 2.
After we apply the above procedure on A, we get e.g. ST = i ( ( ((l)(2) ) (3) ) ( (k) ( 5) ) ) )
where a pair (ab) indicates that trees corresponds to a and b are to be combined.
Thus in this case we get:
^((((l)(2))(3))(CO(5)))
(((D(2))(3)) 2i ^^^^ >Nf (00(5))
((D(2))
as a minimum height tree of A. In general the procedure is applied from the lowest
parenthesis level (see Definition 3) to the higher parenthesis levels.
]xample 1 :
Let A = (a+bc)(d+efg) + hi
= *1 + *2
If there are many choices, then choose those subtrees with smaller
4(Sh[t ]) values first (see Definition 8).
56
where t = (t )(t^)
t_,  a + be = t, + t,_
3 ^5
and t, = d + efg = t„ + t Q .
b 7 8
T[A]
Tft 3 ]
g h
Figure 32. An Arithmetic Expression Tree (2)
The < ffective length e of an arithmetic expression is defined as
e[A] . 2 h[A1 .
The number of single variables in an arithmetic expression is the number
of single variable occurences in it.
The height of a minimum height tree can be obtained without actually
building a tree.
Theorem 1:
(1.1) If A s it a. or A = Z a. then h[ A]
i=l x i=l x
■log 2 fpl g
57
(1.2) If A = It. then h[A]  log
i=l 2
P
£ e[t ]
i=l
P I
(1.3) If A = tt a. x 7T (t.) then h[A] = log
i=l X 3=1 J
r P i + z ert.] I .
2 Jl J '2
Given an arithmetic expression A, to obtain the height of a tree for
A, Theorem 1 is applied from the inner most parts of A to the outer parts,
recursively.
Proof:
(1) It is obvious that if A = Z a. or A = tt a. then h[A] = log r pi .
i=l i=l
(2) Now let A = Z t. . Then we can replace each t. by a product of
i=l x x
e[t.] single variables without affecting the total tree height h[A].
(Note that each t. must be computed before the summation over t. is
e[t.]
p L i J
taken.) Thus A becomes A' = Z tt a . and h[A] = MA']. Let us
i=l j=l
eft.]
1
call a tree for tt a. a subtree. Then a tree for A' is built
,1=1 J
using subtrees in the increasing order of their heights. Since a
binary tree of height h cannot accomodate more than 2 leaves, we
have 2
h[A']
> Z e[ t.l > 2
" i=l X
h[A']l
or h[A'] = h[A] = log.
P
Z e[ t ]
i=l
(3) can be proved in a similar manner.
(Q.E.D.)
58
Definition 2 :
The additive length a and the multiplicative length m of an arithmetic
expression A is defined as follows:
P
(2.1) If A = ir a.., then
i=l X
(i) a[A] = e[A] and
(ii) m[ A] = p.
P
(2.2) If A  Z t., then
i=l X
P
(i) a[A] = I e[t.] and
i=l x
(ii) m[A] = e[A].
P I
(2.3) If A = tt a. x 7r (t.), then
i=l x d=l J
(i) a[A] = e[A] and
(ii) m[A] = p + I e[t ].
0=1 J
It is to be noted that
(!) £[AI a[A]l
2 '.m[A]j 
(2) h[A] = log p a[A] i = log m[A] I ;
d ! * 2 ' '2
compare this definition with Theorem 1.
Definition 3 :
The level £ of a parenthesis pair in an arithmetic expression is defined
as follows :
59
First we start numbering parentheses at the left of the formula, pro
ceeding from left to right, counting each left parenthesis, (, as +1 and each
right parenthesis,), as 1 and adding as we go. We call the maximum number m
the depth of parentheses. Now the level 1 of each parenthesis pair is obtained
as I = p, where p is the count for each parenthesis. The arithmetic expressions
enclosed by the level I parenthesis pair are called the level I arithmetic
I
expressions, A • Also for convenience we assume that there is an outermost
parenthesis pair which encloses A.
Example 2
A =
123 3 3 3 21
(ab((cd + e)(f + g)+k))
3
3
2
Now several lemmas are in order.
Lemma 1
n
n
Let A = T t. or A = ir (t.). Also let A 1 = t, + t. +
. , l . , l 12
1=1 1=1
+ t. ' +
l
... + t or A' = (t. ) x (t) x ... x (t. ') x ... x (t ), and A" = A + t . or A" =
n 12' l n" n+1
A x (t n+1 ). Then
6o
(i) h[A'] >h[A] if h[t.'] > h[t ± ]
(ii) h[A"] > h[A].
Proof:
Obvious from Theorem 1.
What Lemma 1 implies is that the height of the tree for an arithmetic
expression is a nondecreasing function of term heights, and the number of terms
involved.
In an arithmetic expression, there are four possible ways of parenthesis
occurence:
P x ) ... + (A) + ...
P_) ... 6(t n x t_ ... x t ) x (t ' x t ' ... x t ') 9 ...
2 12 n 1 2 m
P 2 ) . .. a x a x ... a x (A) 8 . . .
3 12 n
p u ) ... e(t 1 + t 2 + ... + t n ) x (t^ + 1 2 ' + ... + t m ») e ...
where 6 represents +, x, or no operation.
Lemma 2 :
Let D  B + ( A) + C and D n = B+(t n x ... xt ) x(t' x ... xt ') +C.
1 1 n 1 m
Also let D = B + A + C and D n =B+t n x ...xt xt' x ... xt ' + C. Then
11 n 1 m
r\ d
h[D] >h[D] and h[ D ] > h[ D ] .
Proof :
Obvious Prom Theorem 1.
6l
As an example, let D=(a+b+c)+d and D = (abc)(defgh) . Then
A A A
D = a + b + c + d and DJ = abcdefgh, h[ D] = 3 > h[D ] = 2 and h[D ] = k >
h[Dj] = 3
Lemma >
Let D = Z t . Z t . ] and D d = t, t , ' • t, t n ' ■ ... +tt' ... t t
,1=1 *A.11 «
Then h[D d ] > h[D].
Proof :
n m
Let D = (A)(B) where A = Z t. and B = St.'. Also without losing
i=l x 1=1 x
nerality, assume that h[A] > h[ B] . Then h[D] = h[A] + 1. For each j, let d. =
J
t.' t n +t.' t„ + ... + t.' t. It is clear that h[ t . ' t.] > h[t.l for all i and
j 1 2 j n j i J  i J
d m d
Thus from Lemma 1, we have h[d.] > h[A] for all j. Since D = Z d., h[D ] >
3 0=1 J
min(h[d ]) + log J ml > h[A} + log 2 rml 2 , or since h[D] = h[A] + 1, h[ D ] > h[ D] .
(Q.E.D.)
Note that the above lemma does not imply necessarily that if D =
ft \ H" H"
Zt.HB), and D = t (B) + t (B) + ... + t (B), then h[ D ] > h[D]. Actually,
i=l 7 n
it can be shown that there is a case when h[ D] > hf D ] . What Lemma 3 says is
that D = (A)(B) should not be fully distributed, but partial distribution, as in
D , may be done in some cases.
62
Lemmas 2 and 3 together indicate that distribution in case (P^) and
partial distribution for case (P, ) are the only cases which should be considered
for lowering tree height. In casos (P ) and (Pg), removal of parenthesis
leads to a better result or at least gives the same tree height. Full
distribution in case (P. ) always increases tree height and should not be
done. Also it should be clear that in any case tree height of an arithmetic
expression can not be lower than that of a component term even after
A \ ■ d
distributions are done. For example, let D = t(A) = t x J F t J and D =
n
F ft v t.). Then from Lemma 1, we have h[tt.] > h[t ] for each i. Thus
i=l
h[r> d ] > h[A].
The same argument holds for all four cases. This assures that evaluation
of distributions can be done locally. That is, if some distribution increases
tree height for a term then that distribution should not be performed because once
tree height is increased, it can never be remedied by further distributions.
Actually, there are two cases where distribution pays. For example, if A =
a(bcd + e), then h r A] = k. However, if we distribute a, then we get A = abed +
ae and h r A ] 3 The idea is to balance a tree by filling the "holes" because
a balanced tree can accommodate the largest number of variables among equal height
trees. The situation is, however, not totally trivial, because by distribution,
the number of variables in an expression is also increased. Next let A =
a(bc + d) +e=t+e and A = abc + ad + e = t + e. In this case h[A] = k but
h[A ] = h[ t ] =3 What happened here is that t is "opened" by distributing a
over (be + d) and the "space" to put e in is created.
63
At each level of parenthesis pair, cases (P3) and (P^), i.e., instances
of "holes" and "spaces", are checked and proper distribution is performed. Next
we give definitions of holes and spaces, and formalize these ideas.
33 Holes and Spaces
33«l Introduction
Before we proceed further, let us study trees for arithmetic expressions
more carefully.
P
Let A = Z t.. By Definition 1 we first build minimum height trees T[t.]
i=l 1 x
for all i, and T[AJ is built by combining these T[t.]. Once T[t.] is built the
details of t. do not matter, and the only thing that matters is its height h[t.].
Suppose T[t.] and T[t.] are combined to build T[A]. Assume also that
hrt.l = hrt.l + s. Then we will get s nodes to which no trees are attached other
l 3
than T[t.]. We call these free nodes whose heights are h[t.] +1, h[t.] +2, ...,
J J J
h r t.].
i
T[t.:
<? T[t.]1
1
<? T[t )+l
X ^
A h[t d ]
Free Nodes
(S)
Figure 33 Free Nodes
6k
Similarly we can enumerate all free nodes in T[A] with their heights.
O free node
£^ occupied node
root of T[ t . ]
root of T[A]
Figure ^>.h. Free Nodes in a Tree
Free nodes in a tree T
P
it (t.)
L i=l
are defined similarly.
Let us emphasize that once we get Tft.] we treat it as a whole and do not
care about its details when we build T[A]. That is, when we consider free nodes
in T[A] we mean free nodes "in" T[A] but "outside" of Tft.]. For example let A =
(a+b)(cde+f) = (\){t ). Then
T[A]
d e
65
a and 3 are free nodes in T[A] while y and 8 are free nodes in T[t ] and not in
T[A].
Now suppose there are m free nodes in T[A]. We number them arbitrarily
from 1 to m. Also let us denote the height of a free node a '»y h[al«
Given a free node q whose height is hi" a] in T [A = I t.]
(or T [A  tt( t . ) ] ), by definition we can attach a tree T[t] whose height is h[a]l
(or whose effective length is 2 ^ aJ " ) to a without affecting the height of A:
Definition k:
For A = 7r(t.) or A = Zt . , we build a tree. Then
l i
(k.l) define F [A] to be a set of all free nodes in T[A], and
(U.2) for each i define F [A, t.] to be a set of all free nodes which
r\ 1
exist between the roots of T[A] and T[ t . ] , i.e. the free nodes
which we encounter when we traverse T[A] down to the root of
T[t.].
For example let us consider the following tree (see Figure J. 5)
66
Figure J55 An Example of F and F
Then F A [A] = fa, 3, 7, 6, e] and F R [A, t ± ] = fa, 3, y), F R [A, t g ] = { c& 3} etc.
Lemma h :
Suppose h[ a] = h[ 3] for some free nodes a and 3 in T[A]. Then without
changing the tree height h[A], we can replace two free nodes a and 3 hy one free
node 3' whose height is h[a]+l.
Proof:
original T[A]
Figure 36. Elimination of a Free Node
67
modified T[A] N
Figure 3 6 (continued)
We can combine subtrees 1 and 3, and hence eliminate free nodes a
and 3 and create a new free node 3' (see Figure 36).
(Q.E.D.)
Given F.[A], two free nodes a and (3 of equal height can be replaced by
one free node 3 1 whose height is h[ a] + 1« Repeating this procedure finally we
get a new F '[A] in which no two free nodes have the same height. Let y and 8
be free nodes in F.[A] and F '[A] respectively. Assume that for all free nodes a
in F [A] (or F '[A]), h[ 7] > h[ a] (or h[61 > h[ a] ) . Then obviously h[ 7] <h[5].
However, it is clear that h[6] < h[A], i.e. if h[6] = h[A] then h[A] is not
the height of a minimum tree (see Figure 3*7 )• Hence we have the following
corollary.
'orollary 1 :
In a minimum height tree T[A],
I 2 h[a] " 1 <kA].
C*F A [A]
68
Proof:
Assume that S 2 h[a] ~ 1 = e[
aeF [A]
re[A]. Then by Lemma h, alia in F [A]
can be replaced by one free node 3 whose height is h[A] (note 2 *■ ■•" = e[A]/2)
This, however, implies that we can build a tree T'[A] whose height is h[A]l,
which contradicts our assumption that T[A] is a minimum height tree for A.
T[A]
Figure 3«T« A Minimum Height Tree
(Q.E.D.)
The following definitions are also used in the next section.
Definition 5 :
An integer set is a set of integers with possibly duplicated elements,
e.g. (2,2,2,^,8,8). If an integer x occurs in an integer set Y at least once then
we write x in Y, e.g. 2 in {2,2,2,^,8,81 . Let Y and Z be two integer sets. Then
by Y uni Z we get an integer set where if an integer x occurs i times in Y and j
times in Z, then x occurs i + j times in Y uni Z, e.g. [2,k,k) uni (2,^,8) =
[2,2,k,k,k,8) . Also if Y is an integer set, then #(Y) = (the sum of values of
all elements of Y), e.g. #({2,k,k) )  10, and le Y = (the value of the largest
69
element :ln Y), e.g. le (2,4,4) = 4. Furthermore if x in Y, then by Y del {x} we
mean the integer set which is obtained by deleting one occurence of x from Y, e.g.
(2, 4,4,8, 16, 16} del (4) = (2,4,8,16,16).
P
Let Y, , Y„, ..., Y be integer sets. Then by M (d) Y. we mean the set
r 2 p u l
i=l
Y constructed as follows.
(1) Let Y = (empty) and s = d.
(2) If s = 0, then stop else go to (3).
(3) Let u  min(le Y.) and k be an index of Y. such that u = le Y, .
— i i — k
If there are more than one k which satisfy this, then pick an arbitrary
one. Now let Y = Y uni (ul and s = s  u. Also let Y = Y, del (u) and Y. =
(Y. del {le Y.) ) uni { le Y.ul for all i (i^k). Go to (2).
For example let Y ± = (1,2,2,8) , Y 2 = (4,4,8) and Y, = {16} . Then Y =
3
••
i=l
(13) Y. is constructed as follows,
u 1
(1) Y  and s = 13
(2) le Y ± = 8, le Y = 8, le Y = 16. Hence u = 8 and k = 1. Then Y =
(8^ and s = 13  8 = 5 Also Y [ = (1,2,2), X^ = {4,4} and Y, = (8)
(3) le Y  2, le Y = 4, le Y, = 8. Hence u = 2 and k = 1. Then Y =
(2,81 and s = 5  2 = 3 Also Y 1 = (1,2), Y g = (4,2), Y^ = {6} .
(4) Repeating the procedure we finally get Y= (1,2,2,8).
lefinition 6 ;
Let m be an integer. Now write m as a sum of powers of 2 in which each
power appears at most once. Then by 5b(m) we mean the integer set of powers of
2 which appear in the above sum.
TO
3 2
For example since 13 = 2^+2 +2 = 8+4+1, I b(l3) = {1,4,8} but
Fb (13) 4 {l,k,k,k} since k appears more than once.
3.3.2 Holes
Now we discuss holes in an arithmetic expression. Intuitively, if an
arithmetic expression A has a hole of size u then an arithmetic expression t' with
e[t'] < u may be distributed over A without increasing tree height.
Definition 7 »
For each A = 7r(t.) we define a total hole f unct ion H m and for each t. in
A = r t. we define a relative hole function H^ as follows:
1 R
(7.1) For each A = 7r(t.), build a tree T[A]. Then define H [A] =
I 2 h ^ ^" 1 for all a e F [A] .
(7.2) For each A = Ft., build a tree T[A". Then for each t. define
H R [A,t.] = T 2 h[a]_1 for all a e F R [A,t.]. If F R rA,t.] = 0,
then let HJA.t.] = 0.
R 1 i J
As stated before if a is a free node in T[A], then 2 "• ^ " is the
effective length of a term t' whose tree T[t'] can be attached to a without chang
ing the tree height h[ A] . Also let u in Zb(HJA]). Then this implies that there
is a free node q in F A [ A] such that 2 = u (see Lemma k) . Similarly if u in
T b(HjA, t.]) then there is a free node n in FjA,t.] such that 2 ^ Q '~ = u. Thus
K 1 n 1
in general
(1) if A = 7r(t. ), then h[ (t 1 )x (A)] = h[A] if m[t'] < u, and
71
(2) if A = 2 t., then h[t'+A] = h[ A] if a[ t ' ] < u, where u =
I 2 h[a] " 1 .
aeF A [A]
Definition 8 :
The set of holes Sh[A] for an arithmetic expression A is an integer set
defined as follows:
P
(8.1) If A = ir a., then we let Sh[A] = (Zb(e[A]p)} , e.g. Sh[abcde] =
i=l X
(a(3)) = (i,2).
p p
(8.2) If A = 7T (t.), then we let Sh[ A] = yiii(Sh[ t . ] ) uni Z b(H [ A] ),
i=l 1 i=l it
e.g. Sh[ (abc+d)(efg+h)(i+j)] = Sh[abc+d] uni Sh[ efg+h] uni Sh[i+j]
uni [Zb(H [A])) = {1} uni (1) uni Zb(l^)  (1,1,2,3,8).
P
(8.3) If A = Z t., then we first let Sh'[t.J = Sh[t.] uni Z b(H_[A,t.])
, i 1 i  ~ ~~ H i
1 = 1
P
and d = min (#(Sh'[ t. ] ) ) . Then we let Sh[A] = M u (d) Sh'[t.].
i i=l
Example 2 :
Let A = (a+b)(c+def) + ghi = (t 1 )(t 2 ) + t, = t^ + t,. Then Sh[ t ] = (1}
and Sh[(t 1 )(t )] = (11 uni I b(6) = {1,2, k) . Also Sh[ t ] = {1}. Hence Sh'ft^] =
(1,2,41 and Sh'[t ] = ( 1) uni Z b(12) = [l,h,Q] . Now d = min ( (#Sh'[ t, ] ),
#(Sh'[t 3 ])) = 7.
Hence Sh[A] = ^ (7) Sh'[t.] = (1,2,4).
13A
72
In (83) above note that for every i
(1) if u in Sh[A], then there is u. in Sh'[t.] such that u < u.,
(2) if u in Sh[A], then u < #(Sh'[t.]). Also there is at least one
k such that le Sh[A] = le Sh'[t ].
q
Given an arithmetic expression A, and t' = £ t.', if e[t'] < le Sh[A],
i=l
then t' way be distributed over A without increasing tree height. Informally
we say that A has a hole which can accomodate t 1 .
Now we show that the above assertion is indeed valid. First we observe
q
that if t' = 7T (t.'), then each t. ' may be distributed independently over A.
i=l X X
q
Thus in the following we can assume that t* = £ t.'. Note that if t can be
i=l X
distributed over A without increasing tree height, then any t'(e[t'] < e[ t] )
can be distributed over A as well.
Lemma 5:
P
Let A = ir a.. Then h[A] = h[(A) x (t*)] if e[t'] < le Sh[A].
i=l
Proof: Obvious by Theorem 1
Theorem 2
p  d
(1) Let A = 7T (t.). Then h[A] = h[ (t'A) ] if e[t*] < le Sh[A]
i1 x
73
P _ d #
(2) Let A  It.. Then h[A] = h[(t'A) ] if e[ t ' ] < le Sh[ A]
i=l X
Proof :
We use a mathematical induction to prove the above theorem. Lemma 5
serves as a basis.
P P
Let A= 7T (t.) or A = St.. Assume that the theorem holds for t..
. , l . i i l
i=l i=l
P
(1) A = 7T (t ).
i=l
We show that if e[t'] < le Sh[A] then A can be multiplied by f
without increasing tree height. There are two cases:
(i) There is k such that eft 1 ] < le Sh[ t ] . Then we dis
tribute t' over t, without increasing tree height.
(ii) V e[t'] >leSh[t ]. In this case e[ f ] < le Z b(H [A"] ) .
Then there is a free node a in F A [ A 1 sucn that u = 2
and e[t'] < u means that T[t'] can be attached to a with
out increasing tree height.
Hence A can be multiplied by t' without increasing tree height.
P
(2) A = I t .
i=l
We show that if e[t'] < le Sh[A] then t can be distributed over A
without increasing tree height.
#  d
We write (ft) for the expression obtained after distributing f over
t such that the tree height is deduced, e.g. (a, b+cd) = ab + acd.
7^
First note that for all i, if a€F [A,t.] then 2 h ^ a ^ _1 > le Sh[t.]
Now assume that e[t'] < le Sh[A]. For fixed k, that u = le Sh[A]
implies either
( i ) u < le Sh[ t k ]
or (ii) there is a in F_,[A, t, ] such that u < 2 " a ^ (or
equivalently u < le Ib(H [A,t ])).
In the first case we have h[ t ] = h[(f t ) ] by assumption and
h[A]  h[ 7 t + (ft ) d ].
i/k X k
In the second case we attach T[ f ] to a (Figure 2.8(a)) without
increasing the tree height h[A], i.e.
h[A] = h\ T t + (f ) x (t )].
i/k x *
In general let 1" be a subtree in T[A] whose root is a (Figure
3.8(b)). Then
\ T*
(a) (b)
Figure 38. Attachment of T[ f ] to a Free Node
75
there may be other term trees besides T[ t ] in T", e.g. T[t.] and
T[t ] in Figure 38(b). Hence we get,
h[A] = h[ L t. + (f ) x (t^+t.+tj^)]
i#c,j,h
in this case. Note that a is also in F [A, t.] and TJA, t ].
a j a n.
Repeating this procedure for all k we can get an arithmetic expres
P
sion equivalent to X (t 1 ) x (t.) or (t')(A) without increasing
i=l x
tree height. This proves (2).
It is obvious that in both (l) and (2) if e[t'] > le Sh[A], then t' can
not. be distributed (multiplied) over A without increasing tree height.
(Q.E.D.)
Lemma 6 :
Let A = F t. and e[t»] < le Sh[A]. Then after t' is distributed over A,
we have
Sh[ (t'A) d ] = (Sh[A] del (u} ) uni T b(ue[t'] ) uni Sh[t']
where u is the smallest element in Sh[A] bigger than e[t'].
Now let us summarize what we have so far. Let A and t' be arithmetic
expressions where A = T t. and t' = 7t'.. If e[t'] <le Sh[A], then t' can be
distributed over A without increasing tree height, i.e. h[(t') x (A)] >
h[(t*A) = hfA]. A set of holes in (t'A) is given by the above lemma. Since it
is obvious that le Sh[ A] < e[A], we have the following lemma.
76
Lemma J :
Let f = Zt.'. Then h[(t')(A)] > h[(t'A) d ] implies that h[t'] < h[A].
q
In general let t' = Ti(t.'). For convenience let us assume that e[t. ' ]
< e[t '] < e[t '] < ... < e[t ']. Then if the following procedure can be
 2  3 ~  q
accomplished successfully, we say that A has holes to accomodate all t.. ' (i=l,
2,...,q). #
Procedure:
i
(1) Let V = (t i '(t i _ 1 *,...,(t 1 , A) d ) d ...) d and V Q = A. Let i = 1.
(2) Check if e[t.'] < le Sh[V. .].
i ll
(3) If so, then distribute t. ' over V. , and we have V..
(k) Evaluate Sh[V.].
(5) If i=q then stop, else let i = i + 1 and go to (l).
The procedure may be accomplished successfully if m[t'] < #(Sh[A]).
3 3. 3 Space
Now the second possible distribution case, i.e. space is studied.
The idea of the second distribution case is that given an arithmetic
expression D of the following form
D = ... 9 (f) x (A) + t e ...
1 s
distribute t' over A so that t can be hidden under the combined tree as shown
s
in Figure 3.9*
We assume that each t. ' does not have any holes, i.e. Sh[t.'] =
for all i. Hence Sh[(t*A) d ] = (Sh[A] del { u }) uni b(ue[t']) for
example .
77
t'^ t't 2
(a)
A A A A
ft
(b)
Figure 3.9. An Example of Space (l)
In other words, in case of D, the addition +, cannot be done before (f )(A)
(we write f (A)) is computed while in case of D it may be done earlier.
Note that if h[f (A)] > h[(f A) ] then A has enough holes to
accomodate f and the distribution of f over A is done anyway. Henceforth
throughout the rest of this section we assume that A does not have any holes
to accomodate f . Thus we now deal with the case when h[f (A)] < h[(f A) ].
However, if h[f (A)] < h[(f A) ] holds, then clearly there is no way to
78
get h[J) ] < h[ D] by Lemma 1. Thus we have:
Lemma 8:
Proof:
(1) h[t* (A)] = h[(t*A) d ] must hold to get h[D d ] < h[ D] .
n _ d
(2) Let A = it.. Then h[t'(A)] > h[(t'A) ] if and only if e[t'] <
i=l X
e[A] (i.e. h[t f ] < h[A]). This implies that h[t'(A)] = h[A] + 1.
By inspection.
P
Intuitively the space Sp in an arithmetic expression A = it. with
i=l 1
respect to t' is defined as
P
Sp[A,f]  e[(A) x (t')]  I e[(f) x (t )].
i=l
For example let A = ab + c and t* = d. Then S [A, t'] = 2.
free node
(a) (b)
Figure 3.10. An Example of Space (2)
Let D = t'(A) + t  (t'A) + t. Note that a free node in T[t»(A)] cannot be used
to attach T[ t] while a free node in T[(t'A) ] may be used to attach T[t].
Now the formal definition of space follows.
79
Definition 9 :
p q
Given arithmetic expressions A = Ft. and t' = Ft. ' , the space function
i=l i=l
Sp of A w.r.t. t' is defined as follows. First we build trees T[A] and T[t'], and
in T[A] let F be a set of free nodes f higher than T[t'] (h[f] > h[t']). Also we
define a set I as follows. We let iel if h[t.] > h[t'] and e[ V ] < le Sh[ t. ] .
Now the space function is obtained as :
h[f] h ^V
Sp[A,f] = Z 2 L J + T 2
fe F i€ I
To show how Definition 9 works, we first describe how to build T[(t'A) ]
by attaching T[t'] to T[A] properly (i.e. by distributing t' over A properly).
Since h[(t'A) ] = h[A] +1 (Lemma 8(1)), we first study to build T'[A] which is
obtained by replacing T[t.] in T[A] by T'[t.] whose height is h[t.] + 1. Then the
height of T'[A] is h[A] + 1. Building T[(t'A) ] from T[A] may be explained in an
analogous way.
As stated before the only case to be considered is when h[t'(Aj] =
h[(t'A) d ] = h[A] +1 holds (Lemma 8(1 )). Suppose that all T[t.] in T[A] are re
placed by T'[t.] whose height is h[t.] + 1. Then the new tree T'[A], whose height
is h[A] + 1, is obtained. Note that a free node a in T[A] now becomes a free node
a' in T'[A] with height h[a] + 1. In T'[A] if T'[t.] is replaced by T[t.] again,
then a new free node 3', whose height is h[t.] +1, is created. Having these
facts in mind, now we describe the way t' is distributed over A to create space.
The tree T[(t'A) d ] is built from T[A] as follows (note h[(t'A) d ] =
h[A] +1). Depending on the height of T[t.], we have two cases.
J
8o
(1) h[t.] > h[f ].
If T[t.] has a hole to put T[t'], then we fill it by T[t']. In
J
this case h[t'(t.)] = h[t J and a new free node 3' whose height is
h[t.] + 1 is created in T[(t'A) ]. Otherwise h[t'(t.)] = h[t.]
J J J
+ 1.
(2) h[t.] < h[t'].
J
Find the tree T[Zt ] whose height is h[t'] and which includes
s
T[t.]. We multiply Zt by t' and get h[t*(Zt )] = h[Zt ] + 1.
J s s s
Note that t. and Zt are treated as terms of A. In the resultant tree T[(t*A) ],
j s J '
those free nodes in T[A] whose heights are less than or equal to h[t T ] (i.e.
free nodes in T[Zt ] ) do not appear,
s
(a) (b)
Figure 3H Distribution of t' over A
(c)
.m*
A free node a in T[A] (h[a] > h[t']) appears in T[(t*A) ] as a free
node a' where h[a' ] = h[a] + 1. Thus T[(t'A) ] has those a' and 3' described in (l)
as free nodes.
81
If 6' is a free node in T[ (t f A) d ], then a tree T[t] (h[t] <h[6']  1)
can De attached to T[(t'A) ] without increasing tree height (i.e. h[(t'A) ] =
h[(t'A) + t]). Since h[6'] is either h[t.] + 1 or h[ a ] + 1, we have e[t] =
J
2 J or 2 h ^ ] .
In general if a[t] < Sp[A,t f ] then we have h[t'(A)] = h[(t'A) d ] =
h[(t'A) d + t].
q
Definition 9 niay be generalized to include the case where t' = ir(t.').
In this case we first obtain h[t'] and e[t.'] (i=l, 2, . . ., q), and build T[A".
In T[A] let F be a set of free nodes f which are higher than h[t'] (h[ f ] >
h[t']). Also let I be a set such that i is in I if h[t.] >h[t'] and T[t.] has
enough holes to put all T[t.'] (i.e. a[t'] < /Sh[t.])). Then Sp[A,t f ] =
feF iel
Informally we say that space to put t is created by distributing t'
over A if h[t'(A) + t] > h[(t'A) + t] . Then the procedure is called space
filling. Now we study how much we can reduce tree height by space filling.
Let B = Z't.')(A.) + Zt . and assume that by distributing t. ' over A. space can
li. j l l
i J
be created for all i.
Lemma 9 "
Let B = Z(t. ')(A.) + It. and B d = l(t. 'A. ) d + Ft.. Then h[B d ] = h[ Bl
.11 .3 .11 .j LJ LJ
i i 3
1 if l!3p[A.,t '] >a[B]  e[B]/2, otherwise h[ B d ] = h[ B] .
i
82
Proof :
First note that to lower the height of a tree for B, some terms must
be removed from B so that an effective length of a resultant expression becomes
e[B]/2. Hence ZSp[A. ,t.'] must be
greater than or equal to a[B]  e[B]/2.
r d n
Next we show that h_B J cannot be
smaller than h[B]  1.
As before we assume that
h[t i *(A.)]= h[(t.'A.) d ] for all i
(see lemma 8(1)). It is equivalent to
( — erB]/2
a[B] —
Figure 312. Tree Height Reduction
by Hole Creation
the assumption that Sp[A.,t.'] < e[t.'(A. )]/2 = e[(t.'A. ) ]/2. First we get B !
from B as follows. We replace every (t.'A.) in B by a product P. of
e[(t 'A ) ]/2 single variables. This amounts to assuming that Sp[A. ,t '] =
i i ii
e[(t 'A ) ]/2. Futher we get B" from B' by replacing every t. in B' by a
i i
product Q. of e[t.]/2 single variables. Then it is clear that h[B] > h[B ] >
J J
h[B'] > h[B"] = h[B]  1. If ZSp[A.,t.*] > a[B]  e[B]/2, then h[B] > h[B d ].
Hence h[B d ] = h[B]  1.
(Q.E.D.)
What the lemma implies is that by space filling we can lower tree
height at most by one . In other words to see if the space creation by distri
bution is effective it is enough to see if total tree height can be lowered by
one, and we know if tree height is once lowered by one it is not necessary
83
(i.e. useless) to try to lower tree height further by creating more space by
further distribution.
Unlike a set of holes (Theorem 2), the space function for A = 2t.
does not carry any information about space in components of A, t.. For example let
B = a(b(c+defg) + irl6) + Trl6 + irk = a(A) + tt16 + irk
where Ti is a product of i single variables. Then h[ B] = 7 Now Sh[A] = and
space creation is tried. Note that SpfA.a] = 16 < a[ B]  e[B]/2 = 20, but
Sp[ c+defg, ab] = k. That is, a as well as b should be distributed over c+defg.
Thus we c et B 1 = abc + abcdefg + a(7rlo) + tt16 + wh where h[B'] = 6.
Now this situation is studied in detail. In general we have a form:
F = ... +t'(... +t" ( C ) +. . . +D+. ..) + ... + E + ...
^
J
V
A
V
J
w~
:n if Sp[A,t'] is not enough to reduce the tree height h[ F] , we have to further
check components of A, e.g. Sp[C,t't"]. As we will show later (see Substep 2 of
Step i\ of Algorithm given in Section j.k.l) an arithmetic expression is examined
from the inner most pairs of parentheses to the outer most pair. In the above
diagram, the distribution of t" over C is first checked to see if it reduces the
tree height h[A] and then the distribution of t' over A is examined. If the
itribution of t" over C creates space and reduces the tree height h[A], then
there is no problem. However if that distribution does not lower the tree height
.], then t" will not be distributed over C (see Algorithm). As we showed in the
above example when we check the possibility of reducing the tree height h[ F] by
creating space by the distribution of t' over A, it may be necessary to check
, ft"] as well.
Qk
Let A' = ... + (t"C) d + ... + D + ..., G* = (t'A) d and G" = (t'A') d 
Now we show that if Sp[A, t'] = 0, then it is not necessary to examine components
of A, e.g. Sp[C,t't"] further. This helps to reduce the number of checks
required.
Lemma 10 :
(1) If Sp[A, t'] =0, then h[...+G"+...+E+. ..] < h[...+G'+. ..+E+. ..]
never holds.
(2) If Sp[A,t'] = 0, then h[G"] < h[G'] never holds.
Proof:
(1) We prove this by showing that if Sp[A,t'] = then Sp[A',t'] = 0.
Note that Sp[A, t'] = implies that either
(i) h[t'] > h[t"(C)]
or (ii) h[t n ] = h[C] (Definition 9).
By Lemma 9, in either case we get h[(t'(t"C) ) ] > h[t't"(C)].
Note that the only difference between G' and G" is that a term
— sr d
t't"(C) in G' is being replaced by (t'(t"C) ) in G". Since
h[(t'(t"C) d ) ] > h[t't"(C)], G" cannot have more free nodes
than G'. Hence Sp[A f ,t'] = 0.
(2) This may be proved in a similar way and the details are omitted.
(Q.E.D.)
Thus in F = ... + t' ( . . .+t" (C)+. . .+D+. . . ) + ... + E ... the distri
bution of t" over C should be done if it reduces the tree height of T[A],
otherwise it should be left untouched. In the latter step when the distribution
of t' over A is examined, the possibility of distributing t" over C as well
shall be checked if and only if Sp[A, t'] / 0. Otherwise we shall leave t'(A)
85
as it is and we need not check inside of A again.
3.4 Algorithm
Having these results, an algorithm to reduce tree height of an
arithmetic expression is now described. Given an arithmetic expression, the
algorithm works from the inner most pairs of parentheses to the outer most pair.
We assume that cases (P, ) and (P ) (see Lemma 2) are already taken care of. At
each level of a parenthesis pair, first upon finding a form t'(A), a hole of A
is tried to be filled by t' (Theorem 2). After all holes are filled, a form
t'(A) + t" is checked, i.e. if the distribution of t' over A creates enough space
to accomodate t", then the distribution is made. Note that it is not necessary
to fully distribute (A)(b) = (lt.)(Xt. T ) (see Lemma 3)
It is not necessarily true that reducing tree height of a term t
of an arithmetic expression A reduces tree height of A. However, we show
that reduction of tree height should be made in any case to help later steps of
the distribution algorithm.
n1 n1
Let A = Tt. + t (or ir (t . ) x (t ) ) . Assume that the distribution
. , 1 n . , 1 n
i=l i=l
somehow reduced the height of T[ t ], i.e. the distribution algorithm
n1 n1
reduced A to A' = I t. + t ' (or it (t. ) x (t ')) where h[ t ] > h[ t ']. Also
. , 1 n . n i 7 n ; ' L n J L n J
i=l i=l
assume that MA] = h[A']. Yet it is obvious that #(Sh[ A]) < #(Sh[ A' ] ) (i.e. le
£h r ;.l < le Sh[A']) and also for any t", Sp[A,t"] < Sh[A,t"]. That is, even if
distribution only reduces the tree height h[ t ] and .does not reduce the tree
height h[A], that distribution does not cause any bad effect on the later steps
when A appears as a term of a bigger expression d with respect to holes and spaces ,
86
The arithmetic expression thus obtained may give the lowest tree
height, i.e. the fewest number of computation time steps.
3.k.l Distribution Algorithm
In the steps below we refer to the notation:
. k  1 k  1
I . k 1 I r— k — r i k ! 1
«(...«( k )&.. .)©... ©(...©( )e...e( )©...)©
' sl,i I I sj I I s,j+l I
H — tl — h k <<
A = .. . + t . , 7T (t  ) + ...
k p
where t . , = tt a„ or empty.
s,Jl l=1 t
Step 1 :
Go to an arithmetic expression enclosed in an innermost parenthesis
pair which is not checked. Let this level be k1. In the above diagram we are
«kl
now working on, say, A
s
Step 2 :
Obtain a set of holes for all t . which are enclosed in the kth
k1
parenthesis pair and are components of A , as well as their heights h and
effective lengths e
Step 3 :
k1
In this step, the (kl)th parenthesis pair level A "" is examined.
87
Substep 1 : Hole filling (see Theorem 2)
Let
0+n
k P
t
where t = ir a or empty. Also without loss of generality we assume
that e[t g£ ] < e[t g i+J ] for l = j, j+1, ..., j+n1.
k1
(1) Find an occurrence of a form 7r(t.) or Tra. x 7r(t.) in A .If there
J i s
is no such occurrence, then go to Substep 2. If an occurence of a
form ir(t.) is found, then skip (2) and (3). Otherwise go to (2).
k1
(2) Suppose we find t in A as an instance of Tra. x Tr(t.).
s 1 j
k k
Fill holes in Sh[t ] (h=j, j+1, . . ., j+n) using a.'s in t . If there
are many holes to be filled in, fill the smallest ones first, i.e.
in order of increasing size. Reevaluate Sh by Lemma 8 for those
t . whose holes are filled,
sh
(3) If t , (h=j, j+1, . . ., j'+n) do not have enough holes to accomodate all
a.'s then go back to (l) to find out another occurrence of Tr(t.) or
j+n k
Tra. x Trft.) form. Otherwise we work on tt (t 1 , ) which we get from
i j . . sh' &
P J+n v
ira, x tt (t\ ) after (2).
11 J h=i Sh
(J+) We start from h = j. Check if t' can fill in one of holes in
u sh
88
Sh[t' ] (l=h+l, . . ., j+n). If there are many holes which can
s x.
]^
accomodate t' , fill them in order of increasing size. Continue
the procedure until all t' (h=j+l, . . ., j+nl) are put in some holes
or there is no hole to accomodate t' . Go back to (l) to find
sn
out another occurrence of 7r(t.) or ira.. x 7r(t.) form.
J i J
Substep 2 : Space filling
k1 k
After Substep 1, we again check A , where all holes in t . have
s sj
been filled in as much as possible by Substep 1.
(1) Let Ex = a[A k_1 ]  e[A k_1 ]/2 (see Lemma 9)
(2) Let
k1 k ^ +n1 k n k
A "=...+t . , x it (t,)x(t .,)+...
s S ' J_1 h=j S cf' J t^
~Y f t
k p k
where t . n = tt a, or empty. We also assume that eft . 1 is
B,d1 ! = i * s,j+n J
k k
the largest among all e[t ,] (h=j, . . ., j+n) . Let t' = t . . x
sn s f j ±
j+n1 . . .
tt_ (tg h ). If h[t'] «h[t* ], then evaluate Sp[t* ^ f ] .
Otherwise leave it as it is.
(3) Repeat (2) for every occurrence of a form tt( t . ) (or ira.. x ir(t .))
J 'J
k1
in A . Assume that there are m such occurrences. Arrange all
s
~p[t,t'] in order of decreasing size. For convenience we write
Sp r Sp 2 , ..., SpJSp. > s P . +1 ).
89
m d1 d
(+) If Z Sp. > Ex then let d be such that Z Sp. < Ex and Z Sp > Ex.
i=l x i=l x i=l
k ^ +n_1 k k
(5) Let t tX it ( t >.)x('t • ) be a form which corresponds to
s,j± ' ._. sn s, j+n
1 2 '
t' t
Sp.(i < d). Then distribute t' over t, and create space Sp..
Repeat the same procedure for all i = i, 2 f . .., d.
m
(6) In the case where enough space to accomodate Ex (i.e. Z Sp. < Ex)
i=l x
]^
is not found, a check is made against the component terms of t
(see Lemma 11).
For example let
t .=a.a rt ..«a,n=l
s,jl 12 p'
and t k . = b n b ... b (t^ 1 ) + Z t*!" 1 .
s,j 12 q v sf i=1 si
Then
A k1 . Jc ,,. ,. ,. ^k+l N . ™ ,
s
.. +t* _ x (b n b ... b (t"I x ) + lO + ..
s, j1 12 q sf . _ si
i=l
i/f
k+1
Then the distribution is done if the sum of Sp. and Sp[t „ ,
t . _ x b n . . . b ' is greater than or equal to Ex. Here the dis
s,Jl 1 q J
k+1
tribution of a, ... a b, ... b over Z t .' is to be made as well
1 pi q si
k+1
as the distribution of a, ... a over t _ . This checking is to
1 p sf
90
be made until enough space to accomodate Ex is found or else
until the innermost level of parenthesis pair is reached.
Step k ;
k1
Mark A as checked,
s
Step 5 :
If all levels are checked, then stop, otherwise go back to Step 1.
For example let us consider the following
A = ... + a 1 a 2 a 5 (t 1 )(t 2 )(t 5 ) +
t
Further assume that
Sh[t x ] = Ul, e[t 2 ] = 16
Sh[t 2 ] = (16,21, e[t 2 ] = 61+
and
Sh[t 3 ] , e[t 3 ] = 64.
Then a, a p a, can be distributed over t,, and in turn this whole thing can be distri
buted over t,
. . + a.. a 2 a, { t, 4
o
(t )(tj + ...
and we get h[t'] = 7 whereas h[t] = 8.
91
3.^2 Implementation
A few words about implementation of the algorithm described above are
given as well as the total number of checks required to process an arithmetic
expression. Suppose we are given the aritmetic expression
A = ... + (7r24^rl)(7rlO+(7r4Pnl)(7rl7H7r2)P7r3) + ...
= ... + ( d 1 +d 2 )( e 11 + ( e 2i +e 22^ e 31 +e 32^ + e l+l^ + "••
= ... + (D)(E 1 +(E 2 )(E 3 ) + E^) + ...
= ... + (D)(E) + . . .
where iri represents a product of i single variables. Then we build nested stacks
as shown in Figure 3.13(a). Note that a new stack is created for each form ir(t.)
or Z t . .
1
e 21 (x)
B(+)
e u (x)
E 2 (+)
(+)
TVT f v \
1
im^x;
"2 KAJ
«=
e 00 (x)
A <—
^ —
<—
>>
E ? (+)
D(+)
\
I
\l lXJ
~'~~~ e ^n
d.(x)
31
d 2 (x)
^^ e 32
level

mk
m3
m2
m1
m
Figure 315 Stacks for an Arithmetic Expression
92
(b)
E 4
w 2 (x;
Sh
A
_L
1,2, If, 8,
8,16,32
t,5,6
E 2 (+)
Sh
if a
iii
Sh
A
1,2,4,8
2,3,4,5
2, 3 ,4, 5
e 21 (x)
h
2
Sh
F A
e 22 (x)
/" e 31 (x)
e„(x)
h
1
Sh
F A
N l «■
(c)
Sh
A
El£L
2,4
3,^,5
3,4
"MJ.
N 2 '(x)
h
5
Sh
1,2,4,6
F A
1,2,3,5
/*
e 3i
N 2 "(x)
Sh
e 21 (x)
E ( + )
h
3
Sh
F A
1,2
F E
<
1,2
'22
Figure 3 13 • (continued)
93
Each stack is assigned a level number (cf. Definition 3) where the first stack
which corresponds to A receives the level number (Figure 3« 12(a)).
We start working from a stack with the largest level number, say m. For
each stack t, where t = It. or t = ir(t.), h[t] is evaluated. Also if a stack
represents a form t = It., then Sh[t], F [t] and F [t, t.] are evaluated. If a
stack represents a form t = 7r(t. ), then Sh[t] and F [t] are evaluated. These
values are obtained by Definitions 1, h and 8. Note that this information is
sufficient to evaluate Sp. Figure 3.12(b) gives an illustration.
Upon finding a form 7r(t.) (or Tra . x 7r(t.)) (e.g. the stack N ), we apply
the distribution algorithm and decide if distribution is to be made. If a stack
represents a form t = 7r(t.), then Substep 1 of the distribution algorithm, i.e.
hole filling, is tried. Otherwise a stack represents a form t = I t. and Substep
2 of the distribution algorithm, i.e. space filling is applied. In our example E
is distributed over E,(e[E p ] < le Sh[E,]). Then stacks are revised as shown in
Figure 3 .12(c). Note that the stack N is replaced by two new stacks N ' and N ",
and the stack E disappears.
If all stacks with the level number k have been checked, then stacks with
the level number k1 will be checked. In our example, stacks E(or E' since it has
been revised) and D are now checked.
The total number of checks required to process a whole arithmetic
expression thus depends on the number of parenthesis occurrences in it. Assume
that there are p parenthesis pairs in an arithmetic expression A. For each pair,
space creation should be examined. Hence in total p space creation checks are
required. Now for each ir(t.) form hole filling should be tried. The number of
Qk
occurrences of a form ir(t. ) in A is obviously less than p. Hence the total number
of checks required is less than 2p (i.e. the order of p).
3.5 Discussion
3.5.1 The Height of a Tree
Given a tree for an arithmetic expression, the distribution
algorithm tries to lower tree height by distribution if possible. However,
in general it may not give the minimum tree height. For example let A = ac + ad +
be + bd whose tree height is 3> and since no further distribution is possible,
the distribution algorithm yields the same value. There is, however, an equivalent
expression A' = (a+b)(c+d), whose tree height is lower than 3, i.e. 2. That is,
even though factorization lowers tree height sometimes, the distribution algorithm
does not take care of it.
The question we ask now is how much the distribution algorithm lowers
tree height. Before giving an answer to this question let us study tree height
in more detail. Given an arithmetic expression, Theorem 1 gives the exact height
of a tree obtained by Bovet and Baer's algorithm. It is also desirable if we can
get an approximate tree height without actually building a tree for an arithmetic
expression. Since the number of single variable occurrences (the number one less
than this gives the number of operators in an arithmetic expression) and the
depth of parenthesis nesting may well represent the complexity of arithmetic
expressions, let us try to approximate tree height in terms of them.
Let A be an arithmetic expression with n single variable occurrences
and depth d of parenthesis nesting. Now build a tree for A by Bovet and Baer's
algorithm. Then it can be proved that:
95
jemma 11 :
log 2 rnl 2 < h[A] < n  1.
Moreover we can prove the following theorem.
Theorem 3 •
h[A] < 1 + 2d + log[nl 2 .
The following lemma is helpful to prove Theorem 3<
Lemma 12
(1) 2a>ra],
(2) f2al 2 = 2fal 2
(3) log
r P
r p
[ml
i=l
< log (2 •
2
r m
1=1
i 2
Proof:
(l) and (2) are obvious and (3) can be proved by (l) and (2)
(Q.E.D.)
Proof of Theorem 2 :
Proof is given by induction on d. First let us prove the theorem for
d = 0 Then A has the following pattern:
A  F it a..
i=l j=l r
Then by Theorem 1,
h[A] = log Z T q 1
i=l
96
< log 2
r p
by Lemma 5(3)
= 1 + log[n] 2 .
Nov assume that the theorem holds for d < f .
Let t. be an arithmetic expression with depth d.(< f) parenthesis
J J
i i ^
nesting and n. single variable occurences. Then by assumption h[t.] < 1 + 2d't
J ■ J J
+ log[n_. lp Now an arithmetic expression with f + 1 parenthesis nesting can be
built from t. as follows:
(q. m.
n. . i .
ir a' tt (tih
j=l 3 k=l K
where a. are single variables and at least one of t. has f nested parentheses.
Now each t. can be replaced with a product of e[t.] single variables without
J J
affecting the total tree height. Instead of using the value e[t.], let us use
2d'
the value 22 J . Tn 1 ^. (Note that h[t^] < 1 + 2d 1 + logrn 1 ! = log (2 •
2d 1 . . . h[t X ] 2d*
2 J • TrulJ and eft 1 ] =2 J < 2 • 2 J • rn*"L.) Since d 1 . < f, e[t*] <
2f i
22 fn 1 • Now from Theorem 1, we have
J 2
h[A] < log
r p
TCI.
1 2+ Z 2 S 2frn j 1 2l2
3=1
r P
m.
< log 2
s 2f i
2 [q + r 2 • 2 nt]
3=1 J
1=
97
2.2.2.2 2f r (q, + rnbl
< logl2222
= 1 + 2(f+l) + logrnl 2 .
Thus the theorem holds for d = f + 1 and this proves the theorem.
(Q.E.D.)
Now let us examine the original question i.e. how effective is the
algorithm presented in this chapter. Let A and A be arithmetic expressions
d
where A is the resultant expression obtained from A after the application of
the distribution algorithm. Now build trees for A and A by Bovet and Baer's
algorithm. Then it should be clear that h[A] > h[A ]. Moreover experience
suggests that:
Conjecture :
h[A d ] ^ 2 log 2 rnl 2
where n is the number of single variable occurrences in the original arithmetic
expression A.
Note that the distribution algorithm speeds up a Horner's rule
polynomial in a logarithmic way. Also note that the distribution algorithm
does no distributions in the case of
which takes 21ogr nl steps as it is presented but which would take (n+l) logrn]
steps if fully distributed. Thus the algorithm can save a factor of n/2 steps
over a scheme which would distribute indiscriminately and in some cases achieves
a logarithmic speed up.
98
3.5.2 Introduction of Other Operators
3.5.2.1 Subtraction and Division
Subtractions can be introduced into an arithmetic expression without
causing any effect on the distribution algorithm. It may be necessary to
change operators to build a minimum height tree. For example let A = a + b 
c + d. This will be computed as A = a + b  (cd):
Divisions may require special treatment since the distributive law does not hold
in certain cases, e.g. (a+b+c)/d = a/d + b/d + c/d but a/(b+c+d) / a/b + a/c +
a/d. Hence in general minimization of the height of trees for a numerator and
a denominator is tried independently, and then distribution of a denominator
over a numerator is tried if appropriate. Also let A = t/t'. Then T[A] is
built from T[t] and T[ t ' ] as follows:
T[t],
or
If h[t] / h[t'], then we get nodes to which only one tree is attached, e.g.
a and (3 if h[t] < h[ t'], and a' and 3* if h[t] > h[t']i Then a and p are
treated as free nodes in T[A] while a' and 3' are not treated as free nodes in
T[A], because later when another expression, say t", is multiplied to A, t must
99
be multiplied by t" not t', i.e. t"(A) = t"(t/t') = (t"t)/t' ^ t/(t't**):
T[t"
T[f]
3 52.2 Relational Operators
If an arithmetic expression A contains relational operators e.g.
B I ) C where RO = [>. <, =, >, <, . ••}, trees can be built for B and C
independently:
A =
T[B]
T[C]
If h[B] *' h[C], then operators may be moved from one side to the other to
balance two trees. For example let A=a+b+c>d. Then we modify this as
A' = a + b > d  c and get h[A'] =2 while h[A] = 3:
b d
100
k. COMPLETE PROGRAM HANDLING
Chapter 3 presented the algorithm which reduced tree height for a
single arithmetic expression by distributing multiplications over additions
properly. In this chapter we will discuss some ideas about how to handle
complete programs, i.e. given one program, how can it be executed in the
shortest time by building a tree as well as executing a statement in a for
statement simultaneously for all index values. Ideas include back substitution.
We do not have the solution to the problem, but this chapter presents some
details of the problem and some ways to attack them.
We conclude this chapter by comparing serial and parallel computations
in terms of a generated error. It is shown that in general we could expect less
error from parallel computation than serial computation. It is also shown that
distribution would not increase the size of an error significantly.
4.1 Back Substitution  A Block of Assignment Statements and an Iteration
While the distribution algorithm in the previous chapter discusses
tree height reduction for a single arithmetic expression, it can be used for
any jump free block of assignment statements. If we define those variables which
appear only on the right hand sides of assignment statements or in read statements
in a block as inputs to the block, and those variables which appear only on the
left hand sides of assignment statements or in write statements as outputs from
the block, then we can rewrite the block with one assignment statement per
output by substitution of assignment statements into one another. For example
101
a := b + c;
d := e x f;
g := a + c;
h := a + g x d
can be rewritten as
g := (b+c) + c;
h := (b+c) + ((b+c) + c) x (exf ) .
After such a reduction only input variables appear on the right hand sides of
assignment statements. At this point, the distribution algorithm could be
applied to each remaining assignment statement and if sufficient computer
resources were available, all of the reduced assignment statements could be
executed at once In the above example if each statement is computed in
parallel (by building a tree) independently then 5 steps are required, while
if the back substitution is done then the computation requies only + steps.
Suppose we have assignment statements, A , A , ...,A . Also suppose that by
back substitution we can rewrite this block as A. We build minimum height
trees for A, ,A , ...,A and A. Now we apply the distribution algorithm on those
trees. Let the resultant tree heights be h A ],..., h[A ] and h[A] . Then
obviously h r A,] + ... + hi" A ] > h[A], i.e. back substitution never increases
the computation time (in the sense of tree height) (Figure l+.l).
Our main interest here is the case where strict inequality in the
above relation holds, because that h[A, ] + ... + h[A ] > h[A] holds is
equivalent to a speed up of the computation by back' substitution. Note that
back substitution amounts to symbol manipulation (i.e. replacement) and should
not be confused with arithmetic simplification. For example from
102
w
V
h [An]
Figure k.l. A Back Substituted Tree
a := x + y
b := a + y
we get
b := (x+y) + y
orb :=x+y+y
but we do not get
b := x + 2y .
Now we shall study this kind of speed up.
We shall discuss a limited class of assignment statements, i.e. an
iteration. This may serve to give some insight to the problem of speed up by
back substitution in the general assignment statement case.
By an iteration we mean a statement
'i ■ f(y ii'
Usually a statement is executed repeatedly for 1 = 1, 2,'. . .,n. An example is:
for I := 1 step 1 until 10 do
A[I] := A[I1] + A[I];
103
Also a block of assignment statements such as:
S 1 : a := h + i + j;
S ? : b :■ a + k + m;
S,: c := b + n + p;
S. : d := c + q + r;
falls into this category (note that all statements have a form
output of S. : = output of S. + x + y
where x and y are pure inputs in the sense that they do not appear as outputs).
Assume that we are only interested in the value of y (the other
results, i.e. y ,,y rt , . . . , y, may be obtained similarly to y but in less
n1 n2 1 n
time). Then instead of n statements, i.e. y, = f(..), y p = f(..),...,y =
f(..), we n^ay obtain one statement for y by back substitution. For example,
let y. = a. y. . Then y = a n y , = a .(a ^y _) = a .(a _(a ,y _))
1 ll ll n nl^n1 n1 n2 J n2 n1 n2 n3 n3
n1
= • • •  y tt a, . We use the superscript "b" to distinguish the back
b n_1
substituted form from an iteration form, e.g. y = a ,y . and y = y^ tt a. .
J n n1 n1 n . _ k
k=0
Then instead of computing each y. repeatedly for i = 1, 2, . .,n, y may be
computed directly. In the above example y. can be computed in one step* and
to get y n steps are required while y can be computed in r log nl steps in
parallel (i.e. by building a tree for y ). The following table summarizes the
results for some primitive yet typical iteration formulas.
1C4
y i
b
y n
T
s
nT
s
T
P
ay il
n
a v~
1
n
ri g 2 (n + ljl
yi.i + b
n
y Q +*b +"/."+ b A
1
n
flog 2 (n + 1?
a ii y ii
n1
Vo
k=0 K u
1
n
log^
y , + a
°il i1
n1
z \ + y o
k=0 K
1
n
flog 2 n1
ay._ x + b
*\ + P nl (a ^
2
2n
*2fiog 2 (n + 1?
ay il + X i1
. . ■**
p;(a)
2
2n
ar2flog 2 (n + 1)1
y il + bx il
n1
z bx k + y o
k=0 K u
2
2n
« 'log nl + 1
2
ay il + bx il
p" a)
*n
3
3n
2riog 2 (n + 1)1
/• \ l. n1 , n2
p  (a) = ba + ba + . . . + b
„ . . / \ n , n1 , n2
** P n (a) = a y + a X + a X l + "• + "n2 + X n1
„v.j, ti / \ n , n— 1 , n2 ,
*** p (a) = a y + ba x^ + ba at, + . . . + bax n + bx .
n J 1 n2 n1
T : The time required to compute y. in parallel, i.e. h[y.].
T : The time required to compute y in parallel, i.e. h[ y ].
Table k.l. Comparison of Back Substituted, y, and NonBack Substituted
Computation, y. — Iteration Formulas
105
From Table *4.l, the following lemma is obtained by exhaustion.
Lemma 1 :
Let y. = f (y. , ) be linear in y. . where we assume that in the
presented form additions are reduced to multiplications as much as possible,
e.g. y. = 2y i _ 1 instead of y ± = y i _ 1 + y i _ 1  Then n x h[y.,] > h[y n l
Thus if we have enough FE's, then instead of computing each y.
repeatedly for i = 1, 2, ...,n, we should obtain y by back substitution and
compute it by building a minimum height tree.
If an iteration y. = f(y. , ) is not linear in y. _, e.g. y. = a y? ,
+ b y^ + c, or if it is linear in y. but there are some additions not being
reduced to multiplications, e.g. y. = y. , + a y. , then it is not clear if
back substitution speeds it up. For example, back substitution does not speed
up the computation of y ± = y i _ 1 + y i _ 1 + y ± _ 1 + Y ± _ 1 ' Also let y i = f (y i _ 1 )
be a polynomial of y. , where in the presented form additions are again reduced
to multiplications as much as possible. Then it is not likely that we can
speed up the computation by back substitution. Let
„/ \ m
f (y. , ) = a y. . + ...
where y. is the highest power of y. , among those which appear in f(y. ).
Note that f (y. ) is not necessarily a dense polynomial (a polynomial in which
all powers of y. , i.e. y. ,, y. ,, ..., y. ., appear). While the exact height
of T[ f (y. , )] depends on f(y. ), we may content ourselves with (see Chapter 2)
106
h [ f(y il )] ~ 2 r io g 2 m1 '
Hence 2nriog ? ml steps are required to compute y .
Now let us consider y . Then
n
b , b m
n m nl
m m
= a (a (y ) + . . . ) +
m v m w n2
m m
= a (a ( . . .a y n . . . ) . . . ) + ...
mm nrO '
i n
nl m
= a m y Q + ... .
That is, y becomes a polynomial of y of degree m . Leaving the computation
of a out of consideration, we have (see Chapter 2)
h[y£] * 2[log 2 m n l
* 2nr logpinl .
Hence back substitution does not help to speed up computation significantly
in this case.
To gain a better understanding of more general cases, let us study
the situation from a different point of view. Given an iteration y. = f (y. , ),
let us consider the number of single variable occurrences in y as a measure of
e J n
the complexity. We study two cases separately, i.e. (i) y. .. appears only once
in y. and (ii) y. appears k times in y. . In both cases we assume that there
107
are m single variable occurrences (including the occurrences of y. ) in y. .
For convenience we write N(y) for the number of single variable occurrences in
y, e.g. N(a+b+cd4ae) = 6.
(l) y. , appears only once in y. . In this case we have
N(y 1 ) = m
N(yg) = N( y;L ) + m  1 = 2m  1
N(y^) = N(y^) + m  1 = 2m  2
N(y ) = N(y ,)+ml = mnn + l~ ran.
J n n1
(2) y. appears k times in y. . In this case we have
N(y 1 ) = m
N(y^)  k • N(y^) + m  k = m + k(ml)
N(y^) = k ■ N(Vg) + m  k = m + (k+k 2 )(ml)
• • •
N(y^)  k ■ N(y^_ 1 ) + m  k
= m + (k + . . . +k~ )(ml)
= 1 +^i(ml).
If k n » 1 and m » 1, then N(y b ) s k n_1 m.
> w n
Now if we use 21"log p N(y)l as a measure of the height of a tree, then
we have (see Section 3 5*1 of Chapter 2):
108
h[y ± ]
n x HV ± ]
«#
(1) 2flog 2 m]
2n[ log 2 ml
2flog 2 (mn)l * 2(riog 2 m] + rioggii"!)
(2) 2riog 2 ml
2nf log ml
2riog 2 (k n " 1 m)l « 2((nl)riog 2 kl + floggml)
Table k.2. Comparison of Back Substituted, y , and NonBack Substituted
Computation, y^ — General Cases
For example let m = 5, k = 2 and n = 20 in (2). Then we have
n . hfy.l = k0 • T log 51 =120
and h[y£] = 2(l9flog 2 2l + flog^l) = kk.
Also if we let m = 5 and n = 20 in (l), then we get
n • h[y.] = i+oriog 2 5l = 120
and h[y*] = 2(riog 2 5l + Tlog^Ol) = 16.
Now a few comments about implementation are in order. As for back
substitution of a block of assigment statements, the step by step substitution
is the only possible scheme. In case of an iteration formula, we may use the
ztransformation technique to obtain y [8]. For example let y. = y. n + x. .
Then by applying ztransformation on it, we get Y(z) = zY(z) + X(z) or Y(z) =
^ Z ^ . Hence Y(z) = X(z)(l + z + z 2 + 7? + . . . )
2 2
= (x + X.Z + X z + . . . ) (l + z + z +...)
= x Q + (x 1 + x Q ) z + (x 2 + x ± + X Q )
z + . . .
109
or y = Z \>
1 k=0 *
Two other related problems become evident in the example presented
above. First is algebraic simplification. For example, a := b + kc could be
executed more quickly than a := b+c+c+c+c We shall not discuss this subject
further here. A second problem is the discovery of common subexpressions. In
our example, (b+c) appears twice in the right hand side of it.
If we had an algorithm, e.g. [11] which discovered common sub
expressions in one (or more) tree which could be simultaneously evaluated, the
number of FE's required could be reduced by evaluating the common subexpressions
once for all occurrences. On the other hand, by removing common subexpressions
the execution time (the height of a tree) may be increased in some cases. For
example, if we have x := a(b+c+de) and y := f(g+c+de), then we might try to
replace c + de in x and y by z as follows to save the number of PE's required:
x
However, note that x = a(b+z) or y = f(g+z) takes k steps while the original x
and y require only 3 steps, i.e. h[a(b+c+de)] = 3 and h[a(b+z)] = k. Thus an
overall strategy must be developed for the use of a common subexpression
discovery algorithm in conjunction with overall tree height reduction.
110
h . 2 Loops
This section is included here to complete this chapter, and discusses
the subject superficially. Details will be presented in the following chapters.
Consider the following example.
El: for I := 1 step 1 until 10 do
for J := 1 step 1 until 10 do
S3: A[I,J1 := A[I,J1] + B[j];
In this case ten statements, A[l, J] := A[1,J1] + B[J], A[2, J] := A[2, Jl] +
B[J], ..., A[10, J] := A[10,J1] + B[J] can be computed simultaneously while J
takes values 1,2, ...,10 sequentially. We say that S3 can be computed in parallel
with respect to I. Note that originally the computation of El takes 100 steps .
(One step corresponds to the computation of S3, i.e. addition. For the sake of
brevity we only take arithmetic operations into account and shall not concern
with e.g. operations involved in indexing.) By computing S3 simultaneously for
all values of l(l = 1,2, . ..,10) the computation time can be reduced to 10 steps .
b 10
Finally by building a minimum height tree for A [1,10] (:= A[I,0] + Z B[J])
J=l
for each I (i = 1,2,..., 10), we can compute all ten trees simultaneously in
h_ steps .
To help understanding, let us further consider
L: for I := 1 step 1 until N do
for J := 1 step 1 until N do S;
Then Figure ^.2 (a) shows the execution of L as it is presented. The total
computation time required (t) is N, x N x m, where we assume that m arithmetic
operators are in S. Now suppose S can be computed in parallel with respect to
Ill
I, (Figure 4.2(b)). In Figure 4.2(b), each box has the form shown in Figure
lj.2(c). Here S 11 computed sequentially, i.e. T. = mN p . Now let us
compute S in parallel i.e. by building a tree (Figure 4.2(d)). Then we have
T Q = N h[S]. Note that m > hJ"S]. Further if we back substitute S for J =
1,2 N and get S , we have T = h[ S ]. As stated before (Section 4.1),
N 2 h[S] > h[S b ], or T Q > T ± > T g > T .
general we have
L: for I := 1 step 1 until N, do
for I : = 1 step 1 until N do
for i := 1 step 1 until N do S;
n * n —
re S is an assignment statement. Then the computation of L takes T =
n
7T N. m steps as it is presented, where we assume that m arithmetic operations
]
are involved in the computation of S. If S can be computed for all values of
I = 1,2, ...,N. ) simultaneously, then the computation time can be reduced to
K
T, ~ N.lm) stens, i.e. N. statements can be computed simultaneously
, I , ...,I _,I, .,..., I change sequentially. In general there are n
possibilities, i.e. we examine if S can be computed in parallel with respect
to I for k • 1,2, ...,n. Let P = [ko can be computed in parallel with respect
to I, . hen we would compute 3 in parallel with respect to I where T a =
112
<Z5>
I=N
C3>
(a)
(b)
J: =J+1
J
T
m
<^>
^ ^J+l )
h[S]
3
h[s b ]
(c)
(d)
3
(e)
Figure k.2. Loop Analysis
113
min T . Clearly each statement of the resultant N statements can be computed
kcF k g
by building a tree. Further if it is appropriate we perform back substitution
and obtain a big tree as the above example (El) illustrates.
If a loop is a limit loop which terminates when e < 6 for some pre
determine! B and computed c, it may be approximated by a counting loop (e.g.
or T : 1 i s  er 1 until II do ) which is executed a fixed number of times before
"est is made, and then repeated if the test fails.
Consider a program containing n two way forward jump statements (or
if statements). Let the tests for jumps be Boolean expressions B„,B ,...,B .
— 1 2' n
Assume that there are m output variables from the program given as expressions
,A_, ...,A , where parts of A. may depend on B.'s. In a program when B. is
1' 2' ' nr l j j
encountered, one of two choices is taken depending on the value of B . . It is
possible to start computing all of these possible alternatives at the earliest
time, and choose proper results as soon as values of B.'s become available.
For example
a := g + c;
B : a > ':):
if B then d := e + f + s else d := a + g + t;
: i + f + i x j x k x p x q;
yield
:= B x (e+f+s) + ( not B) x ((g+c)+g+t) +f+ixjxkxpxq
or
114
h := ((g+c) > 0) x (e+f+s) + ((g+c) < 0) x ((g+c)+g+t) + f + i x j
x k x p x q,
where we let B = 1 for true , B = otherwise. Then we may build a tree for h
as follows .
gcefsg eg t f ijkp q
Figure k.J. A Tree with a Boolean Expression
The box \B • produces or 1 depending on the value of (g+c>0).
In general, Boolean expressions can be embedded in arithmetic
expressions as shown in the above example, and a minimum height tree can be
built for it.
h.k Error Analysis
In this section parallel and serial computation are compared in terms
of error. We are only concerned with a generated error , i.e. an error which is
introduced as a result of arithmetic operations. It is shown that in general
parallel computation would produce less error than serial computation. It
is also shown that distribution would not increase the size of an error
significantly. Let co represent any arithmetic operation. In general, we do
not perform the operation co exactly but rather a pseudooperation (jx>) . Hence
instead of obtaining the result xo^y, we obtain a result x(a?)y. We may write
115
x y = (^y)(i+e ) (1)
where <: represr I in error introduced by performing a pseudooperation. For
example, we have
: y = (x+y)(lf€ a )
and
x Q y = Cxy)(l+e m ).
Let us write A for an approximation to an arithmetic expression A with an
error obtained by computing A using pseudo operations, e.g. fy\ or (+) . Then
^an also be written as
(xuoy) = (xo>y)(l+e ).
Now let us consider the computation
A = la..
i=l X
First we compute A serially, i.e.
A = . . .f((a 1 +a 2 )+a,)+a^)+...+a N ).
.e have
*
= a i a 2 = (a 1 +a 2 )(l+e & )
 & + a 2 + e (a»j+a ),
{+) a = (a +a +e (a +a,,)+a )(l+€ )
J 3 L^SlJ.^ 3,
= a, + a, + a, *■ e (2a., +2a^+a. ) .
1 2 3 a 1 2 3
• higher terms of e are neglected.
St
116
# *
\ = A 3 © % = a l + a 2 + a 3 + \ + e a ^ a i + 5a 2 +2a 3 + ai+ ),
\ = \l O *N = ._\ a i + e a ((Nl)a 1+ (Nl)a 2+ (N2)a 3+ ... +aN )
N
2 a. + € (Na +(Nl)a +(N2)a,+. . .+0
i=l
We let E = e (Na +(Nl)a +(N2)a +. . . +a ) . Next let us compute A in parallel,
i.e. by building a tree:
A
Without loss of generality we assume that N is a power of 2. Then
A I 2 ■ a l © a 2 = a l + a 2 + e a ( V a 2 )
1k = A 12 © A lk = a l + a 2 + a 3 + % + 2c a (a l + V a 3 4 V
18 = A lii A 58 = a l + a 2 + •" + a 8 + 3e a (a 1+ a 2+ ...+a 8 )
A
A
A lN =
lN/2 O V+lN " J/i + ri0g 2 N1 £ a .f^i
117
We let r = r log Nl € £ a. . To compare E with E , let a = a = a = ... = a
Then we get
N'
and
a 2 a
:: r log 2 Nl a • e a ,
or
E S > E P.
a a
N
An error for B = it b. can be analyzed in a similar manner. In this case we
i=l ^
N
E S  E P = (Nl) e Trb.
m m m . , i
i=l
Hence, Ln general, we could expect that parallel computation produces less
error than serial computation.
*
hat if higher terms of e and e are neglected, then A can be
a m '
written as
A + i E (A) + ( v E (A)
a a mm
where an I ''A) are arithmetic expressions consisting of variables in A.
For example, if we compute A = afbc+d) serially (i.e. A = a((bc)+d)) then we
get
118
A = (ax((bxc)(l+€ m )+d)(lf€ ))(l+€ )
~ a(bcd) + e a(bc+d) + € (2abc+ad),
am "
and E (A) = a(bc+d)
and E (A) = 2abc+ad.
m
Usually E (A) and E (A) depend on how A is computed as we have shown for
A  E a..
Now let us compare parallel computation of two arithmetic expressions
A and A , where A is the resultant expression obtained by applying the
distribution algorithm on A, in terms of a generated error. Note that we can
write
A = A + e x E (A) + e x E (A)
a a m m
and
A d * = A d + e x E (A d ) + e x E (A d )
a a ' m nr
= A + g x E (A d ) + e x E (A d ).
a a m m
/ n d
As an example let us study A = a(bc+dj + e and A = abc + ad + e.
bed e
(a) A = a(bc+d) + e
(b) A = abc + ad + e
Figure k.k. Trees for a(bc+d) + e and abc + ad + e
119
Then we have
*
A = (a(bc(l+€ )+d)(l+e )(l+€ )+e)(l+€ )
m a nr ' v a y
= (a(bc+d)+e)+e (2abc+2ad+e)+e (2abc+ad)
a m.
and
A d * = (abc(l+€ ) 2 + (ad(l+e )+e)(l+e ))(l+€ )
m m &' J a
= (abc+ad+e) + e (abc+2ad+2e) + e (2abc+ad).
Note that E (A) = E (A ) in the above example, which is not mere chance. We
can show that this holds for all cases.
Lemma 2 :
E (A) = E (A d )
m m
Proof :
where
First let us consider
t#a \ © *2 © ••' © *n
t* = t. + € E (t.) + e E (t. ).
1 i aai mmi
Then clearly
n
E (t) = € E E (t.)
m m . ., m l
i=l
regardless of the order of additions whereas E (t) depends on the order of
additions. Hence we may write
* n n
t = Z t. + e E (t J + e ZE (t.)
. ., i a a Z m . n m i
i=l 1=1
where Z indicates that E (t) depends on the order of additions.
120
and
Now let us consider
A* = t* (tj* © t* © ... © t*)
A = t t x © t © t 2 © ... © t t n
where
and
t = t + e E (t) + e E (t)
a a mm
t. = t. + e E (t. ) + e E (t. ).
i i a a l m irr i'
Then we have
n
n
and
Hence
A = (tK E(t)+e E(t))( Z t 4* E (t )+€ Z E(t ,)(l4€ ni ))
aa mm . ., i a a i. m..mi m
1=1 1=1
n n n
= t St. + e ( Z (t.E (t)) + tE (t ))+ € ( Z (tE (t )+t E (t)+tt.))
. ,i a . , i a a a m.,miim i
i=l i=l 1=1
A = (t4€ E (t)+€ E (t))(t 1 4€ E (tJ+€ E (t 1 ))(l+€ ) (+) ...
a a mm 1 a a 1 mm 1 m v/
= (tt n + e (tE (tJ+t n E (t)) + e (tE (tJ+t.E (t)+tt n )) (+)
v 1 a a 1 1 a ' m m 1 1 nr 1" v^
t Z t, + e E (a£) + € ( Z (tE (t )+t ,E (t)+tt ))
. , l a a L m . , mi lm l
i=l 1=1
E(A) = E (A u ).
m m
(Q.E.D.)
121
As for E (A) and E (A ), they depend on the order of additions and cannot be
compared simply. However, they may not differ significantly. As a simplified
case, let us study the following:
N
A = t( £ a. )
i=l X
and
H N
A = ^ (ta.).
i=l X
Again we assume that N is a power of two. Then to compute A, we first compute
:; n ^ n n
T. a. in parallel. As we showed before, (La..) = L a. + riog_Nle_ L a..
i=l i=l i=l i=l
Hence
* N N
A = t( L a . +r log^Nl e Za.)(l+e )
. , i d a . _ i m
i=l i=l
N N N
= t L a. + e r log Nit Z a. +£ t Z a. .
. , l a . _ i m .  l
i=l i=l i=l
On the other hand, we have
A. = (ta. ) = ta. + e ta.
11 l mi
d +
and A is obtained by summing A. in parallel, i.e.
A d * = (..(((a* © a 2 ) © ( A ; © a*)) © ( (A ; © $ © (a; ©
Ag)))...)
N N N .
= t Z a. + e riog^Nlt Z a. + e t Z a. .
.,i a & 2 •,! m . . i
i=l i=l i=l
Hence in this case E (A) = E (A a ) as well as E (A) = E (A ).
a a mm
122
5 PARALLELISM BETWEEN STATEMENTS
This chapter should be read as an introduction to the following
chapter which discusses loops in a program. In this chapter we study
parallelism between statements, i.e. inter statement parallelism. Given a
loop and jump free sequence of statements (we call this a program), it is
expected that they are executed according to the given (i.e. presented) order.
However if two statements do not depend on each other, they may be executed
simultaneously in hopes of reducing the total computation time. In general,
statements in a program may be executed in any order other than the given
order as long as they produce the same results as they will produce when they
are executed in accordance with the given sequence. In this chapter we give
an algorithm which checks if the execution of statements in a program by some
sequence gives the same results as the execution of statements by the given
sequence does. Also a technique which exploits more parallelism between
statements by introducing temporary locations is introduced.
51 Program
A program P with a memory M is a sequence of assignment statements
S(i), i.e. P = (S(l); S(2); ...; S(i); ...; S(r)) where i is a statement
number and r is the length of a program P (we write r = lg(P)). The memory
M is a set of all variables (or identifiers ) which appear in P.
Associated with each S(i) is a set of input variables, IN(S(i)) and
an output variable, OUT(S(i)). Then M =.U, (lN(S(i)) UOUT(S(i))). Further
123
we define two regions in a memory; a primary input region M and a final
output region M as
Mj. = (m  meIN(S(i)) and Vk < i, m/OUT(S(k) )} .
and M = {m  meOUT(S(i) )1 .
A program uses the values of those variables in VL. as primary input data and
puts final results into M .
C(m) refers to a content ( value ) of a variable m. C(M) refers to
the contents of variables in the memory M as a whole and is called a config 
uration of M. Also C T (m) refers to a value which m has before a computation
(i.e. an initial value of m). Thus C(M ) refers to primary input data given
to a program. We call it an initial configuration .
The following relations are established among statements in P.
A triple (id, i, j) where id e M (id for an identifier) and i, j e
{0,1, . . . ,r,r+l] (r = lg(P)) is in the dependence relation DR(P) if and only if:
(1) (i) i < j and
(ii) id e 0UT(S(i)) and id e IN(S(j)) and
(iii) Vk, i < k < j, id ft 0UT(S(k)),
or (2) (i) i = and
(ii) id e IN(sCj)) and
(iii) Vk, < k < j, id I 0UT(S(k)),
(s(,j) is the first statement to use id),
or (3) (i) $ = r + 1 and
(ii) id e 0UT(S(i)) and
(iii) Vk, i < k < r + 1, id / 0UT(S(k)).
3.214
only if:
(S(i) is the last statement to update id).
Similarly a triple (id, i, j) is in the locking relation LR(P) if and
(i) i < j and
(ii) id e IN(S(i)) and id e OUT(S(j)) and
(iii) Vk, i < k < j, id / CUT(s(k)).
Example 1 (The notations follow ALGOL 60 3] ) :
Let P be
S(l): a
I: d
): f
): g
= b + c;
 a + e;
= g + d;
 h + i.
Tli en
DE(P) = {(b,0,l),(c,0,l),(e,0,2),(g,0,3),(h,0,i),(e,0,^),(a,l,2),
(d,2,5),(f,3,5),(g,^,5)) and
LR(P) = [(g,3,^)).
Since we are only interested in meaningful programs, we assume that
there is no superfluous statement, i.e. there is no id e M such that
(i) id e OUT(S(i)) and id e OUT(S(j)) where i < j, and
(ii) Vk, i < k < j, id / IN(S(k)).
Also v/e assume that there is no statement that has no inputs other than constant
numbers, e.g. "a := 5" •
Now we define an execution order E of a program P as :
E(P) = {(i,j)ie{l,2,...,lg(P)},je(l,2,...}}.
125
JL
We also write E (i) = j if (i, j) e E(T). W To execute a program by E(P) means
that at step j, all statements with statement number E_ (j) are computed
simultaneously using data available before the jth computation as inputs.
A pair (P, E) is used to denote this execution . Also by E n (P), we understand
the execution order given by a program, i.e.
E (P) = ((i,i)  Vi e [l,2,...,lg(P))}.
E n is called a primitive execution order .
We assume that at each time step at least one statement of P must
be executed. That is, for any E there is k such that Vj > k, E (j) = empty
and Vj < k, E (j) / empty. We call k the length of an execution and write
lg(E).
As stated before, C(OUT(S(i))) refers to the contents of a variable
OUT(S(i)). This value, as we expect, varies from time to time throughout an
execution. Thus it is essential to specify the time when a variable is referred
to.
S(i)(P,E) refers to a computation of S(i) of P in an execution (P,E).
C(m) after S(i)(P,E) refers to the value of a variable m right after S(i)(P,E).
C(m) after (P, E) refers to the value of a variable m after an execution of a
whole program.
52 Equivalent Relations Between Executions
Now we define two equivalent relations between executions.
J
For convenience we define that Vi, E_(0) < E_(i) and E^i) <
E p (lg(P)+l)
126
Definition 1 ;
Given a program P and two execution orders E and E , (P,E, ) and
(P, E_) (or simply E.. and E ? ) are said to be output equivalent if and only if:
for all initial memory configurations C_(M_),
Vi C(OUT(S(i))) after s(i)(P,E 1 ) = C(OUT(S(i))) after S(i)(P,Eg).
We write (P,E )~(P,E ) if (P,E.) is output equivalent to (P,E ).
Definition 2 :
Given two programs P and P , let their execution orders be E, and
E p respectively. Also let their memories be M, and M p . Then two executions
(P ,E,) and (P ,E ) are .said to be memory equivalent if and only if:
(l) there is a onetoone function
fi <"ll U V  lM SI U «20>
such that
f(M ) = M
1 II 21
and f(M 1Q ) = M 2Q ,
and (2) for all initial memory configuration pairs C_(M]_j) and C T (M ?T )
such that
Vm € U ir CjU) = C I (f(m)).
Vn € M 1Q ,
C(n) after (P^E^ = C(f(n)) after (P 2 ,E 2 ).
M
We write (P ,E,)~(P ,E ) if (P,E, ) is memory equivalent to (P ,E p).
127
In principle, a program is written assuming that it will be executed
sequentially, i.e. by E Q . It, however, need not necessarily be executed by
E~ as long as it produces the same results as (P,E n ) when it terminates, i.e.
(V
lM,
it may be executed by any E as long as (P,E)=(P,E_) holds.
Now the following theorems can be proved directly from the above
definitions.
Theorem 1
(P,E)=(P,E ) if and only if:
(1) Vi, (id, i,.j)eDR implies that E (i) < E (j).
and (2) for any two triples (id, i, j) and (id, i', j ' ) in DR with the
same identifier id, either E p (j') < E (i) or E (j) < E (i')
holds.
What condition (l) implies is that variables must be properly up
dated before used, and condition (2) prevents variables from being updated
before they are used by all pertinent statements.
(a) Condition (l)
«p(i)
A
\
Ep(j) Q
id
i)
(b) Condition (2)
E p (i) © Kp(iO
©
A i id A
E p (d) © Ep(j«)
4 id
©
/A or /A
Ep(i') © E p (i)
1 ld A
A A
Ep(j') © E p (d)
4 id
Figure 5»1« Conditions for the Output Equivalence
128
Proof of Theorem 1:
(1) if part:
Assume that a statement S(i) receives data from statements
S(0,S(i ), . . .,S(i, ), i.e. for each pair i and i (s=l,2, . . .,k)
there is an identifier id such that (id ,i ,i)eDR. Now let E
— s — s s
be an execution order which guarantees that (l) before S(i) is
computed, all S(i ) are computed, and (2) between the computation
s
of S(i ) and S(i), no statement updates id , then it is clear
s s
that (C(OUT(S(i))) after S(i)(P,E) = C(OUT(s(i))) after S(i)(P,E Q )
providing that all OUT(S(i )) have appropriate values. Note that
5
the above two requirements are equivalent to conditions (l) and
(2) of the theorem. Then by induction, we can show that if
conditions (l) and (2) hold for all statements, then (P, E)~(P, E~).
(2) only if part:
We give an example to show that if an execution order
violates condition (l) or (2) then we cannot get an output
equivalent execution. Now let P be
S(l): a := b;
S(2): c := a;
S(3): b := e.
Then DR = [(b,0,l),(e,0,l),(a,l,2),(c,2,4),(b,3A)}, and (P,E Q )
gives
129
C(0UT(S(1))) after S(l)(P,E Q ) = C^b),
C(0UT(S(2))) after S(2)(P,E Q ) = C^b),
and C(0UT(S(3))) after S(3)(P,E Q ) = C^b).
Now let E (P) = {(1,2), (2,1), (3,3)} which violates the first
condition of the theorem, and Eg(P) = { (l, 2), (2,3), (3,1)) which
violates the second condition. Then
C(0UT(S(2))) after S(2)(P,E 1 ) = C (a)
and C(0UT(S(1))) after S(l)(P,Eg) = C (e)
which do not agree with corresponding values produced by (P, E»).
(Q.E.D.)
Theorem 1 gives more meaningful executions compared to the previous
results [5l [10]. For example let P be:
S(l): a:=f 1 (x)
8(2): b:=f 2 (a)
8(5): c:=f 3 (b)
S(k): b:=f u (x)
S(5): d:=f (b,c).
Fisher [10], for example, would give the following execution (P,E)
as an "equivalent" execution to (P,E ).
130
Step
6) Q
1 aJ b E= ((1 ' 1) ' (2 > 2) > (5 > 5) ' ^'V' (5A))
s I 
This, however, does not give correct results unless P is properly
modified. Note that the variable b carries two different values between
steps 2 and 3 which is physically impossible. Theorem 1 does not recognize
such an execution as "equivalent" to (P,E n ).
Theorem 'c
M
(P,E)=(P,E Q ) if and only if
(1) Vi, (id, i, j) eDR implies that E p (i) < E p (j),
and (2) V., (id, i, j) eLR implies that E (i) < E (j).
J
Example 2 :
P: S(l): a:=b+c;
S(2): d:=a+e;
S(3): a:=q+r;
SCO: h:=a+s.
Let E(P) = {(3,1), (U,2), (1,3), (2,i+)} Then (P,E)£(P,E Q ).
E, however, violates the second condition of Theorem 2, i.e. (a, 2, 3) eLR but
E p (2) <E p (3).
The following lemma is helpful to prove the theorem.
131
Lemma 1 ;
If two conditions of Theorem 2 hold for an execution order E, then
(P,E)(P,E Q ).
Proof :
We show that if conditions (l) and (2) of Theorem 2 (we write C(2l)
and C(22) for them) hold, then conditions (l) and (2) of Theorem 1 (c(ll)
and C(l2)) follow.
First note that C(2l) is identical to C(ll). Next we show that
C(22) together with C(2l) satisfy C(l2). Assume that (a, i^ j 1 ), (a,i , J )
eDR where j. < i p . Then there exist statements S(h,), S(k.), S(h ? ), S(k ? ), . . .,
S(h m ),S(k m ) such that (a, j^l^), (a,^,^), (a,^,^), . . ., (a,k g ,h ^ . . ., (a,k m ,i 2 )
eLR and (a^,^), (a,h 2 ,k 2 ), ..., (a,h,k ),..., (a,h m ,k m ) eDR. Then if C(2l) and
C(22) hold, then E (j ) < E (i ). Thus C(l2) follows.
(Q.E.D.)
Proof of Theorem 2 ;
(1) if part:
Let E be an execution order which satisfies C(2l) and
C(22). Then by Lemma 1, (P,E)"(P,E Q ). Now let i be a statement
number such that (id,i,r+l) eDR where r = lg(P), i.e. S(i) is
the last statement in (P,E.) which updates id. Then
C(id) after S(i)(P,E Q ) = C(id) after (P,E Q ). (l)
132
Also by a similar argument used to prove Lem m a 1, we can show
that for all J such that id e OUT(s(j)), E (J) < E (i) holds.
Thus S(i) is the last statement to update id in (P,E), too. Thus
C(id) after S(i)(P,E) = C(id) after (P,E). (2)
Also since (P,E ) = (P,E), we have C(CUT(S(i))) after S(i)(P,E ) =
C(OUT(S(i))) after S(i)(P,E) or
C(id) after S(i)(P,E Q ) = C(id) after S(i)(P,E). (3)
Thus from (l), (2) and (3), we have C(id) after (P,E) = C(id)
after (P, E n ). Using the same argument for all i such that
(id, i,r+l) eDR, we can show that for all m e M , C(m) after
(P,E) = C(m) after (P,E ).
{?.) only if part:
This part is again proved by giving a counter example. It
is easy to show that a program
S(l):
a := e;
8(2):
c := b;
8(3):
b := a.
together with execution orders E, = { (1,3), (2,1), (3,2)] and
E p = ( (1,1), (2,3), (3,2)} serves as a counter example. The
details are omitted.
(Q.E.D. )
Note that Lemma 2 can be modified as :
133
Corollary 1 ;
If (P,E)=(P,E Q ), then (P,E)=(P,E Q ) .
Now let us study the memory equivalence relation between two
different programs and execution orders, (P,,E ) and (P ,E ), in detail. As
a subcase of this let us consider (P ,E_) and (P ,E ). In general it is
M
impossible to show whether (P ,E )=(P ,E ). That is, this problem can be
reduced to the Turing machine halting problem which is known to be recursively
unsolvable [28]. In our discussion we have put the restriction on P so that
a program is a loopfree block of assignment statements. Even with this
restriction it may still be practically impossible to show memory equivalence
between (P..,E n ) and (P ,E„). For example if P and P are different polynomial
approximations for the same function, then (P , E ) and (P ,E ) are likely
to produce different results due to e.g. a truncation error. We do not pursue
this problem further.
Finally let us consider the following example:
Example 3 :
P: S(l):
k
:= a;
S(2):
b
:= k;
S(3):
k
:= c;
SflO:
d
: = k;
S(5):
k
:= e;
S(6):
f
:= k.
Let E(P) = {(1,1),(2,2),(3,2),(U,3),(5,3),(6,U)}
M
Then (P,E )~(P,E) and lg(E)
k.
I3h
However, the following program P'
P': S(l): k» := a;
S(2): b := k*;
S(3): k" := c;
S(k): d
S(5): k
S(6): f
= k"j
= e;
= k.
together with an execution order E(P') = { (l,l), (2,2), (3,1), (1<.,2), (5,1), (6,2)}
gives a memory equivalent execution to (P,E n ), and lg(E(P')) = 2.
This suggest the introduction of the following transformation, which
M
when applied on a program P, produces a new program P' such that (P,E n )~(P',E n )
Transformation T
Let S  (S(i),S(i+l), ...,S(j)} and S 2 = {S(k), S(k+l), . . ., S(m)) where
j < k. Assume that there is an identifier id such that (id, j,k) €LP, and id
e OUT(S(i)), and Vu, i < u < j, id / OUT(S(u)). Also assume that for any v and
w > i £ v £ J and  k < w < m, there is no id' such that (id',v, w) eDP. Then
replace every occurrence of id in S, by id' where id' ^ M.
Gold [17] presented a similar transformation to describe his model
for linear programming optimization of programs.
After the transformation is applied S, and S can oe processed in
M
parallel, and still (P,E )~(P',E ) holds where P' is the result^ic program a:
the application of T, on P.
This shows that the second condition of Theorem 2 is not essential,
i.e. it can be removed by introducing extra locations if necessnry.
135
6. PARALLELISM IN PROGRAM LOOPS
6.1 Introduction
6.1.1 leplacement of a j f p r . Statement with Many Statements
Using the results from the previous chapter, now let us study loops
in a program, e.g. ALGOL for statements or FORTRAN DO loops, to extract potential
parallelism among statements. Given a loop P, we seek an execution order E with
the minimum length among all possible ones. Sometimes it may be appropriate to
get a loop P' from P by the previously introduced transformation for which there
M
is an execution order E' such that (P',E* ) = (P,E ) and lg(E' ) is the minimum
(For the definition of =, see Chapter 5)
As stated before, in this chapter our main concern is the parallelism
among statements (inter statement parallelism). For example, we are interested
in finding out that all 10 statements (A[I] := A[ I + 1] + FUNC(B[I]); (1=1, 2, ...,
10)) in Fl can be computed in parallel, whereas statements in F2 cannot be
(The notation follows ALGOL 60 [3]):
Fl: for I := 1 step 1 until 10 do
A[I] := A[I + 1] + FUNC(B[I]);
F2: for I := 1 step 1 until 10 do
A[I] := A[I  1] + FUNC(B[I]).
First several notations are presented. According to the ALGOL 60
report [3], a for statement has the following syntax:
136
< for statement > ::= < for clause > + < statement > +
< for clause > ::= for < variable > :» < for Hat > do.
An instance of this is :
for I. := ... do
■ m 9
for I := ... do
begin EL ; S ; . . . ; S end .
For the sake of brevity, we shall write (i. *L n , I_ *■ L_, .... I * L ) (S, , S^,
112 2 . n n 1' 2
. .., S ) or (I , I , ..., I )(S. , S_, . .., S ) for the above for statMMnt
m ± d n ± d m ■
instance where I, is called a loop index, L is an ordered set and called a loop
list set , and S is called a loop body statement with a statement identification
number p (which is different from a statement number (see Chapter 5))
As its name suggests, a loop list set represents a < for list >, e.g.
L, = (1,2, 3 A, 5,6) represents "I. := 1 step 1 until 6." In general we write
L, (i) for the ith element of L thus L. (L. ) is the last element of L, .
Now to facilitate later discussions we introduce the following notation.
Let B = (b.,,b . ...,b ) (b. > for all i) and (i n ,i ,....i ) be ntunles
1 2 n i v 1 2 n'
#
of integers. Then we define the value of (i.,i , ...,i ) w.r.t. B as follows.*
— — — id n
7r <A>+=<A><A>* where * is the Kleene star.
JUL
7nr F or convenience we write i(s..t) for (ts+l) integers i , i , ..,,
S S ^* A.
i + i) i + , e.g. (i(l..s), i(s+2..n)) means (i.,....i ,i ^,...,i ).
L ~ L b is s+2 a
Also (L(s..t)) means ( L g  , L g+1  , . . ., L t  ). Finally (i(n)) means
n i's e.g. (1(3)) = (1,1,1).
137
n n
V ((i(i..n))B) = 2 i.B  IB. +1
n+1
where B. = f b. ,. and B = b ,, = 1.
j , . k+1 n n+1
This notation is introduced so that the relations
V((1,1,...,1,D!B) = 1,
V((1,1,...,1,2)b) = 2,
V((l,l,...,l,b )B) = b ,
n
V((l,l,...,l,2,l)B) = b n + 1,
V((1,1,...,1,2,2)B) = b n +2,
hold.
and V((b 1 ,b 2 ,...,b n )B) = b^x.^
For example V( (2,3,1) I (3A> 5) ) = 31. An n tuple B is called a base.
The inverse function of V is also defined as V~ (t  B) = (i(l..n)) if V( (i(l. .n)) b)
= t. Note that V" 1 is not oneone e.g. V _1 (l5 (J>,k, 5) ) = (2,0,0) or V~ 1 (15 (3A, 5) )
= (l,i+,0). An ntuple (i(l..n)) is said to be normalized if b . > i. > for all
J J
n n
j. Let (i(l..n)) be normalized. Then 1 < V( (i( i. .n) )  B) < 2 b.B.  2 B.+l.
31 J J 0=1 °
n n ,
If 1 < t < Z b.B.  ZB.+l, then V* (t  B) has unique normalized (i(l..n)) as
di J J j=i J
its value.
138
VTe say that normalized (i(i..n)) ranges over B = (b(l..n)) in
n
increasing order if V( (i(i. .n) ) B) takes all values, between 1 and £ b.B. 
J=l J J
n
£ B.+l in increasing order as (i(l..n)) changes. Notationally we write
J=l J
(l(n)) < (i(l..n)) < (b(l..m)).
Finally we let
(i(l..n)) > (j(l..n)) if V((i(l..n))B) > V( (j(l. .n)) b)
and
(i(l..n)) = (j(l..n)) if V((i(l..n))B)  V((j(l..n))B)
The following lemma is an immediate consequence of the above
definition.
Lemma 1:
Let B = (b(l..n)) where V.b. > 2. Then
li —
(1) v((a(l..n))B) < V((a'(l..n))B) implies that
n
v(( ai  a 1 ',...,a n  a n ')B) <  z B . or v((a 1  e^', . . ,,a a n ' ) b)
< V((0(n))B).
(2) V((a 1 '+c 1 ',...,a n '+c n ')B) = V( (a^, . . . , a n +c n )  B) ana
V((a'(l..n))B) > V((a(l..n))B) imply that V((c C,',,..,
n
C n' C n' )B)  ' . ZB j or V (( c i c 1 '^^ c n  c n ')l B ) < V((0(n))B).
(3) Let < [a.  < b for all j. Then V( (a(l. .n) b) < Vi(0(n) ) b) if
and only if there is h such that Vk (1 < k < h), a. = ana a. < 0.
139
A loop must be replaced with a sequence of statements so that we can
use the results of the previous chapter. For example we replace
for I := 1 step 1 until 10 do
SI: A[I] := A[I] + B[I];
with the sequence of ten statements
A[l] := A[l] + B[l];
A[2] := A[2] + B[2];
A[10] := A[10] + B[10].
/" \
In general we will get J it L.nJ statements after the replacement
of a loop P: (I, . I_, . . , I )(S S^;..:S ). Any statement in the set of replaced
r 1' 2 n 1' 2 m
statements can be identified by an ntuple (i(l..n)) which corresponds to values
of I 1 , I 2 , ...,I n (i.e. L 1 (i 1 ),L 2 (i 2 ), . .,L n (i n )), and p which represents a
statement identification number. Thus an (n + l) tuple (i(l..n),p) serves as a
statement number , and we write S((i(l..n),p)) to denote a particular statement
in the set of replaced statements, e.g. in the above example S( (3, l) ) = A[ 3] :=
A[3l + B[31 The actual statement which corresponds to this is the statement
S with L n (i n ),..., L (i ) substituted into every occurrence of I,.....I in S ,
p ll'nn 1' ' n p
and we also write S [L, (i, ),..., L (i )] for this.
p 1 1 ' n n
/ n , \
These ir L. m I statements are to be executed according to the
presented order (i.e. the order specified by for loop lists). In other words,
the statement S( (l, 1, . . ., 1, l) ) is executed first, S((l, 1, . . ., 1,2) ) second, ...,
1U0
the statement S( (i(l. .n),p) ) is executed V( (i(l. .n),p)  ( L(l. .n)  ,m) )th, .. .,
and the statement S(( IL^J, . . ., L ,m)) is executed lastly. Formally, as the
essential execution order we have:
E Q (P) = {((i(l..n),p),V((i(l..n),p)B))  (l(n),l) < (i(l..n),p) <
(L(l..n),m))
where B = ( L(l. .n)  ,m) .
Example 1 :
for I, := 1 step 1 until 10 do
for I := 1 step 1 until 10 do
begin
SI: A 1 [I,,I ] := A 2 [I1,I ] +B 1 [I 1 ,I ];
■1^2
'1
•l'*2
S2: B^I^Igl] := A 5 [I 1+ 1,I 2 ] + B 5 [I 1 , Ig+1] ;
end
is executed as
S( (1,1,1)): A X [1,1] := A 2 [0,1] +B 1 [1,1]J
S((l,l,2))
S( (1,2,1))
S((l,2,2))
B 2 [1,0] : = A 5 [2,l] + B 3 [l,2];
S((10,10,2)): B 2 [10,9] := A 3 [ll,10] +B^[10,11];
The superscript is used to distinguish different occurences of A and B.
11H
A[(i(l..n))] represents a form in which L.. (i. ),..., L (i ) are
substituted into T , .. .,1 in index expressions, e.g. in the above example
A 2 [(i 1 ,i 2 )] = A 2 ti x l,i 2 ]
and A 2 r(3,2)] = A 2 [2,2].
Finally a set of inputs to a statement S( (i(l. .n),p)) is denoted by
IN(S((i(l..n),p))). Similary OUT(S((i(l. .n),p))) represents a set of outputs
from S((i(l..n),p)). From the above example we have, e.g.
IN(S(l,l,2)) = (A 5 [2,1]*B 5 [1,2])
and 0UT(S(1,1,2)) = {B 2 [1,0]}.
6.1.2 A Restricted Loop
In what follows, we mainly deal with a restricted class of for
statements. Two restrictions are introduced. Let a loop with m body statements
be
P™ = (l 1 ,I 2 ,...,I n )(S 1 ;S 2 ;...;S m ).
Restriction 1:
A for list set L. must be an arithmetic sequence, i.e.
L = (.1,2,3,..., t) (1)
for all i
Restriction 2
Let (A ,A , ...,A 1 be a set of all array identifiers in F where the
hth occurrence of A, in P has the following form (where the superscript h is
142
used only if it is important to distinguish different occurrences of A, ):
A^[F(k,h,l), F(k,h,2), ..., F(k,h,n)]. (2)
For fixed k and j, F(k,h, j) has an identical form for all h, i.e. either
F(k,h,j) = I. + w(k,h,j)
J
or F(k,h, j) = (i.e. vacant).
w(k,h, j) is a constant number. Also we assume that each A, appears on the left
hand side of statements at most once.
An example of a restricted loop is:
for I := 1 step 1 until 20 do
for I := 1 step 1 until 30 do
for I, := 1 step 1 until kO do
begin
SI: A 3 [ 1^1,12+3,0] := AjCIylgJ,)*] + ^[0,0,1^];
S2: A 2 [0,I 2 ,I 3 1] := A ? [ 1^1,12,0] + A.^0,0, 1^1] ;
S3: A 1 [0,0,I 5 +1] := A 2 [0,I 2 1,I 3 ];
end ;
Note that, for example, A, always appears as
A,[I, +w(3,h,l), I +w(3,h,2),0], thus the first occurrence of A, is
A 3 [F(3,1,1), F(3,l,2),0] = A 3 [ 1^(3,1,1), I 2 +w(3,l,2),0] • AjCXj1, Ig+3, 0] 
If there is no ambiguity, we write e.g. A,[I, 1, I 2 +3] f° r A,f I, 1, I 2 +3>0]
(which is the conventional form).
1U3
We also write F(k,h, j)(i) for the resultant expression obtained by
substituting i into I. in F(k,h,j), e.g. A,[F(3, 1, l)(2), F(3,l,2)(3),0] =
A,[ 1,6,0] (= A,[l,6] conventionally).
A single variable may be introduced as a special case of array
indentifiers, e.g. we write
for I := 1 step 1 until 1 do
A[I] :=
for
T := ....
6.2 A Loop With a Single Body Statement
6.2.1 Introduction
First we shall deal with the case where a loop has only one body
statement (i.e. m = l). Let a loop with a single body statement be P =
(I , I , ...,I )S. Since there is only one statement we may drop the statement
identification number. Then a statement number for a replaced statement becomes
(i(l..n)) and as the essential execution order we have:
EqCP 1 ) = {((i(l..n)), V((i(l..n))(L(l..n))) 
(l(n)) < (i(l..n)) < (L(l..n))}.
Also in this case we only have to consider the array identifier which
appears on the left hand side of S. Hence instead of s array identifiers we
only have one array identifier (see Restriction 2 of Section 6.1.2). Hence we
drop k and write
A h [F(h,l), F(h,2), ..., F(h,n)]
and
Ikk
F(h,j) = I + w(h,j)
J
for the hth occurrence of A for Eq. (2) of Section 6.1.2 (the superscript is
used if it is necessary to distinguish the different occurrences of A).
Furthermore we assume that F(h, j) ^ for any h and j.
Now let us study the following two examples.
Gl: for I := 1 step 1 until 10 do
A[I] := A[I] + 5;
G2: for I := 1 step 1 until 2 do
for J := 1 step 1 until 10 do
A[I,J] := A[I1,J+1] + 5;
Assume that an arbitrary number of PE's are available. Then:
Gl: All ten statements (A[I] : A[I] + 5) can be computed simultaneously
by 10 PE's.
G2 : A[1,J] and A[2,J2] can be computed simultaneously by two PE's at the
Jth step (J=l,2, . ..,10).
In what follows, the above two types of the interstatement parallelism
are studied.
Before we go into the details, a few comments are in order with regard
to real programs. A for statement with a single body statement, (i, ,.,I )S,
can be classified from several different points of view. First of all let us
take a for list set L.. As a simplified case we have L. = (s ., s .+1, . . ., t . )
(t . = (l.1) + s.) which is equivalent to an ALGOL statement "for I. := s.
step 1 until t. do". Knuth stated [9] that examination of published algorithms
KJ
showed that well over 99 percent of the use of 'the ALGOL for statement, the
ll*5
value of the step was ' +1 1 , and in the majority of the exceptions the step was
a constant. This statement was confirmed by checking all Algorithms published
in the Communications of the ACM in 1969 There were 23 programs and 263 for
statements used. Only six uses were exceptions (z 3 percent).
Next let us examine a body statement S. Then either (l) the left
hand side variable of S (i.e. OUT(S)) is a single variable, or (2) OUT(S) is
an array identifier. In case of (2) S is of a form
A°[F(0,l),...,F(0,n)] := f (A^FCl, l), . . . ,F(l, n)], . . ., A P [ f (p, l), . . . ] ) .
Now S has either one of the following five forms.
M
(1) OUT(S) is a single variable t.
(i) t := a function which does not depend on t,
e.g. t := a + 5,
(ii) t := f(t), e.g. t:= t + a,
(2) OUT(S) is an array variable A:
(i) A [F , ...,F ] := a function which does not depend on A,
e.g. A[I,J] := b + 5,
(ii) for all h F(0, j)  F(h, j) is a constant for each j,
e.g. A[I,J] := A[I5,J+3] + A[I+l,J3] + 5
(iii) other cases, e.g. A[I,J] : A[2I,J5] + a.
Note that if S is of Form (li), then
P 1 = Sfl^ClLj), L 2 (L 2 ), ..., L n (L n )]. For example let P 1 be
We use a lower case letter for a single variable and an upper case
letter for an array variable.
146
for I := 1 step 1 until 5 do t := A[I]  1.
Then after the execution of r, t = A[ 5]  1«
Again all Algorithms published in the CACM were checked (this time the
check was made against Algorithms published in 1968 and 1969. ) There were 52
programs altogether and 117 for statements with a single body statement. The
details were:
(1)
(2)
No
. of Exam
pies
Percentage
(i)
(ii)
1+2
35.8
(i)
18
15.4
(ii)
33
28.2
(iii)
2k
20.6
117 100.0
In what follows we deal with Forms (2i), (2ii) and (2iii). Form
(lii) has been discussed in Chapter h.
6.2.2 Type 1 Parallelism
6.2.2.1 General Case
As stated in Chapter 5> a block of statements P need not be executed
according to the essential execution order E„ and may be executed by any
execution order E as long as (P, E n )=(P, E) holds. In this section we study a
special class of execution orders called type 1 execution orders. This execution
order is defined for each loop index I (u=l,2, . . . ,n) and hence there are n of
these.
ll+7
Definition 1:
A type 1 parallel execution order with respect to I (we write 1p
w.r.t. I ) is given by
E(P) = {((i(l..n)),V((i(l..ul),i(u+l..n))(L(l..ul), L(u+l. .n)  ))
(l(n)) < (i(l..n)) < (L(l..n))),
and is represented by E[ I ] .
Figures 6.1 and 6.2 illustrate execution orders E~ and Efl 1.
D L u J
Note that Efl ]((i(l..n))) = E[I ] ( (i f (l. .n) ) ) if i = i* for all
U li K. K.
k = 1,2, . . . , ul,u+l, . . . ,n. Furthermore note that if
V((i(l..ul))(L(l..ul))) > V((i'(l..ul))(L(l..ul))),
then E[I u ]((i(l..n))) > E[ y ( (i» (1. .n) ) ) .
By introducing extra L  PE's, the computation time becomes one
n n
L I th of the original, i.e. tt L. steps instead of it L. steps, where one
U j=l J j=l 3
step corresponds to the computation of a body statement.
We now introduce TRANQUIL notation [2] to illustrate Definition 1. In
TRANQUIL
for (I) sec} (L) do S
st^oids for
for I := (for list set) do S.
Also in TRANQUIL for (i) sim (L) do S indicates that statements
S(L(i)) are executed simultaneously for all L(i) in L. Then Definition 1
amounts to obtaining
H+8
C\J
\H
?■
r^ : '
O — H •
V
>
V
+
3
+
<><^o
h
*1
H O
a; p
fit v
CO •
•H >
•H *
bO •
v c\?
0) P
3 ti 49
' E s
C CO
• o
o
^ h
 O P
3 co
C
cd
,£
•P
O a5
P^
0) ^
o
fa
£i
P
co
0)
p
ctJ
•H
TJ
•H
SB
H
O
o co
c u (
O CO
W
ft co
CM
3
3
+
3
3
+
0<X)
fa
ll+9
from
for (I ) se£ (L 1 ) do
for (I , ) sea ( L . ) do
u1 — ■* u1 —
for (I ) sim (L ) do
for (I u+1 ) seq, (L u+1 ) do
for (I ) seq (L ) do S
v n — ■* n —
for (I ) se£ (iO do
for (I ) seq (L ) do
u — ■* u —
for (I ) seq (L ) do S.
n — ■* n —
First we study a type 1 parallel execution order for a general loop
in detail. Let the twodimensional plane I  I . x ••• X I be an L by
u u+1 n ' u '
n
7T L. grid (see Figure 6.3)«
j=u+l D
The grid is labeled by 1,2, ...,i , ...etc. rather than by L (l),L (2),
...,L (i ),... etc. for convenience.
Note that each square of the plane represents the computation
S((i(l. .ul),i(u. .n))) for some (i(l..ul)). If P' is executed by E , then the
150
<**♦!
u\J>
Figure 6.3* Conditions of Parallel Computation in a Loop
computation proceeds from the leftmost column to the rightmost column while in a
column the computation proceeds from the top to the bottom sequentially. On the
other hand if P is executed by E[ I ] then we proceed to compute from the top
row to the bottom row while we perform computation in each row simultaneously.
Each computation S( (i(l. .ul), i(u. .n) ) ) uses inputs IN(S( (i(l. .n))) and updates
the output 0UT(S((i(l..n)))). Then as we studied in Chapter 5, we have to make
sure that the computation S((i(l..n))) (marked x in Figure 6.3) does not receive
any data which are to be updated by the computation in the region R, i.e. there
must be no id such that
(id, (i(l..ul),i'(u..n)), (i(l..n))) cDR
151
holds where (i'(u..n)) > (i(u..n)) and 1^ < i^.
Similarly the computation S((i(l..n)}) must not use any data which are to be
updated by the computation in the region Q, i.e. there must be no id' such that
(id\ (i(l..n)),(i(l..ul),i"(u..n))) €LR
holds where (i"(u..n)) < (i(u..n)) and i u " > i u «
The above observation gives the following theorem.
Theorem 1 :
Let E[I ] be a type 1 parallel execution order w.r.t. I . Then
M
(P^Efl ]) = (P 1 ,E Q ) if and only if there are no id, id', (1(1. .n)), (i(l..ul),
J
i'(u..n)) and (i(l. .ul), i"(u. .n) ) for which either
(1) (i) i ' < i and
' u u
(ii) (i'(u+l..n)) > (i(u+l..n)) and
(iii) (id, (i(l. .ul),i' (u. .n)), (i(l. .n))) eDR, where id e
OUT(S((i(l..ul),i"(u..n)))) and id € IN(S( (i(l. .n) ) )
or (2) (i) i " > i and
u u
(ii) (i"(u+l..n)) < (i(u+l..n)) and
(iii) (id*, (i(l..n)), (i(l.. ul), i"(u..n))) eLR where id'e
OUT(S((i(i..ul),i"(u..n)))) and id e IN(S((i(l. .n)))
hold.
T (±' (u+1. .n)) ^ (i(u+l..n)), for example, means V( (i 1 (u+1. .n) )  B) >
V((i(u+1. .n))B) where B = ( L(u+l. .n)  ) . Unless specified, the
base (  L(s. .n)  ) (=( L ,...,L )) is to be understood for (i(s. .n))
152
Let S be of a form A := f(A , ...,A P ) where A is an array identifier
and the superscript is used to distinguish different occurrences of A. Then
id in the first condition of Theorem 1 corresponds to those A [(i(l..n))] for
which A [(i(l..ul),i'(u..n))] = A [(i(l..n))] holds together with the three
conditions (i), (ii), and (iii) of (1) between (i'(u..n)) and (i(u..n)).
Similarly id' corresponds to those A [(i(l..n))] for which A [(i(l..ul),
i' (u. .n))] = A [ (i(l..n))] holds. Thus A (l < h < p) can be classified into
three groups :
CI = {hA satisfies the first condition}
C2 = {hA satisfies the second condition)
C3 = (1,2, ...,p)  CI  C2.
Note that CI n C2 = 0.
Example 2 :
Let P 1 be
(I x  (1,2,3), I 2  (1,2,3))(A°[I 1 ,I 2 ] := f(k\ I^Ig+l])).
Then for i^ = 1< i g = 2 and (i 2 ') = (3) > (ig) = (2), we have A°[ (i^i^ )] =
A [(i ,i )] = A[l,3], or (A[l,3], (1,3), (2,2)) eDR. Thus P cannot be computed
in 1p w.r.t. I,, and CI = [1] .
From this argument it should be clear that if a body statement is of
Form (2i) (see Section 6.2.1), then the loop can be computed in 1p w.r.t.
any I u (u=l,2, ...,n).
153
6.2.2.2 A Restricted Loop
If a loop is a restricted loop, then Theorem 1 may be simplified.
First we define a vector R(h) for each h = 1,2, . ..,p:
R(h) = (R 1 (h),...,R n (h)),
where
R (h) = F(0,j)  F(h,j)
J
= I, + w(0,j)  (I + w(h,j))
J J
= w(0,j)  w(h,j).
For example we get R(l) = (1,8) from a statement
A°[ I r l,I 2 +3] := f(A X [ I^Ig5]).
Then we use these vectors to check parallel computability as follows.
Also for convenience we write
R'(u,h) = (R 1 (h),...,R u _ 1 (h))
and
R"(u,h) = (R u+1 (h),...,R n (h)).
Theorem 2 :
If one of the following two holds for any of R(h) (h = 1,2, ...,p),
then P cannot he computed in 1p w.r.t. I .
(1) (i) R'(u,h) = (0,...,0) and
(ii) R (h) > and
u
(iii) R"(u,h) < (0, ...,0) and V.(u + 1 < j < n)  R.(h)<L.  1.
15U
(2) (i) R*(u,h) = (0, ...,0) and
(ii) R (h) < and
(iii) R"(u,h) > (0,...,0) and V (u + 1 < j < n)  R (h)<L.  1.
That the theorem is valid is the direct consequence of Theorem 1, i.e.
the first check of the theorem corresponds to the first condition of Theorem 1
and the second check corresponds to the second condition. For example the
first condition of Theorem 1 says that if
(id, (i(l..ul), i'(u..n)), (i(l..n))) eDR holds f or i ' < i and
(i'(u+l..n)) > (i(u+l..n)), then P cannot be computed in 1p w.r.t. I , where
u*
id e 0UT(S((i(l..ul), i'(u..n)))
and id e IN(S( (i(l. .n) ) ) ) .
Then id represents the element of A for which
A h [(i(l..n))] = A°[(i(l..u1), i'(u..n))]
holds. Now this implies that
F(h,j)(L.(i.)) = F(0,J)(L (i .))
J J J J
for j < u and
F(h,d)(L,(i,)) = F(0,j)(L.(i '))
for j > n. Hence
L (i ) + w(h,j) = L (i ) + w(0,j)
J J J J
or R.(h) = for j < u, and
J
or
L (i ) + w(h,j) = L.(i') + w(0,d)
J J J J
i.' = i.  R.(h)
3 J d
155
for J > u. Also
(i'(u+l..n)) > (i(u+l..n))
with
B = (L(u+l..n))
becomes
V((i u+1 R u+1 (h),...,i n R n (h))B) > V((i(u+l..n))B).
Then by Lemma 1,
V((R u+1 (h),...,R n (h))B) < V((0, ...,0)b).
Thus the first check of Theorem 2 is verified. The second check can be
varified similarly.
Now let us consider the number of checks required. For each A (h=l, 2,
. ..,p) which appears on the right hand side of S, we first obtain a vector R(h).
Then for each loop index I , we perform the two checks given by Theorem 2 for
all R(h) (h=l, 2, . . .,p). Since there are n loop indicies, in total we perform
2np checks .
The procedure described in this section can be extended to cover
nonrestricted loops, too. Let S be of a form
A°[F(0,l),...,F(0,n)] := f (aV(1, l), . . . ,F(l,n)] , . . . , A P [ (F(p, l), . . . ] )
and we define a vector R(h) for each h = 1,2, • ..,p as we did before, i.e.
R(h) = (R 1 (h), ..., R n (h))
and
R (h) = F(0,j)  F(h,j).
J
Since a loop is not restricted, F(0, j) and F(h, j) may take any form and hence R ^h)
d
156
may not be a constant number but rather a function of loop indicies, e.g. R(h) =
I, + 21,  5. Hence, in the most general case, it is necessary to check the
two conditions of Theorem 2 for all values of (i(l..n)) (i.e. (l(n)) < (i(l..n))
< ( lL1, L , • .., L  )) to examine type 1 parallel comput ability, i.e.
n
2( it L.) checks are required for each R(h) (h=l, 2, . . .,p). In many cases, we
d=i J
can expect that the number of checks required is far smaller than that. For
example if
R(D = (21^21^,1^21^),
then only 2(  L, xL,  ) checks are required, i.e. it is not necessary to check
for those loop indicies, e.g. 1^, which do not appear in R.(l) (d"l, 2,3**0.
6.2.2.3 Temporary Locations
In this section we mean a restricted loop by a loop. The second
condition of Theorem 1 (or 2) may be dropped by introducing extra temporary
locations by applying Transformation T of Chapter 5 on P , i.e. if CI = and
and C2 / 0, then temporary locations may be set up so that P can be computed
in parallel (for CI and C2, see Section 6.2.2.1). Let heC2. This implies that
there are (i(l..n)), (i(l. .u1), i T (u. .n) ) and id (see Figure 6.^) for which
(id, (i(l..n)),(i(l..ul),i'(u..n))) eLR
holds and
id = A h [(i(l..n))] € IN(S((i(l..n))))
and
157
id = A°[(i(l..ul),i'(u..n))] e OUT(S( (i(l. .ul), i ' (u. .n)) )) ) .
If a loop is confuted in 1p w.r.t. I , then we have
E[I u ]((i(l..ul),i'(u..n))) < E[I u ]((i(l..n)))
while
E ((i(l..ul),i'(u..n))) > E ((i(l..n))).
Hence A [(i(l..n))] will be updated "by S( (i(l. .ul), i' (u. .n) ) ) before being
used by S( (i(l. .n)) ). That is, if we compute P in 1p w.r.t. I we must keep
the old value of A [(i(l..n))] which otherwise will be updated by S( (i(l. .ul),
i'(u..n))) at the E[I ] ((i(l. .ul), i* (u. .n) ) )th step separately until it is
used by S((i(l..n))) at the E[ I ] ( (i(l. .n)))th step. The period of time, t ,
through which the old value of A [(i(l..n))] must be kept for the computation
S((i(l..n))) is given by
t h = E[I u ]((i(l..n)))  E[I u ]((i(l..ul),i'(u..n)))
= V((i(u+l..n))B)  V((i'(u+l..n))B),
where B = L(u+l. .n)  ). Then as we showed in Section 6.2.2.2, in case of a
restricted loop, we can show that
n
t = V(R"(u,h) B) + Z B  1
h , t s
s=u+l
n
where B = ir I L , . I . The details are omitted.
s ' t+1 '
t=s
158
T u+r • * *
vl
' n\
i
u
i '
u
•n))
•n))
(i'(u+l.
/>
y
(i(u+l.
o"
A.
Figure G.k. An Illustration of t
Now max t gives the maximum period of time through which A [(i(l..n))] must be
h€C2
kept. Since we have  L  of them (i.e. L  statements are computed
simultaneously), the total amount of temporary locations required will he L  x
u'
max t, . Additional L  locations are required for buffering (see Example 5)«
u
Hence we have the following theorem.
Theorem 3:
The maximum number of temporary storage locations required is
L  x (max [V((R (h),...,R (h))B) + Z B ] )
heC2 U+1 n s=u+l S
where B = ( L(u+l. .n)  ) and B = w L , ,  and B = 1.
t=s
159
Example 3 :
Let P 1 be
for (I ) se£ (1,2, ...,U0) do
for (I ) se£ (1,2, ...,1j0) do
A[I X ,I 2 ] := A[I 1+ 2,I 2 3] + 2;
P as it is cannot "be computed in 1p w.r.t. I because it violates
the second condition of Theorem 2. Now we modify P as follows by introducing
temporary arrays Tl(UOxl) and T2 (1*0x3).
for (I ) se£ (1,2,. ..,^0) do
for (I ) se£ (1,2, ,..,kQ) do
begin SI: T1[I 1 ] := Afl^Ig] ;
S2: k[l v l 2 ] := T2[I ] _,I 2 mod 3] +2;^
S3: T2[I 1 ,I 2 mod 3] := Tl[ Ij ;
end .
Then all three statements can be computed in 1p w.r.t. I , i.e. we
can replace seq in the first for statement by sim . The original P , if
executed sequentially, takes 1600 steps whereas the modified P takes only 120
steps if executed in parallel with respect to I .
JL
"a mod b = a
Also we assume that T2 is properly initialized before the computation
of the loop, i.e. store A[l,*], A[2,*] and A[3,*] in T2[l,*], T[2,*]
and T[0, *].
i6o
6.2.3 Type 2 Parallelism
In this section we mean a restricted loop by a loop. Since the
conflict between two statements S(i) and S(j) due to the existence of an
identifier id such that (id, i, j) eLR may be resolved by introducing temporary
locations (see Chapter 5 and the previous section), such conflict will not be
taken into account to check parallel computability throughout the rest of this
chapter.
This section describes the second type of parallelism, i.e. type 2
parallelism, in a for statement with a single body statement. Type 2
parallelism is introduced to resolve the conflict due to the first condition
of Theorem 1. The following example illustrates it.
Example k :
P: for I := 1 step 1 until kO do
for I := 1 step 1 until kO do
A°[I 1 ,I 2 ] := a\ 1^1,12+1] + A 2 [I 1 ,I 2 1];
Since R 1 (l) a 1 > and (Rp(l)) = (1) < (0) hold, P cannot be
computed in 1p w.r.t. I .
Now let us consider the I I plane (Figure 6.5). Suppose that all
S((i, ,i )) in the shaded area have been computed. Then at the next step those
S((i * , i ' ) ) marked as HJ can be computed simultaneously, and at the following
step all (2) can be computed simultaneously, and so forth. We can see that a
heavy zigzag line travels from left to right like a "wave front" indicating
that all statements on that front can be computed simultaneously.
161
Figure 6.5 Wave Front
Note that computation of P by this scheme takes approximately 120
steps, while if P is computed sequentially it takes k0 x h0 = 1600 steps.
Given a loop P , if P is computed in 1p w.r.t. I , then a "wave
front" is in parallel with the I axis of the I  I , x ••• X I plane, and
^ u u u+1 n * '
it travels in the increasing order of (i(u+l..n)).
If P cannot be computed in 1p w.r.t. I then it may be possible to
find a "wave front" which is diagonal rather than horizontal as in Example k on
the I  I  x ... x I plane,
u u+1 n ^
162
The direction of wave
front travel
tan a = slope of a wave front
Figure 6.6. Wave Front Travel
This wave front is such that all computations S((i(l..n))) which corresponds to
points (i ,i ,,..., i ) which lie right next to a wave front can be computed
simultaneously. In other words all necessary data to compute S( (i(l. .ul),
i(u..n))) have been already computed in the shaded area. The direction of a
wave front's travel is perpendicular to the wave front.
Now let us obtain the slope of a possible wave front for a restricted
loop.
Let P be a restricted loop. Assume that P cannot be computed in
1p w.r.t. I . Then according to Theorem 2, this means that there are R(h) for
which R (h) > and (R (h), . . ,,R (h)) < (0, ...,0) hold (i.e. CI / 0).
163
Theorem k :
The slope of a possible wave front in the I  I , x ••• x I plane
is given by
max
he CI
^ ((V((R u+1 (h),...,R n (h)) B )  r B s +2)1
u s=u+l 1
n
where B = ( lL(u+l. .n)  ) and B = 7r L, ,. and B = 1.
s t+1 n
t=s
In Example k, the slope of the wave front is ■r(("l"l + l)l + 2) * 2.
Proof of Theorem k :
Let us consider S( (i(l. .ul), l(u. .n) ) ) on the I  I x ... x I
u u+l n
plane. Assume that there is a variable id such that (Figure 6.7)
(id, (i(l..ul),i'(u..n)),(i(l..ul),i(u..n))) eDR
holds together with
i ' < i and (i"(u+l..n)) > (i(u+l..n)),
u u —
i.e.
id e IN(S((i(l..ul),i(u..n))))
and
id e OUT(S((i(l..ul),i'(u..n)))).
This implies that there is he CI such that
A h [(i(l..ul),i(u..n))] = A°[(i(l..ul),i'(u..n))]
holds.
l£k
(i(u+l..n))
(i'(u+l..n))
1 1 T
< "T"
t
/
h
//
/
7\ a
, \L.
_x
k R u (h) — ■*
Figure 6.7. An Illustration for Theorem k
In case of a restricted loop, we have
i ' = 1
lUh)
for j > u. Now let
t, = V((i'(u+l..n))B)  V((i(u+l..n))B)
then we get
n
t. = V((R (h),...,R (h))B)  Z B + 1 where B e
n u+1 n . s s
s=u+l
n
7T L, I and
t=s
t+1
B  1. Now if we let the slope of a wave front be equal to
165
t + l t, + 1
h h
i  i ' " R (h)
u u u
then A [ (i(l. .n))] and A [ (i(l. .Ul), i' (u. .n))] will be separated by it
(Figure 6.7).
The actual wave front is a zigzag line, rather than a straight line
as shown in Figure 6.7
If there are more than one h in CI, then we choose a to be large
enough so that all inputs to S( (i(l. .u1), i(u. .n)) ) be inside of a wave front,
i.e.
t + 1
tan 3 = max ^^p.
h£Cl u
(Q.E.D.)
Now suppose we compute F in parallel w.r.t. I using a diagonal
wave front whose slope is D = tan q> Then how many steps (one step corresponds
to the computation of a body statement S) does it take to compute P ?
Theorem c / \
The total number of steps required to compute P in parallel w.r.t.
I using a diagonal wave front whose slope is D is given by
/u1 \
i n
L D^
1 u '
i
T 
P
' TT L.
'.j=u+l J
+
166
Proof:
Let us consider the I  I ,, x ... x I plane.
u u+i n
end
Figure 6.8. An Execution by a Wave Front
Wave front W must travel from the start position to the end position on the
n
plane. How long does it take? It takes L + L d steps where L = it L.
U j=u+l 3
u1
Since we have to process 7rlL.ll  I ,■ x ••• X I planes, in total it
^ ._, ' j ' u u+1 n * '
u1
becomes T = tt L . I (L + L D).
(Q.E.D.)
Note that if a wave front is horizontal (i.e. if P can be computed
in 1p w.r.t. I ), then D = and T = tt IL.I.
167
6.2.4 Conclusion
Assume that there are an arbitrary number of PE's available. Given
a restricted single body statement loop,
P 1 = (l r ...,I n )(A rF°,...,^] := f(A 1 [F^,...,Fj],...,A P [FP,... > ^]))
we can check if F can be computed in 1p w.r.t. I (u=l, ...,n) by Theorem 2.
If it cannot be, then we can check for type 2 parallel computability w.r.t.
I , i.e. find a possible wave front. In either case we obtain the number of
u'
n u1 n
computational steps required, i.e. T = ir L. or T = ( tt L.)( ir L. +
U j=l 3 u j=l 3 j=u+l J
lL I'D), where one step corresponds to the computation of the body statement S.
Then among all possible choices, we would choose to compute in parallel w.r.t.
I where T = min T . 
6.3 A Loop With Many Body Statements
6.31 Introduction
In what follows we mean a restricted loop by a loop. Again a check
against all published Algorithms in 1968 and 1969 CACM issues has been done,
and it has been revealed that well over 50 percent of the cases of for statement
usage (with more than twobody statements) are instances of restricted loops.
Also as stated in Section 6.2.2.3 and Chapter 5, the LR relation may
be disregarded by introducing temporary locations. Hence it will not be taken
into account throughout the rest of this chapter.
168
Given a loop with m body statements, P , there are three different
approaches to compute it in parallel. First it is possible to extend the
procedure described in Section 6.2 by treating m body statements as if they
were one statement. That is, we consider body statements as a function
OUT(S(i(l..n))) = f(lN(S((i(l..n)))).
For example
SI: A[I,J] := f(A[I,Jl],B[Il,Jl]);
S2: B[I,J] := g(A[Il,Jl],B[I,Jl])
yield
S: {A[I,J],B[I,J]} := f(A[I,Jl],A[Il,Jl],B[Il,Jl],B[I,Jl]).
Then we can apply Theorem 1 directly to check if e.g. S can be computed in 1p
w.r.t. I.
The second and the third approaches can be illustrated by the follow
ing two examples.
El: for I := 1 step 1 until kO do
begin
SI: A[I] := f(A[I],B[I]);
S2: B[I] := g(A[ I],B[ 11] );
end ;
E2: for I := 1 step 1 until i+0 do
begin
SI: A[I] := f(A[Il],B[I2]);
S2: B[I] := g(A[I]);
end.
169
In El, note that SI and S2 cannot be computed in parallel for all values of I
because S2 has an iteration form
B[I] := g'(B[Il]).
However, El may be replaced with two for statements :
for I := 1 step 1 until kO do
SI: A[I] := f(A[I],B[I]);
for I := 1 step 1 until kO do
S2: B[I] := g(A[ I], B[ 11] ) ;
Now the first loop can be computed in parallel for all values of I while the
second for statement is still an iteration. In general by replacing a single
or statement with two or more for statements the parallel part may be exposed.
In the second example, SI and S2 can not be computed in parallel
for all values of I, nor can they be separated into two independent loops
because SI uses values which are updated by S2 (i.e. B[I2]), and S2 uses
values being updated by SI (i.e. A[I]). However SI and S2 could be computed
simultaneously while I varies sequentially if the index expression in S2 is
A
"skewed" as follows.
E2' for I := 1 step 1 until kO do
begin
SI: Ari] := f(A[Il],B[I2]);
S2»: B[I1] := g(A[Il])j
end.
JL
Strictly speaking S2' should not be executed when 1=1 and an extra
statement S2" : Bl^O] := g( k[kO]) is required after this loop. For
the sake of brevity those minor boundary effects are ignored through
out this section.
170
Figure 6.9 illustrates the computation of the modified loop as well as the
original loop.
I
SI
S2
•
*
i2
B[l2]
11
A[ 11]
1 /
/
i
1 /
A[i]
»B[i]
i2
11
SI
A[i1]
^
A[i]
S2'
^
B[i2]
B[i1]
Figure 6.9. Simultaneous Execution of Body Statements
In general, the above three approaches could be tried in any
combination. For example, we may first try the first approach, i.e. we try to
execute body statements simultaneously for all values of some loop index. If
this Tails, then we may use the second approach, i.e. we separate a loop or
we replace a loop with as many for statements as possible. On a resultant
for statement we again try the first approach (if it has only one body statement,
then the results of the previous section can be used). If we fail again, then
the third approach can be taken.
We now describe each approach separately.
Before we go further, we define the following notations. Without
loss of generality we assume that the pth occurrence of A, appears in S and
k p
also assume that S and S have forms
P q
171
and
S * (AP[F(k,p,l),...,F(k,p,n)] := f _.(...))
S = (.. := f(...,A*[F(k,q,l),...,F(k,q,n)], ...))•
q q
Then we define a vector R(k,p,q) as follows
R(k,p,q) = (R 1 (k,p,q),...,R n (k,p,q))
where
R,(k,p,q)  F(k,p,i)  F(k,q,j).
J
 w(k,p,j)  w(k,q,j).
If F(k,p, c i) = F(k, q, j) = 0, then we let R.(k,p, q) = 0. Finally we write
J
R'(u,k,p,q) = (R x (k,p,q), ...,R u _ 1 (k,p,q))
and
R"(u,k,p,q) = (R u (k,p,q),...,R n (k,p,q)).
6.3*2 Parallel Computation with Respect to a Loop Index
We first study the first approach described in the previous section,
i.e. we treat body statements as if they were one statement and try to execute
them in parallel with respect to some loop index.
Let us consider P = (I. , I_, .... I ) (S., ; . . . :S ) . Then we treat m body
12 ' n 1 m J
statements as one statement S where
m
0UT(S((i 1 ,i 2 ,...,i n ))) = U OUT(S (d r i 2 , .••,!)))
P=l
and
172
m
IN(S((i r i 2 ,...,i n ))) = IN(S 1 ((i 1 ,i 2 ,...,i n ))) U U [IN(S ((i^ig,
p1
...,i n )))  U OUTCS^dpig,...,^)))].
Having these two sets, we can use results of Section 6.3 directly. For example
let us consider Theorem 2. Then we may modify Theorem 2 as follows. First
suppose an array A£ appears in OUT(S( (i(l. .n) ) ) ) and A^ appears in
IN(3((i(l..n)))). Then obtain R(k,p,q).
Theorem 6 ; (cf. Theorem 2)
For each A^ in 0UT(S( (i(l. .n) ) ) ), we obtain R(k,p, q) for all q such
that A^ is in IN(S( (i(l. .n) ) ) ). Then if there is any R(k,p, q) which satisfies
all three conditions described below, then S cannot be computed in type 1
parallel w.r.t. I . Conditions:
(1) R.(k,p,q) = or for all j = 1,2, ...,ul,
J
(2) R u (k,p,q) > 0, and
(3) there is £(u+l<£<n) such that V .(u+l<j<£l), R .(k,p, q) = and
J J
R/k,p,q) = or R £ (k,p,q) < 0.
The above theorem can be proved similarly to Theorem 2, and the
details will not be given.
Since we have to apply the above check for all R(k,p,q) vectors, the
number of checks required is proportioned to the total number of R(k,p, q) vectors,
#R(k,p, q). Also since we can try to compute S in type 1 parallel in n ways, i.e.
with respect to I (u=l,2, . . .,n), the total number of checks we would perform is
given by n x (#R(k,p, q)).
173
6. J. 3 Separation of a Loop
6.331 Introduction
In this section we study replacement (or separation) of a single for
statement with two or more for statements. Let
7® = (l 1 ,I 2 ,...,I n )(S 1 ;S 2 ;...;S m )
= (l 1 ,...,I u _ 1 ){(l u ,...,I n )(S 1 ;...;S m )]
= (I,,..., I JF*.
1' ' u1 u
For fixed values of I,,..., I , , let us consider P^.
1' ' u1' u
Figure 6.10. Execution of F^
u
Yjk
If we write down the primitive execution order E (p), it can be represented
by the straight line as shown in Figure 6.10. Now let us consider a statement
S((i(l. .n), q) ) (See Figure 6.10). If it does not receive any computed results
from part A, i.e. if there is no id and a statement S( (i(l. .ul),i'(u. .n),p))
for which
(id,(i(l..ul),i'(u..n),p),(i(l..n),q))eDR
and (i'(u..n),p) < (i(u..n),q)
hold, then it may be computed independently of part A. If this holds for all
p and all (i(u..n)) for some fixed q, then we can compute S before any S , i.e.
M
P 111 ~ (I ,...,1 )(S );(I ,...,1 )(S.;...;S S .;...;S ).
u u n q u' n 1' ' q1' q+1' ' m 7
Similarly if S( (i(l. .n), q) ) does not give any output to part B, then we have
M
P 111 " (I ,...,1 )(S.;...;S ,;S _;...;S );(l ,...,1 )(S ).
u v u' ' n 1' q1' q+1' ' m ' u* ' n' q_'
6. 3« 3*2 The Ordering Relation (6 ) and Separation of a Loop
Now we study how P may be replaced with several for statements. We
first define the relation 6 between body statements. The relation is such
u
that if B (p, q) holds, then for any given (i(u..n)) there are (i'(u..n)) and id
such that
(id, (i(l..ul),i'(u..n),p),(i(l..n),q))eDR
and (i*(u..n)) < (i(u..n))
hold. That is, for some fixed q, if there is no p for which B (p, q) holds, then
u
S can be computed before any S and
175
where
P^(l)(S q );(l)(S 1 ;...;S q _ 1 ;S q+1 ;...;S m )
(I) = (I ,...,1 ).
Definition 2:
Between two body statements, S and S we first obtain R(k,p, q). Then
9 (p, q) holds if and only if
(1) R.(k,p, q) = or for all J = 1,2, ...,ul.
J
and (2) the first nonzero element of R"(u,k, p, q) is either or a positive
number. Also if all elements of R"(u,k,p, q) are then p < q
holds.
We also write 9 = { (p, q)  9 (p, q) holds}.
If A, appears more than twice in S , then we modify Definition 2 as
follows. Suppose A, appears twice in S as q, th and q^th occurrences. Then
we construct two vectors R(k,p, q, ) and R(k,p,q_). For each vector we check the
above two conditions. If at least one of two vectors satisfies the two
conditions, then we let 9 (p, q) hold.
u
Example 6 ;
SI: A^1,1,,+3] :=A 2 [I 2 ,I 3 ] + A^[ 1^1] ;
S2: A 2 [I 2 +1,I 3 1] := A 1 [I 1 ,I 2 3] +^[1^1^;
S3: ^[1^5,1^1] : A^Lylj] + Aj; I^^+l] ;
S^: A^LJ : = A^I^l] + A^I^I^];
176
give
R(l,l,2) = (1,6,0), R(l,.l,3) = (1,2,0), .
R(2,2,l) = (0,1,1), R(2,2,*0 = (0,1,1),
R(3,3,2) = (5,0,1) andR(^Al) = (1,0,0).
Then we have = { (2,1), (2,k), (k,1)) .
Now let us study Definition 2 in detail. First we note that id in
Figure 6.10 corresponds to those A, for which
F(k,p,j)(i ) = F(k,q,j)(i )
J J
holds for all j = 1,2, . ..,ul and
F(k,p,j)(i •) = F(k,q,j)(i )
holds for all j = u,u+l, ...,n. The former implies that
F(k,p,j) = F(k,q,j)
or
R,(k,p,q) = (or 0)
J
for j  1, 2, ...,ul which is equivalent to the first condition of Definition 2.
Next note that (i'(u..n),p) < (i(u..n),q) implies that either
(i) (i'(u..n)) < (i(u..n))
or (ii) (i'(u..n)) = (i(u..n)) and p < q.
In the first case
(F(k,h P ,u),...,F(k,h P ,n)) > (F(k,h q ,u),...,F(k,h q ,n))
must hold In the second case
(F(k,h P ,u),...,F(k,h P ,n)) = (F(k,h q ,u),...,F(k,h q ,n))
must hold. These two make up the second condition.
177
From we can construct a dependence graph D with m nodes each of
which represents a body statement, e.g. from Example 6 we get:
In D u we call a series of © u , ^(p^Pg), © u (p 2 ,P 5 ), . . .^(p^p^), . . .,
© (p k TtV^) a chain and write ch(p., p, ) for it. If p, = p.. then it is called a
mesh M. We say that anode p. is in the chain ch(p ,p ) (or in the mesh M), or
the chain ch(p, ,p, ) (or the mesh M) includes p.. Note that for nodes p and q
there may be more than one chain which connects p to q.
Now let
Z = (p  there is no q such that 6 (q, p) holds)
and
Z = (p  there is no q such that 6 (p, q) holds}.
Furthermore let
PD(p) = {q  ch(q,p) exists} U (p}
and
SC(p)  (q  ch(p,q) exists} U {p} •
(PD for predecessors and SC for successors). Then we classify nodes in D as
_ u
follows :
Z 3_ = (P  For all r e PD(p), there is no mesh in D which includes r} ,
Z = {p I For all r e SC(p), there is no mesh in D which includes r) ,
j u
and
178
Z 2 = N  Z 1  Z 3 
Let Z 1 (or Z ) = (p^Pg, . ..,p } • Then we can order this set as p ' p ',..., p '
M
in such a way that 6 (p.', p.') does not hold if i > j. Let us write
9 9
Z.(or Z,) for a resultant ordered set. Also we order Z. = {q.,q_, . . .,a } as
Q
L 2 = ^i''^'* '••> C V ) in such a way that ^i' < q i' if i < J
Now given a loop t where
F = (I^,Ig, . . ., I n ) (S^jSgj . . . ;S ;
■ (i r ,i u . 1 )UV..,i n )(s 1 ;S 2 ;...;s m ))
= (i n ,...,i J?*,
1 ul u
we build the dependence graph D and obtain sets Z , Z and Z,, say Z. =
{P 1 *P 2 >'"*P U ^ z 2 = ( <!■]_» 12' •'•' ^ andZ 3 ■ t r i> r 2> "^ r w^ " ( m=u+v+w )
6 6 6
From Z and Z, we obtain ordered sets Z and Z,, say Z = (p ',p p ', . . . ,p ')
6 6
and Z^ = (r 1 l ,P 2 , ,...,P w ') A1 so we have Z g = (q^ , q^*, . . ., q^' ) . Then
M
p^~ (i)(s ,);(i)(s ,);...;I(S ,);(i)(s ,;...;S ,);
^1 y 2 *u 4 1 ^v
(l)(S r ,);(I)(S ,);...j(l)(S ,)
12 w
where (i) = (I ,1 .,...,1 ).
u' u+1 n
Note that Z (or Z,) together with 9 makes a graph which does not
contain any mesh. To order Z (or Z,) the technique discussed in
Chapter 7 niay be used.
179
Thus we have replaced a loop P with as many for statements as possible. We
say that p is separable from !F if peZ (or peZ_) with respect to u, and that
F is separable with respect to I . Also we say that p is separated with
respect to I if P°J is replaced by many for statements as we showed above.
6.33.3 Temporary Storage
Now let us study the following :
2
P : for I := 1 step 1 until 1 do
for I : = 1 step 1 until kO do
begin
SI: A^y := Ag[I 2 ] + A^Ig];
S2: A k [I 2 ] := A.J Ij +A^[Ig];
end
Then we have R(l,l,2) = (0,0) or © (1,2) holds and:
V
© <D
Hence we get Z, = (1,2) and
P 1 * 2 : (I 1 ,I 2 )(S1: A^IJ := A 2 [I 2 ] + A 3 [I 2 ]);
(I 1 ,I 2 )(S2: A 4 [I 2 ] := A^IJ +A^[I 2 ]).
180
This, however, does not give the same results as produced by the original P^.
Note that after the execution of the first loop (I,, I_)(Sl), the only outcome
is A[I X ] (i.e. A[l]) = A 2 [>0] + AJlfO] . However the second loop, (I ,1 p)(S2),
requires forty different inputs, i.e. Ap[l] + A,[l], . ,,A,JkO] + A,[^0]. Hence
2
it becomes necessary to modify P ' as follows:
P l 2; (\^ 2 ^ S1: ^l'V := A 2 [I 2 ] + A 3 [I 2 ] ) ;
(I 1 ,I 2 )(S2: AJI 2 ] := AJI^y ♦ A^]).
>A[1]
)A[1]
Figure 6.11. An Introduction of Temporary Locations
In general we apply the following transformation rule on a loop when
it is separated. Assume that S and S are body statements in a loop F, and
6 (p, q) holds. Further assume that p is separated from TT with respect to
181
I (i.e. peZ.,qeZ ). Now let us consider the vector R"(u, k,p, q) . Let the
value of the first element which is neither nor be e and its position he
i. Then we let
R(u,k,p,q) = (J  u < j < i and R,(k,p, q) = 0} .
We order elements of R(u,k,p, q) by their positions in R"(u,k,p, q) and write
R(u,k,p,q) = (r(l),r(2),...,r(t)).
Then we apply the following on the loop F .
Transformation T p :
Transformation T is defined for the cases e < and e >
separately.
(1) e > 0.
Change F(k,p,r(j)) and F(k,q,r(j)) to I y /j\ for j = 1,2, . ..,t.
(2) e < 0.
(i) Change F(k,p,r(j)) to 1,/j) for j = 1, ...,t.
(ii) Change F(k, q,r(j)) to the following ALGOL program for
j  l,...,t. If (I r( . +1) = 1) and (l r( . +2) = 1) and ...
(I r(t) = 1) then (if i r( . } = 1 then L r(j)  else I r(j)  1)
else I r(i)'" Also chan g e F(k,h q , r(t) ) to the following ALGOL
program: "if (I , , = l) then L ,, J else I ,. %  1."
— r(t; ' r(t) 1 r(t)
Example 7 •
Let R"(5,k,P,q) = (R 5 (k,p,q),...,R 9 (k,p,q)) = (0,0,0,0,1).
Then we get e = 1 and l = 9 and R(5,k,p,q) = (6,8). Also assume that \lA =
182
Lq = 3« Originally S and S may look like
y WyW :=f p (  );
S q : .. := f (...^[I^I^^yi],...);
Now after Transformation T^ is applied, S and S become:
y : w^'VW :=f (  );
S q' : " := t q^'" fA T/S I l' I 5 > B 6>I 7 >Bg,I 9 ],...),
where
Bg = if Ig = 1 then (if Ig = 1 then 3 else Igl) else I,
and Bq = if In = 1 then 3 else I«  1.
Note that by applying Transformation T ? , temporary locations are
eventually introduced. For example in Example 7, A, is changed to a seven
dimensional array from a four dimensional array by Transformation T .
6.3.k Parallelism Between Body Statements
63 i +l Introduction
Now we describe the parallelism between body statements. As stated
before it becomes necessary to modify index expressions. In this section we
give an algorithm which modifies index expressions properly.
We first describe the algorithm in terms of a restricted loop with
only one loop index, i.e. F^ = (i ) (S ;S ; . . . ;S ). Accordingly every array
identifier in P™ is of a form A.[F(k,h, 1)] (= A. [I,+w(k,h, l)] ) where this is
the hth occurrence of A in F^. For convenience we drop the subscript of
183
loop index. The primitive execution order for p becomes
E Q (P rn ) = {((i,p),V((i,p)(L,m))(l,l) < (i,p) < (L,m)}.
For a given loop p , we consider the IS plane which is an L by
m grid. For example we have the following 1+0x3 grid for:
for I := 1 step 1 until kO do
begin
81: A^I1] := A 2 [I] + A^Il];
S2: A^I] := A^I+1]  A^I];
S3: A 3 [I] := A^I] + Ag[I];
end.
On
this grid, we only show the relation DR, e.g. (A,[i1], (il,3)> (i>l)) e DR,
SI
S2
S3
#
•
i1
i
V
.St.
+
i+1
•
The direction of
wave front travel
Figure 6.12. Wave Front for Simultaneous Execution of Body Statements
184
Then the objective of this section is to discover a wave front W (cf. Section
6.2.4) which separates all inputs from the computation, e.g. in Figure 6.12
inputs (shown by 0) to S( (i,l)), S( (i+1,2)) and S((i,3)) (shown by t) lie above
the wave front indicated by a dotted line. Hence S((i, l)),S( (i+1,2)) and
S((i,3)) can be computed simultaneously while I takes values 1,2, ...,40
sequentially. In general to discover a wave front is equivalent to discovering
a constant C(p) for each body statement S so that all statements S((iC(l),l) ),
.. .,S((iC (p),p )),..., S((iC(m),m)) can be computed simultaneously.
6.3.4.2 The Statement Dependence Graph and the Algorithm
Let us consider the IS plane again and consider the computation
S((i,p)) Assume that there is id such that
(id,(j,q),(i,p))eDR
where either (i) j = i and q < p, or (ii) j < i and p / q, then clearly S((i,p))
and S((j,q)) cannot be computed simultaneously.
Definition 3 '
The statement dependence graph (cf . the dependence graph in Section
6.33), D(p ), is defined by a set N of nodes 1, 2, . ..,m each of which
corresponds to a body statement of P and the arrow relation a. From node p
to q there is an arrow a(p, q) if and only if either one of the following two
conditions hold.
(l) For fixed i, there is k such that
AJJ[F(k,h,l)(i)] € OUT(S((i,p))),
A*[F(k,g,l)(i)] e IN(S((i,q))),
185
F(k,h,l)(i) = F(k,g,l)(i)
and p < q.
(2) For fixed i, there exist k and i' such that
A£[F(k,h,l)(i')l € OUT(S((i',p))),
A*[F(k,g,l)(i)] € IN(S((i,q))),
F(k,h,l)(i«) = F(k,g,l)(i)
and i" < i.
In the first case we label the arrow and write f(p, q) = 0. In the second
case the arrow is labeled 1 and we write f(p, q) = 1.
The statement dependence graph for the previous example is:
© © — °*6)
A chain of arrows, a(p r p 2 ), a(p 2 ,p^), . . ., a(p k _ 1 ,p k ),a(p k ,p 1 ) in
D(P ) is called a mesh M and we say e.g. a(p.,p. ) is in M. If i(p.,p. ) =
for some arrow in M, then M is called a part zero mesh . The following lemma
is obtained immediately.
Lemma 2 :
If D(P ) contains a part zero mesh, then there is no wave front for
P*.
Henceforth we assume that D(F ) has no part zero mesh. Given D(FJ,
we define a subset Z of N as follows:
Z = (p there is no q such that i(p, q) = or f(q,p) = 0} .
Z together with arrows gives a subgraph D g of D(F m ). Further we let
186
Zy. = (P peZ and there is no q such that i(q, p) = 0} .
Now we give an algorithm to find a wave front for D(P m ).
Algorithm 1 ;
(1) Let C(p) = + oo for all p e N.
(2) (i) Take any p from ZL. If Z. = 0, then go to Step (5).
(ii) Let C(p) = 0.
(3) (i) If there are nodes s and t such that a(s,t) exists, £(s,t) = 1 and
C's) > C(t), then we let the value of C(s) be equal to C(t).
(ii) If there are nodes s and t such that a(s,t) exists, f(s,t) = and
C(s) > C(t), then let the value of C(s) he equal to C(t)  1.
Repeat (i) and (ii) until there are no s and t which satisfy (i) or (ii) in
D(F m ).
(h) (i) If for all p in Z C(p) ^ + », then go to Step (5).
'Otherwise take any p from Z for which there is q in Z such that
a(p,q) exists and C(q) / + »• Let M = max (C(s)] where s 6 Z and
C(s) ^ + oo. Then let C(p) = max (c(q) + 1,M} . Go to Step (3).
For all p in Z with C(p) = + oo, let C(p) = M where M = max (C(s)} and
seZ
c(s) ^ + co. If Z ■ $, then let C(p) = for all p in Z.
Example (
(1) D:
187
(2) Z = fl,2,3,U,6,7,8l,
KD J ^© ©^<T>^(D
(3) Let C(l) = and apply Step (3) of Algorithm 1. Then we get C(l) =
since there is no q such that a(q, l) exists.
(k) Let C(2) = 1.
(5) Let C(3) = 1.
(6) Let C(*0 = 2. Then we apply Step (3) of Algorithm 1 on a(S,l+), a(8, 5),
a(5,2),a(2,l),a(7,8),a(6,7),a(9,6),a(3,9),a(3,l) and a(9,l)« And we get
C(5) = C(8) = 2. C(2) = 1. C(7) = 1, C(6) = 0, C(9) = 0, C(3) = 0, and
C(l) = 1.
(7) There is no p in Z with C(p) = + ». Hence Algorithm 1 terminates and
we get
C(l) = 1, C(2) = 1, C(3) = 0, COO = 2, C(5) = 2, C(6) = 0, C(7) = 1,
C(8) = 2 and c(9) = 0.
I^
1
2
5
h
5
6
7
8
9
f\„.
/>
y
•N
i2
o
.•
V
>^
o
>•
i1
9
*•
%
O
»•
o\
V>
S^i
/
l
(J
s
*w
•
^
/
i+1
Figure 6.13 A Wave Front for Example 10
188
Now we show that Algorithm 1 gives a valid wave front. To prove this first
we show that Algorithm 1 is effective, i.e. every step of Algorithm 1 is always
applicable and terminates.
Lemma ~> :
Algorithm 1 is effective.
Proof :
That Step (l),(2),(k) and (5) are effective is clear. Now we show that
Step (3) is effective. First we define U(p) to be a set of nodes such that
U(p) = (q I there is a chain of arrows a(p, ,p ),a(p ,p_), . . .,
a(p n _ x ,P n ) exist where j> ± = q and p n = p) U {p}, e.g.
3 6
© •Ch KpyA^ »k:
D(P 8 ):
and U(5) = (1>2, h, 5>7>9) . U(p) = U(q) implies that there is a mesh which
includes p and q. By assumption this mesh is not a part zero mesh and c(q)
will be assigned the same value as C(p) in a finite number of steps after c(p)
has been assigned a value.
If U(p) 3 U(q), then c(q) will be assigned a value less than or equal to
C(p) in a finite number of steps after C(p) has been assigned a value.
Thus after a finite number of applications Step (3) eventually terminates.
Hence Algorithm 1 is effective.
(Q.E.D.)
189
Theorem 6 :
Algorithm 1 gives a valid wave front . ■
Proof :
To prove this, it is enough to show that' (i) if f(p, q) = 0, then
C(p) < C(q) and (ii) if i(p,q) = 1, then C(p) < C(q). However, from Steps (3)
and (k) of Algorithm 1, clearly the above conditions hold. Also if p is
assigned a value C(p) by Step (5) it implies that either (i) there is r e U(q)
where q e Z, and f(r, q) = 1 or (ii) there is no such r. In the second case
C(p) may take any value (S may be computed at any time), and in the first case
C(q) > C(r) must hold. Hence we let C(q) = max{C(s)} .
(Q.E.D.)
To handle a restricted loop with more than one loop indicies, we
modify Definition 3 as follows. For each S and S in P we first obtain a
vector R(k,p, q).
Definition 3' '>
The statement dependence graph of t f D(f ), is defined by a set N of
nodes 1, 2, ...,m each of which corresponds to a body statement of P^ and the
arrow relation a. There is an arrow a(p, q) if and only if either one of the
following s holds :
(1) R.(k,p, q)  or for all j = 1, 2, ..,n and p < q. We let i(p, q) = 0.
(2) V((R 1 (k,p,q),...,R n (k,p,q)B) > V( (0, . . . , 0)  B) where B = ( Lj, . . ., Lj ).
We let i(p, q) = 1.
190
From Definition 3' clearly
(1) !(p,q) = if S ((i^ig, ...,1^)) US6S the out P ut of S p(( i 1 ; i 2>".» i )) and
p < q.
(2) f(p,q) = 1 if S ((i^ig, •••»i n )) uses the output of S ((^^ig',...,! •))
where (i^ig', . . .,1^ ) < (i^ig', . . ,,i n ).
Algorithm 1 is then applied on D(Fj. For example let P^ be
SI: A^Ig^g] := A 2 [I 2 ] + A 3 [I 2 1,I 3+ 1];
S2: Ag[I 2 +l] := A^I^Ig] + k^I^I^i
S3: A 3 [I 2 ,I 3 ] := Ag[Igl];
Then we have
jn
6.3*5 Discussion
Given a loop P = (I , . . ., I ) (S, ; . . . ;£L ), we first try to execute body
statements in parallel with respect to some loop index. If this fails for any
loop index, or if this does not give a satisfactory result, then we try to
replace the loop with many for statements. Then we can attempt to execute a
body statement (or body statements) of a resultant for statement in parallel
with respect to some loop index. If this fails, then we may try the third
191
approach, i.e. we try to execute all body statements simultaneously while
loop indices vary sequentially. Often the number of loop indiciea, n, is
very small (typically n = 2), and it will be easy to try all variations.
192
7 EQUALLY WEIGHTED— TWO PROCESSOR SCHEDULING PROBLEM
T'l Introduction
This chapter gives a solution to the socalled equally weighted
two processor scheduling problem. Informally the problem may be stated as
follows. Given a set of tasks along with a set of operational precedence
relationships that exist between certain of these tasks, and given two identical
processors (PE),P(2), how does one schedule these tasks on the two processors
so that they execute in the minimum time? It is assumed that either one of
two processors is capable of processing any task in the same amount of time,
say 1 unit of time. Informally a set of tasks together with procedence relations
forms a graph.
Clearly the problem of scheduling any given equally weighted task
graph on k identical processors, P(k), in an optimal way is effectively
solvable by exhaustion. But this is far from possible in practice. The only
practical solution so far obtained is a result for scheduling a rooted tree
(a restricted class of graphs) with equally weighted tasks on k identical
processors, P(k) [21].
Now let us study how the equally weighted— two processor scheduling
problem is related to the computation of arithmetic expressions on a parallel
machine.
In Chapter 3, the parallel computation of an arithmetic expression by
building a syntactic tree was studied. There we were only concerned with the
height of a tree and reducing it by distribution, and we did not introduce any
193
physical restrictions of a machine. For example, in reality, the size of a
machine, i.e. the number of PE's is limited rather than arbitrarily big. One
problem which will arise immediately is whether the distribution algorithm
should be applied or not to reduce tree height since distribution introduces
additional operations. For example assume that we have a two PE machine, P(2).
Now let us consider two arithmetic expressions, A = a(bc+d) + e and B =
abc(defgh+i). Then we have h[A] = k, h[B] = 5, h[A ] = h[abc+ad+e] = 3 and
h[B ] = h[abcdefgh+abci] = k. Thus distribution reduces the height of T[A] and
T[3]. On the other hand Figure 71 shows that if A, B,A and B are computed
on P(2), A is still computed in less time than A while B now takes more time
than B.
Assume that we get A from A by the distribution algorithm. If the
size of a machine is limited, then it may not necessarily be true that A can
be computed in less time than A even if h[A ] < h[A] holds. Actually it is a
nontrivial problem to decide whether distribution is to be made or not to
reduce computation time (which is different from tree height) if the size of
a machine is limited. It depends on the form of an arithmetic expression as
well as the machine organization. We will not go into this problem any
further.
Now let us look at the situation from a different point of view.
Given an arithmetic expression A and its minimum height tree, it is possible
to take advantage of common expressions to reduce the number of operations to
be performed in hopes of reducing computation time. For example let us
consider the computation of A = (a+b+c+d)ef + (a+b)g on P(2). If we evaluate
19^
(a+b) only once then A can be computed in k steps on P(2) while if (a+b) is
evaluated twice, then it takes 5 steps to compute A (see Figure 7*2).
e b c
cab a d e
(a) A (k steps)
(b) A (3 steps)
i h d e f
b c
c 1
(c) B (5 steps)
(d) B (6 steps)
Figure 71. Computation of Nondistributed and
Distributed Arithmetic Expressions
on P(2)
Our main concern in Chapter 2 was to reduce tree height assuming
that the size of a machine is unlimited. Hence we were not interested in
reducing the number of operations. As mentioned there, it was an open
problem to find out common expressions while keeping the height of a tree
minimum. However, if we could take advantage of common expressions while
195
1*
3
2
1
level
(a) A Minimum Height Tree for A
5
k
3
2
1
Step
(b) (a+b) computed twice
(c) (a+b) computed once
Figure T«2. Common Expression
keeping the height of a tree minimum, then we would obtain a graph of operations
rather than a tree for an arithmetic expression (see Figure 72. (c)).
While we do not know how to compute an arithmetic expression A on
P(2) in the minimum time (e.g. should distribution be done?), the scheduling
algorithm presented in this chapter schedules a given graph of operations for
an arithmetic expression on P(2) so that the given graph is processed in the
196
minimum time, assuming that each FE of P(2) may perform addition or multipli
cation independently but in the same amount of time, say 1 unit of time. Note
that we may be able to construct many graphs for A. Hence while the scheduling
algorithm schedules a given graph for A on P(2) in an optimal way, the algorithm
does not necessarily compute A itself in the minimum amount of time.
72 Job Graph
Let G be an acyclic graph with nodes N. (i=l,2, . . .,n) and a set of
directed arrows connecting pairs of nodes. For nodes N and N' we write N * N'
if there is an arrow from N to N' . We say that N is an immediate predecessor
of N' and N' is an immediate successor of N. Also we let
SR(N) = {N 1 N * N'l (a set of successors of N)
and PR(N) = {N' N*  Nl (a set of predecessors of N).
Nodes which have no incoming arrows are called initial nodes , and
nodes which have no outgoing arrows are called terminal nodes . For the sake
of simplicity we assume that a graph has one initial node and one terminal
node. If there are more than two, then we can add a dummy initial/ terminal
node. We write N and N for them, respectively. We also write N =5> N' if
there is a chain N, ,N_, . . . ,N such that N »1 ■♦ ... »M > N' , or N * N' .
12m 1 m
Furthermore we write N / N' or N ^> N' to show that the relation N » N' or
N» I' does not hold.
Definition 1 :
The forward distance (or level ) from the initial node to a node N,
d (N), is the length of the longest path from the initial node to N, thus
197
d_(N T ) = 0. Similarly the backward distance from the terminal node to N, d (n)j
is defined, thus •!. (lO = 0.
Thus a node N cannot be initiated before time cL(n) but may be
initiated at cL.(N) or at any time after that.
Definition 2 ;
The height of a graph G, h(G), is defined as
h(G) = d^V
Then we say that a graph G is tight if for all nodes N,
cL^N) + d T (N) = h(G).
Otherwise we say that a graph G is loose.
Example 1 ;
Figure T>3 A Loose Graph and a Tight Graph
Th
e graph G, is a loose graph because d (N ) + d_(N p ) = 2 ^ h(G. )
whereas the graph G p is a tight graph.
198
First we shall study an optimum scheduling for a tight graph. A
scheduling for a loose graph will be discussed in a latter section. In what
follows, we use words "process" and "schedule" interchangably.
Definition 3 :
Let A(i) be the set of all nodes of forward distance i, i.e.
A(i) = (Nld^N) = il.
Tli is is called a node set .
All nodes in A(i) can be scheduled independently of each other since
there can be no precendence relations between nodes in A(i). In other words,
if N => N 1 then N and N' cannot be processed simultaneously.
Now we have the following lemma which characterizes a tight graph.
Lemma 1 :
If a graph G is tight, then for every node N, there exists N'e SR(N)
and N"e PR(n) such that d (N 1 ) = d^N) + 1 and d^N") = ci^(N)  1. (For the
terminal node SR(N ) = and for the initial node PR(N ) = 0. Those are
exceptions. )
Proof :
Obvious by Definition 2.
(Q.E.D.)
Corollary 1 ;
Let G be a tight graph. Let N be a node of G. Let Ne A(t). Then
for every i, < i < t  1, there is at least one node N'e A(i) such that N' =>
N. Also for every j, t + 1 < j < h(G), there is at least one node N"e A(j)
such that N => N" .
199
Definition k :
To pschedule a set Q of nodes is to partition Q, into subsets of size
2 in an arbitrary way (if q is odd, there will be one subset of size l) and to
order those subsets in an arbitrary way.
A node N is said to be available if all predecessors of N have been
processed.
7.3 Scheduling of a Tight Graph
Having these definitions, now we discuss a scheduling of a tight graph
G on two processors. The idea of this scheduling scheme is rather simple. We
start checking A(i) from i = 1 to h(G). For each i, if A(i) is even, then
we pschedule A(i), and no processor time will be wasted. If  A ( ± )  is odd,
and if we simply pschedule A(i), then there will be one node, N, left which
cannot be processed in parallel with another node in A(i). Thus we will waste
processor time. Therefore a node which can be processed in parallel with the
above left out node N must be found. Where can that node be found? It will be
shown that we have to look as far as the smallest i' larger than i with A(i')
= odd to find it. Thus the amount of work to look ahead is always bounded.
Before we go further, a few more definitions are in order.
t n
For some t and n, let us consider a set A = U A(t+i). Now let us
n i=0
take a node from each of A(t+,j) and A(t+i) (j < i). Let them be N J and N 1 . If
N £> IT", then N and N may be processed simultaneously, providing that all
predecessors of N and N" have been processed.
Now we establish this relati on formally on A .
200
Definition 5 :
,t .
The pline r elati on ( ) between two nodes N and N' in A is defined
n
as follows.
E
WW if and only if
(1) N ^> N' and N 1 £> N
and (2) d (n) ^ d(W ).
We also write (N, W ) for WW. A pair (N,W) is called a pline pair .
Note that in general (N,N') and (W,N" ) do not necessarily imply (N,N").
Further we define A^p)  {(N,N')  N€A(t+i),N'eA(t+j),0 < i,j < n] ,
i.e. A (p) is a set of all pairs of nodes in A between which the pline relation
n f * n *
holds .
L. J
Since (N,N') e A (p) implies that (W>N)e A (p), we shall in general
put only one of them in A (p) and drop the other.
An algorithm to find A (p) is given in Section 75
Definition 6 :
A pline set L on A is an ordered set of pline pairs
*■ n n
L n = ((N ,N 1 '),(N r N 2 '),...,(N k ,N , k+1 ))
where
(1) N Q e A(t) and N' k+1 e A(t+n),
and (2) for all g(l < g < k),ci (N ') = cL (N ) but N ' + N .
201
t t t
We say that A is pconnectable if there is a pline set L on A .
n * n n
t ' t
Also we write L (N~, N , ,) when the first and the last nodes in L
n u K+l n
are of special interest.
Further, we write L^N.N' . ) = L^(N n ,N ') U L t+: J(N ,N' ) if
n 0' k+l 1 0' g ni g' k+l
(N ,,N* ) and (N ,N') are two adjacent elements of L (N n ,N' ) and d_(N' ) =
gi g g g+i n u K+i l g
dj(N ) = t + i.
An algorithm to build a pline set for A is given in Section 7»5«
Example 2 :
A(l) = (b,c,d)
A(2) = {e,f}
A(3) = (g,h,i)
Figure f.k. A Graph G
1 d 1
v = U A(l+i) = {b,c,d,e,f,g,h,i}. A Q (p) = ((b,f),(d,e),(f,h),(e,i)). A
1=0
202
typical L = ((b,f), (e,i)). Hence A is pconnectable.
Further we define a special pline set called a pline (l) set.
Definition 7 *
Given a set A . We call a pline set
n
a pline (l) set (write L (l)) if
cL(N. ) + 1 = (L(N^ +1 ) for all i(0 < i < k).
Note that in this case k = n  1.
Also we write L (l)(N~,N,' ,,) when the first and last nodes are of
n 0' k+l y
particular interest.
Now a few lemmas are in order.
Lemma 2 :
Suppose N e A(t) and N' e A(t+n) for some t and n in graph G. Assume
that (N, N') holds. Then there is a pline (l) set
I>(1)(N,N') = ((N ,N 1 '),(N 1 ,N 2 '),...,(N n _ 1 ,N n '))
where N^ = N and N ' = N 1 ■
n
Proof :
A proof is given by an induction on n.
First note that A(t+i) > 2 for all i, 1 < i < n  1. Otherwise
N $* N' holds and (N,N') does not.
203
(1) Now let n = 2. Then there must be two distinct nodes
N ,N p e A(t+1) such that N  N, and N * N' . Otherwise the graph
G is not tight. Hence (N^N 1 ) and (N,N 2 ). Thus 1^(1) = ((N,Ng),
(N r N')).
(2) Now assume that the lemma holds for n < i.
Let n = i + 1. Let N e A(t), N'e A(t+i+l), and (N,N')« Then
there must be two distinct nodes N ,N e A(t+i) such that (N,N..)
and N $> N p . This follows from Lemma 1 and Corollary 1. Then
(N p ,N*)> since otherwise N => N' holds and this contradicts the
assumption. By the induction hypothesis, there is a
L^_ 1 (1)(N,N 1 ) = ((N,ir L '),(N 1 ,^ , ),...,(M n " 2 ,N 1 )).
Thus there is a pline (l) set
L*(1)(F,H') = L^_ 1 (1)(N,N 1 ) U (N 2 ,N«) = ( (N, N 1 ' ), (n\ N 2 ' ),
...,(M n ' 2 ,N 1 ),(N 2 ,N')).
(Q.E.D.)
Lemma 3:
Suppose that N e A(t),N 1 ,N 2 £ A(t+i),N^,N e A(t+j), and N 1 e A(t+n)
where i < j < n. Also assume that (N,N ),(N ,N*), and (n ,N') hold.
Then there is a pline (l) set L^(l)(N,N') = ((N,^ 1 ),..., (N ^N')).
Proof:
20U
'© W
©
: © : ©
\ /
/ N
©
i w) «
*f(K
© ©:
©
©
<3 ®
Figure 75* An Illustration for Lemma 3
Since (N, N ) by the previous lemma, there is a pline (1) set
L^(1)(N,N 3 ) = Lj(l)(N,N) U L^(1)(N',N 3 ),
u * u
and since (rr,N'), there is a pline (l) set
L^aKw^N') = L^CDCN 2 ^) U L t+ ^(1)(NSN').
ni ' ji v ' nj v '
By definition, N ^ N' , N / N' , W £ N and N 1 ^ N 2 .
205
Now we have two cases.
(1) N 5 = N'
Then N ^ N 1 . Now let
l*(i)(n,n') = l*(i)(n,n) u lJ^CDCh 1 ,/) U L^ (l ) ft. , r ) .
(2) N 5 ^ N'.
Then let
L*(i)(H,r) = lJ(i)(n,n 3 ) U L^(l)(N',N').
Thus in either case there is a pline (l) set on A .
n
(Q.E.D. )
From Lemmas 2 and 3> we can prove the following lemma.
Lemma k :
If A is pconnectable, then there is a pline (l) set L (l)(N, N')
where N e A(t) and N 1 e A(t+n).
What Lemma h implies is the following.
Let L*(l)(N,N') = ((N,N 1 '),(N 1 ,N 2 '),(N 2 ,N 3 '),...,(N n _ 1 ,N')), i.e.
d I (N.') = cL^N.) = t + i and cL^N.) + 1 = ^(N^).
Now assume that a set A is pconnectable. Then we have L (l)(N,N') =
n n '
((N,^'), (N r N 2 '), ...,(N n _ 1 ,N')) where N e A(t) and N' e A(t+n). Since (N.,N'),
we can process both at the same time. To do this we first process A(t+i)  {N.}
206
(notice that djC^) = t+i). Then process (N.,N' }. Finally process A(t+i+l) 
(N! ,}. This leads, us to the follovring scheduling
Definition 8:
Assume A is pconnectable. Then by to pline schedule A bv I (l).
n * n n '
we mean the following scheduling.
Let L*(l) = ((N ,N 1 '),(N 1 ,N 2 , ),...,(N n1 ,N n ')).
(1) pschedule A(t)  {N Q } .
(2) g = 1
(3) pschedule {N _ ,N '}
o o
(h) pschedule A(t+g)  {N ',N ].
D g
(5) g = g + 1. If g < n, then go to (3).
(6) pschedule (N ,,N '} .
v ' * n1' n
(7) pschedule A(t+k)  (N '}•
Now an algorithm to schedule a tight graph for two processors is
described.
Scheduling is done according to node sets A(i), for i = 1,2, . . .,h(G).
All nodes in A(i) can be processed independently of each other, i.e.,
in any order.
If A(i) is even, then two processors can be kept busy all the time
to process A(i) and A(i) can be processed in time A(i)/2.
If A(i) is odd, then there is a possibility that a machine becomes
idle, i.e., one node will be left out from A(i) which does not have any partner
207
to be processed with. Let it be N. Then a partner must be found from some
A(j), j > i. First we may try to find N' € A(i+l) which can be a partner of N.
However if A(i+l) is even, then A(i+l)  (N'}  becomes odd and we have the
same problem again, and may try to find a node from A(i+2) to fill an idle
machine, and so on. If this cycle is ever to stop, it must stop when A(i+k)
is hit where A(i+k) is odd. Otherwise there is no way to remedy the cycle,
and machine time must be wasted.
Tight Graph Scheduling Algorithm :
Step 1: t =
Step 2: If t = h(G) then pschedule A(t) and stop, else go to Step 3*
Step 3: If A(t) is even, then
(31) pschedule A(t) and
(32) go to Step 7
Step k: A(t)  is odd.
Find A(t+1).
Step 5: If VN e A(t) SR(N) = A(t+l), then
(51) pschedule A(t) and
(52) go to Step 7.
Step 6: There is N' e A(t) such that SR(N') < A(t+l).
(61) If A(t+l) is odd, then
(611) pschedule A(t)  (N']
(612) pschedule (N*} U {N"l where N" e A(t+l)  SR(N').
(613) pschedule A(t+1)  (N") .
(61U) go to Step 7
208
(62) A(t+l) is even.
(621) Find out the smallest k greater than 1 such that A(t+k)
is odd.
(622) If there is no such k (i.e., we have checked up to A(h(G)))
then pschedule A(i) t < i < h(G) individually, and stop.
(623) Else we have a set (2= (A(t),A(t+l), . . .,A(t+k)} where A(t)
and A(t+k) are odd and other A(t+i) are all even. Check
pconnectivity of A, .
(1) A, is not pconnectable.
(li) pschedule A(t),A(t+l), .. .,A(t+kl) individually.
(1ii) Let t = t + k  1.
(liii) go to Step "J.
(2) A, is pconnectable.
(2i) Find a pline (l) set l£(l) = ( (N Q , N^ ), (N^N^ ), . . .,
(N k+1 ,N k ')).
(2ii) pline schedule A, by L. (l).
(2iii) t = t + k.
(2iv) go to Step 7
Step 7: t = t + 1.
Go to Step 2.
Example 3 :
(l) A(l) and A(3) are odds, and A(2)  is even. Thus we have &=
{A(1),A(2),A(3)1  (by Step 6 of Algorithm)
209
A(l) = {a,b,c,d,e}
A(2) = (f,g)
A(3) = (h,i,J).
Figure 76. An Example of a Tight Graph Scheduling
(2) For A^, we have Lg(l) = ( (d, f), (g,h)).
(3) According to Step (623) (2), we schedule as follows.
(i) pschedule A(l)  {d} = (a,b, c, e).
(ii) pschedule { d, f } .
(iii) pschedule A(2)  (f,g) = 0.
(iv) pschedule (g,h) .
(v) pschedule A(3)  (h) = {i,j}.
Thus we have an optimum schedule :
210
Step 12 3^5
machine A
B
a c ' d
g i
b e f
h J
Proof for the algorithm :
Lemma 5 '
Step 3 is optimum and "whatever pschedule is made, it does not affect
the later stages.
Lemma 6 :
Step 5 is optimum and whatever pschedule is made, it does not affect
the later stages.
Proof :
First note that after A(t) has been processed, nodes in A(t+l) only
can become available. Since VN e A(t), S(N) = A(t+l), all nodes in A(t)
must have been processed before any node in A(t+l) can be processed.
(Q.E.D.)
Lemma 7 •
Step 61 is optimum and whatever pschedule is made, it does not
effect the later stages.
Proof :
The algorithm actually processes A(t) and A(t+l) (where A(t)  and
A(t+l) are odd) in time ( A(t)  + A(t+l)  )/2, which is optimum.
(Q.E.D.)
211
Lemma 8 :
Step 622 is optimum.
Proof :
Lettf= (A(t),A(t+l), ...,A(h(G))). Since A(t) is odd and all other
A(i) is even (t < i < h(G)), it takes at least time
I" lA(t)K!A(t + l)l + ...^A(h(G))l 
to process cl.
Step 622 achieves this.
(Q.E.D.)
Lemma 9 :
Assume A(t) and A(t+k)  are odd and A(t+i) are all even (l < i <
k). If A, is pconnectable, then Step 623 (2) is optimum.
Proof :
Oc= {A(t),A(t+l), . . .,A(t+k)} can be processed in time
A(t)+A(t+l)+...+A(t+k)
2
Step 623 (2) achieves this.
(Q.E.D.)
Lemma 10 :
Assume A(t) and A(t+k)  are odd and A(t+i) are all even (l < i <
k). If A, is not pconnectable, then Step 623 (l) is optimum.
212
Proof :
Let£^= (A(t),A(t+l), . . . ,A(t+k)} . Since A^ is not pconnectable,
there is no pline set 1^(1) Thus there will be N in some A(t+i) (0 < i < k)
which does not have any partner to be processed with it, thus making a machine
idle. Now if this situation could be remedied, then it would be only because
there is N' e A(t+n+j) (j > 0) which can be done in parallel with N.
A(t+n)
A(t+n+j)
odd
even
odd
Figure T«T An Illustration for Lemma 11
Suppose that the above could be done. The parallel processing of N
t ,
and N 1 can, however, be advantageous only if A. is peonnected and there is
a pline (l) set L^(l)(N,N f ) where N e A(t),N' e A(t+i) and N' 4 N  Otherwise
213
processors cannot be kept busy for A(t),A(t+l), . . . , A(t+i){N ) and we do not
gain at all by delaying a processing of N. Now from the assumption, N £> N"
for all N" € A(t+n) because otherwise A becomes pconnectable. Also by
n "
Corollary 1, there is a node N in A(t+n) such that N > N' . This implies that
N 5> N'. Thus N cannot be processed in parallel with N. This proves the
lemma.
(Q.E.D.)
Essentially what the above algorithm does is upon finding A(t) where
A(t) is odd to try to delay the processing of a node N in A(t) till the next
node set A(t+l). If A(t+l) is even, then again the algorithm tries to delay
the processing of a node N' in A(t+l) till the next level A(t+2), etc. until it
finds some A(t+k) where A(t+k)  is odd, or A(t+k) U (N"}  is even where N"
is a node whose processing is being delayed from the previous stage. In other
words, the algorithm tries to establish pconnectability between two adjacent
node sets both of which have an odd number of elements. Note that it is not
necessary to try to establish pconnectability among more than two odd node
sets, say A(t),A(t+k) and A(t+m)(m > k), because A cannot be pconnectable if
A, is not (see Lemma 10).
Now the above argument together with Lemmas 510 prove the following
theorem.
Theorem 1 :
The algorithm gives an optimum schedule of a tight graph.
20A
1 .k Scheduling of a Loose Graph
Now let us extend the above algorithm so that it can handle a loose
graph. To facilitate presentation, a few definitions are in order.
(From now on by "a graph", a "loose graph" is to be understood. )
Definition 9 ;
A node N in a graph G for which d_(N)+d T (N) ^ h(G) is called a loose
node . Otherwise a node is tight. Let N be a loose node. Then we define the
far distance d_(N) as <L(N) = h(G)  d^N).
A set B(i) is a node set (cf. Definition
k) such that
B(i) = {N(N is a tight node) and
(d^N) = i)} U {N(N is a
loose node) and (cL(N) = i)} .
We write Bt(i) and Bi(i) for the above
two component sets respectively, i.e.
B(i) = Bt(i) U B*(i).
Note that a loose node N can be
processed anywhere between cL(N) and
d i ( V
d (N) without affecting the rest of a
Figure 78. A Loose Node
graph. Definition 9 says that loose
nodes are pushed far down and
classified in terms of the far distance rather than the forward distance.
Note that a loose node N receives two attributes, the far distance
d (N) and the forward distance d (N), and is classified in terms of cL(N),
215
e.g. we say that a loose node N with the forward distance d (N) is in B(i)
where i = cL.(N).
We also define a set C(i).
Definition 10 :
A set C(i) of loose nodes is defined as follows.
C(i) = (N  N is a loose node and <L(n) < i < d_(N)} .
C(i) is a set of those loose nodes which may be processed in parallel
with a node in B(i). If cL.(N) = i, then N is put in B(i) rather than in C(i).
C(i) is a set of loose nodes which will be used to fill up waste processor time
if necessary.
Scheduling a loose graph is similar to that for a tight graph.
A loose graph is scheduled in accordance with B(i) for i=l, 2, . . .,h(G) . A loose
node N, even though it is in B(cL(N)), may be scheduled with any B(k) where
d_(N) < k < d (N). All tight nodes are scheduled first and loose nodes are
scheduled as late as possible. If it becomes inevitable to waste processor
time if only tight nodes are used, then loose nodes are used to fill those
otherwise wasted times.
In what follows B(i) plays a similar role to that of A(i) in the
previous discussion. All definitions for A(i) are also applicable to B(i) with
a few modifications.
Now the pline relation (==) between two nodes N and N' in B is re
n
defined as follows.
216
Definition 5' :
t £
Let N and N' be two nodes in B . NTT ((n,N' )) holds if either one of:
(l) (i) N 1 is not a loose node (N may be a loose node),
and (ii) (^(N) (or d^N) if N is a loose node) + 1 = cL(N')^
(In other words, N e B(t+k) and N' e Bt(t+k+l), < k < n. )
and (iii) N / N'.
or (2) (i) N* is a loose node (N may be a tight node),
and (ii) MN' ) < d_(N) (or d_(N) if N is a loose node) < d_(ir)
(In other words, N* € Bl(t+k) and N e B(t+j) where <L. (N 1 ) <
t+j < t+k < t+n. ) If (L(N') < t, then the above inequality
becomes
t < d^(N) (or IjW) < d^N 1 )
holds .
Then L on B is defined similarly to Definition 6.
n n
Example k (see Figure 7»9) 
(2,5), (3,*0,(^,8),(5,7) e B n (p) b y (!)' Since <W = t + 1 and d J (N 6 ) =
t + 3, (2,6),'(5,6), (1^,6), (5,6) e B*(p) by (2).
217
B(t) = (N^
B(t+1) = {N 2 ,N 3 1
B(t+2) = {N V N 5 1
V B(t+3) = (N^N^Ngl
Bt(t+3) = {N 7 ,Ng),Bl(t+3) = {N 6 }
B(t44) = {N }
Figure 7.9. The pline Relation in B.
n
(1) of the above definition resembles Definition 5 and takes care of
tight nodes. As Lemmas 2, 3 and h showed Definition 5 is more general than
necessary. That is, the pline relation need only to be defined between
adjacent levels, i.e. A(i) and A(i+l). And Definition 5' follows this
simplification.
(2) , on the other hand, takes care of loose nodes. If a loose node
N is in B(i) then this means that d. (N) = i and N may be processed in parallel
with any node N' e B(j) where d_(N) < j < i. Because of this addition, Lemmas
2, 3 and k do not hold anymore. For example let us consider the graph G:
218
© B(t+1)
It is easy to see that (N ,N ) e B,(p) by (2) of Definition 5'. Since (N ,N )
/ B,(p), there is no pline (l) set on B,(,p). This, however, does not bother
us. Assume that B is pconnectable, and let L = ( (tL.N ' ), (N n , N ' ), . . .,
n nO'11'2''
(N.,N. '),..., (N,,,N' )), where N € B(t) and N ' e B(t+n). Further assume
that N. e B(t+a. ) and N. * e B(t+a. ,). If a. + 1 / a. ., then by Definition
i i l+l l+l i ' l+l 7 J
5' N. ' is a loose node and d_(N. ') < t + a. . Thus N. ' can be processed
l+l I l+l — l l+l
in parallel with N. without affecting the rest of a graph, i.e. we first process
B(t+a. )  fN.) and then process (N..N. ']. If a. + 1 = a. ,, then the old
v i y l * i' l+l i i+l'
strategy can be applied, i.e. we first process B(t+a. )  {N.}, then {N.,N. '},
then B(t+a. .)  {N. n '} .
l+l l+l '
This leads us to modify Definition 8 as follows.
219
Definition 8':
Given B* and L*, where L^ = ((N^N^ ), (N^M, • • , (N^N^' ), . . .,
(N, ,, N. '))• Then by to pline schedule B by L , we mean the following
scheduling.
(1) pschedule B(t)  {N Q } .
(2) g = 1.
(3) pschedule (N ,N '} •
g J  6
(U) Let N , e B(t+a . ) and N ' e B(t+a ).
g1 g1 g g
(i) If a . +1 = a , then pschedule B(t+a )  {N ' N }.
g1 g g g ' g
(ii) If a , + 1 4 a > then pschedule B(t+b) where b = a , +
g1 T g g1
1, a , + 2, .... a  1. Then pschedule B(t+a ) 
g1 g g
{N f ,N } .
g g
(5) g = g + 1. If g < k, then go to (3).
(6) pschedule (Nj^N'}.
(7) pschedule B(t+k)  [N.'}
Finally in connection with Definition 8', we define the following.
t t t
Suppose that B is not pconnectable, i.e. there is no pline set L on B .
* n * n n
It is, however, possible to find a pline set L on B for some s, < s < n.
Definition 11 :
t t
Given B which is not pconnectable. Now we check if B is p
n s
connectable for s = 1,2, . ..,nl. Let m be the smallest s such that B is p
S
220
connectable but B . is not. We call m the maximum pconnectable distance,
s+l ■ * '
and L a maximum pline set.
m *
The following example illustrates the above definition.
Example 5 ;
B(t+1)
Figure 7*10. An Example of the Maximum pconnectable Distance
Let us consider pconnectability of B^. It is clear that B^,(p) =
{ (N,,N, ), (N, ,N/), (N , No)} , and B, is not pconnectable. Since B, is p
connectable (N , N. ) but B is not, the maximum pconnectable distance in the
above B, is 1 and L = ((N ,N, )).
221
Using a similar technique to the one described in Section 7«5> we can
check pconnectability of B .
An algorithm to schedule a loose graph for two processors is now
given. The algorithm resembles the one for a tight graph. The major difference
lies in the treatment of loose nodes. Loose nodes are scheduled as late as
possible. They will be used when it becomes inevitable to waste processor
time.
In what follows we modify the definitions for B(i) and C(i) so that
they do not include those loose nodes which hare been already scheduled.
Loose Graph Scheduling Algorithm
Step 1: t = 0.
Step 2: If t = h(G) then pschedule all unscheduled nodes and stop, else go
to Step 3»
Step 3: If B(t) is even, then
(31) pschedule B(t)
(32) go to Step 7
Step k: B(t) is odd.
Find B(t+1).
Step 5: If V N e B(t) SR(N) = B(t+l) ,
then
(51) Check C(t). If C(t) = 0, then pschedule B(t) and go to Step 7.
(52) Otherwise pick N with the minimum cL(n) in C(t). (if there are
more than one such N, pick any N. ) Now pschedule B(t) U {N} and
go to Step 7*
Step 6: There is N' e B(t) such that SR(N') < B(t+l).
(61) If B(t+l) is odd, then
(611) pschedule B(t)  {N 1 }.
(612) pschedule (N') U (N") where N" e B(t+l)  SR(N').
(613) pschedule B(t+l)  {N"} .
(61U) go to Step 7
(62) B(t+l) is even.
(621) Find out the smallest k greater than 1 such that B(t+k) is
odd.
(622) If there is no such k (i.e. we have checked up to B(h(G))),
then pschedule B(i) (t < i < h(G)) individually and stop.
(623) Else we have a set & = (B(t),B(t+l), . . .,B(t+k)} where B(t)
and B(t+k) are odd and other B(t+i) are all even. Check
if B, is pconnectable.
k
(l) B, is not pconnectable.
(li) Find the maximum pconnectahle distance in B, . Let
K.
it be m.
(1ii) Check C(t+m). If C(t+m) = (/>, then pschedule B(t),
B(t+l),...,B(t+kl) individually. Let t = t+k1.
Go to Step 7
(1iii) Otherwise let B' (t+m) = B(t+m) U {N} where N e C(tfm)
t m_1
and has the minimum dCN). Let B' = U B(t+i) U
B'(t+m). Then pline schedule B' by a maximum
' r m
223
pline set L . Let t = t + m. Go to Step 7*
t
(2) B^ is pconnectable.
(2i) Find a pline set L. .
(2ii) pline schedule B by L .
(2iii) t = t + k.
(2iv) Go to Step 7.
Step 7: t = t + 1.
Go to Step 2.
Proof for the Algorithm :
That the above algorithm is optimum can be proved by a similar
argument used to prove the previous algorithm. For example, we can show that
Steps 3,51,61,622 and 623(2) are optimum easily by previous Lemmas. Now
we have to show that Steps 52 and 623 (l) are optimum.
Lemma 11 :
Step 52 is optimum.
Proof :
Suppose that we do not use N in C(t) to fill up otherwise wasted
processor time, where N is the node with the minimum cL(n) in C(t). Saving N
for later use, however, does not improve the situation because a node N cannot
be used more effectively than to fill up wasted processor time anyway.
Also the choice of N from C(t) is optimum. Suppose for example that
N' with cL(N') > cL(N) is also in C(t). Now let us consider those two nodes
(See Figure 712). Whether we use N or N' to schedule with B(t) will not make
22k
any difference to schedule B(t+l),B(t+2), . . ,,B(d_(N)) because a node N or N'
is available if necessary. Suppose we used N 1 with B(t). Then it is not
possible to fill a later request which may arise when B(u) (d_(N) + 1 < u <
dL(N' )) is scheduled, whereas if we used N with B(t) then we can fill the
request. Thus this proves the lemma.
djCN')
Figure 7.11. An Illustration for Lemma 13
(Q.E.D. )
That Step 623 (l) is optimum is proved similarly.
Now we have the following theorem.
Theorem 2
The algorithm gives an optimum schedule of a loose graph.
225
7 • 5 Supplement
(1) An algorithm to establish A (p) on A is now discussed,
v *© n n
n
Let m = L A(t+i) \, and B be an m x m connection matrix where the
i=0
first A(t) columns and rows are labeled by nodes in A(t), the second A(t+l) 
columns and rows by nodes in A(t+l), and so forth. An element b. . of B is 1 if
and only if N. * N . where N. and N. are labels for the ith row and the jth
column.
Now define the multiplication of the connection matrices as follows.
Let A = B x C where A, B and C are m x m connection matricies. Then a. . =
ij
m  m ,  —
V (b * C ). Now complete B m = V B . In B m , b?. = 1 implies that N. >N.
k=l lK * J k=l 1J x J
and b. . =0 implies that (N.,N.) • For example, let us consider the graph in
J j J J
Figure 713» Then we have
A^ = { (a,f), (a,g), (b,e), (b,h), (b,g), (c,d), (d,h), (d,j), (f,h), (g,h),
(g,i)).
(2) Given A (p) for A , an algorithm to find pconnectivity and L (l) is
described.
According to Lemma k, if A is pconnectable, then there is a
pline (l) set L (l). Thus to check pconnectability it is enough to examine
if there is a pline (l) set L (l).
n
226
(a)
a
b
c
d
e
f
g
h
i
J
a
1
1
b
1
1
c
1
1
l
d
1
e
1
1
1
f
1
1
g
1
h
i
•i
a
b
c
d
e
f
e
,1
a
1
1
i
i
b
1
1
i
c
1
1
1
1
i
d
i
e
i
f
•
i
g
i
h
\
K —
i
\
.
J
\
(b) B
(c) B c
t/t,
Figure 7.12. An Example for A ( )
227
First let
A*(p)(l) = a£( P )  C(N,N f ) I^CN)  d^N')! > 1}
— t
Now we construct a graph A as follows.
n
(1) A has following nodes:
(i) an initial node N ,
s
(ii) a terminal node N_,
. . t
(iii) all nodes in A ,
(iv) for each node N (N € A , N / A(t), N / A(t+n)) a new
duplicate node N' .
— t
(2) A has following verticies:
n
(i) a vertex from N to every node N which was in A(t),
(ii) a vertex from every node N which was in A(t+n) to N„,
iii
(iii) if (N^Ng) e A*(p)(l), then a vertex from V to N .
(iv) for every A(t+k),l < k < n1, for every N e A(t+k), a vertex
from N to every N' which is a duplicate of N" e A(t+k) but
is not identical to N.
To illustrate the above definition, let us consider the following
example .
Example :
Let A 2 = {a,b,c,d,e, f,g}
where a, b e A(t)
228
c, d, e e A(t+l)
and f,g € A(t+2).
Further let
a£( P )(1) = {(b,c),(d,f),(d,g),(c,f)}
Figure "J. 13 An Example for pconnectivity Discovery
Then A has nodes = {N , N_, a, b.c, d, e, f, g (which are nodes in A ), c*, d',e'
(which are duplicates of nodes in A(l))} and verticies = { (N ,a),(N ,b), (f,N_),
(g^N,), (b,c), (d,f ), (d,g), (c,f )(which are original verticies in A ), (c,d ! ), (c, e' ),
hj n
(d, c' ), (d, e' ), (e,c' ), (e, d' ) (which are verticies from N to a duplicate N' of N"
which is not identical to N).]
Now it is clear that A has a pline (l) set L (l) if and only if
n n
there is a path from N to N in A . There is a wellknown algorithm for path
finding, e.g. f 19] .
229
Figure 713. (continued)
230
8. CONCLUSION
This thesis introduced new techniques to expose hidden parallelism
in a program. Techniques included the use of one of the fundamental arithmetic
laws, i.e., the distributive law, extensively. Furthermore it was suggested
that with the help of these techniques computation of a program might be
speeded up logarithmically in the sense that computation time became a log
arithmic function of the number of single variable occurrences in a program
rather than its linear function. Even though discussions were based on an
ILLIAC IV type machine, as mentioned before, they are readily applicable to
pipeline machines such as CDC STAR.
Chapter 2 of the thesis studied parallel computation of summations,
powers and polynomials. The minimum time to evaluate summations or powers as
well as the minimum number of PE's required to attain it was given. A scheme
which computed a polynomial in parallel in lesser time than any known scheme
was also introduced. Because of its simplicity in scheduling, the kth order
Horner's rule for parallel polynomial computation was studied in detail. It
was shown that for this algorithm the availability of more PE's sometimes
increased the computation time. The algorithm was such that all PE's were
forced to participate in computation.
Chapter 3 presented an algorithm which reduced tree height for an
arithmetic expression by distribution. The algorithm worked from the inner
most parenthesis pair to the outer most one and scanned an arithmetic expression
only once. A measure for the height of the minimum height tree for an arith
231
metic expression was given as a function of the depth of parenthesis pair
nesting and the number of single variable occurrences in it.
Chapter h extended the above idea to cover a sequence of arithmetic
expressions. It was shown that by replacing a sequence of arithmetic expres
sions with an arithmetic expression by back substitution, the computation
time could be speeded up in a logarithmic way for a certain class of iteration
formulas, e.g., x. . := a x x. + b . The chapter also showed that parallel
computation was in general more favorable than sequential computation in terms
of the round off error. Furthermore it was shown that distribution would not
introduce the significant amount of the round off error.
Chapter 5 studied inter statement parallelism as an introduction
to the following chapter. An algorithm which checked if the execution of
statements in a program by some sequence gave the same results as the execution
of statements by the given sequence did was given. The algorithm was new in
the sense that it prevented variables from being updated before they were used.
This had not been taken into account by the previous works. Also a technique
which exploited more parallelism between statements by introducing temporary
locations was introduced.
Chapter 6 presented an algorithm which checked if a statement in a
loop could be executed simultaneously for all values of a loop index. The
algorithm checked index expressions and the way the values of indices varied
only and did not require a loop to be replaced with a sequence of statements.
In case a statement in a loop could not be executed in parallel with respect
to a loop index as it was, the algorithm "skewed" the computation of a state
ment with respect to a loop index so that the statement could be executed in
parallel for all values of the loop index. Also to expose hidden parallelism
232
from a loop, replacement of a loop with several loops was discussed.
A solution for the equally weighted two processor scheduling problem
was given in Chapter 7« The only practical work so far obtained was a result
for scheduling a rooted tree with equally weighted tasks on k identical
processors. The solution given in Chapter 7 scheduled a graph with equally
weighted tasks on two identical processors. If we considered common expres.
sions in an arithmetic expression then we would obtain a graph of operations
rather than a tree for the arithmetic expression and the scheduling algorithm
was readily applicable for scheduling that graph on P(2) .
Suggestions for further research have been given in several places
throughout the thesis and need not be repeated here. We conclude by giving
two possible extensions that deserve brief mention.
(1) The design of a better machine.
Even though we assumed that a PE can communicate with any
other PE instantaneously, this may not be the case in reality
because it is costly and impractical to provide data paths
between every PE pair. Hence it is necessary to design PE
interconnection which is economical yet powerful enough to
simulate the above idealized interconnection [25], [26].
(2) Generalization of the idea given in this thesis.
The three laws of arithmetic were utilized in this thesis
in terms of parallelism exploitation. We should, however, pay
more attention on these laws even in terms of serial computation.
For example suppose an arithmetic expression which involves
matrices, row and column vectors is given. Then by the appro
priate application of the associative law, the number of multi
plications required may be reduced drastically.
233
LIST OF REFERENCES
[1] Abrahams, P. W. , "A Formal Solution to the Dangling else of ALGOL 60 and
Related Languages", Coram. ACM, 9 (September, 1966), pp. 679682.
[2] Abel, N. E.j et al., "TRANQUIL: A Language for an Array Processing
Computer Tr , Proc . of the Spring Joint Computer Conference (1969),
pp. 5773
[3] Naur, P., et al. , "Revised Report on the Algorithmic Language ALGOL 60",
Coram . ACM, 6 (January, 1963), pp. 117 •
[4] Allard, R. W., Wolf, K. A. and Zemlin, R. A., "Some Effects of the 6600
Computer on Language Structures", Comm . ACM, 7 (February, 1964),
pp. 112119.
Baer, J. E., "Graph Models of Computations in Computer Systems", Ph.D.
Dessertation, University of California, Los Angeles, Report No.
6846 (October, 1968) .
[6] Baer, J. E. and Bovet, D. P., "Compilation of Arithmetic Expressions for
Parallel Computations", Proc . of IFIP Congress (1968), pp. 340346.
[7] Barnes, G. H., et al., "The IlliacIV Computer", IEEE Trans , of Computers,
C17 (August, 1968), pp. 746757.
[8] Beightler, C. S., et al., "A Short Table of zTransforms and Generating
Functions", Operations Research, 9 (JulyAugust, 1961), pp. 574
578.
[9] Bingham, H. W., Reigel, W. E. and Fisher, D. A., "Control Mechanisms for
Parallelism in Programs", Burroughs Corporation, ECOM024637
(October, 1968).
[10] Bingham, H. W. and Reigel, W. E., "Parallelism Exposure and Exploitation in
Digital Computing Systems", Burroughs Corporation, EC0M02463F
(June, 1969).
[11] Brewer, M. A., "Generation of Optimal Code for Expressions via
Factorization", Coram . ACM, 12 (June, 1969), pp. 333340.
[12] "Newsdata", Computer Decision (March, 1970), p. 2.
[13] Conway, M. E., "A Multiprocessor System Design", Proc . of the Fall Joint
Computer Conference (1963), pp. 139146.
[14] Conway, R. W., Maxwell, L. W. and Miller, L. W., Theory of Scheduling,
Addis on Wis ley Publishing Company, Inc., New York (1967).
[15] Dorn, W. S., "Generalizations of Horner's Rule for Polynomial Evaluation",
IBM Journal of Research and Development, 6 (April, 1962), pp. 239
245.
•d.0*
[16] Estrin, G., "Organization of Computer Systemsthe Fixed plus Variable
Structure Computer", Proc . of Western Joint Computer Conference
(May, I960), pp. 3340.
[17] Gold, D. E., "A Model for Linear Programming Optimization of I/O — Bound
Programs", M.S. Thesis, Department of Computer Science, University
of Illinois at UrbanaChampaign, Report No. 340 (June, 1969).
[18] Graham, W. R., "The Parallel and the Pipeline Computers", Datamation (April,
1970), pp. 6871.
[19] Harary, F., Norman, R. Z. and Cartwright, D., Structural Model : An
Introduction to the Theory of Directed Graphs , JohnWiley and Sons,
Inc., New York (l9o"6]T
[20] Hellerman, H., "Parallel Processing of Algebraic Expressions", IEEE Trans .
of Electronic Computers, EC15 (January, I966), pp. 8291.
[21] Hu, T. C, "Parallel Sequencing and Assembly Line Problems", Operation
Research, 9 (NovemberDecember, 1961), pp. 841848.
[22] Knowls, M., et al. , "Matrix Operations on IlliacTV", Department of Computer
Science, University of Illinois at UrbanaChampaign, Report No. 222
(March, 1967).
[23] Knuth, D. E., The Art of Computer Programming, Vol. 2, Addis on Wesley
Publishing Company, Inc., New York (1969).
[24] Kuck, D. J., "I Iliac IV Software and Application Programming", IEEE Trans .
of Computers, C17 (August, 1968), pp. 758769
[2 5] Kuck, D. J. and Muraoka, Y., "A Machine Organization for Arithmetic
Expression Evaluation and an Algorithm for Tree Height Reduction",
unpublished (September, 1969).
[26] Kuck, D. J., "A Preprocessing High Speed Memory System", to be published.
[27] Logan, J. R., "A Design Technique for Digital Squering Networks", Computer
Design (February, 1970), pp. 8488.
[28] Minsky, L. M., Computation : Finite and Infinite Machines , Prentice Hall,
Inc . , New Jersey (1967).
[2 9] Motzkin, T. S., "Evaluation of Polynomials and Evaluation of Rational
Functions", Bull . A. M.S ., 6l (1965), P 163.
[ 30] Murtha, J. C, "Highly Parallel Information Processing System", in Advances
in Computers, Academic Press, Inc., New York, 7 (1966), pp. 2116.
[31] Muntz, R. R. and Coffman, E. G., "Optimal Preemptive Scheduling on Two
Processor Systems", IEEE Trans . of Computers, Cl8 (November, 1969),
pp. 10141020.
235
[32] Nievergelt, J., "Parallel Methods for Integrating Ordinary Differential
Equations", Comm . ACM, 7 (December, 1964), pp. 731733.
[33] Noyce, R. N., "Making Integrated Electronics Technology Work", IEEE
Spectrum, 5 (May, 1968), pp. 6366.
[34] Ostrowski, A. M., "On Two Problems in Abstract Algebra Connected with
Horner's Rule", in Studies in Mathematics and Mechanics Presented
to R. von Mises, Academic Press, New York (1954), pp. 4048.
[35] Pan, V. Ya., "Methods of Computing Values of Polynomials", Russian
Mathematical Surveys, 21 (JanuaryFebruary, 1966), pp. 105136.
[36] Ramamoorthy, C. V. and Gonzalez, M. J., "A Survey of Techniques for
Recognizing Parallel Proces sable Streams in Computer Programs",
Proc . of the Fall Joint Computer Conference (1969), pp. 115.
[37] Russel, E. C, "Automatic Program Analysis", University of California,
Los Angeles, Report No. 6972 (March, 1969).
[38] Shedler, G. S. and Lehman, M. M., "Evaluation of Redundancy in a Parallel
Algorithm", IBM System Journal, 6, 3 (1967), pp. 142149.
[39] Squire, J. S., "A Translation Algorithm for a Multiple Processor Computer",
Proc . of the l8th ACM National Conference (1963).
[40] Stone, H. S., "Onepass Compilation of Arithmetic Expressions for a
Parallel Processor", Comm . ACM, 10 (April, I967), pp. 220223.
[41] Thompson, R. N. and Wilkinson, J. A., "The D825 Automatic Operating and
Scheduling Problem", in Programming Systems and Languages,
McGrawHill, New York (1967), pp. 647660.
[42] Winograd, S., "On the Time Required to Perform Addition", JACM, 12
(April, 1965), pp. 277285.
[43] Winograd, S., "The Number of Multiplications Involved in Computing
Certain Functions", Proc. of IFIP Conference (1968), pp. 276279.
236
VITA
Yoichi Muraoka was born in Sendai, Japan, on July 20, 19^2. He
graduated from Waseda University, Tokyo, Japan, in Electrical Engineering
in March, 19&5 and started his graduate study at the graduate college,
Waseda University.
Since September 1966, he has been a research assistant with the
project of Illiac IV computer in the Department of Computer Science of
the University of Illinois at UrbanaChampaign. In 1969 he received his
degree of Master of Science in Computer Science.
He is a member of the Association for Computing Machinery and
the Institute of Electrical and Electronics Engineering.
•&
at
^