m A

r: M

Computer Science Deparlmen

TECHNICAL REPORT

The Use of Data Flow Information for

the Selection and Evaluation of Software

Test Data

P. Frank!

Technical Report 394 August 1988

3 <u

o J=

a\ (1-1 I w <o O

(X -H 4J *4-( 'O

H ■-• <0

u x: u-i Ti c

C/5 CU O JJ o

E - <U 0)

a

o

u ^ c ID (a 0)

Z Cu H

c

C ITJ O

e 4J

1-1 o O 0)

0) (A

NEW YORK UNIVERSITY

r.>r»j J

Department of Computer Science Courant Institute of Mathematical Sciences

251 MERCER STREET, NEW YORK. N.Y. 10012

^^•-•.•••.•••-;<

-•-•_#•«•,

*^i^%-«:

K

The Use of Data Flow Information for

the Selection and Evaluation of Software

Test Data

P. FrankI

Technical Report 394 August 1988

A disseration in the Department o? Computer Science submitted to the faculty of the Graduate School of Arts and Science m partial fulfillment of the requirements for the degree of Doctor of Philosophy at New York University This work was supported m part by the National Scence Foundation under Grant NSF-DC'^ ^

Copyright © 1987 by Phyllis G. Frankl All rights reserved.

To the meniorx' of m\' aunt, Paula F. Silver

- IV -

The Use of Data Flow Information for the Selection and Evaluation of Software Test Data

Phyllis G. Frankl

Elaine J. Wevuker, Research Advisor

Tvso families of software test data adequacy criteria, each based on data flow analysis, are defined for programs written in Pascal. Their formal properties are investi- gated and interactive softsvare testing tools based on them are described.

The first of these families, the data flow testing criteria, was previously defined for programs written in a simple language. We extend the definitions to apply to programs written in Pascal. The data flow testing criteria are based purely on the syntax of the pro- gram being tested. They require that the test data execute certain paths from program points at which variables are defined to program points at which those definitions are used. We describe the design and implementation of a software testing tool, ASSET, based on the data flow testing criteria.

A serious weakness of the data flow testing criteria is that for some programs there exists no set of test data which is adequate for testing the program according to these cri- teria. This problem arises due to unexecutable paths in the program. The second family of criteria, the feasible data flow testing criteria, circumvent this problem by eliminating from consideration those definition-use associations which can never be exercised. Wc show that certain fonnal properties of the feasible data flow testing criteria differ significantly from those of the data flow testing criteria.

- V

Since it is undecidable whether a given set of test data satisfies a given feasible data flow testing criterion, feasible data flow testing cannot be full\' automated. However, it can be partially automated. We describe a heuristic method, the path expression method, which attempts to determine whether a given definition-use association can be exercised. The path expression method is based on a combination of data flow analysis and sym- bolic evaluation. We introduce a new symbolic evaluation technique which is more gen- eral, but essentially no more expensive, than symbolic execution. The path expression method, along with ASSET, constitute a tool which partially automates feasible data flow tesfin<z.

VI -

Acknow Icdyements

I would like to thank my advisor. Elaine Weyuker, for her invaluable technical gui- dance and moral support. Stewan Weiss, Deberah Rennels. Ernie Campbell, Valerie Ban- and Eugene Rodolphe participated in many useful discussions of this work. I would also like to thank the many friends who put up with me while I worked on this thesis. Finally, I would like to thank my parents, Daniel and Estelle Frankl. who provided unwavering support and encouragement throughout mv education.

- Vll -

Table of Contents

0. TABLE OF CONTENTS vii

1. INTRODUCTION 1

1.1. Background 1

1.2. Terminology 3

1.3 Limitations of Testing 5

1 .3. Related Work 6

1 .4. Thesis Over\'iew 9

2. THE DATA FLOW TESTLNG CRITERIA 12

2.1. Definitions of the Data Flow Testing Criteria 12

2.2. Axiomatic Evaluation of the Data Flow Testing Criteria 29

3. DESIGN AND IMPLEMENTATION OF ASSET 46

3. 1 . Introduction 46

3.2. The Design of ASSET 47

3.3. The Implementation of ASSET 57

4. THE FEASIBLE DATA FLOW TESTING CRITERIA 60

4.1 . Definitions of the feasible data flow testing criteria 60

4.2. Axiomatic evaluation of the feasible data flow testing criteria 72

4.3. A generalized notion of executability 78

5. THE PATH EXPRESSION HEURISTIC 82

5. 1 . Introduction 82

5.2. Intuition Motivating the Path Expression Method 84

5.3. Definition of path expressions 91

5.4. Partial Symbolic Evaluation of Path Expressions 97

5.5. Precise Description of the Path Expression Method 109

5.6. Enhancements and Limitations 133

6. CONCLUSION 143

6. 1 . Summary 143

6.2. Future work 1 47

BIBLIOGRAPHY 149

- Vlll -

APPENDICES

I. ASSET User Manual 153

II. Algorithm for Derivation of Initial Path Expression 174

III. Script of Path Expression Method Session 193

CHAPTER 1: Introduction

1. Background

The development of techniques for testing programs is among software engineering's most perplexing problems. Ideally, these techniques should be amenable to automation and should yield test sets which are small enough to be useful in practice yet which are likely to expose errors in the program being tested. Unfonunately, there are many theoretical and practical obstacles to the development of such techniques.

It is often stated that the goal of testing is the deieciion of errors. That is. the goal of testing is to find an input which causes the program's output to differ from the specified output. This distinguishes testing from program verification in which one attempts to mathematically prove the correctness of the program. Of course, if the pro- gram has no errors, then the testing process can never achieve this goal. A more realistic, though less precise summar>' o^ the aim of testing is, "the goal of testing is to detect errors or, failing that, to increase confidence that ihe program is correct."

To test a program P, one chooses a set T of possible inputs to P. runs P using the elements of T as inputs, and compares actual outputs to the expected outputs. If an error is found, then the testing phase icmiinates and the debugging phase begins. If no error is found, one can either choose additional test data and continue the testing phase, or can release the software to the user. If T is deemed to be a "good" test set, then the fact that running the program on the elements of T has failed to expose errors in the program should increase contidence that the program is correct.

Two key questions in software testing research are. "how should software test data

- 2 -

be selected?" and "how can one tell whether a program has been tested 'enough'?". A test data selection method is a procedure for choosing test cases for a program. A test data adequacy criterion is a predicate used to evaluate the test data, that is, to determine whether a program has been tested "enough". Given a test data selection method M and an adequacy criterion A, we can summarize the testing process with the following algo- rithm:

begin

initialize test set T to the empt>' set;

error_detected := false;

adequately_tested := false;

while (not error_detected) and (not adequately_tested) do

begin

if T is adequate for testing program P whose specification is S

according to adequacy criterion A then adequately_tested := true else begin

choose set T' of additional test cases

according to test data selection method M: run P on all inputs from 7', comparing outputs

to the specified outputs; if error is detected then error_detected := true; T:=Tur' end end

if error_detected then report error

else release program P, cenifying that it has been adequately tested according to criterion A; end:

It is possible to define test data selection methods and test data adequacy criteria in completely ad hoc ways. It seems reasonable, however, to expect that test data selection methods and adequacy criteria which exploit available information will yield test sets which are more likely to expose those errors which exist in P.

There are several sources of information available to aid in the selection and evalua-

-3-

tion of software test data. These include the program text, the program's specification, and information about common types of errors. Those techniques which are based pri- marily on the program text are called program based, those based on the specification are called specification based, and those based on information about common errors are called error based. Because program specifications are often somewhat vague docu- ments written in natural language, and because of the scarcity of information on "com- mon errors", program based methods are currently more amenable to automation than are specification based or error based methods. The software testing strategies examined in this thesis are program based.

Our emphasis in this thesis will be on adequacy criteria. Note however that there is a correspondence between test data selection methods and test data adequacy criteria. Corresponding to an adequacy criterion C, there is a test data selection method A/c, which, in effect, says, "select a test set which satisfies criterion C". For example, if the adequacy critenon says that the test data must cause the execution of everj- statement in the program, then the corresponding selection method would involve examining the pro- gram to find inputs which cause the execution of particular statements. Similarly, corresponding to a selection method M. there is an adequacy criterion method O/, which says. "A test set is adequate if it has been produced by method M". Thus, while this thesis will mainly discuss adequacy criteria, the reader should bear in mind that these cri- teria also implicitly correspond to test data selection methods.

2. rcrnn'ndlojjv

Let U be the set of numbers and characters rcprescniable on some panicular com- puter system The mpui space, 1. and otirput space, O. are both equal to the set of finite

sequences of elements of U. A specification is a relation ScIxO. The set D^= (iel B oeO such that (i,o)€S} is called the input domain of specification S. A program P defines a panial function (which by abuse of notation we will also call P) P : I->0. The input domain of program P is set D^={iell program P halts on input i} ={iell P(i) is defined). Those inputs which cause program P to crash or to go into an infinite loop are not in D^.

Program P is correct with respect to specification S if and only if

1) D^eZ)'', and

2) VieD'''((i,P(i))eS.

That is, P is correct with respect to S if and only if for each element i of the input domain of S, P halts on input i, returning a value which is in accordance with the specification.

A test set T for a program/specification pair, (P,S) is a subset of the specification's input domain. Throughout this thesis, when we address the question of whether a test set T satisfies an adequacy criterion, we will also assume that T is a subset of the program's input domain. This is a reasonable assumption because one normally applies an ade- quacy criterion after the program has run successfully on all elements of the test set. When no ambiguity results, we will omit reference to the specification S and assume that

A program unit's structure can be represented abstractly b\ its_/?ou' graph, a single- entr\', single exit directed graph in which each node represents a sequence of statements which are always executed as a unit and each edge represents a conditional or uncondi- tional transfer of control. A path is a sequence i , . . . ,a!,„) of nodes such that («,,«,+i) is an edge for l<i<m. A complete path is a path («],... ,«,„) in which «i is the entrj'

-5-

node and n^ is the exit node of the flow graph. If G is the flow graph of a program unit P, we will sometimes refer to a path through G, as a path through P. In chapter 2 we will give a more precise definition of the flow graphs for the class of program units we are

considering.

3. Limitations of Testing

We now take note of some o^ the inherent limitations of program testing as a valida- tion strategy.

We first note thai it is usually impractical to prove a program correct by testing it. Let T be a proper subset of the input domain, D^ of a specification S. Let /oeD'^-T be a point such that for some y, (fo,y)^S. Then there is a program P such that (t.P(t))€S for all teT but (/o.PUo)) ^S. P is incorrect with respect to S, but this fact is not exposed by testing P on the test set T. Thus the only way to prove a program correct by testine it is by testing it on every element of the D^ (e.xcept perhaps those elements i for which (ylS(i,y))=U).

Another limitation of testing stems from the fact that in order for testing a program to be meaningful, there must be some way of determining for each test case t whether (t.P(t))6 S. This may be a difficult problem, for example, if the amount of data produced by the program is ver\' large, or if the program is written to "find the answer" to some question for which the answer is not known [WEY82].

L'ndecidability results place limitations on the automatability of program based test- ing. A node, branch, or path is executable if and only if there is some element of the input space which causes it to be executed llic following undecidabiliiy results place fundamental limitations on the feasibility of automating the program validation process

-6-

[WEY79,DAV83]: There is no algorithm to determine whether

( 1 ) a given program P halts on a given input t,

(2) a panicular statement, branch, or path is executable,

(3) a particular path from the entr)' to the exit of a program is executable.

Problem (1) is the famous halting problem; the halting problem can easily be reduced to the problems in (2); the unsolvability of (3) follows from the unsolvability of Hilbert's Tenth Problem [DAV73].

4. Related Work

Several software test data adequacy criteria are based on the idea that one cannot consider a program to be adequately tested if certain sequences of statements have never been executed by any test data. These methods, known as structured testing, associate a lest set T, with the set Fl of paths through program P"s flow graph which are executed when P is run with inputs from T. The test set T. or equivalently the set of paths n, is C-adequate for P if and only if each of the sequences required by C is a subpath of one of the paths in H.

The strongest of the structured testing criteria is path testing, which requires that every path through the program's flow graph be executed by the test data. Path testing is impractical for all but the most trivial programs, since any program having a loop has infinitely many paths. Nevertheless, it is useful to study path testing because some of the limitations of path testing apply a fortiori to the other structured testing techniques.

One limitation of path testing is that it cannot guarantee the detection of so-called missing path errors [G0075]. If the program fails treat some special case, there will be no path in the program which corresponds to that special case, and hence the test data

-7-

uill not be required to exercise that special case. In fact, a similar problem plagues all program based testing strategies: if the program lacks some feature which it is supposed to have, analysis of the program text will fail to suggest the inclusion of test data aimed at exercising that feature.

A second limitation of path testing arises due to coincidental correctness. Suppose that test cases t and t' both cause path tt to to be executed, and that (t,P(t))eS, while (r',P(r'))^S. Then a test set T which includes t but not ;' will execute path tc without exposing the error. For example, suppose that path n is supposed to read a number, square it, and print the result; hut instead it reads a number, doubles it, and prints the result. Then the input t=2 v. ill execute n but fail to expose the error.

Statement testing and branch testing are well-known structured testing methods which are weaker than path testing. Statement testing requires that even,' statement (equivalently. ever>- node in the program's flow graph) be executed, while branch testing requires that ever>' edge in the program's flow graph be executed. These criteria are based on the intuition that one cannot feel ver\' confident that the program is coirect if some statement or bran.h has never been executed by any of the test data. Unfor- tunately, statement and branch testing can fail to expose many common errors (nO\V7S,m A7.S|. Several criteria which are based on analysis of the program's control flou and which are stronger than branch testing but weaker than path testing have been proposed |HCJ\\7K.M1L74.\VOOSO].

More recently, a number of le.st data adequacy criteria \Ahich are based on data flow- analysis, some of which "bridge the gap " between branch testing and path testing, have been proposed and studied |I1I:R76.R.\Ps:,R Ars^l.AS83.NTAS4.CLA85J. Tools based on some of them have been implemented |FRA85a.FRA85b.GIR85.KOR85J.

-8-

These criteria are based on the intuition that one should not feel confident that a variable has been assigned the correct value at some point in the program if no test data causes the execution of a path from the assignment to a point where the variable's value is subse- quently used.

Rapps and Weyuker [RAP82,RAP85] defined a family of criteria based on data flow analysis for programs written in a simple language having only simple variables, input/output statements, assignment statements, and conditional and unconditional branching. The basic idea of these criteria is to require the test data to exercise cenain paths from points in the program at which a variable is defined to points at which that definition is used. Rapps and Weyuker compared the various criteria in this family to one another and to statement, branch and path testing. The family of criteria defined in chapter two of this thesis is obtained by extending the definitions of the Rapps and Weyuker criteria to apply to programs written in a large subset of Pascal.

The criterion defined by Herman is similar to one of the Rapps and Weyuker cri- teria, the "all-uses" criterion. Roughly speaking, The criteria defined by Ntafos [NTA84] requires the test data to exercise sequences of length k of definition-use pairs, where a variable .vi is defined in node / 1 and used in node i 2. and variable X2 is defined in node j2 and used in node ;3, etc. The criteria defined by Laski and Korel [LAS83] are based on the intuition that the value assigned to a variable at a particular node n may depend on the values of several variables, each of which could possibly have several definitions which reach node n. The criteria require the test data to exercise paths along which vari- ous combinations of these definitions occur. Clarke et. al. [CLA86] compared the criteria defined by Rapps and Weyuker. by Ntafos and by Laski and Korel to one another.

-9-

An adequacy criterion C satisfies the applicabiliry property if and only if for everv' program P there exists some test set which is C-adequate for P [WEY86]. One would expect a "good" adequacy criterion C to satisfy the applicability propeny. However, none of the structured testing criteria satisfy the applicability propeny. The problem arises due to the fact that some path which is required by the criterion may be unexecut- able. Furthermore, for each of the structured testing criteria, it is undecidable whether a test set exists which adequately tests a given program. In chapter 4 we will define a fam- ily of adequacy criteria uhich is derived from the Rapps and Weyuker data flow testing criteria, and which circumvents this problem. A similar approach could be taken to deriving applicable criteria from the other structured testing criteria.

5. Thesis Over\ icw

In this thesis we examine iv\o families of test data adequacy criteria. For each of these families, we define the criteria for programs written in a large subset of Pascal, we investigate formal properties of the criteria, and we discuss software testing tools based on the criteria.

The first of these families, the dam flow testing criteria, is derived from the family of criteria defined by Rapps and Weyuker, These criteria, which we define in section 1 of chapter 2, are based purely on the syntax of the program being tested. They require the test data to execute certain paths from points in the program at which a variables are defined to points where those definitions are used. The criteria are defined in such a way that the relationship among them is the same as the relationship among Rapps and Wcyuker's family of criteria.

Weyuker has defined eight axioms which, she argues, should be satisfied b> a

- 10-

"good" adequacy criterion [\VEY861. In Secrion 2.2, we explore some of the properties of the data flow testing criteria by examining which of the criteria satisfy which of Weyuker's axioms. We show that the criteria have some serious shortcomings, including failure to satisfy the applicability property.

In chapter three, we describe the design and implementation of a tool, ASSET, which is based on the data flow testing criteria. ASSET takes as input a program P, a test set T, and one of the data flow testing criteria, C, and determines whether T is adequate according to C for testing P.

The second family of criteria, the feasible data flow testing criteria, is defined in chapter 4, section 1. TTiese criteria, which are derived from the data flow testing criteria, take into account some semantic information about the program being tested. The cri- teria are defined in such a way as to insure that they satisfy the applicability propeny. We show that that the relationship among the family of feasible data flow testing criteria differs significantly from the relationship among the family of data flow testing criteria. In section 4.2 v.e show that most of the feasible data flow testing criteria satisfy all of Weyuker's axioms.

Unfortunately, for any feasible data flow testing criterion C there is no algorithm to determine whether a given set of test data is C-adequate for testing a given program. The problem arises because in order to determine whether T is C-adequate, it may be neces- sary' to determine whether some definition-use association can ever be exercised. That is, it may be necessary to determine whether there is an executable path, along which some variable v is not redefined, from a node d to a node u.

In chapter 5 we present a heuristic which attempts to detemiine whether any test data exists which exercises a civen definition-use association. The heuristic is based on a

- 11 -

combination of symix)lic evaluation and data flow information. This heuristic, along with ASSET form the basis for an interactive feasible data flow testing tool.

- 12-

CHAPTER 2:

Tht Data Flow Testing Criteria

1. Definitions of the Data Flow Testing Criteria

A family of test data adequacy criteria, based on analyzing the data flow charac- teristics of the program being tested, was defined in [RAP82,RAP85]. These criteria, which we call dataflow testing criteria, or DP testing for short, were originally defined for a very simple universal programming language consisting of assignment statements, conditional and unconditional transfer statements, and I/O statements. They require that the test data exercise certain paths from a point in a program where a variable is defined to points where the variable is subsequently used. A tool. ASSET, which performs DP testing on programs written in such a language is described in [FRA85a].

In order to make data flow testing more practical, we have extended it to apply to a large subset of Pascal, and have enhanced ASSET accordingly. The basic ideas behind data flow testing apply to testing programs written in other imperative languages, but for precision it is necessary to specify a panicular syntax. We now present the extended theory of data flow testing.

We apply DP testing to an individual subprogram, i.e., a main program, a pro- cedure, or a function. To execute a procedure or function P. we must call it from a driver program. Thus, to test a procedure or function P we need a test-set'drivcr-program pair (1 .D), uherc I) is a program which might call P and T is a subset of the input domain of the specification for D. Obviously, the path (or paths) through P which is executed when a panicular test case is input to D will depend on D, as well as on the test case. \S c will often omit reference to the dnver program, when ii is obvious which dnvcr program is

- 13-

calling the subprogram. Similarly, we may omit reference to the driver program if it sim- ply reads in the arguments to the subprogram in order, then calls the subprogram once.

As a technical convenience we assume that the subprogram being tested has no goto statements, no with statements, no variant records, no functions having var parameters, no procedural or functional parameters, and no conformant arrays. It would not be difficult to relax these assumptions. We also assume that in every conditional statement, the Boolean expression which determines the flow of control has at least one occurrence of a variable or a call to the function eof or to the function eoln.

A subprogram can be uniquely decomposed into a set of disjoint blocks of state- ments. A block is a maximal sequence of simple statements having the properties that it can only be entered through the first statement and that whenever the first statement is executed, the remaining statements are executed in the given order. The subprogram to be tested is represented by a flow graph in which the nodes correspond to the blocks of the subprogram and edges indicate possible flow of control between blocks. As a techni- cal convenience, some nodes uhich correspond to empty sequences of statements may also be added to the flow graph. Figure 2.1.1 shows the subgraphs corresponding to statements in the language. The subprogram's flow graph is obtained by merging the exit node of each statement with tlie entn.' node of the following statement. An entr>' node preceding the first statement of the procedure and an exit node succeeding the last state- ment are added.

We will call a flow graph formed according to the rules in Figure 2.1.1 a structured flow graph. All of the flow graphs considered in this thesis will be structured flow graphs. The "extra" nodes, corresponding to empty sequences of statements, ha\e been added as a technical convenience, in order to simplify some of the algorithms in Chapter

FIGURE 2.1.1

Control flow and da:a flow for statements in the language

,S.i deno:e> the subgraph corresponding to statement Si.

SI.MPLE STATEMENTS .Assignment statement: v:= expr;

Node 1 has cuscs of each variable in -^

exor followed bv a definition of v. ^

Input. Output statements:

read(vl vnl:

readln( vl vni:

read(f.vl vnl:

readln(f,vl vnl;

Node 1 has definitions of vl \n.

If the '-le variable f is present then node i

also has a c-ase followed bv a definition of f\ ■»■

_ : ^

w-ritefel enV V

wruclniel en):

writctf.el en);

w ritclnff.cl en);

Node 1 has c-uses of each variable occurring in cl, en.

If the file variable f is present then node i also has a definition followed b\ a c-use of f*.

Procedure call: Picl .en); f

Node J ha> c-uscs of each variable occuring in

the expressions cl en.

T.-c*c arc fo.l wed b> defin:tions of each actual ^

parameter w.T..n corresponds to a >Br formal parameter ^

Nodes ; and ic are indujed to assure that ?^

the procedure call has its own node. v

tcc>nt:Dued on neit pagci

REPETITIVE STATEMENTS

while statement:

while B do S.

Let h be the entry node to subgraph S. Edges (i.h) and (i.j) have p-uses of each variable in the boolean expression B.

®,

for statement:

for v:= el to e2 do S;

for v:^ el downto e2 do S;

Let tmp be a new variable. Let f and g be the entry and exit nodes, respectively, of S. Node h has c-uses of each variable ;n el, followed by a definition of v and c-uses of each variable m e2 followed by a definition of tmp. Edges (i.f) and fi.j) have p-uses of V and tmp. Node g has a c-use followed bv a def of v.

(J.

repeat statement:

repeat SI;... ;Sn until B:

Let j be the entry node of SI. and let k be the exit node of Sn.

Edges (k.j) and (k,i) have p-uses of each variable in the boolean expression B.

(^continued on next page)

CONDITIONAL STATEMENTS

if-then-else statement

if B then SI:

■f B then 51 else S2;

Let k and j be the entry nodes of SI and 52. respectively. Edges (i.j) and (i.k) have p-uses of each variable in the boolean expression B. :f there ;> no else" part then subgraph 52 has a single node correspondmu to an empty block.

case el of

label-iistl : 51;

Y ^

labei-listn Sn end:

Let jl jn be the entry nodes of

51....,Sn. respectively.

Edges (i.jl) (i.jn)

have p-uses of each variable m the expression el.

<^. ^

i^>i

- 17-

5 and Appendix 11.

Data flow analysis was originally used for compiler optimization [HEC77,SCH73]. It generally classifies each variable occurrence as being either a definition, in which a value is stored in a memory location, a use, in which a value is fetched from a memor>' location, or an undefinition, in which the value and the location become unbound. For our purposes, we will also distinguish between two different types of uses. The first type directly affects the computation being performed or outputs the result of some earlier definition. We call such a use a computation use or a c-use. Of course, a c-use may indirectly affect the flow of control through the subprogram. In contrast, the second type of use directly affects the flow of control through the subprogram, and thereby may indirectly affect the computations performed. We call such a use ?i predicate use or p- use.

We will associate a sequence of definitions and c-uses with each node in the flow graph and will associate a set of p-uses with each edge in the flow graph. Figure 2.1.1 shows the classification of variable occurrences in the language's statements. In addi- tion, the entry node is considered to have a definition of each parameter, each non-local variable which occurs in the subprogram and of the input buffer input T, which may implicitly occur in calls to the standard procedures/functions read, readln, eoln and eof. The exit node has an undefinition of each local variable, a c-use of each variable parame- ter, a c-use of each non-local variable, and a c-use of the input buffer, input T.

We now discuss hov. data flow analysis is handled for structured variables. Since it is not in general possible to determine the particular array element which is being defined or used in an occurrence of an array variable. v.e will treat each entire array as a single entitv. If a is an arrav variable, anv definition of the variable a \e ] w ill consist of a c-use

- 18-

of each variable occurring in the expression e, followed by a definition of a. Any use of a [e ] will consist of uses of all of the variables occurring in e followed by a use of a.

Similarly, we will treat pointers purely syntactically, making no attempt to perform data flow analysis on dereferenced pointers. If p is a pointer variable, a definition of p T consists of a c-use of p followed by a definition of /^T, and a use of pT consists of a use of p followed by a use of /? T. Since it is not possible to statically determine the memor>' location to which a pointer points, we will ignore the definitions and uses of p T.

Each field of a record is treated as an individual variable. Any unqualified occurrence of a record is treated as an occurrence of each field of the record. Occurrences of file variables in I/O statements are handled by considering the effect of the statement on the file buffer.

Note that our model of data flow may not reflect the actual data flow in the subpro- gram being tested completely accurately. For example, we have made no attempt to per- form any interprocedural data flow analysis, have ignored dereferenced pointers, have made no attempt to disambiguate array references, and have ignored potential aliasing and side effects. In an optimizing compiler it is imperative that conser\'ative assump- tions be made about the tlou of data, lest a code transformation which changes the semantics of the program be pcrt'ormed. In the context of data flow testing, however, such caution is not strictly necessar>'. On the other hand, it seems reasonable to expect that more accurate data flow analysis will force the selection of better test data. In (rR.A85bl ^c compare the test data needed to adequately lest programs when each array is treated as a single entity to the test data required to adequately lest transformed pro- grams in which array references are di.sambiguaied and each element of the array is treated as an individual entity. More exploration of the trade off between the difficulty of

- 19- performing accurate data flow analysis and the quality of resulting test data is needed.

We are interested in tracing the flow of data between nodes, and thus define a c-use of a variable x in node i to be a global c-use if the value of x has been assigned in some block other than block i. Let x be a variable occurring in a subprogram. A path

(j,«i «m<y)< m>0, containing no definitions or undefinitions of x in nodes «],...,«„ is

called a definition clear path with respect to x (def-clear path wrt x) from node i to node j and from node i to edge (n,^J). A node i has a global definition of a variable x if it has a definition of x and there is a def-clear path wrt x from node i to some node containing a global c-use or edge containing a p-use of x. Since ever>- p-use is associated with a potential transfer of control from one node to another, there is no need to distinguish between p-uses and global p-uses.

We restrict the class of subprograms to which data flow testing applies to those sub- programs P satisfying the following two properties:

1. No-Syntactic-Undefined-P-use Property (NSUP):

For ever\' p-use of a variable x on an edge (i,j) in P, there is some path from the start node to edge (i,j) which contains a global definition of x.

2. Non-Straight-Line Propeny (NSL):

P has at least one conditional or repetitive statement.

Note that the NSL property guarantees that at least one node in P"s flow graph has more than one successor, and that at least one variable has a p-use in P.

The subprogram's dcf-use graph is obtained from the flow graph by associating with each node i. the sets c-u.veO\;={ variables which have global c-uses in block i) and rfe/0'j={ variables which ha\c global definitions in block i) and associating with each

-20-

edge (i,j) the set /7-w5eC/,y')={ variables which have p-uses on edge (i,j^). We also define sets of nodes dcu(x.i )={nodc^ j such that xGC-use(j) and there is a def-clear path with respect to x from i to j) and dpu(x,i)={cdgQS (j,k) such that xGp-use(j,k) and there is a def-clear path with respect to x from i to (j,k)). These definitions are summarized in Fig- ure 2.1.2.

V = the set of variables

N = the set of nodes

E = the set of edges

def(i) = {xe VI x has a global definition in block i}

c-use(i) = (xe VI x has a global c-use in block i)

p-use(i,j) = (x€ VI X has a p-use in edge (i,j) )

dcu(x,i) = {j N I X e c-use{j) and there is a def-clear path v\Tt x from i to j)

dpu(x,i) = {(j,k) e E I X p-use(j,k) and there is a def-clear path wn x from i to (j,k) )

Figure 2.1.2

Thus, if xe def(i) and je dcu(x,i) then x has a global definition in node i and c-use in node j and there is a definition clear path with respect to x from node i to node j. Therefore it may be possible for control to reach node j with the variable x having the value which was assigned to it in node i.

A dcfinition-c-use association is a triple (i,j.x) where i is a node containing a global definition of x and je dcu(x.i). A dcfinition-p-use association is a triple (i.fj.kt.xt where i is a node containing a global definition of x and (j,k) e dpu(x.i). A simple path is one in which all nodes, except possibly the first and last, arc distinct. A loop-free path is one in

which all nodes are distinct. A path {n\ Hj.rii,) is a dupath with respect to a variable

X if /J 1 has a global definition of x and cither

i) «i has a global c-use of \ and {n\ n^./Ji) is a def-clcar simple path with respect to

\. or

-21 -

ii) («y,«i) has a p-use of x and {n ],..., «^) is a def-clear loop-free path w.ith respect to x.

An association is a definition-c-use association, a definition-p-use association, or a du- path.

Recall that a complete path is a path from the entn,' node to the exit node of the flow graph. A complete path n covers a definition-c-use association (i,j,x) [respectively a definition-p-use association (i,(j.k),x)] if it has a definition clear subpath with respect to x from i to j [respectively, from i to (j,k)l. n covers a du-path n' if t:' is a subpath of K. A set n of complete paths covers an association if some element of the set does. A test- set/driver-program pair (D.T) covers an association if when input to D the elements of T cause the execution of the set of paths H, and H covers the association.

Roughly speaking, the family of data flow testing criteria is based on requiring that the test data execute definition clear paths from each node containing a global definition of a variable to specified nodes containing global c-uses and edges containing p-uses of

definition clear

that variable. For each variable definition we can demand that ^

isome_

all some

paths with respect to that variable from the node containing the definition to

reachable by some such path be executed. The criteria are defined pre cisely in Figure 2.1.3.

of

the

uses

c -uses

_p -uses.

If variable x is has a global definition in node i, the all-defs criterion requires the test data to exerci.se some path which goes from node i to some node or edge at which the value assigned to x in node i is used. The all-uses criterion requires the test data to exer- cise at least one path to each such node or edge. The all-du-paths criterion requires that all of the du-paths from i to each such node or edge be exerci.sed. The criteria all-p-uses,

THE DATA FLOW TESTING CRITERIA

A test-set/driver-program pair (T,D; satisfies criterion C for subprogram P if and only if for each node i in P's flow graph and each x defii) the set n of paths executed by T covers the following associations:

CRITERION

ASSOCIATIONS REQCIRED

All-defs

Some (i,j,xj s.t. jedcu(x,i) or some (i,(j,k),x) s.t. (i.k)edpu(x,i).

All-c-uses

All (i.j.x) s.t. jedcu(x,i).

A!l-p-uses All (i.(j.kl.\) s.t. ( j,k )6 dpu(\.i i.

All-p-uses/some-c-uses

All (i,(J,k).x) s.t. (j,k)€dpu(x.i). In addition, if dpu(x,i)=0 then some (i,j.x) s.t. jedcuCx.i). Note that since i has a global definition of x, dpu(x,i)=0 => dcu'x.i)*0.

All-c- use s/some-p- uses

All (i,j.x) s.t. jedcu(x,i). In addition, if dcu(x,i)=0 then some (i,(j,k),x) s.t. (j.k)€dcu(x,i). Note that since i has a global definition of x. dcu(x.i)=0 => dpu{x.i)*:0.

All-uses

All (i,j,x) s.t. j e dcu(x,i) and all (i,(i.k).x) s.t. (i.k)edpu(x.i).

AU-du-palhs

All du-paths from i to j with respect to x for each jedcu(x,i) and all du-paths from i to (j,k) with respect to x for each (j.k)€dpu(\,i).

For comparison we also define the criteria all-nodes (all-edges, all-paths, respectively) which require that fl cover every node (every edge. ever\ path, respectively) in the flow graph.

Huurc 2.1..^

-23-

all-c-uses, all-p-uses/some-c-uses, and all-c-uses/some-p-uses place emphasis on either c-uses or p-uses. Note that any subprogram has only finitely many definition-use associa- tions, so none of the DF criteria requires an infinite amount of test data. Upper bounds on the amount of test data required by the DF criteria are established in [WEY84].

Criterion C \ includes criterion C2 if and only if for even,' subprogram, any test- set/driver-program pair which satisfies C\ also satisfies C^- Criterion C\ strictly includes criterion C2, denoted C \ => C2, if and only if C \ includes C 2 and for some subprogram P there is a test-set/driver- program pair which satisfies C2 but does not satisfy C \ . The notion of subsumption in [CLA85] is similar to the notion of inclusion.

Rapps and Weyuker proved that for the simple language for which DF testing was originally defined, the relationship among the criteria is as shown in Figure 2.1.4 [RAP85]. Clarke et. al. [CLA86] have shown the relationship of the criteria defined by Laski and Korel [LAS 83 J and Ntafos [NTA84] to the DF criteria.

In extending the theor\- of DF testing to apply to programs written in Pascal we have preserved the inclusion relations among the DF criteria. Doing so required the inclusion of definitions of all non-local variables in the entn.' node of the procedure and careful treatment of implicit uses of the variable inpuA . For symmetry, we have also added the all-c-uses criterion, which was not defined in [RAP85].

THEOREM 2.1.1:

The relationship between the data flow testing criteria is as shown in Figure 2.1.4. That is, DF criterion CI includes criterion C2 if and only if the inclusion is explicitly shown in the figure, or if it follows from the transiti\ity of the relation.

PROOF:

ALL-PATHS

ALL-DL-PATHS

ALL-LSES

ALL-C-LSES.SOME-P-LSLS ALL-P-USES SOME-C-USES

ALL-C-USES ALL-DEFS ALL-P-LSES

ALL-EDGES

ALL-NODES

FIGLRF 2.1.4 Thf relationship among the duia Hoh tesiinf; criteria

-25-

Rapps and Weyuker proved a theorem analogous to this one for the data flow testing criteria as originally defined in [RAP85]. Some minor modification of their proof is needed to account for the fact that our flow graphs differ from those used by Rapps and Weyuker. For symmetry, we have also added the all-c-uses criterion, which was omitted from the original family of criteria.

The proofs of all of the inclusions except (all-p-uses includes all-edges) are obvious and will be omitted. The proofs of strictness of inclusions and of incomparabiliiy are essentially the same as the corresponding proof in [RAP85]. The same programs and test sets used in that proof can be used here. Although there are some minor differences in the form of the flow graphs corresponding to these programs, the proofs go through.

It remains to show that all-p-uses => all-edges, that all-c-uses/some-p-uses => all-c- uses, and that all-c-uses is incomparable w ith the other criteria.

Proof that all-p-uses => all-edges:

Let T be a test set which satisfies all-p-uses for P and let Y\ be the set of paths through P"s flow graph which are executed by T. Let (i.j) be an edge in the flow graph.

case 1: (i,j) is labelled.

Then since P satisfies the NSUP property, there is a definition-use association (k,(i,j),x) for some variable x and node k. Since T satisfies all-p-uses, H covers (k,(i,j),x), so, n covers edge (i,j).

case 2: (i.j) is not labelled.

case 2a: There only one path from the entr}' node to edge (i,j).

Then every path in n covers edge (i.j). Since P satisfies NSUP and NSL. FI is not empty, so there is at least one such path. Thus P covers edge (i,j).

-26- case 2b: There is more than one path from the entr\' node to edge (i,j). Let n be the unique node such that

1) n has more than one successor, and

2) there exists exactly one acyclic path from node n to edge (ij).

Let m be the unique successor of n such that there is an acyclic path from n to m to (i,j). Then ever>' path uhich covers edge (n.m) covers edge (i,j). Since n has more than one successor, edge (n,m) is labelled, so (by NSUP), there is a def-p-use association {k,(n,m),x). n covers (k,(n,m),x), hence covers (n,m), hence covers (i,j).

FYoof that all-c-uses/some-p-uses => all-c-uses.

It is obvious that all-c-uses/some-p-uses includes all-c-uses. To see that the inclu- sion is strict, consider the program whose flov. graph is shoun in Figure 2.1.5. The only definition-c-use associations in this program are (4,8,x) and (l,10,inputT). The set T={(0,0)) satisfies all-c-uses for this program, but it does not cover either of the definition-p-use associations (3,(6.7).z) or (3,(6,8),z), hence does not satisfy all-c- uses/some-p-uses.

Proof that all-c-uses is incomparable with the other criteria:

The set T=((0,0;) satisfies all-c-uses for the program shown in Figure 2.1.5. but does not satisfy all-p-uses/some-c-uses. all-p-uses, all-edges, all-nodes, or all-defs, so all-c-uses does not include any of these criteria.

To see that none of the criteria all-p-uscs/some-c-uses. all-p-uses. all-edges, all- nodes, or all-dcfs include ihc all-c-uscs criterion, consider the program shown in Figure 2.1.6 The test set |(-1V(5)) satisfies the criteria all-p-uses/some-c-uses. all-p-uses. all- edges, all-nodes, and all-defs. However, it does not cover the def-c-usc association

■^ read(v,z;;

v<0

y

\y> = o

\

z:=0 3.

-I , x:=0

/

\

z<0/ ^^•.^z>:^0

8 write(x;;

10

FIGIRR 2.1.5

y<0

©

x:= 1; ^ read(y);

y> = 0

4, x: = 0:

\

^t^

writeln(x);

\

^ wnteln(x)

(3

FICIRK :.1.6

-29- (4,7, x), and hence does not satisfy the all-c-uses criterion.

2. Axiomatic Evaluation of Data Flou Testing Criteria

Weyuker has defined eight axioms which, she argues, should be satisfied by a "good" adequacy criterion [WEY86]. In this section we explore which of these axioms are satisfied by the data flow testing criteria.

Weyuker's model of computation assumes that the objects being tested are single- entry, single-exit programs which take all of their inputs at the beginning and which have a single output statement at the end. We v.ill make certain minor modifications in the axioms in order to make them apply to testing Pascal subprograms.

Axiom 1: Applicability

For every subprogram, there exists an adequate test set.

None of the data flow testing criteria satisfy the applicability axiom. Consider the following program, whose flov. graph is shown in Figure 2.2.1.

program Applicability_CounterExample:

var x,y,z: integer;

begin

read(x.y,z); if x>Othen begin

v:=0; z:=0; end;

if X <0 then begin

WTiteln(y):

if z>() then writeln('hello') end, end.

There are no executable paths which cover the association f3.8.y). which is required by the all-defs and all-c-uses criteria, nor the associations (3,(8,9).z), or (3,(9, 10), z),

®

x> = 0.

^2^^ read(x.y,2),

r:=0 "^

^^

,rf

x<0.

writelnfj): ^7

wntelnrhello'): ^9

HuL Kt :.:.!

-31 -

which are required by the all p-uses criterion. Therefore no test set is C-adequate for testing this program, for any data flow criterion C. Note that this program does not have any unexecutable code.

While the above program was contrived to demonstrate that none of the criteria satisfy the applicability axiom, there are many "realistic" subprograms for which no set of test data is C-adequate when C is all-p-uses, all-c-uses, all-c-uses/some-p-uses, all-p- uses/some-c-uses. all-uses, or all-du-paths. For example, any subprogram having a for loop in which the upper bound is always greater than or equal to the lower bound has a def-p-use association which is not covered by any executable path. In Figure 2.2.2, the only executable path from the definition of i in node 1 to the p-use of i on edge (2,4) is the path which goes around the loop 5 times. Since this path has a definition of i, we can conclude that no executable path covers the definition-p-use association (1,(2,4), i).

Axiom 2: Non-Exhaustive Applicability

There is a subprogram P and test set T such that P is adequately tested by T and T is not an exhaustive test set.

All of the data flow testing criteria satisfy this axiom. Consider the following pro- gram whose flow graph is shown in Figure 2.2.3.

program Non-exhaustive_Applicability( input, output);

var X : integer;

begin

read(x);

if x>0 then wnteln('hello')

else writeln('goodbye') end

The test set T={(0),(1)) is C-adequate for testing this program, where C is any data flow testing criterion. On the other hand, the exhaustive te.st set for this program is the set of

^'i:=l

S:

i:=i-l:

Subgraph corresponding to the statement for i:= 1 :o 5 do S;"

FIGLRE 2.2.2

G

X

2 , read(x);

writeln('hello'); ''J

^ writeln('goodbye');

6)

FIGURE 2.2.3

-34- iniegers.

Axiom 3: Monotonicity

If T is adequate for P and TqT' then T' is adequate for P.

It is obvious that all of the data flov. criteria satisfy the monotonicity property since all of the associations covered by T are also covered by 7"'.

Axiom 4: Inadequate Empty Set

The empty set is not adequate for any subprogram.

Since every subprogram in the class we are considering satisfies the No-Syntacric- Undefined-P-uses Propeny and the Non-straight-line Property, each subprogram has at least one definition-p-use association. It follows that the empty set is not adequate for any of the criteria all-p-uses. all-p-uses/some-c-uses. all-c-uses/some-p-uses, all-uses, or all-du-paths.

Since even,' subprogram has a definition of inpuA in the entry node and a c-use of inpui^ in the exit node, the all-c-uses criterion satisfies the inadequate empty set axiom. Note, however, that the fact that all-c-uses satisfies this axiom can be construed as being an anifact of the definitions.

.Axiom 5: Aiili-tAliiisioii.ilitx

There arc subprograms P and Q such that P is equivalent to Q, T is adequate for P. but T is not adequate for Q.

All of the data flow testing criteria satisfy the Anti-extensionality propeny. Con- sider the program*; P and Q uhosc flow graphs are shown in Figure 2.2.4. Both of ihcsc programs lake as input a single integer, x. and output the absolute value of x The test set

read(xl

x<0

-X .3

x> = 0

4 v:=x;

« r::ein(v);

FLOWGRAPH FOR PROGRAM P

readlx ';

x< = 0.

s : - - X J*

N,

\

.^>0

X

4 v:=x;

wntelnlv);

FLOW GRAPH FOR PROGRAM Q

FIGLRF 2.2.4

-36-

T={(-1),(0)) is C-adequate for P for any data flow testing criterion C. However, since T does not cover node 3 in Q, it does not satisfy the all-nodes criterion, hence does not satisfy all-p-uses, all-p-uses/some-c-uses, all-uses, or all-du-paths for Q. T does not cover the def-c-use association (4,5, y) in Q, which is required by the all-defs criterion, and hence does not satisfy the criteria all-defs, all-c-uses, all-c-uses/some-p-uses for Q. So while T satisfies all of the criteria for P, it satisfies none of them for the equivalent program Q.

Axiom 6: General Multiple Change

There are subprograms P and Q which are the "same shape", and a test set T such that T is adequate for P, but T is not adequate for Q. (Roughly speaking. P and Q are the same shape if P can be transformed into Q by performing one or more simple transformations such as replacing a relational operator with another relational operator, replacing a con- stant with another constant, etc.)

All of the data flow testing criteria satisfy this axiom. Consider again the programs in Figure 2.2.4. These programs are the same shape because Q is obtained from P by changing the relational operator in an if-then-else statement. As shown above. The test set T=((-1).(0)) satisfies all of the data flow testing criteria for P, but satisfies none of the criteria for Q, Note that when considering the anti-cxtcnsionality axiom, the relevant feature of programs P and Q was their semantic closeness, but when considering the gen- eral multiple change axiom, the relevant feature is their syntactic closeness.

Avioni 7: Anliditoniposilion

There exists a subprogram P and a "component" Q such thai T is adequate for P. T' is the set of vectors of values that that variables can ussutnc u:i ciiujikc iv <,> Iit some t of T,

-37- and T' is not adequate for Q.

A component of P is a contiguous sequence of statements of P. Since our definitions of the data flow testing criteria apply to testing a subprogram, not a com- ponent, we will modify the axiom slightly. The modified axiom captures the intuition behind Weyuker's axiom, but fits our model of computation.

Antidecomposition :

There is a subprogram P which calls a subprogram Q and a test T such that T is ade- quate for the program P' obtained from P by expanding in-line one call to Q, but the test-set/driver-program pair (T,P) is not adequate for Q.

None of the data flow testing criteria except all-c-uses and all-c-uses/some-p-uses satisfy the antidecomposition axiom. The fact that all-c-uses and all-c-uses/some-p-uses do satisfy the axiom is an artifact of the definitions, which we will discuss below.

With certain exceptions which we will examine below, each definition-use associa- tion (3 in Q corresponds to a definition-use association a' in the ponion of P' which corresponds to Q. The test case which covers a' when input to P' will also cover a when P is run.

For example, consider the following programs, whose flow graphs are shown in Fig- ure 2.2.5.

program P(input.output):

var X : integer;

procedure Q(x: integer);

begin

if x>0 then writelnChcllo') else writeln( 'goodbye";

end: (Q)

begin IP)

read(x):

Q(x);

-38 write(x)

end.

program /"(input, output); var x,y,tl : integer; begin (P)

read(x);

(in-line expansion of call to Q begins}

tl := x;

if tl>Othen wTiteln('hello')

else writelnC 'goodbye')

{in-line expansion of call to Q ends)

write(x) end.

The association (6,(7,8),x) in Q, which arises from the fact that the parameter x has a definition in node 1 and a p-use on edge(7,8), corresponds to the association (13,(13, 14),tl) in P'. Both are covered by any test case consisting of an integer which is greater than or equal to 0.

The exceptions arise due to the fact that we have given the exit node of a procedure a c-use of each var parameter and each global variable. If some global variable (or var parameter) has a definition in Q which reaches the exit node of Q and if that variable (or the actual parameter corresponding to the var formal parameter) is never used at any point in P which is reachable from the call to Q. then Q will have a definition-c-use asso- ciation which does not correspond to any definiiion-c-use association in /".

For example, consider the programs whose flow graphs are shown in Figure 2.2.6. Let y be a global variable Procedure Q has a definition of y in node 9 which reaches the exit node of Q and which is never used in the calling program, P. The test set T={(0)) is adequate for P' according to the all-defs. all-c-uses. and all-c-uses/somcp-uses criterion. However. 1 is not adequate for Q according to any of these criteria. Because it is non-

0

^ read(x);

A

x> = 0.

,x<0

^' Q(x); writelnChello')r(8^

\^ wriieln('goodbye');

^ wrue(x)

FLOWGRAPH FOR P

FLOWGRAPH FOR Q

U

Q read(x); tl:= x;

tl> -0^

writeln('hello'); -U^

KO -L> writeln("goodbye

■16 write(x);

X

FLOWGRAPH FOR P'

FIGURE 2.2.5

D

®

2^ read(x);

i Qix):

X

4/ write(x);

d

FLOW GRAPH FOR P

FLOWGRAPH FOR Q

Q

ijo write(x);

6

FLOWGRAPH FOR P"

FIGURE 2.2.6

-41 -

local to Q, y has a c-use in node 1 1, the exit node of Q. To satisfy all-defs, all-c-uses, or all-c-uses/some-p-uses for Q a test set must cover the definition c-use association (9,1 l,y), thus must include an element (Y), where Y is strictly less than zero.

Note that the fact that these criteria satisfy the antidecomposition axiom is an artifact of the definitions. If the model of data flow were based on accurate interpro- cedural data flow analysis, or if the class of subprograms were restricted to exclude those having data flow anomalies like the one in the program in Figure 2.2.6 then the criteria would fail to satisfy the axiom.

Axiom 8: Anticomposition

There exist subprograms P and Q and test set T such that T is adequate for P and P(T) is adequate for Q, but T is not adequate for P;Q.

Weyuker's model of computation assumes that all input statements appear at the begin- ning of a program, and that all output statements appear at the end of a program. "P;Q" denotes the program obtained from two programs P and Q. which use the same set of identifiers, by replacing P's unique output and exit statements by Q' where Q' is obtained from Q by deleting the input statements. We will modify the Anticomposition axiom slightly obtain an axiom which captures the intuition motivating Weyuker's definition, but which fits our model of computation.

Anticomposition:

There is a subprogram R whose statement part consists of a call to procedure P fol- lowed immediately by a call to procedure Q. and a test set T for R such that the test-sei/driver-program pair (T.R) is adequate for P and for Q. but T is not adequate for the subprogram R' obtained from R by expanding P and Q in-line.

-42

All of the data flow testing criteria satisfy the Anticomposition axiom. To see that the criteria all-defs, all-c-uses, all-c-uses/some-p-uses, all-uses, and all-du-paths satisfy the axiom, consider the following programs, whose flow graphs are shown in Figure 2.2.7

program R(input,output); var x,y,z: integer; procedure P; begin (P) y:=0; read(x):

ifx>=Othen v:= 1; end; (P) procedure Q; begin (Q)

read(z);

If z < 0 then writeln(v); end; {P) begin (R) P;

0 end

program /?'(input,ourput); var x,y,z: integer; begin!/?')

y:=0;

read(x);

if X >=0 then y.= 1;

read(z);

if z < 0 then wriieln(y;; end

Let T=|(-1.-1),(0.0)). Then the test set/driver program pair (T,R) satisfies the all of the data flow testing criteria for P and for Q. However. T docs not satisfy the cnteria all- dcfs. all-c-uses. all-c-uses/some-p-uses, all-uses, or all-du-paths for R' To do so, T would have to cover the dcfiniiion-c-usc association ( 16.19.y). That is, T would have to include an clcmcni (a.h) where a>() and b<0. A similar example shows that the criieria all-p-uscs and all-p-uses/some-c-uscs also satisfy the anticomposition axiom.

r?'

x<0

(^

.■^ v:^0: readi X):

v^

x> =0

-7^ y:=l;

wriielniy ) ; ,9

8^ read(z);

■t-.

&

PROCEDIRE ?

PROCEDURE Q

13

^■^ y:= 0; read I x);

x<0

/

/

x> =0

■^ y

>i>

18, rcadiz);

wrr.einl'v): 19

PROGRAM R'

FIGLRE : : :

-44-

Summary

The above results are summarized in the table shown in Figure 2.2.8.

Axioms

Criteria

1

2

3

4

5

6

7

8

all-defs

no

ves

ves

ves

ves

ves

no

ves

all-c-uses

no

ves

ves

ves

ves

ves

ves

ves

all-p-uses

no

ves

ves

ves

ves

ves

no

ves

all-c-uses/ some-p-uses

no

yes

yes

yes

yes

yes

yes

yes

all-p-uses/ some-c-uses

no

yes

yes

yes

yes

yes

no

yes

all-uses

no

ves

ves

ves

ves

ves

no

ves

all-du-paths

no

ves

ves

ves

yes

ves

no

ves

FIGURE 2.2.8 Summary of .Axiomatic Evaluation of the Data Flow Testing Criteria

The fact that none of the data flow testing criteria satisfy the applicability axiom is disap- pointing, though hardly surprising. Statement testing, branch testing, and path testing also fail to satisfy the applicability axiom, for essentially the same reason. All of these criteria base the choice of paths which the test data must execute solely on the syntax of the subprogram being tested. If some required path is unexecutable then the subprogram cannot be adequately tested.

One could argue that the subprograms which cannot be adequately tested according to the statement testing or branch testing criteria are in some sense pathological because they have unexecutable statements or unexecutable branches. (.On the other hand, one could argi'" that there are good reasons, arising from "defensive programming" for pro- grams to have unexecutable code.) Be that as it may, the situation is somewhat different for the data flow testing criteria. There arc programs which have no unexecutable state- ments or branches yet which cannot be adequately tested by the data flow testing criteria.

-45-

Practitioners generally dodge the issue of inapplicable criteria by eliminating from consideration those statements, branches, or def-use associations which cannot be exe- cuted. Since it is undecidable whether a given statement, branch or def-use association can be executed, this process cannot be fully automated.

In chapter four we will define a new family of adequacy criteria which are derived from the data flow testing criteria by eliminating unexecutable associations from con- sideration. These new criteria do satisfy the applicability axiom. We will see that there are some significant differences between the formal properties of this family and those of the data flow testing criteria.

-46-

CHAPTER 3: The Design and Implementation of ASSET

1. Introduction

We have built a software testing tool called ASSET (A System to Select and Evalu- ate Tests) which is based on the data flow testing criteria. In this chapter we will discuss asset's design and implementation.

ASSET takes as input a Pascal program P, the name of a subprogram Q of P, a set of test data, and one of the data flow testing criteria. It runs the program P on the test data and produces a list of those definition-use associations in Q which are required by the given criterion but not covered by the test data.

The user then examines the output to determine whether the program has behaved according to its specification. If the program computed the wrong output, the user repons that the program has a error. If the program has behaved correctly on all of the test cases, and if the criterion has not been satisfied, the user selects additional test data and continues the testing process. If the program has behaved correctly and the criterion has been satisfied the user can either release the program, certifying that the subprogram Q has been adequately tested according to the criterion, or can select another criterion and continue.

In section 2 we describe ASSET'S design and in section 3 we describe the imple- mentation of the current version of ASSET, which is an experimental prototv'pe. A script of an ASSET session can be found in section 6 of the ASSET User manual, which is included as Appendix I.

-47- 2. The Design of ASSET

ASSET is an interactive system consisting (primarily) of five modules which com- municate with each other via files. Execution of the modules is invoked by commands issued by the user, via a user interface. The modules are (A) Front End, (B) Association-Finder, (C) Compiler, (D) Tester, and (E) Checker. The front end is the only module which is concerned with the syntax of the language in which the program being tested is written. Thus, extending ASSET to handle programs written in other languages should be fairly straight forward.

In this section we describe each of these modules and their interaction with one another. Figure 3.2.1 is a schematic diagram of ASSET'S design. We conclude this sec- tion with a discussion of some of the limitations of this design.

Module A: The Front End

This module is similar to the front end of a compiler. It takes as input a syntacti- cally correct program (written in the subset of Pascal described in Chapter 2) P, which we will call the subject program, and a subprogram Q of P, which we will call the subject procedure. It produces a symbol table with information on the identifiers visible to the statement part of Q, the flow graph of Q, a table called the def-use table, and a program called the modified subject program. The def-use table lists the definitions and c-uses in each node and p-uses on each edge. The modified subject program is a program which, for each test case t, simulates the behavior of P on input t, and which also produces a trace of the path through Q's flow graph which is executed when P is run with input t.

To do this, the front end must perform lexical analysis, parse all declarations which are visible to the statement part of Q, and (at least partially) parse the statement part of Q.

■4M'

--!>-

FIGURE 3.2.1 The Design of ASSET

-49-

Q is divided into basic blocks and the flow graph is built according to the rules in Figure 2.1.1. Each node in the flow graph, or equivalently each basic block (including certain empty basic blocks), is given a unique number.

The statements in each basic block b of Q are examined. The sequence of definitions and c-uses of variables occurring in the b are recorded and the p-uses of vari- ables on edges whose source is b are recorded according to the rules in Figure 2.1.1. It is not necessary to completely parse the expressions occurring in Q, only to identify the variables occurring in each expression.

The modified subject program is obtained from P by adding a probe at the begin- ning of each basic block in Q, which when executed, will write the block's number to a file called the program trace. A probe is which appends a zero to the program trace each time a call to Q is about to terminate is also added to the modified subject program. Extra begins and ends are added as needed. In order to place a probe in every block, repetitive statements are translated into a form which involves gotos and labels. For example, the statement

while X > 0 do writeln(x):

in the subject procedure would be translated into a fragment of the form

Lx: probe(«i);

if not (x > 0) then goto L2;

probe(/i2); writeln(x); gotoLi; L2: probe(n3): probe(n4);

where Li and L2 are statement labels, the n, are node numbers, and the probes write the node numbers to the program trace file. The modified subject program must also include declarations of the program trace file, of labels, of temporary variables introduced while

-50- translating for loops, and of any other identifiers which may appear in the probes.

Module B: Association-Finder

This module takes as input the def-use table and flow graph produced by the front end, and one of the data flow testing criteria and produces a list of all of the definition- use associations (of the appropriate kind) in subprogram Q. If the criterion is all-defs, all-p-uses, all-c-uses, all-p-uses/some-c-uses, all-c-uses/some-p-uses, or all-uses then the relevant associations are definition-c-use and/or definition-p-use associations. If the cri- terion is all-du-paths, the relevant associations are du-paths. Since the number of du- paths may be exponential in the number of nodes in the flow graph, ASSET does not prepare a list of them unless necessary.

There are several possible ways to build the list of definition-c-use and definition- p-use associations. The problem can be viewed as finding all of the definitions which reach each use of a variable (use-definition chaining) or as finding all of the uses which can be reached by each definition of a variable (definition-use chaining). There are several well-known polynomial-time algorithms for these problems [AH086,HEC77).

Finding all of the du-paths in Q has an exponential lower bound. This is because the number of du-paths is potentially exponential, and each of the du-paths must be traversed explicitly.

Module C: Compiler

This module is an ordinary Pascal compiler. In ASSET, it takes the modified sub- ject program as input and produces an object file which we will call the modified object code.

-51 -

Module D: Tester

This module takes as input the modified object code, and a set T of test data. For each test case t in T, it executes the modified program with t as an input. This produces the output P(t) which would be produced by running P on t, and also appends the path (or paths) executed through Q to the program trace file.

Module E: Checker

This module checks which of the associations required by the criterion have been covered by the test set T. It then determines whether the criterion has been satisfied and, if not, outputs a list of associations which required by the criterion but which have not been covered.

We associate with each definition-c-use association and each definition-p-use asso- ciation a regular expression which describes the set of paths covering the association. Let A be the alphabet { 1,...,N} where Q's flow graph has N nodes. For each variable v, let D'^^A be the set of node numbers corresponding to nodes which have a global definition of v. Let (d,u,v) be a def-c-use association. The regular expression r'^"'^=A*d(A-D^)*uA* describes the set of paths which cover the association in the sense that if 7t is a path from the entry to the exit of Q then k covers (d,u,v) if and only if 71 belongs to the language generated by r'^"-'^. Similarly, the regular expression /•'^•^•'•^=A*d(A-D^)*stA* describes the set of paths which cover the def-p-use associa- tion (d,(s,t),v).

If the criterion is all-defs, all-c-uses, all-p-uses, all-c-uses/some-p-uses, all-p- uses/some-c-uses, or all-uses, ASSET simulates the operation of deterministic finite auto- mata which correspond to these regular expressions. By submitting each of the paths

-52-

traversed by the test data as input to each of the automata, ASSET determines which of the def-c-use and which of the def-p-use associations have been covered by the test set. The algorithm in Figure 3.2.2 simulates the parallel operation of the automata. Note that only one pass over the program trace file is needed to determine which of the def-c-use and def-p-use associations are covered by the test set.

If the criterion is all-c-uses, all-p-uses, or all-uses, ASSET then simply checks whether all of the automata corresponding to the appropriate type of use have accepted at least one input. Those which have not accepted correspond to associations which have not been covered. If the criterion is all-defs, all-c-uses/some-p-uses, or all-p-uses/some- c-uses, then a little more bookkeeping is necessary. For example, for the all-defs cri- terion, it is necessary to check whether any of the associations arising from a panicular definition of a variable have been covered.

If the criterion is all-du-paths, ASSET uses pattern-matching to determine which du-paths have been covered by the test set. For each du-path 6 and each path k in the program trace, ASSET treats 7t as text, and 5 as a pattern, and searches the text to deter- mine whether the pattern occurs. The pattern 6 occurs in 7t if and only if the path n cov- ers the du-path 5. By using a pattern matching algorithm which never has to back-up in the text (such as the Knuth-Morris-Pratt Algorithm [KNU77]). and by checking the pat- terns represented by all of the du-paths in parallel (in a manner similar to the above simu- lation of the parallel execution of several automata) ASSET can check all of the du-paths in a single pass over the program trace file.

Limitations

One limitation of this design is that the program traces do not distinguish between

-53

const

type

var

begin

V= number of variables;

N = number of nodes in flow graph;

CU = number ofdef-c-use associations;

PU = number ofdef-p-use associations;

VariableRange: 1..V;

NodeRange: 1..N;

C_UseRange: 1..CU;

P_UseRange: 1..PU;

StaieType: (ql,q2,q3);

automaton 1 = record

v:

VariableRange;

d.u:

NodeRange;

state:

StateType;

accepted:

boolean;

end;

automaton! =

record

v:

VariableRange;

d.s.L-

NodeRange;

state:

StateType;

accepted:

boolean;

end;

def_c_use:

array [C_UseRange] of

automaton 1 ;

def_p_use:

array rP_UseRange] of automaton!;

i:

integer;

fori:=l toCUdo

initialize defcuseli] so that the fields d.u. and v represent the node with the definition, node with the c-use. and variable, respectively in the /'* def-c-use association and initialize state to ql and accepted to false; fori:=l toPUdo

initialize def_p_use[ij so that the fields d.s,i. and v

represent the node with the definition, source and target of the edge with the p-use, and variable, respectively in the /** def-p-use association and initialize state to ql and accepted to false; read(program_trace,node); while node o END_OF_PATH_MARKER do begin (process next node in the path)

for i:=l to CU do (update state of/'* "def-c-use" automaton) with def_c_use[i] do

if not accepted then case state of

ql: if node = d then state :=q2; q2: if node = u then accepted := true else if ((nodeeD*) and (node o d) then state := ql; end; (case) end; (for)

for i:=l to PU do (update state of /'* "def-p-use" automaton) with def_p_usc[i] do

if not accepted then

-54-

if (d <> s) then

case state of

q 1 : if node = d then state := q2;

q2; if node = s then state := q3

else if ((nodee D") and (node o d)) then state := ql; q3: if node = l then accepted := true

else if node = s then state := q3

else if ((nodeeD') and (node o d)) then state := ql

else state := q2; end (case) else {d = s) case state of

ql : if node = d then state := q3; q3: if node = t then accepted := true

else if (node o d) then state := ql end; {case]

end; (for) read(progTam_trace ,node) ; end; (while) end;

Figure 3.2.2

-55-

different invocations of the same procedure. This presents certain problems when the procedure being tested calls itself recursively. For example, consider the following pro- gram:

program driver(input,output); var a : integer, procedure Q(x : integer); var y : integer; begin

read(y);

if X >=0 then

begin

Q(x-l); write(y) end end; {Q) begin {driver) read(a); Q(a) end.

The flow graph for procedure Q is shown in Figure 3.2.3. The test set T={(0)) executes the path (1,2,3,4,6,7) through its first invocation of procedure Q, and the path (1,2,5,6,7) through its second invocation of procedure Q. The first of these has a definition-clear subpath with respect to y from node 2 to node 4, hence covers the def-c-use association (2,4,y). However, the design of ASSET described here would say that the paths executed were (1,2,3,1,2,5,6,7) and (4,6,7). Therefore it would not see that the association (2,4,y) had been covered. It is left to the user to determine by hand that the association has been covered.

A more sophisticated design could circumvent this problem by dynamically tagging the node numbers in the program trace with a procedure-invocation-ID, then separating the ponions of the program trace which arise from different invocations. In the above example, the initial program trace would be

(1.1,1.2,1.3,2.1,2.2,2.5,2.6,2.7,2.0,1.4,1.6,1.7,1.0), where each element of the sequence

©

X >= 0

(j\ read(y)

x<0

Q(x-l) (3)

writc(y) '4

®

Flow graph for procedure Q

FIGURE 3.2.3

-57-

has the form invocation-ID. node-number, and where node number 0 indicates termina- tion of the corresponding invocation. This trace could then easily be divided into the paths (1,2,3,4,6,7) and (1,2,5,6,7), as desired. It is not clear whether the benefits of such a design justify the considerable overhead involved.

3. The Implementation of ASSET

In this section we describe the implementation of the current version of ASSET, which is an experimental prototype.

ASSET runs BSD 4.2 UNIX systems and has been implemented on DEC VAX's and on Sun Workstations*. The front end, the association-finder, and the checker are written in Berkeley Pascal. Each of these is an individual Pascal program. The user interface is written in C and makes system calls to invoke the modules, to manipulate files, and to invoke various additional functions which were not described in the above design description. Module C, the compiler, is the Unix system's Pascal compiler, pc. Module D, the checker is implemented as part of the user interface.

The user interface is user-friendly and is driven by commands issued by the user. In a typical ASSET session, the user will initially select a criterion and direct ASSET to invoke the front end, the association-finder, and the compiler. At this point the checker can be invoked to produce a list of all of the definition-use associations. The user then invokes module D, the tester, to run the modified subject program on one or more test cases, then the checker to determine which (if any) associations required by the criterion have not been executed by the test data. At this point, the user can add additional test

UNIX is a trademark of AT&T Bell Laboratories. DEC and VAX are trademarks of Digital Equipment Corporation. Sun Workstation is a trademark of Sun Microsystems, Inc.

-58-

cases to the test set and run the program again or can select a new criterion and check the current test set against the new criterion. A more complete description of ASSET from the user's viewpoint can be found in the ASSET User Manual, included in this thesis as Appendix I.

Module A, the front end, is implemented as a one-pass recursive-descent parser which builds the symbol table, builds the flow graph for Q; classifies variable occurrences as being definitions, c-uses, or p-uses, storing this information in a table called the def-use table; and creates a preliminary version of the modified subject pro- gram in which the probes have been inserted. Declarations are added in one pass over the modified subject program. The def-use table, an adjacency list representation of the flow graph, and relevant parts of the symbol table are written to files. The front end does not completely parse expressions, nor does it explicitly build the parse tree.

In Module B, the association-finder, a series of depth-first-searches is performed (unless the criterion is all-du-paths.) For each definition of a variable v having a definition in node i, a depth first search is performed beginning at node i. Each time a node j is visited, ASSET checks the def-use table to see whether variable v occurs in node j. If node j has a c-use of v then the def-c-use association (ij,v) is recorded. If node j is the source of edges (j,)t i ) (],k„) having p-uses of v then the def-p-use associa- tions (i.(j^ 1 ).v) (i,(j.^m).v) are recorded. If node j has a definition of v then the search

backtracks.

At most NV depth first searches are performed where N is the number of nodes in the flow graph and V is the number of variables. Since the graphs involved are sparse, each depth first search takes 0(N) time. Thus the complexity of the association-finder is OCN^V).

-59-

This implementation of the association-finder is a hold-over from an earlier version of ASSET (which took as input programs written in a simple language having only sim- ple variables) in which some data flow anomaly detection was performed. More efficiency could be achieved in practice by using one of the algorithms mentioned in sec- tion 2.

If the criterion is all-defs, all-c-uses, all-p-uses, all-c-uses/some-p-uses, all-p- uses/some-c-uses, or all-uses, then module E, the checker, uses the algorithm in Figure 3.2.2. If the criterion is all-du-paths then the checker uses the Knuth-Morris-Pratt pattern matching algorithm. The du-paths are stored internally as an array of arrays of characters and one pass is made over the program trace file.

We conclude this section by mentioning some of the limitations of the current implementation of ASSET. The fact that ASSET stores various tables internally as arrays imposes certain restrictions on the number of nodes in the flow graph, the number of def-use associations, etc. ASSET also places certain minor restrictions on identifiers which can be used in the subject program, in order to avoid interference with the names of files created by ASSET and with identifiers added to the modified subject program. Details of these restrictions can be found in the ASSET User Manual. In addition, the User Manual documents certain bugs in the current implementation which will be corrected in the next version.

-60-

CHAPTER4:

The Feasible Data Flow Testing Criteria

1. Definitions of the feasible data flow testing criteria

Given a subprogram P and a DF criterion C it may be the case that no test data for P exists which satisfies C. This occurs when none of the paths which cover a particular association required by C are executable. In such a case, P cannot be adequately tested according to C. In this section we introduce a new family of criteria which are derived from the DF criteria and which circumvent this problem and investigate some of its pro- penies. We assume that all side effects and aliasing are known.

Recall that a complete path is a path from the entry node to the exit node of a subprogram's flow graph. We will say that a complete path is executable or feasible if there exists some assignment of values to input variables, non-local variables, and param- eters which causes the path to be executed. We will say that a path is executable if it is a subpath of an executable complete path. According to this definition, the question of whether or not a given path through a subprogram is executable is independent of the context in which that subprogram is called. However, it may depend on the effects of any procedures or functions which are called along the path. In section 5 we will dis- cuss the consequences of modifying this notion of cxecutability to take into account information about the context in which the subprogram is called. Note that whether or not a panicular path is executable depends on the actual subprogram, not just on its def- use graph.

We will say that an association is executable if there is some executable complete path which covers it; otherwise it is unexecutable We define subsets fdcu(x,i) cdcu(x.i)

-61-

and fdpu(x,i) c dpu(x,i) whose elements correspond to those associations which are exe- cutable as follows: dcu(x,i)=(no(ies j such that xec-use(j) and there is an executable definition clear path with respect to x from i to j}. dpu(x,i)={tdgts (j,k) such that xep- use(j,k) and there is an executable definition clear path with respect to x from i to (jM)]- Equivalently, fdcu(x,i)={jedcu(x,i)lthe association (ij,x) is executable} and fdpu(x,i)={(jjc)edpu(x,i)lthe association (i,(jJc),x) is executable}. For each DF criterion C we define a new criterion C* by selecting the required associations from fdcu(x,i) and fdpu(x,i) instead of from dcu(x,i) and dpu(x,i). Precise definitions of these criteria are given in Figure 4.1.1. We will refer to the criteria {(all-du-paths)*, (all-uses)*, (all-p- uses/some-c-uses)*, (all-c-uses/some-p-uses)*, (all-p-uses)*) &s feasible dataflow testing criteria, or PDF criteria, for short.

The FDF criteria satisfy the applicability property: For any subprogram P and any FDF criterion C*, there is some test set T which satisfies C*. However, the question of whether a particular T satisfies C* for subprogram P is undecidable. In going from the family DF to the family FDF, we have traded the undecidability of the existence ques- tion, "is there any test which is C-adequate for P?" for the undecidability of the recogni- tion problem "is a given test C*-adequate for P?"

Observe that for any DF criterion C, C => C*. We now investigate the inclusion relations among the FDF criteria.

THEOREM 4.1.1: The family of FDF criteria is partially ordered by strict inclusion as shown in Figure 4.1.2. Funhermore, FDF criterion C,* includes FDF criterion Cj* if and only if it is explicitly shown to do so in the figure or it follows from the transitivity of the relations.

THE FEASIBLE DATA FLOW TESTING CRITERIA

fdcu(x,i) = {J6dcu(x,i)l the association (i,j,x) is executable) fdpu(x,i) = {(j,k)6dpu(x,i)l the association (i,(j,k),x) is executable)

A test-set/driver-program pair (T,D) satisfies criterion C for subprogram P if and only if for each node i in P's flow graph and each x e def(i) the set n of paths executed by T covers the following associations:

CRITERION

REQUIRED ASSOCIATIONS

(all-defsj*

if fdcu(x,i) u fdpu(x,i) ^ 0 then some (i,j,x) s.t JGfdcu(x,i) or some (i,(j,k),x) s.t. (i,k)efdpu(x,i).

(all-c-uses)*

all (ij,x) s.t. jefdcufx.i).

(all-p-uses)*

all (i.(j,k),x) s.t. (j,k)G fdpu(x,i).

(all-p-uses/some-c-uses)*

all (i,(j,k),x) s.t. (j,k)G fdpu(x,i). In addition, if fdpu(x,i) = 0 and fdcu(x,i) ?t <) then some (ij,x) s.t. je fdcu(x,i).

(aJl-c-uses/some-p-uses)*

all (i,j,x) s.t. JGfdcu(x,i). In addition, if fdcu(x,i) = 0 and fdpu{x,i) ^ 0 then some (i,(i,k),x) s.t. (j,k)efdpufx,i).

(aJl-uses)*

all (ij.x) s.t. j e fdcu(x,i) and all (i,(j,k),x) s.t. (j.k) e fdpu(x,i).

(all-du-paths)*

all executable du-paths with respect to x from i to j s.t. jedcu(x,i) and all executable du-paths with respect to x from i to (j.k) for each (j.k) e dpu(x,i).

For comparison we also define the criteria (all-nodes)* [(all-edges)*, (all-paths)*, respec- tively 1 which require that Fl cover each executable node [each executable edge, each exe- cutable path, respectively.)

Figure 4. LI

-b3-

(ALL-PATHS)*

(ALL-DU-PATHS)*

\

/

(ALL-EDGES)"

K/

/^

(ALL-USESj* (ALL-NODES)*

-.x

(ALL-C-USES,SOME-F-USES)*

(ALL-P-USES/SOME-C-USES)"

(ALL-C-USES)'

(ALL-DEFS)*

(ALL-P-USES)*

FIGURE 4.1.2 The relationship among the feasible data flow testing criteria

-64-

PROOF:

A. Strictness of the inclusions

We first observe that if subprogram P has no unexecutable paths, then a test is C- adequate for P if and only if it is C*-adequate for P. This observation, along uath the proofs of strictness of the inclusions in Theorem 1 of [RAP85], none of which involve subprograms with unexecutable paths, shows that all of the inclusions in Figure 4.1.2 are strict. It thus suffices to show that the inclusions in Figure 4.1.2 hold.

B.l. (all-paths)* => (all-uses)* :

Suppose not. Then there is a subprogram P and a set T of test data which is (all- paths)*-adequate for P but not (all-uses)*-adequate. Let n be the set of paths through P which T executes. There exist a node i in P with a global definition of some variable x, a node j with a global c-use of x or edge (jM) with p-use of x, and an executable definition clear path with respect to x from i to j (respectively from i to (j,k)] which is not covered by n. This contradicts the fact that n covers every executable path.

The proofs that (all-paths)* =* (all-du-paths)* and (all-paths)* => (all-edges)* are similar. B.2. (all-paths)* => (all-du-paths)* :

Suppose not. Then there is a subprogram P and a set T of test data which is (all- paths)*-adequate for P but not (all-du-paths)*-adequatc. Let n be the set of paths through P which T executes. There exists an executable du-path which is not covered by n This contradicts the fact that fl covers every executable path.

B.3. (all-paths)* => (all-edges)* :

-65-

Suppose not. Then there is a subprogram P and a set T of test data which is (all- paths)*-adequate for P but not (all-edges)*-adequate. Let n be the set of paths through P which T executes. There exists an executable edge (i,j) which is not covered by n. This contradicts the fact that n covers every executable path.

B.4. (all-edges)* => (all-nodes)* :

Let T be a test which satisfies (all-edges)* for subprogram P, and let n be the set of paths executed by T. Let n be any executable node in P. If n is the entry node, then n has a unique successor, m, and (n,m) is executable. So n covers (n,m) and hence covers n. If n is not the entry node, then since n is executable, some branch (i,n) is executable. So n covers (i,n) and hence covers n.

B.5. (all-uses)* => (all-p-uses/some-c-uses)*, (all-p-uses/some-c-uses)* =* (all-p-uses)*, (all-p-uses/some-c-uses)* => (all-defs)*, (all-uses)* => (all-c-uses/some-p-uses)*, (all-c- uses/some-p-uses)* => (all-defs)* :

These inclusions follow immediately from the definitions of the criteria given in Figure 4.1.1. For example, any set FT of paths which covers all of the associations required by (all-uses)* will a fortiori cover all of the associations required by (all-p- uses/some-c-uses)*.

We next show that those relations not in the transitive closiu-e of the diagram in Fig- ure 4.1.2 do not hold.

C.l. (all-du-paths)* (all-p-uses)*; (all-du-paths)* (all-p-uses/some-c-uses)*; (all-du- paths)* (all-uses)*; (all-du-paths)* (all-c-uses/some-p-uses)*; (all-du-paths)* (all- defs)*; (all-du-paths)* (all-edges)*; (all-du-paths)* (all-nodes)* :

It suffices to show that (all-du-paths)* (all-p-uses)*, (all-du-paths)* (all-defs)*, and

-66-

(all-du-paths)* (all-nodes)*. The rest follows from the transitivity of =>. Consider the subprogram shown in Figure 4.1.3a. Its du-paths are shown in Figure 4.1.3b. Of these, only (1,2), (2,3,4), (4,3,4), and (4,3,5) are executable. Let T = {(X,Y)) where X is any integer and Y < 0. Since T executes n = {(1,2,3,4,3,4,3,5,6,7,9,10)), T satisfies (all-du- paths)*. However, n does not cover the associations (2,(6,8),y), (2,8,x), or node 8, all of which are covered by the executable path (1,2,3,4,3,4,3,5,6,8,9,10), so T does not satisfy (all-p-uses)*, (all-defs)*, or (all-nodes)*.

Intuitively, (all-du-paths)* fails to include these criteria because it is possible for a subprogram to have cenain definition-use associations which can be executed only by paths which traverse some loop one or more times.

C.2. (all-p-uses)* (all-edges)*; (all-p-uses/some-c-uses)* (all-edges)*; (all-uses)* (all- edges)*; (all-p-uses)* (all-nodes)*; (all-p-uses/some-c-uses)* (all-nodes)*; (all-uses)* (all-nodes)* :

Consider the subprogram in Figure 4.1.4, where y is a local variable (and hence does not have a definition in the entry node). Notice that since node 3 is unexecutable, y is always uninitialized at node 5. In the absence of any information about which (if either) edge leaving node 5 will be executed when the program is run on actual test data, we make the worst-case assumption that edges (5,6) and (5,7) are both executable. This would be the case, for example, in an environment in which uninitialized variables receive arbitrary values. Since node 3 is unexecutable. the only executable definition-use associations are ( 1 ,2.input). (2, (2.4), x) and (2.9,inputT). Let T be a test which executes n=( (1.2,4.5.6.8.9)) or n={(1.2,4.5.7.8.9)). Then T satisfies (all-p-uscs)*. (all-p- uscs/some-c-uses)*. and (all-uses)*, but docs not satisfy (all-edges)* or (all-nodes)*.

0

^

read(x,y);

^ i:=l;

writeln('hello') i:=i-l;

y<o X \ y> = o

©,

^ writeln(x);

0

FIGURE 4.1.3a

t?

Path

With respect to

executable

(1.2)

inputT

yes

(2,3.4)

i

yes

(2,3.5)

i

no

(4.3.4)

i

yes

(4,3,5)

i

yes

(2,3,5.6,7,9.10)

inputT

no

(2,3,5,6,8,9,10)

inputT

no

Figure 4.1.3b

(i)

x<0

read(y); (^

V read(x); ^L'^ x: = sqr(x); x> = 0

vOO

©,

v = 0

®

®

^ ^

FIGURE 4.1.4

-70-

Notice that the subprogram in Figure 4.1.4 has a data flow anomaly [OST76]. We shall see below that this is not a mere coincidence, but that rather, it is this particular kind of anomaly which prevents the inclusions from holding.

The rest of the non-inclusions follow from the incomparability and strictness proofs in [RAP85].

It seems discouraging that (all-p-uses)* fails to include (all-edges)*. Data flow test- ing was developed in order to "bridge the gap" between branch testing and path testing. Since many "real-life" subprograms cannot be adequately tested using the unstarred ver- sions of the data flow criteria, one would hope that the FDF criteria would "bridge the gap" between (all-edges)* and (all-paths)*. We have seen that this is not the case. We next show that for a cenain class of "well-behaved" subprograms, any test which satisfies (all-p-uses)* satisfies (all-edges)*.

DEFINITION: We will say that a subprogram P satisfies the No-Feasible-Undefined-P- uses property (NFUP) if and only if for every executable edge (ij) in P having a p-use of a variable x there is some executable path from the start node to edge (iJ) which contains a global definition of x.

We note that it is quite reasonable to expect subprograms to have propeny NFUP. If (i.j) is an edge which causes NTUP to fail, then any input which causes (iJ) to be exe- cuted will involve referencing an uninitialized variable.

THEORFM 4.1.2 For the class of subprograms which satisfy NTUP, (all-p-uses)* =* (all-edges)*.

PROOF: Let P be a subprogram satisfying NTUP, let T be a test which satisfies (all-p- uses)* for P. let II be the set of paths executed by T. and let (ij) be an executable edge in P. Suppose (i j) has a p-use of a variable x. By h\-pothesis there is an executable path n

-71-

from the start node to (i,j) which includes a global definition of x. Let n be the last node in 7t having a global definition of x. Then (n, (i j), x) is an executable defininon-p-use association so it is covered by n. Hence (ij) is covered by Fl.

If (iJ) has no p-uses, then the result follows by a straight-forward modification of the corresponding part of the proof of (all-p-uses) =* (all-edges) in Chapter 2, section 1 .■

In [RAP85] the class of subprograms to which data-flow testing applies is restricted to those subprograms satisfying the No-Syntactic-Undefined-P-use Property, defined in section 2 above. This restriction was necessary in order to insure that all-p-uses => all- edges. NFUP is a strengthening of NSUP. It takes into account the fact that in subpro- grams satisfying NSUP, the only paths 7C from the entry node to some p-use of x such that x has a global definition in some node in 7i might be unexecutable.

It is tempting to restrict the class of programs to which the FDF criteria apply to those satisfying NFUP. It is our feeling however that while one can live with the undeci- dability of the adequate test recognition problem and perhaps (albeit very uncomfortably) with the undecidability of the adequate test existence problem, one should at least be able to decide algorithmically whether a given testing strategy applies to a given subprogram. Since it is undecidable whether a given subprogram satisfies NFUP, we refirain from requiring this property of subprograms to be tested.

Another possible way to force (all-p-uses)* to include (all-edges)* would be to require subprograms to satisfy the No-Anomalies Propeny (NA):

Every path from the start node to a use of a variable x must contain a definition of x.

Osterweil and Fosdick [OST76] consider any subprogram not satisfying this property to have data-flow anomaly indicative of possible subprogram error. Since NA is a purely

-72-

syntactic property and NA implies NFUP we could restrict FDF testing to subprograms satisfying this property. We feel that this is overly restrictive, since many perfectly good subprograms fail to satisfy NA.

One way to force NA to be satisfied is to give the entry node a definition of each variable. This would potentially increase the number of def-use associations and thus make the criteria more demanding. However, it would make the model of the subprogram's data flow reflect the actual data flow less accurately.

Another approach is to perform FDF testing in conjunction with a check for data- flow anomalies. For any subprogram which satisfies NA and any test set T which satisfies (all-p-uses)* the tester will be assured that T satisfies (all-edges)* . In case NA does not hold, the tester should explicitly check whether (all-edges)* is satisfied and if necessary add more test data or inspect the subprogram for references of uninitialized variables.

2. Axiomatic Evaluation of the Feasible Data Flow Testing Criteria

In this section we evaluate which of the axioms for a good adequacy criterion arc satisfied by the feasible data flow testing criteria.

Several of the axioms are of the form "There exists a program P and a test set T such that some property holds". If a data flow testing criterion C satisfies one of these axioms, then the corresponding feasible data flow testing criterion C* automatically satisfies the axiom. In these cases, the examples used in Chapter 2.2 to prove that C satisfies an axiom also show thai C* satisfies the axiom.

-73-

Axiom 1: Applicability

For every program, there exists an adequate test set.

The feasible data flow testing criteria have all been defined in such a way as to guarantee that the applicability criterion is satisfied. For each of these criteria, the definition-use associations which must be covered by the test set are all executable. By definition, this guarantees the existence of adequate test data.

Axiom 2: Non-Exhaustive Applicability

There is a program P and test set T such that P is adequately tested by T and T is not an exhaustive test set.

All of the feasible data flow testing criteria satisfy this axiom. This follows firom the fact that all of the data flow testing criteria satisfy the axiom.

Axiom 3: Monotonicity

If T is adequate for P and TcT"' then T' is adequate for P.

Since all of the executable associations covered by T are also covered by T', all of the data flow criteria satisfy the monotonicity property.

Axiom 4: Inadequate Empty Set

The empty set is not adequate for any program.

Every subprogram has at least one executable def-p-use association. To see this, let n be the unique node such that n has at least two successors and no node on any path from the entry node to node n has more than one successor. At least one of the edges (n,m) whose source is node n must be executable. Since node n has more than one sue-

-74-

cessor, edge (n,m) has a p-use of some variable, v. Since the subprogram is assumed to satisfy the No-Syntactic-Undefined-p-uses property, there is a global definition of v on the path from the entry node to (n,m). Therefore, there is an executable definition-p-use association. Note that the path from the entry node to edge (n,m) is a du-path.

It follows that the criteria (all-defs)*, (all-p-uses)*, (all-p-uses/some-c-uses)*, (all- c-uses/some-p-uses)*, (all-uses)*, and (all-du-paths)* satisfy the inadequate empty set propeny.

The criterion (all-c-uses)* fails to satisfy the inadequate empty set axiom. This is

because it is possible for a program to have no executable definition-c-use associations.

Consider, for example, the program

program example(input,output);

var X : integer;

begin

read(x);

while x=x do writeln('hello') end.

The only definition-c-use association in this program arises from the fact that there is a definition of input} in the entry node and a c-use of input'l in the exit node. But this association is unexecutable because control can never reach the exit node.

Axiom 5: Anti-extensionality

There arc programs P and Q such that P is equivalent to Q, T is adequate for P, but T is not adequate for Q.

All of the feasible data flow testing criteria satisfy the Anti-extensionality axiom.. This follows from the fact that all of the DF criteria satisfy the axiom.

-75-

Axiom 6: General Multiple Change

There are programs P and Q which are the "same shape", and a test set T such that T is adequate for P, but T is not adequate for Q.

All of the feasible data flow testing criteria satisfy the general multiple change axiom. This follows from the fact that all of the DF criteria satisfy the axiom.

Axiom 7: Antidecomposition

There is a subprogram P which calls a subprogram Q and a test T such that T is adequate for the program P' obtained from P by expanding in-line one call to Q, but the test- set/driver-program pair (T,P) is not adequate for Q.

All of the feasible data flow testing criteria satisfy the antidecomposition axiom. Recall that most of the DF criteria failed to satisfy this axiom because (with certain exceptions, which are artifacts our model of data flow) each definition-use association a in Q corresponds to a definition-use association a' in the portion of/" which corresponds to Q. However, it is possible for a to be executable, while a' is unexecutable. An extreme case of this occurs when every call to Q from P is unexecutable.

Consider the following programs, whose flow graphs are shown in Figure 4.2.1.

program P(input,ourput); van X : integer; procedure Q;

var y: integer, begin {Q}

read(y);

if y>=0 then writeln('hello') else writeln('goodbye') writeln(y); end; (Q) begin

read(x);

if x>=Othen

-76- ifx<Othen Q

end.

program /"(inpui.output); var x,y : integer, begin

read(x);

if X >=0 then

if X <0 then

begin (procedure Q expanded in-line} read(y);

if y>=Othen writeln('hello') else vvTiteln('gCKxibye') writeln(y); end; (end of in-line expansion of Q} end.

Note that the call to Q in P is unexecutable. The only executable paths from the entry to the exit of P' are (16,17.18,26,27) and (16,17,19,20,25). Since the test set T={(-1),(0)) executes both of these paths, it covers all of the executable definition-use associations in P', and hence satisfies all of the FDF criteria. When the elements of T are input to program P, however, they do not cover any of the associations (ll,14,y), (11,(1 1,12), y), or (1 1,(1 1,1 3),y) in Q. Note that these associations are executable, since ihey can be covered, for example by the driver-program/test-set pair (/"'.T) where P" just calls Q. So T does not satisfy all-defs or all-p-uses for Q, and hence does not satisfy any of the criteria. This shows that all of the FDF criteria satisfy the Antidecomposition axiom.

Recall that the DF criteria all-defs, all-c-uscs, and all-c-uses/some-p-uses satisfied this axiom as an only artifact of the particular model of data flow employed. In the case of the FDF criteria, on the other hand, the criteria satisfy the axioms because of the fun- damental nature of the criteria.

1-1

FIGURE 4.2.1

X read(y);

write!n('goodbye'); (T^^

writelnChello'

PROCEDURE Q

(!)

PROGRA.M P

@

PROGRAM P'

V^ read(x);

read(y);

> = 0

'<?\ writeln('hello');

^

-78-

Axiom 8: Anticomposition

There is a program R whose statement part consists of a call to procedure P followed immediately by a call to procedure Q, and a test set T for R such that is adequate for P and for Q (as procedure's called by R) but T is not adequate for the program R' obtained from R by expanding P and Q in-line.

All of the feasible data flow testing criteria satisfy this axiom. This follows from the fact that the DF criteria satisfy the axiom.

Combining the results in this section gives us

THEOREM 4.2.1:

The criteria (all-defs)*, (all-p-uses)*, (all-p-uses/some-c-uses)*, (all-c-uses/some-p- uses)*, (all-uses)*, and (all-du-paths)* all satisfy the applicability, non-exhaustive appli- cability, monotonicity, inadequate empty set, anti-extensionality, general multiple change, anti-decomposition, and anti-composition axioms. The (all-c-uses)* criterion satisfies all of the axioms except for the inadequate empty set axiom.

3. A generalized notion of executability

The definition of executability given in Section 3 fails to take into account any information about the context in which a subprogram is called. It may be the case that there is no input data to the program as a whole which cause the execution of a particular executable path through a subprogram. In order to adequately test such a subprogram with respect to a given FDF criterion it may be necessary to write a driver program which assigns particular values to global variables and parameters, and then calls the subpro- gram. Whether this extra effort is "wonhwhilc" depends on whether it is likely that the subprogram will ever bo called in a context other than the one in uhich it currently

-79-

appears in the program. In this section we define a more general notion of executability which takes into account information about the context in which a subprogram is called and explore the effects of this generalization on the FDF criteria.

Consider the program:

program main(input,output); type CharString = arraj'[1..10] of char, var string 1 : CharString; length : integer.

procedure WriteString(str: CharString; n : integer); {Writes the first n characters of str to standard output.} var i : integer; begin

for i := 1 to n do write(str[i]) end;

begin {statement part of main program}

if length > 0 then WriteString(stringl, length) else ...

end. {main }

Suppose that at every point in the program at which WriteString is called, the value of n is guaranteed to be strictly greater than zero. Then no input to the program can cause the execution of the path through the procedure which traverses the loop zero times.

In order to adequately test WriteString with respect to the criterion (all-uses)* it is necessary to include test data which causes the for loop to be traversed zero times. To do this, one must write a driver program which calls WriteString with the second parameter having a value less than or equal to zero. If we think that we might actually want to use the procedure WriteString in a less restricted context (for example due to modifications

-80-

of the calling program or to re-using the procedure in a different program) then this is a reasonable thing to do. On the other hand, if we are fairly certain that the procedure will never be called in a context where n is less than or equal to zero, then writing a driver program could be construed as being a wasted effort. What is needed is a notion of test data adequacy which takes into account information about the context in which the sub- program being tested can be called.

We can achieve this by relativizing the definition of executability as follows. We associate with the subprogram to be tested, a predicate IC{V i,...,Vi^), called the input constraint where V\,...,Vi, represent the subprogram's parameters and non-local vari- ables. A path through the subprogram is then executable relative to IC if there exists some assignment of values to input variables, parameters, and non-local variables which satisfies IC and which causes the path to be executed. A path is executable as defined in Section 4. 1 if and only if it is executable relative to the input constraint IC^TRUE.

The notion of executability of an association and the definitions of the PDF criteria can be relativized in a straight forward manner.

The relationship among the relativized FDF criteria is essentially the same as that among the non-relatived criteria. The definitions must be modified to reflect the fact that the objects being tested now consist of pairs (P,IC) where Pisa subprogram and IC is an input constraint. We say thai relativized criterion Cj includes relativized criterion Ci if for every subprogram/input-constraint pair (P,IC) every test which satisfies Cj for that pair also satisfies €2- C] strictly includes C2 if Cj includes C2 and for some pair (P.IC) there is a test which satisfies C2 but docs not satisfy C\. It is easy to show that the relationship among the relativized criteria is as shown in Figure 4.1.2.

-81 -

One reasonable choice for the input constraint is the predicate ]C^"^ obtained by taking the constraints on the input to the program as a whole (drawn from the program's specification), conjoining them, and "pushing them through" the program to all points at which the subprogram being tested is called. In practice, one might want to use a weaker predicate than /C*''^ which can be built up during the testing process as follows. Sup- pose that at some point in the testing process the tester notices that a particular execut- able association has still not been exercised. Upon examining the program to see what values of input data, non-local variables and parameters would cause the execution of that association, the tester sees that the needed values of non-local variables and parame- ters cannot arise in the context of the program as a whole. One can then formulate a con- straint which reflects this fact and conjoin it to the previous constraint.

If the calling program is modified some time after the subprogram has been certified adequately tested, the predicate IC will provide useful documentation which will help in selecting additional test data for the subprogram. If IC is still satisfied whenever the sub- program is called then no further testing of the subprogram will be needed. If IC no longer hold at the points of call, however, it will be necessary to update IC, determine which def-use associations become executable relative to the new set of constraints, and add test data to exercise those associations.

-82-

CHAPTER5:

The Path Expression Heuristic

1. Introduction

In order to determine whether a test set T satisfies a feasible data flow testing cri- terion C, it is necessary to examine all of the definition-use associations required by C but not covered by T to see whether any of them are executable. This may be a tedious and error prone task.

In this chapter we present a heuristic called the path expression method which attempts to determine whether a given definition-use association (d,u,v) is executable. There are two "hard" aspects to this problem. First, there may be infinitely many paths which cover the association. Second, there is no algorithm to determine whether a partic- ular path is executable. Our approach involves using regular expressions called path expressions to describe a set n of paths which could potentially be executable paths cov- ering the association, and using a combination of symbolic evaluation and data flow information to eliminate from consideration subsets of U consisting of paths which can be shown to be unexecutable or to have definitions of v.

The approach is heuristic in the sense that it will either answer "association is unex- ecutable", or "association may be executable". If it answers "association is unexecut- able" then the association is guaranteed to in fact be unexecutable. If it answers "associa- tion may be executable" then the association may be executable or unexecutable. Hope- fully, for "realistic" programs, it will only infrequently answer "association may be exe- cutable" when the association is actually unexecutable. When it answers "association may ho executable" il will also supply some information which may help the user in

-83-

deiermining by hand that the association is unexecutable, or in selecting a test case which covers the association.

Let P be the subprogram being tested. Throughout this chapter, unless otherwise noted, we will make the following assumptions, in addition to the assumptions in Chapter 2, section 1 :

1 . No aliasing. That is, we assume that when P is called, all of the actual parameters corresponding to var formal parameters are distinct from each other and distinct from all global variables which occur in P. We also assume that no pointer vari- ables occur in P. We will relax this assumption later.

2. No hidden side effects. That is, the only variables visible to P which are modified by any procedure or function P' called by P are those variables which are passed to /" as var parameters.

3. Functions called by P have no var parameters.

4. P has no repeat-until statements.

Assumptions 1 and 2 insure that the data flow information is sufficiently accurate. Assumption 3 simplifies the symbolic evaluation which will be performed. Assumption 4 is a technical convenience which insures that the exit from each loop in the flow graph of P is at the top of the loop. In section 5.6 we discuss the effects of relaxing these assumptions.

Loops in the flow graph will play an important role in the path expression method. A node h in a flow graph G is a loop header, if, when a depth first search is performed on G, starting at the entry node, h is the target of a back edge. In the recursive definition of structured flow graphs given in Figure 2.1.1, the loop headers are the nodes labelled with

-84-

the lener "i", in the subgraphs corresponding to for statements and while statements (and the entry node to 5 1 in the subgraph corresponding to the repeat-until statement.) If (e,h) is the back edge whose target is h*, then the we define the loop corresponding to hxoht the subgraph of G consisting of h along with those nodes n such that there is a path from n to e which does not pass through h. A path through the loop headed by node /z is a path 7t= Oi,...,/*), where k>l, /i=/;t=h, and ij is a node in the loop headed by node h, for l<j<k. Note that according to this definition, the path K=(h) is a path through the loop headed by node h.

We begin this chapter with an overview in Section 5.2 of how the path expression method works. In section 5.3, we define the syntax of path expressions and show how they can be used to represent a set of paths through a flow graph. A new symbolic evaluation technique called partial symbolic evaluation of path expressions is described in Section 5.4. A complete description of the method is given in Section 5.5. Some limi- tations and enhancements are discussed in Section 5.6. An algorithm for the derivation of a path expression representing the set of paths from a given node to a given node or edge is presented in Appendix 11. A brief description of our (partial) implementation of the path expression heuristic and a script of a "typical" session are given in Appendix III.

2. Intuition Motivating the Path Expression Method

In this section we introduce the path expression method by presenting two exam- ples.

Note that for siruclurcd graph.s. there is only one such edge.

-85-

Example 5.2.1:

We begin with a simple example based on a common program fragment:

begin

done := false; while not done do begin

SI;

if B then done := true

else S2;

S3; end; end;

where S1,S2, and S3 are arbitrary statements and B is a boolean expression. The flow

graph for this program fragment is shown in Figure 5.2.1.

Consider the definition-p-use association (l,(2,8),done). Let n be the set of paths from node 1 to edge (2,8). We can represent this set by a path expression

lp2-8. The symbols 1 and 8 are node symbols which represent nodes 1 and 8, respectively. The symbol P2 is a loop symbol which represents the set of paths through the loop headed by node 2. We think of the path expression as meaning "any path in n consists of node 1 followed by some arbitrary path through the loop whose header is node 2, followed by node 8." This representation of n is "high-level" in the sense that it omits details about what happens inside of the loop.

For a path n in n to be executable, the values of the variables must satisfy the expression E=(Jo/ie=true) when n exits from the loop.

Since the variable done is set to false in node 1 , the expression E cannot be satisfied by the values of the variables at the bottom of node 1. Therefore, along any executable path from 1 to 8, the expression E must change value within the loop.

-fC-

G

done:= false;

Ta^ done:= true;

FIGURE 5.2.1

-87-

In order for E to change value, one of the variables in E must have a definition. In this case, the only variable in E is done so we can eliminate from consideration any path along which done has no definition. But we are looking for paths which are definition clear with respect to done, so we can also eliminate any paths which do have a definition of done. We have eliminated all of the possibilities, so we can conclude that the associa- tion is unexecutable.

Note that in this case we did not have to pay attention to any details about what hap- pens within the loop. In other cases we will have to rewrite the path expression to make it express more details about what happens inside of loops. In either case, path expres- sions provide a convenient formalism which helps us to look at the problem at "the right level of detail".

A similar situation arises from a program fragment consisting of a for loop in which the upper bound is always greater than or equal to the lower bound. If the counter vari- able X is set equal to the lower bound in node i and if edge (j,k) is the exit from the loop, then the def-use association (i,(j,k),x) is unexecutable. An analysis analogous to the one above can be used to prove this. These two common "paradigms" of unexecutable asso- ciations can by shown to be unexecutable by a straight-forward application of the path expression method.

Example 5.2.2:

Our next example is somewhat more complicated. Consider the association (2,(6,14),switch) in the "bubble-son" program shown in Figure 5.2.2. The set of paths from node 2 to edge (6,14) can be represented by the path expression

a=2-p3-5p6l4.

^<i-

^ ^

switch:= true;

read(a[ij)

n:-a- 1

FK.l RK 52.2

-89-

We think of a as saying that any path from node 2 to edge (6,14) stans at node 2, then takes an arbitrary path through the loop headed by node 3, then goes to node 5, then takes an arbitrary path through the loop headed by node 6, then ends at node 14. In order to execute branch (6,14), the expression

E= ((« < 2) or {switch = false)) must be satisfied by the values of the variables at the bottom of node 6. Since the values of the variables never satisfy E when control reaches the bottom of node 5, we know that on any executable path at least one of the variables in the set [n,switch} must change value in the loop headed by node 6.

At this point we need more information about the paths through the loop headed by node 6. We replace the symbol p^ by a sub-path-expression which includes details about the body of this loop. We obtain

a'=2p3-5(6-7-p8l3)*614. We think of a' as adding the information that any path through the loop headed by node 6 consists of zero or more subpaths which start at node 6, go to node 7, take an arbitrary path through the loop headed by node 8, and end at node 13; followed by a final visit to node 6.

Since we are looking for def-clear paths with respect to switch we can eliminate any path along which switch is always redefined. Since switch has a definition in node 7, the only definition clear paths with respect to switch are those represented by the path expression

a"=2p3-5-6 14 a" says that the only path through the loop headed by node 6 which we need to consider is the path consisting of the single node 6. But none of the variables in {n,switch] are

-90- .

defined in node 6, E cannot change value in the loop, so the definition-use association is unexecutable.

Overview of the path expression method

We now give a rough overview of the path expression method. The ideas presented here will be made more precise in the following sections.

Given an association (d,u,v) we begin by deriving a path expression a which represents the set of paths from d to u, and which omits details about paths through loops. The path expression has node symbols which represent single nodes and loop symbols which represent sets of paths through loops. As the algorithm progresses we use a com- bination of symbolic evaluation and data flow analysis to eliminate from consideration paths which can be shown to be unexecutable or which can be shown to have definitions of v.

For each loop symbol occurring in a we use symbolic evaluation to attempt to find a collection of sets of variables, {5i,...,5i} such that for each h, at least one of the vari- ables in the set 5/, must be redefined within the loop on any executable path covering (d.u.v). We then replace the loop symbol by a sub-expression a' which includes more details about the body of the loop, a' represents the set of definition-clear paths with respect to v through the loop. If, for some h, none of the variables in S/, have definitions on any of the paths represented by a' then there are no executable paths in the set of paths represented by a, so the association is unexecutable.

The algorithm terminates when cither 1. ihc current path expression represents the empty set of paths (in this case the associ- ation is unexecutable).

-91-

or

2. the current path expression contains no more loop symbols (in this case the associa- tion may be executable).

3. Definition of Path Expressions

We now define the syntax and semantics of path expressions*. Let P be a procedure with flow graph G.

Let Z = {n, I n, Nodes(G)} u {p, I «, is a loop header node in G). We will refer to the symbols n, as node symbols and the symbols p, as loop symbols. We will say that the symbols p, and n, correspond to the node «,-. Let Q = { (, ), *, •, [^, X, 0 ). A path expression is a regular expression e (ZuQ)* defined in the usual way [DAV83]. Specifically,

1. X is a path expression.

2. 0 is a path expression.

3. Each element of Z is a path expression.

4. If a and P are path expressions then so are i. (ap)

ii. (auP) iii. a*

Tarjan [TAR81A,TAR81B] and Wegman [WEG83], and others, have also used regular ex- pressions to represent sets of paths through flow graphs. Our formalism differs somewhat from theirs, in that we exploit the structure of the flow graphs, placing special emphasis on loop headers. By so doing, we loose some generality, but gain the ability to ignore details about the internal structure of loops until they arc needed.

-92-

5. Only those strings formed by a finite number of applications of rules 1 through 4 are path expressions.

We associate a set PATHS (a) of sequences of nodes with each path expression a as follows:

1. PATHS (X) = (X}, the set whose only element is the null path.

2. PATHS(0) = 0, the empty set.

3. PATHS(n,) = {az, ), the set whose only element is the path of length one whose only node is Oj.

4. PATHS(p,) = {a I a is a path which begins and ends with node «, and which con- tains no nodes which are not in the loop headed by n, }

5. PATHS(a-p) = PATHS(a)PATHS(p)= (opIaePATHSCa) and pePATHSCp))

6. PATHS(ai^p) = PATHS(a) ^ PATHS(p)

7. PATHS(a*) = (PATHS(a))*

If S is a set of paths and a is a path expression, we will say that a represents S if and only if PATHS (a)=S.

A path expression a is valid if every element of PATHS(a) is actually a path in G. That is, a is valid if and only if

X\- ... Xk G PATHS(a) => (x,.j:,>]) is an edge in G for 1 < i < k - 1.

We will say that two valid path expressions, a and P are equivalent, denoted a = P if and only if PATHS(a) = PATHS(P).

As a notational convenience we eliminate extraneous parentheses, adopting the con- vention that * has precedence over "•" which has precedence over \j ^'^ c will

-93- sometimes use the numeral "i" written in bold-face in place of n,.

We adopt the notational convention of using lower case Greek letters near the beginning of the alphabet to represent path expressions, and lower case Greek letters near the end of the alphabet to represent paths.

If a is a substring of P and a is a path expression we will say that a is a sub-path- expression of |3.

It will be necessary to distinguish between sub-path-expressions and occurrences of sub-path-expressions. For the sake of clarity, we will represent path expressions as binary trees in the obvious way. We will refer to the nodes of these binary trees as ver- tices in order to avoid confusing them with nodes of the procedure's flow graph. Each vertex of the binary tree will be labeled with a symbol from ZuQ. Each leaf vertex will be labelled with a node symbol, a loop symbol, "k or 0. Each intemal vertex will be labelled with *, •, or ^^. Vertices with the symbols and ^^ will have two children, while vertices with the symbol * will only have a left child.

The path-expression corresponding to a tree T can easily be recovered by perform- ing an in-order traversal of T:

function PathExpr(T: tree): path-expression; begin

if T is the null tree then return null path-expression

else

begin

sym := label(root(T)); ifsyme {"•","*", "<^") then

return(CONCAT("(",PathExpr(left-subtree),sym,PaihExpr(right-subtree),")")) else return(sym) end end;

Let T be a tree representing some path expression a. Then each subtree of T corresponds to an occurrence of a sub-path-expression of a. We will sometimes denote a

-94-

subtree of T which corresponds to a sub-path-expression P by T^ . The notation (T°T^) will denote a tree whose root is labelled with the symbol •, whose left subtree is and whose right subtree is T^. Similarly, (T°^\^T^) will denote a tree whose root is labelled with the symbol \^, whose left subtree is 7" and whose right subtree is T^; the notation (7°)* will denote a tree whose root is labelled with the symbol *, whose left subtree is 7" and whose right subtree is null. We will sometimes omit parentheses when it is unecessary to distinguish between details of the structure of the trees. For example, 7a.7P.7Y ^,iii denote either ((7°T^)T^) or (7"-(7Pt^)). The symbol T'^+T^ will denote the set of trees {7"-7P, 7"^^7^ 7P<^7"}

Let be a subtree of 7^. We will say that tc is a T^-subpath of p if and only if ne PATHS(a), pe PATHS(Y), Ji is a subpaih of p and either

1 . 7^ 7°-h7^ or 7^€ 7^+7°, or 7^6 (7°)* or

2. 71 is a 7°-subpath of a and o is a 7'^-subpath of p.

Intuitively, if 7i is a T^-suhpaih of p, 7t is the portion of p which comes from the occurrence of the sub-path-expression a of y corresponding to the subtree 7".

Let 7] and Ti be trees, where T2 may be a subtree of 7;. We define the 72- reduction of 7] to be the path expression computed by the following function:

-95-

function Reduce(7"i,r2): path-expression; begin

if Ti is the null tree then return (null path-expression)

else

begin

sym := label(root(7i));

if syme {Node-symbols, Loop-symbols} then retiim(sym)

else

begin

L := left subtree of Tj ; R := right subtree of 7" i ; ifsym= " 'then

return(CONCAT(Reduce(L,r2)."-".Reduce(R,7"2))); if sym = "[^" then

if T2 is a subtree of L then retum(C0NCAT("-",Reduce(L,r2))) else if 72 is a subtree of R then retum(C0NCAT("-",Reduce(R,r2)))

elsereturn(CONCAT(Reduce(L,r2)/'u"'^^"'^^<^'^2))) if sym = "* " then

if 7^2 is a subtree of L then return(C0NCAT("-",Reduce(L,r2)))

elsereturn(CONCAT("(",Reduce(L,7-2),")"/'*")); end; end; end;

Let Y be the T^-reduction of 7" where T^ is a subtree of 7°. Then y is a path expression which represents a subset of PATHS(a) consisting entirely of paths having 7P-subpaths. Furthermore, 7^5p-e where 5 and e are path expressions. Note that if there is no subtree 7^ of 7" such that 7^ is a subtree of 7^ and the root of 7^ is labelled with \j" or "*", then the 7'^-reduction of 7" is a, itself.

We associate with each valid path expression a three sets of variables, ALWAYS(a), SOMETIMES(a) and NEVER(a) which indicate the variables which have definitions along all, some, or none of the paths in PATHS (a). We define these sets recursively as follows:

Let VAR_SET be the set of all variables visible to the procedure P. ALWAYS(A.) = 0 SOMETIMES(X) = 0 NEVER(X.) = VAR_SET

-96-

ALWAYS(0) = VAR_SET SOMETIMES(0) = 0 NEVER(0) = VAR_SET

ALWAYS(n,) = {v g VAR_SET I v has a global definition in node i)

SOMETIMES(n,) = ALWAYS(n,)

NEVER(n,) = {v e VAR_SET I v does not have a global definition in node i}

ALWAYS(p,) = (v G VAR_SET I v has a global definition on every path

inPATHS(p,)} SOMETIMES(p,) = {v 6 VAR_SET I v has a global definition on some path

in PATHSfp,)} NEVER(p,) = {v G VAR_SET I v has a global definition on no path

inPATHS(p,))

ALWAYS((ap)) = ALWAYS(a) u ALWAYS(P) SOMETIMES((ap)) = SOMETIMES(a) u SOMETIMES(p) NEVER((a-p)) = NEVER(a) n NEVER(p)

ALWAYS((ai^3)) = ALWAYS(a) n ALWAYS(p) SOMETIMES((a,^P)) = SOMETIMES(a) u SOMETIMES(p) NEVER((a^^P)) = NEVER(a) n NEVER(p)

ALWAYS{a*) = 0 SOMETIMES(a*) = SOMETIMES(a) NEVER(a*) = NEVER(a)

It is straight forward to show that the following properties hold for all path expres- sions a which are not equivalent to 0:

1 . ALWAYSCa) c SOMETIMES(a)

2. NEVER(a) = COMPLEMENT(SOMETIMES(a))

V G

ALWAYS (a)

SOMETIMES {ai if and only if v has a global definition on

all

some

no

NEVER (a) ne PATHS(a).

4. If a = p then ALWAYS(a)=ALWAYS(p). SOMETIMES(a)=SOMETIMES(P). and NEVER(a)=NEVER(p).

-97-

4. Partial Symbolic Evaluation of Path Expressions

Symbolic evaluation techniques aim to derive algebraic expressions summarizing the effect of executing a path or set of paths through a program. Symbolic execution derives an expression representing a single path through a program. Global symbolic evaluation attempts to derive an expression representing an entire program.

Most symbolic execution methods attempt to derive both a path condition describ- ing the set of input values which cause the path to be executed, and a path computation, describing how the values of variables are transformed when the path is executed. Several symbolic execution systems have been described in the literature and have been implemented [KIN76,HOW78b,CHE79,CLA81,KEM85]. Global symbolic evaluation is a much more difficult technique, the goal of which is the derivation of an expression which summarizes the entire effect of a procedure. Its development has been impeded by serious theoretical and practical obstacles.

In this section, we describe a new symbolic evaluation technique, partial symbolic evaluation of path expressions. This method is more general than symbolic execution, yet essentially no more expensive. It can provide some but not all of the information which global symbolic evaluation would provide.

We begin with a formal description of symbolic execution. We then present an example of symbolic execution. This is followed by a formal description of partial sym- bolic evaluation of path expressions followed by an example. Finally, we give a brief description of global symbolic evaluation and compare the three techniques. Initially we will assume that there are no procedure or function calls along any of the paths being symbolically evaluated. We will relax this assumption later.

-98-

The systems described in the literature symbolically execute paths from the program's entry to its exit. Our heuristic method for attempting to determine whether a definition-use association is executable involves deriving path conditions and path com- putations for arbitrary paths. We generalize symbolic execution in a straight-forward way to allow for this.

Let P be a procedure and let V ■[,... ,V„ be the program variables which are visible to P (i.e., local variables, formal parameters and non-local variables). Let T^,...,T„ be the types of V'],...,V'„, respectively; let S, be the set of values which can be assumed by vari- ables of type 7",; and let U = Sj u ... u 5„. For 1 < i < n, let 5/ = S, u (T) where we think of "T" as denoting the value "undefined". Let 5'„ + i = U*, the set of finite sequences of elements of U.

We define the state space ofP to be the set

S = Si X ... X S „+i.

A state V = (v'l ,...,v„^i) is an element of S. If v is a state, we think of V] v„ as

representing the values of the variables V ^ V^ and of v„+i = {X\,X2,...) as represent- ing the sequence in the input file. For notational convenience we are assuming that P gets its input from a single file input. If P took input from k > 1 files, we could let •S'„*2 S'„.,i=U* represent these files, and let S =5/ x ... x5'„+4.

To each path n in P there corresponds a subset D" of S called the path domain ofn defined by

D" = |v e S I if at the top of path n the values of variables V'j V,, are V] v„.

respectively, and the sequence of values in ilic input file is \-„,\ then path re will be executed)

-99- and a function f- =(f\,...J^^\ ) : D" ^ S which transforms an initial state to di final state.

The goal of symbolic execution is the derivation of a predicate PC^{\), called the path condition, such that

PC"(v) holds if and only if v D" and a predicate C(9MP"(v,w)=C(9M/'7(v,w)& ... &COMP;j(v,w), called the path com- putation, such that

COM/'f (v,w) holds if and only if w, =f-{y)

Thus, the path condition of it describes the path domain of 7t and the path computation describes the way the values of variables are transformed when 7t is executed.

Note that if /'C"=FALSE then D" is empty, so 7t is unexecutable. However the converse does not hold. Recall that a path is executable if it is a subpath of a complete executable path. Thus, if the only complete paths having 7: as a subpath are unexecut- able, then 71 is unexecutable, even if FC" is satisfiable. If 7t is a complete path, then n is executable if and only if fC" is satisfiable if and only if Z)" is non-empty.

We now define the predicates COMP"^ and PC". We will first define COMP^ and PC^ when 7: consists of at most one node. We will define COMP^^ in terms of COMP^ and COMPP and define PC"P in terms of PC, PC P and COMP"" .

Let Ti = X, the null path. Then PC^Cv) = TRUE and COMP"" (v,w) = (v = w).

Let 7C = (ni) be a path consisting of a single node. Then PC" = TRUE and COA/f "(v,w): is the predicate computed by the following algorithm:

type Symbolic^ Value : character string;

SymboIic_Sequence : sequence of characters; Symbolic_State : record

- 100- .

v: Symbolic_Value;

in: Symbolic_Sequence; end; {To avoid confusion we will refer to variables used in this algorithm as "variables" and to variables occurring in the procedure which is being symbolically executed as "pro- gram variables". Notation: we will denote variables of type Symbolic_State by bold-face letters and their fields by subscripted letters. For example if x is a variable of type Symbolic_State then X],...,Xn stand for x.v[l],...,x.v[n] which represent the symbolic values of the program variables Vi,...,V„; x^+i stands for x.in which represents the sym- bolic value of the input file. We assume that the elements of the sequence Xn+i are dis- tinct from one another. }

function Symbolically_Execute_Node (n: Node;v,w:Symbolic_State): Predicate; (Remms an expression representing the predicate COMP^{\,w)} var X : Symbolic_State; begin

for i := 1 to n-t-1 do a:, := "v,";

for sc := 1 to (number of statements in block) do

begin

stmt := sc'^ statement; Symbolically_Execute_Statement(stmt,x);

end;

for i := 1 to n do

COMPf := CONCAT("h' ", "='\x,y,

COMP'' := COMPOS:. ..SiCOMPl

returnCCOA/P"): end; {Symbolically_Execute_Node)

procedure Symbolically_Execute_Statement (stmt : StatementType; var x : Symbolic_State); begin

if stmt is an input statement,

"read(V, J ,...,\',^;" then begin

j:,, :=Head(x„+i);

A^.i :=Tail(A:„+i);

x,^ :=Head(x„ + ]);

Xr,^x :=Tail(.x„.i); end; if stmt is an assignment statement,

"\\ := cxpr" then .t, := Symbolically-Evaluate-Expr(cxpr.x); end. ( S ymKilically-Evaluatc-Staiemcnt }

fiiiuiion S\nilK)licall\-l \<iluale-I'\pr(expr,x): character_string;

( Returns the sinng tonncd hy replacing each occurrence of program variable \\ in cxpr

- 101 -

by the value of variable x, . For example, if expr h "V^+V2" and the values of Xi andj:2 are "V3*V4" and "5", respectively, then Symbolically-Evaluate-Expr returns the character string "v3*V4+5". This could be defined formally in terms of functions Symbolically_Execute_Term and Symbolically_Execute_Factor and a parse of the expression. )

Let o = np be the concatenation of the paths n and p. The function correspond- ing to o is the composition of the functions/" and/*' corresponding to the paths n and p. Thus,

COMP'^iwWf) = COMP''{\y) & COMP^{\\\v).

LetBC'^(Vi,...,V''„) be the branch condition on the edge between the last node of ;t and the first node of p. Then

PC^Cv) = PC^Cv) & COMP^Cv.v') & PC^iy') &. BC'Piy').

It is clear that the predicates COMP^ and PC^ satisfy the desired propenies.

We can summarize all of the information obtained from symbolically executing a path 7t in the predicate 5£"( v,w ) given by

5£"=/'C"&C0M/'".

Thus, (v,w) satisfies S£"(v,w) if and only if v e D" and /" transforms v into w.

These recursive definitions of COMP^ and PC^ lead in an obvious way to a non- deterministic recursive algorithm for computing the predicates:

iflength(Tt) = 0 then CO\fP'':=COMP^ else if length(7r) = 1 then

compute COMP^ by above algorithm else begin

let 7t=op where length(o),length(p)>l;

compute COMP^ from COMP^ and COMP^ according to the definition end; Specifying a particular way to decompose n into a and p gives a deterministic algorithm.

- 102-

If on each recursive call n is decomposed in such a way that length(a)=l the algorithm which results is the forward expansion algorithm described in the literature. If n is decomposed in such a way that length(p)=l the algorithm which results is similar to the the backward expansion algorithm described in the literature.

We now present an example of symbolic execution. Consider the bubbleson pro- gram shown in Figure 5.2.2. We will denote the initial values of the variables i, a, n. imp, switch, and inpui by /°, a°, n^, tmp^, switch^, and input^ respectively, and the final values by /^, a^, n^, tmp^ , switch^, and input^ respectively. Thus we will be trying to derive the expression

5£"( i^ ,a^ ,n^ ,rmp'^ ,switch^ ,inpui^ , i^,a^,n^,tmp^,switch^,input^)

Let 7t be the path (6,7,8,9,10,12,8,13,6,14). We initialize the path condition to TRUE. Figure 5.4.1 shows the symbolic values of the variables and the path condition after symbolic execution of each subpath. We have only shown those variables which are changed by/" or which are involved in PC". The entry in the row for subpath pn, in the column labelled BC is the predicate BC'"''(w) & CO\fpP(\^,w).

To obtain COMP^ we set the final values of variables equal to the symbolic values. To obtain the path condition we conjoin the predicates BC. Conjoining COMP^' and PC" gives

SE^{ i^ ,a^ ,n^ ,rmp^ jwitch^ ,inpui^ , i^,a^,nf,tmp^,swiich^,input^) = (a°|Il>a"|21) & (n"=2) & (switch^) & (c2^[\]^i^\2]) & {a^[2]=<2^[\]) & (fl^[3)=a"(3I) & (a^|4]=u"l4]) & (a^[5]=a^\5]) & (i^=2) & (n^=l) & (s^^iich^) & {inpui^=inpui^)

Thus, symbolic execution of n shows that in order for n to be executed, the value of

- 103-

n must be equal to 2, the value of swiich must be TRUE, and the value of a[l] must be greater than the value of a[2] at the beginning of the path. After executing such a path, the value of / will be 2, the value of a[l] and a[2] will have been swapped, and the value of n will be 1.

subpath

a[l]

a[2]

i

n

switch

BC

initial

a°[\]

a^m

switch"

TRUE

(6)

(6,7)

1

FALSE

n'*>2 & switch'^

(6,7,8)

(6,7,8,9)

l>n°-l

(6,7,8,9,10)

a°[2]

a'il]

TRUE

fl°[l]>a°[2]

(6,7,8,9,10,12)

2

(6,7,8,9,10,12,8)

(6,7,8,9,10,12,8,13)

«°-l

2>n°-\

(6,7,8,9,10,12,8,13,6)

(6,7,8,9,10,12,8,13,6,14)

n°-l<2ornotTRUE

Figure 5.4.1

We now describe partial evaluation of path expressions. Let a be a valid path expression. The goal of partial symbolic evaluation of a is the derivation of the predi- cates PCOMP'^i\,w), the partial path compulation, and PPC^(y), the partial path con- dition. Intuitively, PPC will partially describe the common features of the path domains of paths in PATHS(a) and PCOMP will panially describe the common features of path computations corresponding to paths in PATHS(a).

Let (3 be a sub-path-expression of a consisting only of node symbols, concatenation symbols, lambdas and parentheses. Then PATHS(p) consists of a single path, n. In this case, partial symbolic evaluation of P will be essentially the same as symbolic execution

- 104-

of n. On the other hand, if p contains loop symbols, union symbols, or stars, then PATHS(P) will have more than one element. Unlike global symbolic evaluation, panial symbolic evaluation will not attempt to consider all of the details of each of the (possibly infinitely many) elements of PATHS(P). Rather, it will assign a new symbolic value to any variable which has a global definition on at least one path in PATHS(p). Intuitively, if there is more than one way in which the value of a variable can be changed by the functions corresponding to PATHS(P), then the value of that variable is unknown after partially symbolically evaluating p. A precise definition of partial symbolic evaluation follows.

We define PCOMP^{\\w) recursively as follows:

1. If a = n, or a = X then PCOMP°-(y,yi) = COMP^Cv.w) where 7i is the unique ele- ment of PATHS(a).

2. Ifa = 0thenPCOM/'"(v,w) = FALSE.

3. Forae {p., (PuyX P*)

Let SOMETIMES(a) = (V,, V,J. Then

f u', =UNIQUE NEW SYMBOL if / e fi i ij

I w\ = V, otherwise

4. If a = (3y) then PCOMP°(v,w) = PCOMP^{\y') & PCOMP^(\',vf).

The definition of PPC^i\) for a path expression a is similar to the definition of PC^iWw) for a path n.

1 . If a is a single symbol n, .p, . X. or a = (P^^/) or a = p* then P/'r°{v) = TRUE.

2. If a IS 0 then PPC°(\) s FALSE.

- 105-

3. If a is (Py) then

PPCiy) = PPC^iv) & PCOMP^{\y) & PPCH\') & BC^Hxy,

We summarize all of the information obtained from partially symbolically evaluat- ing a path expression a with the predicate /'S£'°(v,w) given by

P5£° = PPC^ & PCOMP"^.

We now present an example of partial symbolic evaluation. Consider again the bubbleson program shown in Figure 5.2.2. Let a be the path expression

6-7p8l3 6 14. We initialize the symbolic values of the variables /, a, n, and switch to / , a°, n , and switch^, respectively, and we initialize the path condition to TRUE. Figure 5.4.2 shows the symbolic values of the variables and the path condition after executing each subpath. Note that in symbolically executing the loop symbol pg we have given new symbolic values to the variables a, switch, and i because each of these variables is potentially redefined in the loop headed by node 8.

Setting the final values of variables equal to the symbolic values, conjoining the entries in the column labelled BC, and simplifying, we get

PSE'^ii^ ,a^ ,n^ , switch^ Mpui^ , i^^ ,n^, switch^, input^ =

{af=a') & (i^=i') & («^=rt°-l) & {,swUchf=switch') & n°>2 &. switch^ & r>n°-\

& {n^-\<2v noi switch')

= n^>2& switch^ & i^ > n^-\ & (n^=3 v not switch^)

Thus, partial symbolic evaluation of a tells us that in order for a path in PATHS(a) to be executed, the value of n must be greater than or equal to 2 and the value of switch must be TRUE at the beginning of the path; after executing such a path, the value of / will be strictly greater than one less than the initial value of n; furthermore, if the initial

- 106-

value of n was not equal to 3, then the final value of switch must be FALSE. This is some, but not all of the information which, in theory, global symbolic evaluation of the set of paths might provide.

subpath

a

i

n

switch

BC

iniiial

/"

n'

switch^

TRUE

6

6-7

1

FALSE

n°>2 & switch°

67d.

a'

/'

switch'

67p8l3

n°-l

i'>n°-\

6-7ps 13 6

6 7p, 13-614

(n°-l <2) vnoi.rK7fc/i'

Figure 5.4.2

For comparison, we will now briefly describe global symbolic evaluation. The descriptions of global symbolic evaluation in the literature deal with global symbolic evaluation of an entire routine. The goal is to derive a single expression which summar- izes the effect of executing any path through the routine. It is straight forward to general- ize this to global symbolic evaluation of a path expression. Given a path expression a the goal of global symbolic evaluation is the derivation of a predicate G5£°(v,w) given by

GSE°^ = V S£".

ne PATHS (a)

(v.w) satisfies G5E°(v.w) if and only if for some TtePATHSfa). v e D'^ and f" transforms v into w. Thus GSE^ gives a complete description of all of the paths in PATlLS(a).

Unfonunately, and not surprisingly, there are serious obstacles to automating global

symbolic evaluation If P.\THS(a) = (Tt] n„) is finite then it is fat least in principle)

possible to automatically derive GSE^. In this case GSE° will be of the form

- 107-

(/>C"' ^ COMP""^ ) &...& (PC"" -> COMP''")

The length of this expression may be exponential in the length of a.

The situation is even worse if a contains loop symbols or stars. In this case, PATHS(a) is infinite and it may be impossible to automatically derive GSE^. Global symbolic evaluation methods described in the literature attempt to derive and solve recurrence relations describing the effects of loops. However, it is not always possible to do so. Nested loops and selection statements within loops are particularly problematic.

Consider the path expression a=6-7-pg-13-6-14 to which we applied partial sym- bolic evaluation above. Manually analyzing the set of paths represented by a shows that

GS£°(v,w)=

in^>2) & switch'^ &

((n^ > 2) ^ i(a°{n^-]]<a^[n% & (af[n^]=a^{n%))

& (Vj)((l<j<«°)^(a/[«0]>a-^[j]))

This expression gives more detail about the effect of the paths through the loop headed by node 6. Unfortunately, however, the global symbolic evaluation methods described in the literature are not powerful enough to derive GSE°^ for this path expression.

The following facts about the relationship between the symbolic execution, partial symbolic evaluation, and global symbolic evaluation are easy to verify:

1. For each K€ PATHS (a). Sf'' =>P5£"

2. CS£"=>/'5£"

3. If PATHS(a)= { n } then 5E" = PSE" = GSE""

- 108-

When PATHS(a) consists of a single path, the three methods are equivalent. Global symbolic evaluation is more powerful than partial symbolic evaluation, but partial sym- bolic evaluation is much more practical, being essentially no more difficult than symbolic execution.

We conclude this section with some comments on procedure calls. Paths containing procedure calls are problematic for symbolic execution systems. (Note that since we are assuming that functions cannot take var parameters or have side effects, they do not present the same problems as procedure calls do.) Let 7i be a path consisting of a single node whose only statement is a procedure call. One would like 5£"(v,w) to hold if and only if when the procedure is called with initial state v the procedure terminates and the state upon returning from the call is w. However, since there is no way to determine a priori the path which will be followed through the procedure, in order to derive S£"(v,w) one would have to employ global symbolic evaluation.

One approach which has been suggested for symbolically executing paths with pro- cedure calls is the replacement of the node corresponding to the procedure call by a par- ticular path through the called procedure. This is equivalent to disallowing procedure calls, and seems overly restrictive. Another approach is to replace the procedure call by an executable specification for the procedure. This seems more reasonable, but it assumes the availability of an executable specification for each procedure and assumes that the procedure correctly implements that specification.

For partial symbolic evaluation, on the other hand, there is an easy way to treat nodes having procedure calls. The approach is consistent with the treatment of other symbols corresponding to more than one path. If a is a node whose only statement is a procedure call then wc simply define

- 109-

, H' ^UNIQUE NEW SYMBOL if / €{i^ , ...Jh) PCOMPfi\,w)=] - - , .

', = V, otherwise

where V;j ,...,V,^ are the variables which can potentially be defined by the procedure call (actual parameters corresponding to var formal parameter and global variables which are modified in the called procedure).

5. Precise Description of the Path Expression Method

In this section we give a precise description of the path expression method. Let (d,u,v) be definition-use association. The path expression method will attempt to deter- mine that (d,u,v) is unexecutable, that is, that there is no executable path from the procedure's entry to its exit which has as a subpath a definition-clear path with respect to V from d to u.

We can assume, without loss of generality, that u is a node, rather than an edge. This is because we have defined structured flow graphs in such a way as to insure that if (i,j) is an edge having a p-use of some variable, then there is no edge (i,k) for k?tj. Thus, if (d,(i,j),v) is a definition-p-use association, the paths from d to (ij) are precisely the paths from d to j.

A candidate is a path from the procedure's entry to its exit which has a subpath going from d to u. A candidate is viable if it is executable and if it has a definition clear subpath with respect to v from d to u. Thus, the definition-use association is executable if and only if the set of viable candidates is non-empty.

We will use a path expression ^ to describe the set of candidates. As the path expression method progresses, we will attempt to eliminate from consideration subsets which consist only of candidates which are not viable. When such a subset is eliminated.

- 110-

we will rewrite ^ accordingly. If we can eliminate all of the candidates from considera- tion then the association is unexecutable. In this case, ^ will be equivalent to the path expression 0.

Let a be a path expression which begins with n^ or p^ and let (i,j) be an edge in G. We will sometimes prepend a with a symbol n,"''' or the symbol p,"^. This will indicate that we are including the branch condition on edge (i,j) in the partial symbolic evaluation of a. Similarly, when a ends with n^ or p^ and (j,i) is an edge in G we will sometimes append a with n,*^' or p,'"". This will indicate that we are including the branch condi- tion on edge (j,i) in the partial symbolic evaluation of a.

Let T be the tree corresponding to a path expression representing the set of paths from the entry node to the bottom of node d. Let T be the tree corresponding to a path expression representing the set of paths which go from the bottom of node d to the top of node u without passing through d or u. Let T be the tree corresponding to a path expression representing the set of paths from the top of node u to the exit node. Let jsg 7a .y-a .ya 7hen every candidate is an element of PATHS(^) of the form Ttpo where ne PATHS(a''^), p€ PATHS(a'^"), and oe PATHSia""). The viable candi- dates are executable paths of this form where p is a definition-clear path with respect to V. Throughout this section, T^ will stand for the tree representing the current set of can- didates and T will stand for the current subtree corresponding to the definition-clear subpaihs with respect to v from d to u.

W'c begin this section by defining a set of rules which will be used to re\>.Tite path expressions. W c will then define some notions which will be used for showing that cer- tain sets of paths contain no executable paths. W'c then present two path-cxprcssion- mcihod algorithms. The first, PEMl, is a very general non-dciemiinisiic procedure

which is not guaranteed to terminate. The second, PEM2, is an efficient but less general algorithm, in which most of the non-determinism has been removed. We conclude with a discussion of the implementation of PEM2 and two examples.

We now define a set of rewrite rules which will be used in the path expression method:

RWl:Replace a subtree 7"" of 7^ by T^ . RW2:Replace a subtree 7" of 7^ by 7"' where oc=a'.

Let be a subtree of 7^. If 7^ is the tree obtained by applying RW2 to 7", i.e. by replacing the subtree 7" by 7"', then PATHS(^0=PATHS(^). So application of RW2 results in a different representation of the same set of candidate paths. On the other hand, if 7^' is the tree obtained by applying RWl to 7", then PATHS (^OcPATHS(^). Before we apply rule RWl, we will have to insure that none of the paths in PATHS(^)-PATHS(^') is viable.

Let 7" be a subtree of 7^. We will say that is RWl -safe if and only if no viable candidate in PATHS(^) has a 7"-subpath.

Let 7" be an RWl -safe subtree of 7^. Let 7'' be the tree obtained from 7^ by applying RWl to 7°. Then all of the paths in PATHS(^)-PATHS(^') have 7°-subpaths, hence are not viable. So we can safely replace ^ by t,\ thereby eliminating from con- sideration some of the non-viable candidates.

We now discuss methods for trying to determine whether or not a subtree is RWl- safe. The simplest method of showing a subtree to be RW-safe uses only syntactic infor- mation. If 7" is a subtree of 7 and ve ALWAYS(a) then no path having a 7"- subpath is viable, so 7" is RWl -safe.

- 112-

Other methcxls of showing a subtree to be RWl-safe take into account the semantics of the procedure. Let E be a predicate whose free variables are some but not necessarily all of the program variables and let a be a path expression. We will say that E is a- satisfiable if and only if there exist states v and w such that GS£"(v,w) & E(w). Thus, E is a-satisfiable if and only if for some path rce PATHS(a) and some state v, veD" and /" transforms v into w and w satisfies E. In particular, if PATHS(a) is the set of paths from procedure entry to point p then E is a-satisfiable if and only if when control reaches point p, the program state satisfies predicate E. We will say that E is a-unsaiisfiable if and only if E is not a-satisfiable.

In the path expression method we will be interested in showing that certain predi- cates are a-unsatisfiable for cenain path expressions a. Since it is not in general possible to derive GSE"^ we will approximate GSE^ with a weaker predicate. Let C(v,w) be any predicate such that C5£°=>C. Clearly, if (v,\v) satisfies G5£"(v,w) & E(w) then (v,w) satisfies C(v,w) & E(w). Thus if C(v,w) & E(w) is unsatisfiable then E is a-unsatisfiable. In particular, if /'S£"(v,w) & E(w) is unsatisfiable then E is a-unsatisfiable.

We will say that E is (x-necessary if and only if ( V PC'')=>E. That is, if v

ne PATHS (a)

is in the path domain of any element of PATHS(a) then v satisfies E. For example, the predicate PPC(a) is a-necessary. It follows from the definitions that if a predicate E=FALSE is a-necessary then all of the paths in PATHS(a) are unexecutable, hence none of them are subpaths of viable paths, so 7" is R W 1 -safe.

Note that the question of whether or not E is a-satisfiablc pertains to program states which can occur after executing a path from PATHS(a) while the question of nn heiher or not E is a-necessary pertains to program slates which must occur before executing a path from PATHS(a).

- 113-

Let S be a set of program variables and let 7"" be a subtree of T^. We will say that S has the T°- -Must-Redefine property if and only if at least one element VeS has a definition on each r°-subpath of a viable path in PATHS(^). Thus, if S satisfies the 7"- Must-Redefine propeny and S£NEVER(a) then no viable element of PATHS(^) has an r"-subpath. In other words, if S satisfies the 7" -Must-Redefine propeny and ScNEVER(a) then 7" is RWl-safe.

Let a=P-Y-5 be a sub-path-expression of ^, where p, y, and 5 are sub-path- expressions of a. Let E(V],...,V„) be a predicate which is both 5-necessary and P- unsatisfiable. Let VARIABLES(E) denote the set of program variables which actually occur in E. Let ti be an executable element of PATHS(a). Then 7i=pox where pePATHS(p), aePATHS(Y), and xePATHS(5) are all executable paths. Let veD". f transforms v into a state v' which does not satisfy E (since E is P-unsatisfiable). f^ transforms v' into a state w which does satisfy E (since wsD'^ and E is 5-necessary). So at least one of the variables in VARIABLES(E) must change value while path a is being executed. Thus, at least one of the variables in VARIABLES (E) must have a definition on path o.

Combining the above definitions and observations gives LEMMA 5.5.1:

Let 7" be a subtree of 7^ where 7"g7P-7^-7^ Let E(Vi V„) be a predicate

which is both 6-necessar\' and P-unsatisfiable. Then the set VARIABLES(E)

satisfies the 7^-Must-Redefine property. If VARIABLES(E)cNEVER(Y) then 7^ is

RWl-safe.H

We will associate with each subtree 7" of 7^ a collection MUST-REDEFINE(7°) of sets of variables, such that each Se MUST-REDEFINE(7") satisfies the 7°-Must-

- 114-

Redefine Propeny. Initially, each of the collections MUST-REDEFINEfT") will be empty. As the path expression method progresses, we will use Lemma 5.5.1 to derive elements of the collections.

The following observations justify associating the collection MUST- REDEHNECT"") with the root of 7". Let T be a subtree of where is a subtree of 7^ and let S be a set of program variables which satisfies the 7"-MUST-REDEFlNE propeny. Let 7 and T^ be the trees obtained by applying RW2 to T. Then since PATHS(^)=PATHS(^') and PATHS(a)=PATHS(aO, S satisfies the 7"'-Must-Redefine propeny (with respect to ^')- Similarly, if 7*^ is RWl-safe and if 7" and 7^ are the trees obtained by applying RWl to T then S satisfies the 7 -Must-Redefine propeny (with respect to T^ ). Thus if S belongs to the collection MUST-REDEFINE(7") associ- ated with the root of 7°, and if RW2 is applied to a subtree of 7" or RWl is applied to a RWl-safe subtree of 7" then S still satisfies the 7"-Must-Redefine propeny for the transformed tree.

We now present a non-deterministic procedure which attempts to determine whether an association is unexecutable. It returns "UNEXECUTABLE" only if the asso- ciation actually is unexecutable. We will then discuss a strategy for eliminating the non- determinism and increasing the efficiency of the algorithm. An algorithm for computing a path expression representing the set of paths from a node s to a node or edge t is given in Appendix I.

- 115-

Algorithm 5.5.1:

function PEMl(d,UjV'^: (POSSIBLY-EXECUTABLE.UNEXECUTABLE);

initialize T^ and 7"° '

for each subtree 7" of 7"^ do

begin

MUST-REDEFINE(7P):=0;

RWl-safe(rP):= false end;

done:=false; while not done do begin

if ^=0 then return(UNEXECUTABLE);

else select

Dtrue => Apply RWl to an RWl-safe subtree 7" of 7^

ntrue=> Apply RW2 to a subtree 7" of 7^

D true => UPDATE MUST-REDEFINE: choose a subtree T^ of 7^; let a=P-Y-5 be the 7'>^-reduction of 7^; find a 5-necessary predicate E; if E is P-unsatisfiable then

MUST-REDEFINE(7"0:=MUST-REDEFINE(7^)u{VARIABLES(E)};

Dtrue => UPDATE RWl-safe:

Choose a subtree 7" of 7"^'" such that ve ALWAYS(a)

or choose a subtree 7" of 7^ such that MUST-REDEnNE(7°)cNEVER(a)

or choose a subtree of 7^ such that FALSE is a-necessary;

RWl-safe(7"):=true;

ntrue=> done:=true; {give up)

end select; end; return(POSSIBLY-EXECUTABLE);

As indicated above, application of the rewrite rule RWl to RWl-safe states does not eliminate any viable candidates. Thus every time the top of the while loop is reached all of the viable candidates belong to PATHS(^). If ^=0 then PATHS(£) is empty, so there are no viable candidates. In this case the procedure terminates, returning "UNEXE- CUTABLE". However, the algorithm is not guaranteed to terminate. If at any point the

- 116-

"done := true" selection is executed the algorithm will terminate returning "POSSIBLY- EXECUTABLE". In this case PATHS(^) is a superset of the set of viable paths.

It would be possible to eliminate the non-determinism in PEMl, by selecting a par- ticular order in which to perform the operations and to restrict the application of RW2 so as to avoid the generation of trees which had already been generated at a previous step. However, even with these restrictions it would be possible for ^ to become arbitrarily long, for example by using RW2 to repeatedly replace a sub-path-expression a* by

In the function PEM2, we will sacrifice some of the generality of PEMl, for the sake of efficiency. We will do this by only allowing a restricted form of the rewrite rule RW2:

RW2':

Replace a subtree 7" of 7'' by where a=p,, a'=(p*n,), no node symbol or loop symbol occurs more than once in P, and a'=a. (Thus a' represents the set of paths through the loop headed by node i.)

RW2' transforms ^ into an equivalent path expression which includes more details about the body of some loop.

Function PE.\12 uses rewrite rules RWl and RW2', and thus concentrates attention on loops in the attempt to eliminate non-viable candidates. PEM2 will also focus attcn- tion on the subtree 7 , applying the rewrite rules only to subtrees of this tree. PEM2 also eliminates (most oO the non-determinism of PEMl by performing the operations in a pre-determined order.

- 117-

Algorithm 5.5.2

function PEM2(d,u,y): (POSSIBLY-EXECUTABLE,UNEXECUTABLE);

lunction PtMZca,UA'): Initialize T^ and 7 ;

for each subtree 7" of T^ do MUST-REDEFINE(r"):=0; while T has a subtree of the form T^' do begin

{ 1 } if ^=0 then retum(UNEXECUTABLE) else

begin ^

{2} choose a subtree 7P' of T"'^'" ; ^

{3) let a be the 7P' -reduction of 7^;

then a=p'-p,Y' where p' and / are sub-path-expressions; let (3 denote P'-p,'"^ and let ydenote p,'"^ Y; {4} derive y-necessary predicates £ i ,...,£4;

{5) if£]&...&£;t=FALSEthen (FALSE is y-necessary) apply RWl to 7"

else begin

for i := 1 to k do begin {6} if £, is P-unsatisfiable

{7) then MUST-REDEFINE(7^)

:=MUST-REDEFINE(7Y)u{VARIABLES(E)); end; (for) end; (else) {8) Apply RW2' to 7P';

{let 7 denote the subtree which replaces T^ ] [9] Apply RWl to any subtree 7^ of 7* such that ve ALWAYS(e);

{10} Update sets ALWAYS(0, SOMETIMES(0, NEVER(C) for all

sub-path-expressions C, of ^. {11} Apply RWl to any subtree 7^ of 7^ such that

(3SeMUST-REDEFINE(7^) s.t. ScNEVER(e)). end; {else] end; (while)

if PPC(^)=FALSE then (FALSE is ^-necessary) return(UNEXECUTABLE) elsereturn(POSSIBLY-EXECUTABLE);

Each time the top of the while loop in PEM2 is reached, the set of viable candidates is contained in PATHS(^). Thus, PEM2 returns "UNEXECUTABLE" only if the definition-use association (d,u,v) is unexecutable. The algorithm terminates either when b,=0 or when there are no more loop symbols to which to apply RW2'. If PEM2 returns "POSSIBLY-EXECUTABLE" then the final value of S, represents a super-set of the set of viable candidates.

- 118-

We will next address a number of implementation issues. We assume the availabil- ity of a partial symbolic evaluator and a theorem prover.

In line 1 the algorithm has to determine whether ^ is equivalent to 0. One way to make it easier to do this is by performing the following simplifications whenever T'' is modified:

1. Replace subtrees of the form T^T^ or T^-T^ by T^.

2. Replace subtrees of the form T^[^T^ orT^y^T^ by T^.

3. Replace subtrees of the form r^T^ or r^T^ by 7^.

4. Replace subtrees of the form T^* by T^.

5. Replace subtrees of the form 7 * by 7 .

These simplifications can be performed in a single post-order traversal of the tree 7''. After performing these simplifications, ^=0 if and only if ^=0.

Each time execution reaches line 2 the algorithm chooses a subtree, T^\ on which to concentrate. This non-determinism can be eliminated by specifying a particular order in which loop symbols are to be processed. For example it could be stipulated that loop symbols are to be processed in increasing order of depth of nesting and from left to right within a given depth of nesting. More work is needed to determine how the order in which loop symbols are processed effects the efficiency and power of the algorithm.

In line 4, the algorithm derives a set [E\ Ei ] of Y-necessar>' predicates. In order

to derive all of the y necessary predicates, we would have to employ global symbolic evaluation of y However, we can derive some of them by using partial symbolic evalua- tion. Recall that the the predicate PPC^ is y-ncccssary' So we can derive this family by writing PPC^ as a conjunct of clauses, PPC^Cv^s £,(v)&...&£a(v).

- 119-

Similarly, in line 6, the algorithm must determine whether £, is P-unsatisfiable, i.e. whether G5£^(v,w)&£,(w) is unsatisfiable. To get a precise answer to this question we would need to perform global symbolic evaluation to derive the predicate and to employ a sufficiently powerful theorem prover to check its satisfiability. It is not in general pos- sible either to perform global symbolic evaluation or to determine whether an arbitrary predicate is satisfiable. However, we can perform the following approximation. As pointed out earlier, if PSE^{\,w)&Ei{w) is unsatisfiable then £, is (i-unsatisfiable. If the theorem prover can show that /'S£^(v,w)&£,(w) is unsatisfiable then we know that £,■ is p-unsatisfiable. If the theorem prover determines that /'S£^(v,w)&£,(w) is satisfiable or if it cannot reach a conclusion (within some predetermined amount of time) then PEM2 can make the conservative assumption that £, is not P-unsatisfiable (although in fact it might be.)

The sets ALWAYS, SOMETIMES, and NEVER can easily be updated (line 10) whenever 7'' is modified by performing a single post-order traversal of the tree, applying the definitions of the sets. At the same time, RWl can be applied to subtrees which have become RWl -safe by virtue of the updates (line 11).

Note that given a partial symbolic evaluator, a theorem prover, and a strategy for choosing the subtree 7^' on which to concentrate, PEM2 becomes a fully automatic pro- cedure. We now show that the time-complexity of PEM2 is polynomial in the length of the subject program.

The algorithm for initializing 7" produces a path expression in which each loop

d,u

symbol appears at most twice. Let p, be a loop symbol appearing in T" ' . The loop symbols which are generated by repeatedly applying RW2' to 7"^ are all distinct from one another. So the total number of passes through the while loop is less than or equal to

-120- IL " where L is the number of loops in the procedure.

We can also guarantee that at any point in the algorithm, the length of ^ is O(N^) where N is the number of nodes in the flow graph (see Appendix II). Let M be the number of characters in the subject program. Clearly N=0(M) and L=OfM).

In line 1, PEM2 checks whether ^=0. This can be done in O(N^) time by simplify- ing T'' during one traversal of the tree. There are many possible strategies for choosing a subtree T^' on which to concentrate in line 2. Some of these strategies take only 0(1) time. Finding the 7 "-reduction of 7S and the corresponding sub-path-expressions (3 and ^ in step 3 can be done with one traversal of the tree, hence in O(N^) time. Partial sym- bolic evaluation of a path expression yean be done in time proponional to the length of y which is O(N^) so step 4 can be performed in 0(A'^) time. Since the number of clauses in each branch is 0(M), the number, k. of predicates £, derived in step 4 is OiN^M)=0(M-).

Since, as discussed above, the theorem prover can be forced to return after some predetermined constant amount of time, each execution of step 5 and of 6 takes 0(1) time. The for loop is executed O(M^) times, and each execution takes 00) time. Steps 8, 9, and 10 can each be performed in 0(/V') time. In step 1 1 there are O(N') subtrees to examine, each of which can potentially have OCA/-') elements in its MUST-REDEFINE SCI. so step 1 1 takes 0{N^M^)=0(M^) time.

Thus PEM2 runs in time bounded by 0(L^M^)=0{M^). The above estimate was very crude, and was strongly affected by the fact that k=0(M'). This is not a tight upper bound on k. Furthermore, in practice, k will generally be a small number, and the algo- rithm will run more efficiently than this worst-case upper bound would indicate.

- 121 -

It is also possible to implement PEM2 in such a way that only the "relevant" loops need be examined. This improves the run time in practice. To do this, it is necessary to make conservative estimates of the sets ALWAYS(p,), SOMETIMES(p,), NEVER(p,) as follows. Let the definitions oi ALWAYS', SOMETIMES', and NEVER', be analogous to those of ALWAYS, SOMETIMES, and NEVER except that ALWAYS'ip,)=ALVJAYS{n,), 5C>A/£T/M£S'(p,)=VAR_SET; and NEVER'ip,)=0. Then for any path expression a, ALW AY S'{a)Q ALWAYS (a), SC>M£T/MES'(a)2S0METIMES(a), and NEVER'ia)^NEVER(a). Therefore, any sub- tree which is marked as being RWl-safe, using the sets ALWAYS' and NEVER' will in fact be RWl-safe. Also, when performing panial symbolic evaluation, any variable which should be given a unique new symbolic value will be given one.

We next present two examples of the operation of PEM2. In the first, we re- examine the bubble-son procedure presented in section 5.2. In the second we examine a program in which the bodies of deeply nested loops are relevant to determining that an association is unexecutable. When no ambiguity results, we will refer to applying rev.Tiie rules to sub-path-expressions, rather than to the corresponding trees. Figures 5.5.1 and 5.5.3 show how the trees 7"'' and 7" ' evolve as the algorithm proceeds in Examples 5.5.1 and 5.5.2, respectively. In these figures, the root of a subtree T is anno- tated with a list of the elements of MUST-REDEFINE(T).

Example 5.5.1

Consider again the definition-use association (2,(6,14),.m7rc/0 from the bubble-sort program in Figure 5.2.2. Initializing ^ and a"^'" gives a'''"=2-p3-5-p6-14 and e=l 2 p3 5 P6 14 15.

- 122-

We first direct attention toward the loop symbol P3. The 7''' -reduction of T^ is is ^ itself. p=l-2-p3"'^ and •y^p3'""-5-p6-14-15. Panial symbolic evaluation of y gives /»PC^=£i=0>5). Panial symbolic evaluation of p gives PCOMP^=^ii=\) so fj is p- unsatisfiable. We place the element VAR1ABLES(£,)={(} in the set MUST- REDEnNE(7"P' )• (See Figure 5.5.1.a).

We next apply RW2' to 7^', replacing P3 by 5=(3-4)*-3. (See Figure 5.5.1.b). There are no opportunities to apply RWl to 5. Control goes to the top of the while loop with a'^"=2(3-4)*-3-5p(,-14 and 4=l-2(3-4)*-3-5p6l415.

We next concentrate on loop symbol P6. The 7^' -reduction of T"" is is ^ itself. (3=l-2(3-4)*-3-5pf,"'^ and •j^p3'""1415. Partial symbolic evaluation gives PPC'^=ExM{n<2) or (not switch)) and PCOMP^=>{switch8i{n=5)), so £1 is P- unsatisfiable. The set VARIABLES(£i )={"-^^'''<^'') is placed in MUST- REDEFINE(rP'). We apply RW7'. replacing p6 by 5=(6-7psl3)*-6. (See Figure 5.5.1.CJ. Since 5W7';c/i6 ALWAYS((6-7p)il3)j we can apply RWl to this sub-path- expression. (See Figure 5.5.1.d). After simphfying, we have a'''"=2- (3-4)*-3-5-614 where {«.5w//c/i)e MUST-REDEFINE(r^). Since {n^H/;c/i)cNEVER(6) we can apply RWlto6. (See Figure 5. 5. l.e).

Simplifying, gives ^=0. (See Figure 5.5.1.0- Control returns to the top of the while loop, and, since there are no remaining loop symbols, exits from the loop. Since PPC^=PPC^ s.?\LSE, PEM2 returns "UNEXECUTABLE".

Similarly, if wc had concentrated attention on Pf, first, PEMZ would have returned "UNEXECUTABLE". In fact, in this case it would not have had to apply RW2' to 7^'.

-Ii3 -

FIGURE 5.5.1

(a)

/ -)

Apply RW2" to p3

(b)

/

_A

.T4-

Apply RW2' to p6

1 (

V

(c)

Apply RWl

V

(d)

lil-

(e)

j! Apply RWl

\y

Simplify

V

(f)

^

- 127-

Example 5.5.2

We next examine a somewhat more complicated, though admittedly contrived, example of the path expression method. Consider the following procedure, where "def(x)" and "useW are abbreviations for some statements which have a definition or use, respectively, of the variable x. The flow graph for this procedure is shown in Figure 5.5.2.

procedure example5_5_2; var x,y,z : integer; function f(w:integer): boolean; begin (f)

end; {f} begin

def(x);

for y := 1 to 5 do

for z := 1 to 5 do if f(z) then begin

def(x); def(y) end: use(x) end; Consider the definition-use association (2,1 l,x). The path expression method initializes ^

to 1-2-P3-1M2 and a'^'" to 2-p3-ll. (See Fig 5.5.3a.)

We first concentrate on the loop symbol P3. The 7^' -reduction of T^ is ^. (i= llpi'^P and -pp3^'"-lM2. E^=PPC'^=(y>5) and PCOMP^^(y=0). £1 is p- unsatisfiable so the element (y) is added to MUST-REDEFINE(7"P').

Next, JP' is replaced by T^ where 5=(3-4p5-10)*-3 is equivalent to P3. (See Fig 5.5.3b.) Since {y) is not contained in NEVER(5) and there is no subexpression of 5 in which x is always redefined, it is not possible to apply RWl at this point. Control returns to the top of the while loop in PEM2 with £=l-2-(3-4-p5-10)*-3-lM2 and a"^-" to

■ub-

viy^

A def(x); ^v:=l;

FIGIRE 5.5.2

-\13-

FIGURE 5.5.3

fa)

\

d,u

{>■}

11

Apply RW2' to p3

V

(b)

'\

{>■}

12

\,

11

^

.^

/"

A.

p3 {z}

m

(c)

-1^0-

Apply RW2' to p5

K/

12

■*i

11

_^

{>■} V

_/"V

m

/

y

' \

. /

\

'"^\.

-ni-

Apply RWl to 8 and simplify

(d)

-Ill-

Apply RWl

\^

(e)

0 11

[', Simplify

i 1

V

- 133-

2(3 4p5 10)*-311.

We next concentrate on the loop symbol P5. The 7 P' -reduction of T^ is l-2-3-4p5-10-31112. (i=l-2-3-4p5'^^ and ■yfcp5'""-10-31M2. £,=/'/'C^=(z>5) and PCOMP^=>iz=0). £1 is p-unsatisfiable so the element {y} is added to MUST- REDEHNECrP^ ). Next, T^' is replaced by T^ where 5=(6(7|^8)-9)*-5 is equivalent to P5. (See Fig 5.5.3c.) Since x is always defined in node 8, RWl can be applied. (See Fig 5.5.3d.) Now ^=l-2(3-4(6-7-8-9)*-510)*-31112. Since {yleMUST- REDEnNT((3-4(6-7-8-9)*-510)*-3) and {y}£NTVER((3-4(6-7-8-9)*-510)*-3), this subexpression is RWl -safe. Applying RWl gives ^=r2-0-iri2 (See Fig 5.5.3e.) and simplifying gives ^=0. So PEM2 exits from the while loop and returns "UNEXE- CUTABLE". Note that when (y) was added to MUST-REDEnNE(rP'), {y} was not contained in NEVER(p3) so it was not possible to apply RWl at the root R of this sub- tree. However, after applying RW2' to T^^ then to T^^, and applying RWl to T^ more details were known about the viable paths. At this point the subtree rooted at R became RWl -safe.

6. Enhancements and Limitations

In this section we will discuss limitations of the path expression method and pro- pose some enhancements. These enhancements include the relaxation of some of the assumptions about the procedure being tested, as well as the use of more powerful path expression rewriting rules and symbolic evaluation techniques. We will also briefly dis- cuss other uses of the path expression method. Throughout this section, when we refer to the "path expression method" we v.ill mean the implementation of function PEM2 dis- cussed in section 5.5, unless otherwise noted.

- 134-

Since it is undecidable whether or not a panicular definition-use association is exe- cutable, it is inevitable that the path expression method (even in its more general form) cannot identify all unexecutable associations. Rather than trying to formally classify those programs for which the path expression is guaranteed to be successful, we will examine a situation in which the path expression fails to identify unexecutable associa- tions as such.

Example 5.6.1:

Consider the procedure GeiWord in the following program:

program driver(input, output); const MAX = 5;

type String = array[ 1 ..MAX] of char; var a : String;

i, length : integer; procedure GetWord(var word : String; var n: integer); ( Reads a word and changes lower case letters into caps} var i : O..MAX;

offset : integer, begin

wTiteln( 'Enter a word of length <= ',MAX:1 );

n := 0;

while ((not eoln) and (n < MAX)) do

begin

n := n + 1; read(word[n])

end;

offset := ord( 'a') - ord('A');

for 1 := 1 to n do

if ('a' <= word(il) and (word[i] <='z') then word[i| := chr( ord(word[i]) - offset) end, IGctW'ord)

begin (driver)

GctWordl a, length);

for 1 :- 1 to length do wriie(a[ij);

writcin; end

The flow graph for this procedure is shown in Figure 5.6.1. Wc will now apply the path

(p

©n-O;

read(a[n]);

^ offset:=ord('a')-ord('A');

\1^ i:= 1;

a'>word[i] or 'z'<word[i]

Figure 5.6. 1

- 136-

expression method to the definition-use association (1,8, word). This association arises from the fact that word is defined in node 1 (since it is a var parameter) and used in node 8, and there is a definition clear path with respect to word from 2 to 8.

Initially,

a'^."=l*'«'.2-p3-5p6-7-8'^^

and

^=l-2p3-5p6-7 810p6lll2.

We will first concentrate attention on the loop symbol P3. The JP' -reduction of T"^ is is ^itself. (i=l-2-p3'^^ and">tp3'"^'-5p6-7-p8-910p6-lM2.

To find a set of y-necessary predicates we derive the partial path condition of y:

E i^PC''=ieoln OT n>5). To check whether £) is (i-unsatisfiable, we derive the partial path computation of [3:

FCOMP^={n=Ql). Since the predicate PCOMP^&Ei={n=0)&ieoln or n>5)={eoln=TR\JE) is satisfiable, we cannot show £, to be (J-unsatisfiable, and therefore cannot add any elements to the collection MUST-REDEFINE(rP» ).

The path expression P3 is equivalent to 5=(3-4)*-3. Since word is defined in node 4, "we apply RW'l to the subexpression 4 of 5. Simplifying gives 5'=3, and applying RW2'givesa''"=l''"' 235 p. 7 8'"^and^=I 235 p,-7-810p^ 11 12.

We next concentrate on the loop symbol p^. Lei T^" denote the subtree correspond- ing to the occurrence of P(, in a''". The T^' -reduction of T^ is is ^ iiself. [J=i:35pt,"^ and Y=P6'"''7-810 p^, 11 i:

PPC^Mi^)&C^<word[i])S:[word\i]^'z). Let £,=(/^/i). rCOMP^=>{i = \)&{n=€)) so£i is p-unsatisfiablc. So the clement li.n) is placed in MrST-REDEFINE(7''«).

- 137-

RW2' is applied, replacing p^ by the equivalent path expression (6-7(8^9)10)*-6 Since word is defined in node 8 (or since node 8 is the target node) RWl is applied to node 8. At this point, a'''"= l'""-2-3-5(6-7-910)*-6-7-8'°'', and ^=l-2-3-5-(6-7-910)*-6-7-810p6-lM2. Since {i,n) is not a subset of NEVER((6-7-9-10)*-6), no funher re-writing can be done at this point.

Since a''*" has no remaining loop symbols, we exit from the while loop and com- pute PPC^. Since PPC^={{n=Q)&.eoln) is not equivalent to FALSE, the algorithm returns "POSSIBLY-EXECUTABLE".

However, the association is in fact unexecutable. This can be seen by observing that the value of / increases in the loop headed by node 6. Thus, since the predicate (/ = l)&(rt=0) holds before entering the loop and the value of n does not change in the loop, the exit condition, {i>n), can never hold. So none of the candidates are viable.

Intuitively, the reason that the path expression method fails to conclude that this association is unexecutable is because it takes into account the fact that / is redefined in the loop, but fails to take into account the way in which it is redefined.

There are several ways to enhance the path expression method so that unexecutable associations like this one would be detected. One method would be to augment the pro- cedure with user-supplied (or programmer-supplied) assertions and allow the partial sym- bolic evaluator to use the assertions in the obvious way. If the assertion "/>!" were added on branch (6,7) then partial symbolic evaluation of

^=1 2-3 5(6 7 9 10)*-6 7-810p6 11 12 would yield PPC^=^{n=0)& eoln &ii'>\)&{i'<n)=FALSE where /' is the symbolic value of i after executing the sub-path expression l-2-3-5(6-7-910)*. Thus the algo- rithm would return "UNEXECUTABLE".

- 138-

Another way to enhance the path expression method would be to use a more power- ful symbolic evaluation technique than partial symbolic evaluation. While global sym- bolic evaluation in its full generality is not feasible, some limited form in which recurrence relations are derived and solved for some but not all variables could be practi- cal. For example, the recurrence relation for variable / in the loop headed by node 6 is simple:

/°=1

,'^=/*-i + l fork>0

where /* denotes the value of / after k iterations of the loop. Solving the recurrence rela- tion yields / =k-t-l for k>0 which implies that the value of i is greater than or equal to 1 when branch (6,7) is executed. An enhanced version of the path expression method which employed symbolic evaluation techniques powerful enough to infer this fact would return "UNEXECUTABLE" for this association.

Another approach to enhancing the path expression method is to include more powerful revmte rules. Restricting the rewrite rule RW2 to R\V2' was a severe restric- tion, aimed at ensuring the efficiency of the algorithm. If somewhat less stringent restric- tions were placed on R\V2, a more powerful algorithm would result. Consider for exam- ple the rewrite rule

RW2":

Replace a subtree 7" of 7"^ by 7"'^ where a=(Y-5)*Yand [i=Y (5y)*.

If the path expression method were enhanced to allow application of R\V2" as well as RWl, and R\V2' then ^=1 2 35 (6-7-9- 10)* 6 7 S 10 p„ II 12 could be replaced by V^l 2 .^ >U^ (•> 10 U -)♦ S 10 p, 11 12 Since P/T^'hFALSF, the path expression method would return "UNEXECUTABLE".

- 139-

More research is needed to determine which rewrite rules and which more powerful symbolic evaluation techniques could be added to the path expression method without unduly compromising efficiency.

We next discuss the prospects for relaxing some of the assumptions, made in section 5.1, about the procedure P being tested. We will discuss the way various strategies for relaxing assumptions affect the power of the path expression heuristic.

The first assumption about P was that no aliasing occurs in P. There are several ways in which aliasing can arise in a procedure: Ifx and y are var formal parameters in P and P is called with the same actual parameter passed to x and >' then x and y are aliases. Similarly, if a: is a global variable which is also passed to the var parameter y then x and >" are aliases. If p and q are pointer variables with the same value then p' and q' are aliases. If a is an array and i and j have the same value then a [i ] and a [j] are aliases.

In order for the path expression method to function correctly it is necessary that all aliasing be known. U x and y are aliased then any assignment to x is also an assignment to y. This must be taken into account both in the symbolic evaluation and in the checks for whether a variable is in ALWAYS(a), SOMETIMES (a), or KEVER(a) for a path expression a. Two standard approaches to insuring that all aliasing is known are to res- trict the kind of aliasing which can occur, and to allow any kind of aliasing but make conservative assumptions about the values of variables.

The later approach entails considering each definition of a variable x to also be a definition of variable >• if .x and y can potentially be aliased. This is the approach we have taken to handling arrays. One could apply this approach to other kinds of aliasing as fol- lows: Replace NEVER(a) by NEVER' (a) and SOMETlMES(a) by SOMETIMES'{a) where blE\'ER'{a)={x\ neither x nor any variable which is potentially aliased to x is

- 140-

defined on any element of PATHS(a)), and where SOMEJlMES'{a)= complement of NEVER'ia.). In partial symbolic evaluation of a statement in which a variable x is defined, it would be necessary to also note that any variable which is potentially aliased to X could also change value. For example, if >• is a potential alias of j:, then after sym- bolically executing the statement "jc := z", the new symbolic value, Y of y is given by Y=Z or Y-Y', where Y' is the old symbolic value of y and Z is the symbolic value of z. If there are a lot of variables which are potential aliases, this can become quite unwieldy.

A simpler approach to partial symbolic evaluation in the presence of aliasing is to assign a unique new symbol to each potential alias of x whenever an assignment to x is made. This is computationally simple, but could result in too much loss of information. Most likely, the best method of relaxing the "no aliasing" assumption would involve a combination of restricting the type of aliasing which could occur and making conserva- tive assumptions about the ways in which values of variables can change. Perhaps heuristic approaches to determining whether variables which are potential aliases are actually aliased would be fruitful.

The second assumption about P (no hidden side-effects and no functions with var parameters) was also motivated by concern for ensuring the accuracy of the data flow information collected and the panial symbolic evaluation. Approaches to relaxing this assumption are similar to those to relaxing the first assumption. By performing interpro- ccdural data flow analysis [MUC81,BUR861, it is possible to "uncover" hidden side effects. In those cases where it is not possible to tell for certain whether a called pro- cedure or function has side effects it is necessary to make conservative a.ssumptions.

The third assumption is that functions have no \ar parameters. This assumption is a convenience which simplifies the symbolic evaluation of function calls. It can ea.sily be

- 141 -

eliminated.

The founh assumption, that P has no repeat-until statements, was included as a technical convenience in order to insure that all loops in P's flow graph have exits from the top. One way to relax this assumption would be to have different kinds of loop sym- bols for repeat-until loops and for while loops and to expand the set of rewrite rules accordingly. It also seems likely that the path expression method could be generalized to handle arbitrary reducible flow graphs. The concept of loop symbols representing the set of paths through a loop would be replaced by "interval symbols" representing the paths through an interval. For arbitrary flow graphs, some variant of PEl could probably be developed. Such an algorithm would use regular expressions without loop symbols, and would initialize the path expression ^ using the algorithm in [TARS IB]. More investiga- tion of these ideas is needed.

Up to this point we have focussed on the use of the path expression method for attempting to determine that a given definition-use association is unexecutable. This has primarily been motivated by the fact that if we show that all of the definition-use associa- tions required by a data flow testing criterion C but not covered by a test set T are unexe- cutable then we have shown that T is C*-adequate. We will now briefly discuss two other applications of the path expression method.

The path expression method can also be used as an aid for test data generation. Suppose (d,u,v) is an executable definition use association. If we apply the path expres- sion method to (d,u,v) it will return "Possibly Executable". The final value of ^ will be a path expression representing the final set of candidates. This may be a proper subset of the initial set of candidates. By panially symbolically executing the path expression ^ we obtain the ^-necessary predicate PPC^. Since any test element which covers (d,u,v)

- 142-

must satisfy PPC^, we need only consider such elements as potential test cases. Note however, that while PPC^ is a necessary condition for a test case to cover (d,u,v) it is not sufficient.

The path expression method may also be useful in compiler optimization. Many of the code transformations performed during optimization involve finding the definitions of a variable which reach a particular point p, that is, definitions d of variable v such that there is a definition clear path with respect to v from d to p. Compilers generally make the conservative assumption that all definitions which reach a point are relevant. How- ever, if the only definition clear paths with respect to v from d to p are unexecutable, then the value w hich is assigned to v at d cannot effect the value of v at p. In our notation, if the definition-use association (d,p,v) is unexecutable, then the optimizing compiler can safely remove d from the set of definitions of v which reach p.

Consider, for example, constant propagation. Suppose that definitions d\ di^ of

variable v all reach point p. Suppose that at dj ^a, the value assigned to v is zero, but

ax d] it is non-zero. If the definition-use association {d\,p,\') can be shown by the path expression method to bo unexecutable, then whenever control reaches point p, the value of v is guaranteed to be zero. Thus any use of v in the statement at point p can be replaced by the constant zero. Similarly, the path expression method could be used to increase the accuracy of copy propagation, live-variable analysis, detection of induction variables, detection of parallelism, etc.

More work is needed to determine when, whether, and to what extent the cost of using the path expression method to eliminate "unexecutable" reaching definitions is justified h\ the improvement in the compiled code.

- 143-

CHAPTER6: Conclusion

1. Summary

In this thesis we have investigated two families of test data adequacy criteria, the data flow testing criteria and the feasible data flow testing criteria. We have used the inclusion relation to compare these criteria to one another and to statement, branch and path testing; have explored the extent to which each criterion satisfies Weyuker's axioms for a good adequacy criterion; and have examined prospects for building testing tools based on the criteria. We have described the design and implementation of a tool, ASSET, which is based on data flow testing, and have described a heuristic, the path expression method, which, in conjunction with ASSET partially automates feasible data flow testing.

The data flow testing criteria are based purely on the syntax of the program being tested. A definition use association is either

1) a triple (d,u,v), where d is a node having a definition of variable v, u is a node hav- ing a use of V, and there is a definition-clear path with respect to v from d to u. or

2) a du-path with respect to v.

The criteria require the test data to exercise cenain subsets of the set of definition-use associations. These criteria were originally defined by Rapps and Weyukcr to apply to programs written in a simple language. We have extended the definitions to apply to subprograms written in a large subset of Pascal. This has been done in such a way as to preserve the inclusion relationships among the criteria.

- 144-

The data flow testing criteria fail to satisfy several of Weyuker's axioms for a "good" adequacy criterion. In particular, they fail to satisfy the applicability axiom, which requires that an adequate test set exist for each subprogram. Furthermore, it is undecidable whether an adequate test set exists for a given subprogram. This is not surprising, since the data flow testing criteria are structured testing criteria, and all of the structured testing criteria have this weakness. For the data flow testing criteria, even quite "ordinary" subprograms may fail to be adequately testable.

The feasible data flow testing criteria are derived from the data flow testing criteria by eliminating from consideration those definition-use associations which can never be exercised. All but one of the feasible data flow testing criteria satisfy all of Weyuker's axioms, including applicability property. However, given test set T, an FDF testing cri- terion C, and a subprogram P, it is undecidable whether T is C-adequate for P.

We have shown that the inclusion relations among the feasible data flow testing cri- teria differ from those among the data flow testing criteria. While all-du-paths => all- uses, (all-du-paths)* is incomparable to all of the other feasible data flow testing criteria. Also, while all-p-uses => all-edges, (all-p-uses)* and (all-edges)* are incomparable. This means that the feasible data flow testing criteria do not "bridge the gap" between (the feasible analogs oO branch testing and path testing. However, if for class of subpro- grams satisfying the quite reasonable NFUP property, defined in chapter 4, (all-p-uses)* => (all-edges)*.

The fact that the relationship among the feasible data flow testing criteria differs from the relationship among the data flow testing criteria has implications for studies of the formal properties of other testing criteria. Several criteria which fail to satisfy the applicability property (including all of the structured testing criteria) have been defined

- 145-

and some of their formal propenies have been investigated. The criteria which are actu- ally used in practice, however, are in the spirit of the feasible data flow testing criteria. It is these criteria, which take account of semantic information which should be studied, not their purely-syntactic counterparts.

For example, consider a practitioner who is trying to branch test a subprogram P with an unexecutable branch b. At some point, observing that none of the test cases have caused b to be executed, she will examine P, determine that b is unexecutable, and elim- inate it from consideration. Thus, the criterion being used is actually (all-edges)*, not all-edges.

The theorems appearing in the literature which compare various structtired testing criteria should be re-examined in this light. That is, for each of these criteria C, a cri- terion C* should be defined, which agrees with C for programs having no unexecutable paths, but which satisfies the applicability property. The relationship among the starred criteria should be then examined. If criterion C\ does not include criterion C2 then C\* will not include critenon Cj*- However, if C\=^C2 it will not necessarily be the case thatri*=>C2*.

In order to determine uhethcr a test set T satisfies a feasible data flow testing cri- terion C, it is necessary to examine all of the definition-use associations re<]uired by C but not covered by T to sec whether any of them are executable. This may be a tedious and error prone task. With that in mind, we have developed a heuristic method, the path expression method, which attempts to determine whether a given definition-c-usc associ- ation or dcfinition-p-use association is executable.

Given a definition-use association (d.u.v), a viable candidate is an executable path from subprogram entry to subprogram exit, which has a definition clear subpaih with

- 146-

respect to v from d to u. The path expression method uses a regular expression I, to represent a superset of the set of viable candidates. A combination of symbolic evalua- tion and data flow information is used to weed out paths which can be shown not to be viable candidates, and ^ is rewritten accordingly. If, when the heuristic terminates, t, represents the empty set of paths, then the association (d,u,v) is guaranteed to be unexe- cutable.

The path expression method uses a new symbolic evaluation technique called partial symbolic evaluation of path expressions. Roughly speaking, partial symbolic evaluation of a path expression a which represents a single path n is equivalent to symbolic execu- tion of K, while partial symbolic evaluation of a path expression a which represents more than one path assigns a new symbolic value to those variables which are potentially redefined on any one of the paths. The result of partially symbolically evaluating path expression a is an expression which describes in part the effect of executing any one of the paths represented by a.

The path expression method and partial symbolic evaluation of path expressions were developed in order to help eliminate from consideration unexecutable definition-use associations. They can also be used as an aid to test data generation. If the path expres- sion method returns the answer, "association may be executable", then ^ represents a non-empty set of paths, which includes all of the viable candidates. Symbolic evaluation of ^ gives a predicate which must be satisfied by any test case which covers the given definition-use association.

We believe that the path expression method will be able to identify a significant number of the unexecutable definition-use associations in "typical" subprograms. In con- junction with ASSET this will form a usable tool for feasible data flow testing.

- 147-

Typically, the user would invoke the path expression heuristic when, after a "reasonable" amount of testing, some definition-use associations remained unexecuted. If the heuristic determined that the association was unexecutable, the user could safely eliminate it from consideration. Otherwise the user would be left to determine by hand whether that asso- ciation was in fact unexecutable, or to find test data to cover the association. The final path expression ^ produced by the heuristic would provide some helpful information to the user. Preliminary experiments indicate that this is a reasonable scenario, but more experimentation is needed.

2. Future Work

There are many avenues for future research. Several questions relating to the enhancement of the path expression method and of symbolic evaluation of path expres- sions are suggested in Chapter 6, section 5. Some other topics for future work include

Empirical and theoretical studies of the effectiveness of feasible data flow testing:

Such studies would be aimed at measuring the error detecting ability of the criteria. In order to perform empirical studies, a large sample of "real" programs, preferably with known errors, would be needed.

Theoretical studies would have as their goal the classification of those errors which are guaranteed to be detected by the various criteria. Theoretical and empirical studies of the amount of test data required by the criteria would also be useful. A related issue is the classification of program features which make feasible data flow testing more or less effective.

Empirical and theoretical studies of effectiveness of heuristic:

- 148-

Empirical studies would be aimed at determining the percentage of unexecutable associations which are identified as such by the heuristic. We are currently conducting a pilot study in which we are examining the subprograms in Kemighan and Plauger's Software Tools in Pascal (KER 81].

Theoretical studies would attempt to classify those unexecutable associations which are guaranteed to be identified. A related issue is the classification of program features which make identifying unexecutable associations harder or easier.

Investigate the effect of using more sophisticated data flow analysis:

The data flow analysis used to define the DF and FDF criteria was quite unsophisti- cated. It seems reasonable to expect that using more sophisticated data flow analysis techniques would lead to criteria which demanded "better" test data. For example, using interprocedural data flow analysis, one could define criteria which apply to larger pro- gram units than procedures and functions. It might also be possible to define criteria which treat array elements as individual variables, using heuristics to attempt to disambi- guate array references. One could then explore the trade-off between difficulty of apply- ing the criteria to a program and test set, versus error-detecting ability of the test sets demanded by the criteria.

Extend data flow testing and feasible data flow testing to apply to programs written in other languages:

Depending on the language in question, this could involve handling minor issues, such as static variables and definitions within expressions, and/or major issues such as those arising in languages which explicitly suppon parallelism.

- 149- BffiLIOGRAPHY

[AH086] A.V. Aho, R. Sethi, J.D. UUman, Compilers, Principles, Techniques, and Tools, Addison- Wesley Publishing Company, Reading, Mass., 1986.

[BUR86] M. Burke, R. Cytron, "Interprocedural Dependence Analysis and Paralleliza- tion," SIGPLAN '86 Symposium on Compiler Construction, ACM.

[CHEJ T.E. Cheatham, Jr., G.H. HoUoway, J.A. Townley, "Symbolic Evaluation

and the Analysis of Programs", IEEE Transactions on Software Engineering, Vol. SE-5, No. 4,July 1979.

[CLA81J L.A. Clarke, D.J. Richardson, "Symbolic Evaluation Methods - Implications and Applications," in Computer Program Testing, ed. by Chandrasekaran and Radicchi, Nonh-Holland Publishing Co., 1981.

[CLA85] L.A. Clarke, A. Podgurski, D.J. Richardson and S.J. Zeil, "A Comparison of Data Flow Path Selection Criteria", 8ih IEEE International Conference on Software Engineering, August, 1985, London, pp. 244-251.

[DAV73] M.D. Davis, "Hilbert's Tenth Problem is Unsolvable," American Mathemati- cal Monthly, 80, March 1973.

[DAV83] M.D. Davis, E.J. Weyuker, Computability, Complexity, and Languages, Academic Press, New York, 1983.

[FOS76] L.D. Fosdick.L.J Osterweil, "Data How Analysis in Software Reliability." ACM Computing Surveys, 8,3,1976.

[FRA85a] P.O. Frank!, S.N. Weiss and E.J. Weyuker. "ASSET: A System to Select and Evaluate Tests ', Proceedings of ilie IEEE Conference on Sofr^'are Tools, New York, April 1985.

(FRA85bl P.O. Frankl and E.J. Weyuker, "A Data Flow Testing Tool." Proceedings of IEEE Softfair //, San Francisco, Dec. 1985.

(FRA87al P.O. Frankl, "ASSET User Manual." Computer Science Depanment Techni- cal Report #318. Courant Institute of Mathematical Sciences. New "^'ork University. New York. NY. Sept 1987.

[GIR8.'>| M R Girgis and MR. Woodward. "An Integrated System for Program Tcst- mg Using Weak Mutation and Data Flow Analysis." Sih IEEE Inicrnaiional Conference on Software Engineering, August. 1985. London, pp. 313-319.

- 150-

[G0075] J.B. Goodenough and S.L. Gerhan, "Toward a Theory of Test Data Selec- tion," IEEE Transactions on Software Engineering, SE-1,2, June 1975.

[GOU83] J.S. Gourlay, "A Mathematical Framework for the Investigation of Testing", IEEE Transactions on Software Engineering, SE-9, 6, Nov. 1983.

[HAR77] W.H. Harrison, "Compiler Analysis of the Value Ranges for Variables", IEEE Trans. Software Eng., Vol. SE-3, No.3, 1977.

[HEC77] M. S. Hecht, Flow Analysis of Computer Programs, North Holland, 1977.

[HER76] P.M. Herman, "A Data Flow Analysis Approach to Program Testing", The Australian Computer Journal, Vol. 8, No. 3, November 1976.

[HOW76] W.H. Howden, "Reliability of the Path Analysis Testing Strategy", IEEE Transactions on Software Engineering, Vol. SE-2, No. 3, 1976, pp. 208-215.

[HOW78] W.E. Howden, "A Survey of Dynamic Analysis Methods" in Tutorial: Software Testing and Validation Techniques, eds. E. Miller and W.E. Howden, IEEE Computer Society, 1978.

[HOW78b] W.E. Howden, "An Evaluation of the Effectiveness of Symbolic Testing," Software Practice and Experience, 8, 1978.

[HUA75] J.C. Huang, " An Approach to Program Testing," ACM Computing Surveys,

7(3), Sept. 1975, pp. 113-128.

[JEN85] K. Jensen, N. Wirth, Pascal User Manual and Report, Revised for the ISO Pascal Standard, Springer, New York, 1985.

[KEM85] R.A. Kemmerer, S.T. Eckmann, "UNISEX: a UNlx-based Symbolic EXecu- tor for Pascal," Software Practice and Experience, Vol. 15(5), May 1985.

[KER81] Kemighan, Plauger, Software Tools in Pascal, Addison Wesley, Reading, Mass., 1981.

[KNU77] D.E. Knuth, J.H. Morris, V.R. Pratt, "Fast Pattern Matching in Strings," SIAM Journal on Computing, 6, 2, June, 1977.

[KIN76] J.C. King, "Symbolic Execution and Program Testing," Communications of the ACM, \9,1, My 1976.

[KOR85] B. Korel and J. Laski, "A tool for Data Flow Oriented Program Testing", Proceedings of IEEE Softfair II, San Francisco, Dec. 1985.

151

[LAS83] J. W. Laski and B. Korel, "A Data Row Oriented Program Testing Strategy," IEEE Trans. Software Eng., Vol. SE-9, No.3, May 1983, pp. 347-354.

[MIL74] E. F. Miller, Jr. M. R. Paige, J. P. Benson, and W. R. Wisehart, "Structural Techniques of Program Validation," Dig. Compcon, Spring 1974, pp. 161- 164.

[MUC81] Muchnick, Jones, Program Flow Analysis, Theory and Applications, Prentice-Hall, Englewood Cliffs, NJ, 1981.

[NTA84] S. Ntafos, "On Required Element Testing," IEEE Trans. Software Eng., Vol. SE-10. No.6, Nov. 1984, pp. 795-803.

[OST76] L.J. Osterweil and L. D. Fosdick, "DAVE - A Validation Error Detection and Documentation System for Fortran Programs," SoftM'are Practice and Experience, Oct-Dec 1976, pp. 473-486.

[0ST81] L.J. Osterweil, "Using Data Flow Tools in Software Engineering," in Pro- gram Flow Analysis: Theory and Applications, Muchnick, ed., Prentice Hall 1981.

[RAP82] S. Rapps and E.J. Weyuker, "Data Flow Analysis Techniques for Program Test Data Selection," Proceedings Sixth International Conference on Software Engineering, Tokyo, Japan, Sept. 1982, pp. 272-278.

[RAP85J S. Rapps and E.J. Weyuker, "Selecting Software Test Data Using Data Flow Information", IEEE Trans. Software Eng., Vol. SE-11. No.4, April 1985. pp.

367-375.

[SCH73] M. Schaeffer, A Mathematical Theory of Global Program Optimization, Prentice-Hall, Englewood Cliffs, N. J., 1973.

(TAR81 Al R.E. Tarjan, "A Unified Approach to Path Problems." Journal of the Associ- ation for Comoputing Machinery, Vol. 28, No. 3, July 1981.

[TAR81B] R.E. Tarjan, "Fast Algorithms for Solving Path Problems." Journal of the Association for Comoputing Machinery, Vol. 28. No. 3, July I'JSl.

[WEG83J M. Wegman, "Summarizing Graphs bv Regular Expressions," Proceedings of the Tenth SIGACT POPL, Austin. Tex.. 1983.

[\STY7')1 F J. Weyuker. "The Applicability of Program Schema Results to Programs." Inicrnatiiinal Journal if Computer and Information Sciences, 8,5. Ovt. 1979.

- 152-

[WEY82] E.J. Weyuker, "On Testing Non-testable Programs," The Computer Journal, 25,4, 1982

[WEY84] E.J. Weyuker, "The Complexity of Data Flow Criteria for Test Data Selec- tion," Information Processing Letters, Vol. 19, No. 2, August 1984, pp. 103- 109.

[WEY86] E.J. Weyuker, "Axiomatizing Software Test Data Adequacy," IEEE Trans. Software Eng., Vol. SE-12, No. 12, Dec. 1986.

[WOO80] M. R. Woodward, D. Hedley, and M. A. Hennell, "Experience with Path Analysis and Testing of Programs," IEEE Trans. Softn^are Eng., Vol. SE-6, May 1980, pp. 278-286.

- 153

APPENDIX I:

ASSET USER MANUAL

Preliminary Version

June 29, 1987

Phyllis Frank!

Departmeni of Computer Science

Courant Institute of Mathematical Sciences

New York University

251 Mercer Street

New York. New York 10012

1. INTRODUCTION

ASSET (A System to Select and Evaluate Tests) is an interactive software testing tool based on data flow testing (1,2,3,4]. It lakes as input a Pascal program, a set of test data, and one of the data flow testing criteria. It runs the program on the test data and produces a list of those definition-use associations required by the given cnterion but not covered by the test data.

The user then examines the output to determine whether the program has behaved according to its specification. If the program computed the w-Tong output, the user rcpons that the program has a bug. If the program has behaved correctly on all of the test cases, and if the criterion has not been satisfied, the user selects additional test data and continues the testing process. If the program has behaved correctly and the criterion has been satisfied the user can either release the program, certifying that it has been ade- quately tested according to the criterion, or can select another criterion and continue.

ASSET is used for testing an individual subprogram (procedure, function, or main program) which we'll call the "subject procedure". We will refer to the program of which the subject procedure is pan as the "subject program". The data flow analysis which it performs is inira-procedural analysis of the subject procedure.

This manual assumes that the reader has some familiarity with the theor>' of data flow testing and with the UNIX operating system. To acquire some familiarity with data flow testing, sec the references in section 7. Section 2 of this manual defines the rcstnc- tions on programs accepted by ASSET. Section 3 describes the most imponant ASSET commands. Section 4 describes the rest of the ASSET commands. Section 5 descnbcs how to use ASSET on a separately compiled program. Section 6 gives an example of an ASSET session. Section 7 gives references to relevant papcrb.

AcknowU'dm'mcnts:

I he graphics facility was s^ritten by Stewart Weiss, who also helped implement an earlier version of ASSET. The user- interface was translated from csh to C and enhanced by Ernie Campbell.

- 154-

Disclaimer:

ASSET is still a "work in progress". We do not vouch for its reliability nor are we responsible for maintaining it. However, if you find any bugs, or have other suggestions as to how the tool could be improved we would appreciate hearing from you. Please send e-mail to frankl@csd2.nyu.edu and to weyuker@csd2.nyu.edu.

2. THE LANGUAGE ACCEPTED BY ASSET

ASSET accepts syntactically correct programs written in the subset of (level 0) ISO Pascal consisting of programs which do not include any of the following constructs:

1. goto statements and labels.

2. functions with var parameters.

3. variant records

4. with statements

5. procedural or functional parameters

6. forward declarations

7. conformant arrays

The following identifiers are used in the modified version of the subject program and

therefore should not be used in the subject program: FW traversed

The following identifiers are used as file names by ASSET and therefore should not be

used as file names in the subject program:

IncFiles, StringTabFile, SymTabFile, anomalies, a.out, comp.msgs, copy, copy. exec, copy.p, copy2, copy2.p, defuse, display, globdecs, graph, junkfile, listing, mainfile, mainfile2, pairs, paths, picfile, record, results, results. old, subject, traversed

It is recommended that the user compile the program before submitting it to ASSET in

order to insure that it is syntactically correct.

In addition, ASSET accepts certain Berkeley Pascal constructs which are not part of standard Pascal, including separate compilation (see section 5).

ASSET places certain arbitrary restrictions on the size of the program. Specifically, number of variables visible to subject procedure <= 3(X) number of procedures and functions visible to subject procedure <= 300 number of types visible to subject procedure <= 300 number of constants visible to subject procedure <= 3(K) number of identifiers visible to subject procedure <= 500 length of each identifier <= 80

total number of characters in all identifiers visible to subject procedure <= 3(X)0 number of basic blocks in subject procedure <= ICX) For instructions on how to increase these limits, see Appendix I.

BUGS:

1. ASSET requires that the last statement in a case statement be followed by a semi- colon. For example, the (syntactically correct) program fragment case b of

true: writeln('hello');

false: wnteln('goodbye')

- 155-

end will be rejected by ASSET, but case b of

true: writelnChello'); false: writeln('goodbye'); end will be accepted.

The user should check that all case statements in the subject procedure are of this form before submitting the program to ASSET.

2. ASSET translates for loops incorrectly: If a statement of the form

"for i := el toe2 do S" appears in the subject procedure, in writing the modified subject procedure ASSET will translate this to a code fragment functionally equivalent to

i :=el; <label 1 >: if i > e2 then goto <label2> S;

i := succ(i); goto <labell>; <label2>: next statement.

It should translate it to a fragment functionally equivalent to i :=el; tmp :=e2; <labell>: if i > tmp then goto <label2> S;

i := succ(i); goto <label 1 >; <label2>: next statement;

Thus, if any variable occurring in expression is modified within the statement S (which is legal, though poor programming practice), the modified subject program will not be equivalent to the subject program.

These bugs will be corrected in the next version of ASSET.

3. THE MOST IMPORTANT ASSET COMMANDS

In this section we describe the basic ASSET commands. The commands arc listed in an order in which they can be executed. The section ends with some comments on the

relationship between the commands. Additional commands are described in the next sec- tion,

Invoking ASSET

We will refer to the directory from which .ASSET is invoked as the "current work- ing directory". Before invoking ASSET you should set up a "subject director." which contains the source of the program to be tested. It is usually convenient to have the sub- ject director) be a subdirectory of current working directory, but this is not necessary. The subject directory should mn be the same as the current working director. . To enter ASSET type "asset" in response to the I NIX prompt ASSET will rcsp<">nd \Mth Welcome to ASSET. For help type "help."

- 156-

Enter relative pathname of initial default directory. »:

At this point you should enter the pathname of the subject directory, relative to the current working directory. ASSET will then print the ASSET "command level prompt", ">»: ", signifying that it is ready to accept a command.

We now describe each of the ASSET commands. These commands can be invoked when ASSET types the command level prompt. Some of these commands will query the user for additional information. When this occurs, ASSET will prompt the user with its "inner prompt", "»: ". Sometimes ASSET will ask the user a question which has a yes or no answer. The user should respond with "Y" or "y" for yes and "N" or "n" for no. ASSET will indicate the default (if there is one) in brackets. If the default is "Y" then any non-null response which does not begin with "N" or "n" will be considered to be "Y". Similarly, if the default is "N" then any non-null response which does not begin with "Y" or "y" will be considered to be "N". If the user just hits carriage return, the inner prompt will continue to appear until a non-null response is given. The logical dependence of ASSET commands is given in Figure 1. The commands must be typed in lower case letters. Any command can be abbreviated by its first three letters.

begin

The "begin" command analyzes the subject procedure, breaking it into basic blocks, constructing its flow graph, and determining the blocks in which each variable has definitions and c-uses and the edges on which each variable has p-uses. An adjacency list for the graph is written to the file "graph", and a representation of the data-flow infor- mation is written to the file "defuse". The "String Table" in which ASSET has recorded the identifiers appearing in the subject program is written to the file "StringTabPile".

In addition, ASSET constructs a modified version of the subject program in which probes have been insened. That is, at the beginning of basic block i, the statement "writeln(traversed,i :FW);" is inserted. Declarations of the text file "traversed", the con- stant "FW", and any labels which appear in the modified program are also inserted. We will refer to the resulting program as the "modified subject program". ASSET writes a pretty-printed version of the modified subject program to the file "copy.p".

ASSET also creates files "copy", "copy2" (preliminary versions of the modified sub- ject program) and "SymTabFile" (a representation of the Symbol table for the subject program) which will not be used by subsequent commands, but may be useful for debug- ging-

When you type "begin", ASSET will respond with Enter name of subject procedure file. »: If there is no file of this name in the subject directory ASSET will print an error message and return to the command level. Otherwise ASSET will copy the file to the file "sub- ject", then say

Separate Compilation? (Y/N)[N] »: Answer "y" if the subject program is separately compiled, "n" otherwise. We will assume throughout the rest of this section that the subject program is NOT separately compiled. See section 5 for instructions on using ASSET on separately compiled pro- grams.

- 157-

Next, ASSET will ask you for the name of the procedure to be instrumented for test- ing. You may either supply the name of the procedure, or direct ASSET to prompt you with the names of the procedures in the program.

select

The "select" command causes ASSET to print a menu and query the user to select one of the data flow testing criteria.

find-associations

This command causes ASSET to produce a list of all definition-use associations (of the appropriate type) in the subject procedure. If the selected criterion is All-du-paths, find-associations writes a list of all of the du-paths in the subject procedure to the file "paths". Otherwise, find-associations writes a list of all of the definition-c-use associa- tions and definition-p-use associations in the subject procedure to the file "pairs". Note that in the worst case, finding all of the du-paths may take time exponential in the number of basic blocks.

compile

The "compile" command invokes the Pascal compiler to compile the modified sub- ject program. The executable code is written to the file "copy. exec". Warning and Error messages generated by the compiler are written to the file "comp.msgs".

run

The "run" command executes the modified subject program, on one or more test cases and adds the program trace to the file "traversed". When you type "run", ASSET may respond with

File 'traversed' already exists. Type 'Y' if you want to append to it, 'N'

otherwise. »: Type "Y" if you want to add the program traces generated by previous test cases to the current one, "N" otherwise. If you answer no, ASSET will ask you whether you want to save the old file "traversed" (i.e. the program traces of previous test cases).

Next, ASSET will ask you whether the program takes input from command line arguments. Answer yes if the subject program reads command line arguments via the Berkeley Pascal argv mechanism or if you would like to redirect the standard input or output using the UNIX redirection mechanism. If you answer yes, ASSET will prompt you for the command line arguments. For example supplying

<infilc >outfile as the command line argument will cause the modified subject program to read its stan- dard input from the file "mfile" and v^xitc its standard output to the file "outfilc".

ASSET will then say

Executing modified subject program ... then will execute the modified subject program with the command line arguments you've supplied. When execution terminates, ASSET will ask you whether you would like to run the program on additional test data. If you answer yes, ASSET will again ask you for command line arguments and execute the program. The new program trace will be added to the traces which have previously been uniten to the file "traversed". If you

- 158-

answer no, ASSET will return to the command level.

Note that one trace is produced for each run of the subject program, not one for each invocation of the subject procedure. Thus a single program trace may contain several paths from the entry to the exit of the subject procedure. If the subject procedure is recursive these paths will be "nested" within one another.

check

Tlie "check" command checks whether the selected criterion is satisfied by the test data whose program traces are in the file "traversed". ASSET writes to the file "results" a list of those def-use associations which are required by the criterion but have not been covered by the test data. (Before doing this, ASSET moves the old version of the file "results" to the file "results. old". This allows the user to easily see which def-use associa- tions were eliminated by the most recent additions to the test set by using the UNIX com- mand "diff results results. old"). If the file "traversed" does not exist, ASSET initializes it to represent an empty set of program traces. In this case, "results" will contain a list of all of the def-use associations required by the criterion.

save

The "save" command moves the files "subject", "copy.p", "copy. exec", "traversed", "results", "pairs", "paths", "defuse", etc. from the current working directory to the sub- ject directory.

exit

Moves important files to the subject directory and removes the rest, then exits.

Note on the relationship between the commands

Before checking whether a set of test data satisfies any of the criteria All-defs, All- p-uses, All-c-uses, All-c-uses/some-p-uses, All-p-uses/some-c-uses or All-uses it is necessary for ASSET to create a list of definition-use associations in the file "pairs". Before checking whether a set of test data satisfies the criterion All-du-paths it is neces- sary for ASSET to create a list of du-paths in the file "paths". When the most recenUy selected criterion is All-defs, All-p-uses, All-c-uses, All-c-uses/some-p-uses, All-p- uses/some-c-uses or All-uses the "find-associations" command creates the file "pairs". When the most recently selected criterion is All-du-paths the "find-associations" com- mand creates the file "paths". Thus it is necessary to use the "find-associations" com- mand once before the first time that "check" is invoked with one of the criteria involving "pairs" and once before the first time that "check" is invoked with the criterion du-paths.

Suppose the user has been attempting to satisfy criterion C then decides to switch to another criterion, C. Let T be the set of test data on which the program has just been run (i.e. assume that set of program traces in the file "traversed" corresponds to test set T). Then to see how close T comes to satisfying C the user need only invoke the "select- criterion" command (to select the new criterion, C) and the "check" command. It is not necessary to run the program on test set T again.

- 159- 4. OTHER ASSET COMMANDS

csh

Spawns a new shell from which you can invoke your favorite UNIX commands. To exit from this shell and return to the ASSET command level type control-D.

directory

Displays the name of the subject directory and prompts the user for a new subject directory.

graphics

The "graphics" command writes a PIC program describing a graphical representa- tion of the subject procedure's flow graph. The PIC program is written to the file "picfile". To obtain a picture of the flow graph, run the troff preprocessor "pic" (from a UNIX shell, not directly from ASSET) with picfile as input then pipe it through troff.

[At NYU, pipe the pic output through itroff (for output on a Canon Laser Printer), psroff (for output on an Apple Laser Writer), or preroff (for output on the console if you are in the Suntools environment.) For example, typing

% pic picfile I psroff -Pap3 (where "%" is the UNIX prompt) will send printed output to the LaserWriter in room 505.]

help

On line help facility.

purge

Removes (i.e. destroys) all of the files "subject", "copy.p", "copy. exec", "traversed", "results", "pairs", "paths", "defuse", etc. from the current working directory. Handle with care.

sepcomp

Compiles and links separately compiled subject program. See section 5.

restore

Copies the files "subject", "copy.p". "copy. exec", "traversed", "results", "pairs", "paths", "defuse", etc. from the subject directory to the current working directory, thus restoring a previous ASSET session. If the files in the subject directory were created by running ASSET on a separately compiled subject program some additional action is taken (sec section 5.)

view

The view' command directs ASSET to display the contents of a (text) file on the screen, a screenful at a time. If you type "view <filcname>" ASSET will display that file. Otherwise. ASSET will prompt you for the file name The view command works by invoking the UNIX command more", hence responds to the same inputs as docs "more".

- 160-

Most importantly, type carriage return to display the next line of text, space to display the next screenful of text and q to return to the ASSET command level. For other options, see the UNIX manual page for "more".

5. USING ASSET ON SEPARATELY COMPILED PROGRAMS

This section describes how to use ASSET to test separately compiled programs. This section assumes some familiarity with Berkeley Pascal's separate compilation mechanism. (See the relevent sections of the Berkeley Pascal manual.)

To instrument a procedure for testing, ASSET must create modified versions of the file containing the source for that procedure, the file containing the source for the main program, and a file containing global variable declarations. To create an executable code for the modified subject program, ASSET must recompile and relink the modified files and any other files dependent on them.

In order to do this, ASSET assumes that subject directory contains a Makefile describing the dependency information (see documentation for the UNIX command "make"), that there is a file called "globals.h" which has global variable declarations, and such that the object files for the main program and for the procedure to be instrumented are dependent on globals.h.

If the user answers "yes" to the question "Separate compilation?" during execution of ASSET'S "begin" command, ASSET prompts the user for the names of the file con- taining the main program and the file containing the subject procedure. We'll refer to these as <main_file> and <subject_proc_file>, respectively. ASSET moves <main_file>, <subject_proc_file>, and "globals.h" to <main_file>.bak, <subject_proc_file>.bak, and "global s.h.bak", respectively. ASSET writes the modified versions of the files to <main_file>, <subject_proc_file>, and "globals.h". Thus when the UNIX "make" com- mand is invoked, the modified subject program will be compiled and linked.

To compile the modified subject program use the ASSET command "sepcomp", rather than the ASSET "compile" command. Sepcomp invokes the UNIX make com- mand, and moves the executable code (which is assumed to be in "<subject- directory>/a.out") to copy. exec.

When the "save" command is invoked, ASSET looks for files with the ".bak" exten- sion in the subject directory. For each such file, ASSET moves <fname> to <fname>.mod and moves <fname>.bak to <fname>. Thus the original files have the same contents as they had before the ASSET session, and the modified versions have the extension ".mod".

The "restore" command looks for files with the ".mod" extension and queries the user as to whether they should be restored. If the user answers yes to the question "restore <fname>.mod (Y/N)?" then ASSET moves <fname>to <fname>.bak and moves <fname>.mod to <fname>. Thus to restore a session in which the instrumented procedure was in file "Proc3.p", the main program was in file "main.p", answer yes to the questions "restore ProcS.p.mod?", "restore main.p. mod?", and "restore globals.h. mod?"

Restrictions on separately compiled programs :

The separate compilation option makes the following assumptions:

- 161-

1. All relevant files are in the subject directory.

2. There is a file called "globals.h" containing global definitions in the subject directory.

3. "globals.h" is mentioned in #include directives in the subject procedure file and in the main program file.

4. There is a Makefile containing the dependency information, etc. needed to re-link and load the program contained in the subject directory.

5. The Makefile directs the executable code to file "a.out".

6. The procedure being instrumented is at the outer level of nesting.

7. TTie last included .h file cannot have any extra garbage at the end. (This is a bug).

NB It is currently NOT POSSIBLE to instrument the main program of a separately compiled module.

6. EXAMPLE OF AN ASSET SESSION

In this section we present an annotated example of an ASSET session. To distin- guish between text written by the system and that written by the user, we display text entered by the user in boldface type. Comments are written in italics.

Example 1:

This example shows an ASSET session in which a brute-force string-matching pro- cedure is analyzed. The program reads a string and a pattern. It is supposed to print the position in the stnng at which the partem first appears, and print 0 if the pattern never appears in the string. The current working directory has a subdirectory called "StrMtch". The file "StrMtch/subject.p" contains the following program: program TestStringMatch (input, output); const MAX = 80;

LENGTH = 10; type Source = array[ 1 ..MAX] of char;

String = arrayl 1 ..LENGTH] of char; var Pat : String; /

Txt : Source; i.result.TxtLen : integer;

function StringMatch(Pattem:String; SorTcxt: Source;

Patlxn, SorLen : integer): integer. {Brute force pattern-matcher. Returns 0 for no match) var PatPos. SorPos : integer; begin

PatPos := 1; SorPos := I; repeat

if Pattcml PatPos] = SorTcxij SorPos] then begin

SorPos := SorPos + 1;

- 162-

PatPos := PatPos + 1 end {then) else begin

SorPos := (SorPos - PatPos) + 2; PatPos := 1 end; {else} until (PatPos > PatLen) or (SorPos > SorLen); if PatPos > PatLen then StringMatch := SorPos - PatLen else StringMatch := 0 end; '{StringMatch)

begin {main program) {Read Input)

writelnCENTER THE TEXT'); i:=l;

while (not(eoln) and (i <= MAX)) do begin

read(Txt[i]);

i := i + 1 end;

TxtLen := i - 1; readln;

writelnCENTER THE PATTERN'); i:=l;

while (not(eoln) and (i <= LENGTH)) do begin

read(Pat[i]);

i := i + 1 end; readln;

result := StringMatch(Pat,Txt,i - 1, TxtLen); writeln('The pattern first appears at position ',result:3, ' in the text.'); end.

Script staned on Sun May 31 13:09:59 1987 csd27> asset

Welcome to ASSET. For help type "help."

Enter relative pathname of initial default directory. »: StrMtch

»>: begin

- 163-

Enter name of subject procedure file. »: subject. p

Separate Compilation? (Y/N) [N]

»: n

Enter the name of the procedure to be instrumented.

If you would like to be prompted with the names of

the procedures in the subject program, just hit carriage return.

»:

=> Should TestStringMatch

be instrumented for testing? (Y/N)

»: n

==> Should StringMatch

be instrumented for testing? (Y/N)

»:y

»>: select

SELECT A CRITERION

A. All-defs

B. All-c-uses

C. AU-p-uses

D. AU-c-uses/some-p-uses

E. All-p-uses/some-c-uses

F. All-uses

G. AU-du-paths

Enter letter representing the selected criterion »: a

Criterion is All-defs.

>»: find

We next check whether the criterion has been satisfied with no test data. This is not necessary, but by doing this, we get a list of all of the def-c-use and def-p-use associa- tions in the program. »>: check ALL-DEFS:

Still must exercise at least one of the following def-clear paths: with respect to from to

Pattern 1 ( 3. 5)

Pattern 1 ( 3. 4)

AND

Still must exercise at least one of the following def-clear paths: with respect to from to

8

( 6, 3)

( 6, 7)

( 7, 9)

( 7, 8)

164-

SorText 1 ( 3, 5)

SorText 1 ( 3, 4)

AND

Still must exercise at least one of the following def-clear paths:

with respect to from to

PatLen

PatLen

PatLen

PatLen

PatLen

AND

Still must exercise at least one of the following def-clear paths: with respect to from to

SorLen 1 ( 6, 3)

SorLen 1 ( 6, 7)

AND

Still must exercise at least one of the following def-clear paths:

with respect to from to

PatPos 2 4

PatPos 2 5

PatPos 2 ( 3, 5)

PatPos 2 ( 3, 4)

AND

Still must exercise at least one of the following def-clear paths:

with respect to from to

SorPos 2 4

SorPos 2 5

SorPos 2 ( 3, 5)

SorPos 2 ( 3, 4)

ANT)

Still must exercise at least one of the following def-clear paths:

with respect to from to

PatPos

PatPos

PatPos

PatPos

PatPos

PatPos

PatPos

PatPos

AND

Still must exercise at least one of the following def-clear paths:

4

4

4

5

4

( 3, 5)

4

( 3, 4)

4

( 6, 3)

4

( 6, 7)

4

( 7, 9)

4

( 7, 8)

165-

with respect to

from

to

SorPos

4

4

SorPos

4

5

SorPos

4

8

SorPos

4

( 3,

5)

SorPos

4

( 3,

4)

SorPos

4

( 6,

3)

SorPos

4

( 6,

7)

AND

Still must exercise at least one of the following def-clear paths;

with respect to

from tc

PatPos

5

4

PatPos

5

5

PatPos

5

( 3, 5)

PatPos

5

( 3, 4)

PatPos

5

( 6, 3)

PatPos

5

( 6, 7)

PatPos

5

( 7, 9)

PatPos

5

( 7, 8)

AND

Still must exercise at least one of the following def-clear paths:

with respect to

from

to

SorPos

5

4

SorPos

5

5

SorPos

5

8

SorPos

5

( 3,

5)

SorPos

5

( 3,

4)

SorPos

5

( 6.

3)

SorPos

5

( 6,

7)

To look at these again use the command 'view results'.

Next we will compile the program and start running it on some test data. As the initial test data set, we select one element in which the pattern appears in the string and one element in which the pattern does not appear in the string. »>: compile Compilation begins ... Done, and successful.

»>: run

File 'traversed' already exists.

Do you want to append to it? (Y/N) f Y]

»: n

Do you want to save old 'traversed'? (Y/N) [N]

»: n

Command line arguments? (Y/N) [Y]

166

»: n

Executing mcxiified subject program ...

ENTER THE TEXT The quick brown fox

ENTER THE PATTERN quick

The pattern first appears at position 5 in the text.

Do you want to run the subject program

on some additional test data? (Y/N) [N]

»:y

Command line arguments? (Y/N) [Y]

»: n

Executing modified subject program ...

ENTER THE TEXT The quick brown fox

ENTER THE PATTERN quack

The pattern first appears at position 0 in the text. Do you want to run the subject program on some additional test data? (Y/N) [N] »: n

>»: check

ALL-DEFS: CRITERION SATISFIED

To look at these again use the command 'view results'.

The test set satisfies the all-defs criterion. We next check whether the same test set satisfies a stronger criterion, all-uses. »>: select

SELECT A CRITERION

A. AU-defs

B. All-c-uses

C. All-p-uses

D. AU-c-uses/some-p-uses

E. All-p-uses/some-c-uses

F. All-uses

G. AU-du-paths

Enter letter representing the selected criterion »:f

167

Criterion is All-uses.

>»; check

ALL-USES

Still need to exercise all of the following of def-clear paths:

with respect to

from

to

PatPos

2

4

SorPos

2

4

SorPos

5

8

PatPos

2

( 3,

4)

SorPos

2

( 3,

4)

PatPos

4

( 7,

9)

PatPos

5

( 7,

8)

To look at these again use the command 'view results'.

To aid in the selection of test data which cover the remaining def-use associations, the user can draw the flow graph (see Figure 2) and use "copy.p" to aid in labeling each node with the corresponding code. Notice that for each i, block i begins with the state- ment "write( traversed, i:FW);"

>»: view copy.p

program TesiStringMatch(traversed, input, output);

var

traversed: text; const

MAX = 80;

LENGTH = 10; type

Source = array [1..MAX] of char;

String = array (1.. LENGTH] of char var

Pat: String;

Txt: Source;

i, resuh, TxtLen: integer;

function StringMatch(Pattern: String; SorText: Source; PatLen, SorLen; integer): integer;

label

10; const

RV = 4; var

PatPos, SorPos: integer;

begin

168

write(traversed, 1: FW); write(traversed, 2: FW); PatPos:= 1; SorPos := 1; 10:

write(traversed, 3: FW); if Pattem[PatPos] = SorText[SorPos] then begin

write(traversed, 4: FW);

begin

SorPos := SorPos + 1; PatPos := PatPos + 1

end end else begin

write(traversed, 5: FW);

begin

SorPos := SorPos - PatPos + 2; PatPos := 1

end end;

write(traversed, 6: FW); if not ((PatPos > PatLen) or (SorPos > SorLen)) then

goto 10; write(traversed, 7: FW); if PatPos > PatLen then begin

write(traversed, 8: FW);

StringMatch := SorPos - PatLen end else begin write(traversed, 9: FW);

StringMatch := 0 end;

write(traversed, 10: FW); write(traversed, 11: FW) end; { StringMatch )

begin

rewrite (traversed);

writelnCENTER THE TEXT);

i:=l;

while not eoln and (i <= MAX) do begin

read(Txt[i]); i := i + 1

end;

TxtLen := i - 1;

readln;

writelnCENTER THE PATTERN'); i:=l;

- 169

while not eoln and (i <= LENGTH) do begin

read(Pat[i]); i:=i + 1

end;

readln;

result := StringMatch(Pat, Txt, i - 1, TxtLen);

writelnCThe pattern first appears at position ',

end. { TestStringMatch )

result: 3, ' in the text.')

Examining the annotated flow graph, we see that in order to execute a path from 2 to 4 which is definition clear with respect to PatPos, a test case in which the first character of the string matches the first character of the pattern is needed. We run the program on such a test case, adding its trace to those produced by the test cases run previously.

>»: run

File 'traversed' already exists.

Do you want to append to it? (Y/N) [Y]

»: y

Command line arguments? (Y/N) [Y] »: n

Executing modified subject program ...

ENTER THE TEXT The quick brown fox

ENTER THE PATTERN The

The pattern first appears at position 1 in the text. Do you want to run the subject program on some addilionai test data? (Y/Nj [NJ »: n

»>: check

Still need to exercise all of the following of def-clear paths:

with respect to from to

Sorl\^s 5 8

PatPos 4 ( 7. 9)

PatPos 3 ( 7. 8)

To look at these again use the command 'view results'.

Examining the annotated flow graph, we see that in order to execute a path from 5 to 8 which IS definition clear with respect to SorPos, a test case in which the pattern is the

- 170-

null string is needed. We run the program on such a test case, adding its trace to those produced by the test cases run previously.

>»: run

File 'traversed' already exists.

Do you want to append to it? (Y/N) [Y]

»:y

Command line arguments? (Y/N) [Y]

»: n

Executing modified subject program ...

ENTER THE TEXT The quick brown fox ENTER THE PATTERN

The pattern first appears at position 2 in the text.

Examining the program's output, we see that the program has reported that the null

string first appears in position 2 of the string. This is an error! The reader should note

that this bug is guaranteed to be detected by any test set which is adequate according to

the all-uses criterion. Having discovered a bug, we save the ASSET session and prepare

to report the error.

Do you want to run the subject program

on some additional test data? (Y/N) [N]

»: n

>»: save

Note that to cover the one remaining association, (4,(7,9)J'atPos), we would have to include a test case in which the first k characters of the pattern matched the last k characters of the text, for some k s.t. 0 < k < the length of the pattern.

r

i-» I

begin

select additional

test data select different

criterion report errror release software

o

A

user

\

results

Figure 1

Schematic diagram of ASSET

P is program C is criterion. I is Test Set.

Ill-

_^_^.^

_

»

c ft £

^

c

.ii.

c

ti

(L

«

ft

C

^

r^

u

&

z

.-

e.

c

>

C

IT

u

= li

C

e

^

t"-

e - o-

w

ft

^

C ft.

a. E

c

X

c

s r

w^ a. >■

X

0. X

* 4

>

c

w >

a. >

c

L.

-.-

c

c c

^

c

c

-^ ft'

C

3

> tt T

Sm. Cl

j^

L

> -

> 0; K 1. r

_

ft. "

k

>-

IT '

"" **

II

C ft

(^ ft.

V

e

re - 6.

- Li - X

X X

ft

^^

J" t; ~

-s

r- Q.

ft

IX

f^ C ft II ft M

_

X •-

X

c-

T -

u e a

ft C

c ft

r a. c

e

c e, 1

k. ft

£

*■ i i'

a. c.

& >

w^ t-

a

V V

■C IT

ft e

X ft £ ft £ 6 ft > i_ C > -

^

a. >

V

c

y (t «

c e

■— ft

c c

IT. ^

•c c.

C

<- X

3

C .

? =

. ^

a. ft.

L. »

3. 0. > >

""

w •■

0. -

e ft

> ft.

il-z^kzi

ft.

c c

I) M

ft i- a

1- a. -

C

t- a.

a c a, - -

XT U- - O-

^

Ci,o a>— - ^

if

c e

h. <

c - - 1

^

^

3. a

c o

a ft »

£

0. a s £

e.

ft =

c ft ft 1 .-- ft 3 1.-. Q B

b

> c

c

I ft T 1

z z

X

X -

3 - ft '1

-

c

t~ 1

3

c

X

U X

9 a

&. L-

3 -

t

ft a -

'

»

3

T = £

D

r -

B

S f

« >%

>.

t

_ > .'

P c^

0,

0.

- 1

r\ ^

I & <^

.£_^

X

X

£ X

~ ~ ^ -

Q

^^^

^_^^,=,=—

i^

- 173-

7. REFERENCES

1. S. Rapps and E.J. Weyuker, "Selecting Software Test Data Using Data Flow Information", IEEE Trans. Software Eng., Vol. SE-11, No. 4, April 1985, pp. 367-375.

2. P.G. Frankl, S.N. Weiss and E.J. Weyuker, "ASSET: A System to Select and Evaluate Tests", Proceedings of the IEEE Conference on Software Tools, New York, April 1985.

3. P.G. Frankl and E.J. Weyuker, "A Data Flow Testing Tool," Proceedings of IEEE Softfair II, San Francisco, Dec. 1985.

4. P.G. Frankl and E.J. Weyuker, "Data Flow Testing in the Presence of Unexe- cutable Paths", Proceedings of the IEEE Workshop on Software Testing, Banff, Canada, July 1986.

- 174-

APPENDIX n Algorithm for derivation of initial path expression

In this appendix we present an algorithm which takes as input a flow graph G, and a definition-use association (d,u,v)* and which produces a path expression ^ such that all viable candidates are contained in PATHS (^). The length of the path expression ^ is 0(N) where N is the number of nodes in the flow graph. Tlie algorithm runs in time 0(N). Throughout this section we assume that the flow graph G is formed according to the rules in Figure 2.1.1, from a subprogram having no repeat-until statements.

The algorithm is based on a generalization of the notion of dominators [HEC77]. Recall that node i dominates node j if and only if every path from the entry node to node j passes through i; i properly dominates j if and only if i?tj and i dominates j; i immedi- ately dominates j iff

a) i properly dominates j, and

b) if k properly dominates j and k^^i then k (properly) dominates i. We will say that node i s-dominates node j if and only if

a) i and j are reachable from s, and

b) every path from from s to j passes through i.

We will say that i properly s-dominates j if and only if i^j and i s-dominates j; i immedi- ately s-dominates j iff

a) i properly s-dominates j. and

Aciuall) . ihc algonlhm makes no use of ihc vanablc v. However, it docs assume that v has a dcfiniuon in node d. As in chapter 5. we assume, without loss of generality, that u is a node, rath- er than an edge.

- 175-

b) if k properly s-dominates j and k?ti then k (properly) s-dominates i.

We use s-DOM(j) to denote {i I i s-dominates j}.

Consider the graph G^ obtained from G by deleting all nodes and edges which are not reachable from s, deleting all edges of the form (i,s) and making s the entry node. It follows immediately from the definitions that i s-dominates j in G if and only if i dom- inates j in G^. With this observation, the following properties of the s-dominance rela- tion follow from the analogous propenies of the dominance relation.

LEMMA AD. 1 : s-DOM(s) = { s ) .

LEMMA AH. 2: The s-dominance relation of a flow graph is a partial ordering.

LEMMA AD. 3: Node s s-dominates all nodes which are reachable from s.

LEMMA AII.4: The s-dominators of a node form a chain (i.e. a linear ordering).

LEMMA AD. 5: Every node except s which is reachable from s has a unique immediate dominator.

LEMMA AD. 6: The graph of the reflexive and transitive reduction of the s-dominance relation of a flow graph is a forest. There is one tree T in this forest which contains all and only those nodes which are reachable from s. There is an arc (i j) in T if and only if i immediately dominates j, and there is a path in T fi-om i to j if and only if i dominates j.

We note that for a given s, all of the sets s-DOM(j) can be computed in time propor- tional to the number of nodes in G as follows.

1) Perform a depth first search on G starting at s to determine which nodes are in G^ and to derive a reverse postordering of the nodes in G^.

2) Apply the dominator algorithm for reducible flow graphs [?iEC77] to compute the sets DOM(j) in G^. By the above observation, these are the sets s-DOM(j) in G.

- 176-

Each of these steps takes 0(e) bit vector operations, where e is the number of edges in G. Since G is a structured flow graph, e=0(IGI). To justify step 2, we note that if G'^ were irreducible it would have a "forbidden subgraph" [HEC77]. TTien G would also have a "forbidden subgraph". But since G is a structured flow graph, it is reducible, hence has no "forbidden subgraph."

Recall that given a definition-use association (d,u,v), algorithm PEM2 (Algorithm 5.5.2) needs to build up the path expression 4, representing the initial set of candidates, from three components, a''*^ represents the set of paths from the entry node to node d; a"^" represents the set of paths from node d to node u; a"'" represents the set of paths from the entry node to node d;

We will present an algorithm which, given a "source" node s and a "target" node t, will produce a path expression which represents all of the paths (n\,...,ni,) where ni=s, /i;t=t. and ny{s,t) for l<j<k.* The intuition motivating the algorithm is that every path from s to t must pass through all of the s-dominators of t. In addition, such paths may pass through other nodes.

Recall that if node h is a loop header, then a path through the loop headed by /i is a path of length greater than or equal to one. which begins and ends at node h, and which does not contain any nodes which are not in the loop headed by h. If n is not a loop header, wc v. ill say that a path n is n-free if and only if node n does not occur in n, except possibly as the first ncxle in the path. If n is a loop header, wc will say that n is n-free if and only if 7i=po, where p is either the null path or a path through the loop headed by n, and n does not occur in o. For our purposes, given a souac s and a target t. it will suffice

The path expression produced may also rcprcseni some additional paths from s lo i.

- 177-

to produce a path expression a such that all of the s-free paths from s to t belong to PATHS(a).

We will present two algorithms, BTl and BT2. BTl builds a path expression tree representing the s-free paths from s to t in the case that s dominates t. We first present the algorithm for BTl and prove its correctness. We then show why this algorithm fails when s does not dominate t, and present a somewhat more complicate algorithm BT2, which handles the general case. Finally, we argue that BT2 is correct, analyze its com- plexity, and show that the length of the path expression produced is 0(N)-

We distinguish between three types of nodes which may occur in the flow graph G: loop header nodes, selection exit nodes, and regular nodes. A loop header node, n, has exactiy two predecessors, p\ and p2, where p\ is the immediate dominator of n, and n dominates p 2- A selection exit, has at least two predecessors; if n is a selection exit then none of n's predecessors dominate n. A regular node has at most one predecessor; if p is the predecessor of a regular node n, then p is the immediate dominator of n. It follows from the structure of the subgraphs given in Figure 2.1.1 that the sets {loop header nodes), {selection exit nodes), and {regular nodes) form a panition of the set of nodes. The nodes in G can easily be classified by performing a depth first search of G. For each back edge, (ij), j is a loop header, for each cross edge, (i j), j is a selection exit; all nodes which are neither the target of a back edge, nor the target of a cross edge are regular nodes.

The algorithms BTl and BT2 make use of the following three primitive functions. CONCAT takes as input two trees, 7"] and T2, and returns the tree whose root is labelled by the symbol "•", whose left subtree is 7], and whose right subtree is T2- UNION takes as input two trees, 7"i and T2, and returns the tree whose root is labelled by the symbol

- 178-

"\^", whose left subtree is Tj , and whose right subtree is T2. MAKEVERTEX takes as input a node symbol n, or a loop symbol p,, and returns the tree whose root is labelled by that symbol, and whose left and right subtrees are null.

Function BTl is given in Figure AII.l.

function BTl(s,t: NodeRange): Tree;

(returns path expression tree representing set of paths from s to t, where s dominates t)

begin

if s does not dominate t then Error, if s=l then BTl := MAKEVERTEX(X) else begin

imm_s_dom := immediate s-dominator oft; if t is regular node then

BTl := CONCAT(BTl(s, imm_s_dom),MAKEVERTEX(n,)) else if t is a loop header then

BTl := CONCAT(BTl(s, imm_s_dom),MAKEVERTEX(p,)) else (t is a selection exit) begin

let / 1 ,...,/;„ be the predecessors of t; { m is greater than or equal to 2 } Tl:= UNION(BTUimm_s_dom,i 1),

BTl (imm_s_dom,j 2 )); for j := 3 to m do begin

T2 :=T1;

Tl := UNION(BTl(imm_s_dom,/p,T2); end; {*) T2:=C0NCAT(Tl,MAKEVERTEX(n,));

BTl := CONCAT(BTl(s. imm_s_dom).T2); end end end.

Figure AII.l

Roughly speaking. BTl involves climbing upwards through the tree of immediate-s- dominators from t to s Whenever a selection-exit n is encountered, BTl calls itself recursively to build subtfccs representing the sets of paths from the entrance of the sclcc-

- 179- tion subgraph to the predecessors of n, then "unions" these subtrees together.

THEOREM

Let s dominate t. Let a be the path expression represented by the tree returned by BTl(s,t). Then if node s is not a loop header, the path expression n^-a represents the set of s-free paths from s to t, and if node s is a loop header, the path expression p^-a represents the set of s-free paths firom s to t.

PROOF:

Let C' be the subgraph of C^ consisting of those nodes from which t is reachable. The proof is by course-of-values induction on the size of the G^'.

Base Case: IG*''I=1.

If \G^''\=\, then s=t, so BTl(s,t) consists of a single vertex labelled by X, and a=X. PATHS(n^-a)=PATHS(ni)={(«^)), which is the set of paths from s to t.

Inductive case:

Suppose the theorem holds for all 5' and r' such that s' dominates r' and \G ' \< k. We show that the theorem holds for s and t such that C^''= k. We will assume that s is not a loop header. The proof when s is a loop header is identical except that occurrences of the symbol n^ are replaced by the symbol p^.

case 1 : t is a regular node.

Let d be the immediate s-dominator of t. Every s-free path from s to t is of the form Kn, where 7t is an s-free path from s to d. The function BTW(s,t) returns a tree represent- ing the path expression a=P-n,, where P is the path expression represented by BTl(s,d). So ni-a=n^-p-n,. By the inductive hypothesis, n^-P represents the set of s-free paths from s to d. Therefore,

-180-

PATHS(ni-a)={7i«,l t: is an s-free path from s to d}={s-free paths from s to t).

case 2: t is a loop header.

Let d be the immediate s-dominator of t. Every s-free path from s to t is of the form Ttp where 7t is an s-free path from s to d and p is a path through the loop headed by node t. The function BTW(s,t) returns a tree representing the path expression a=p-p,, where [i=BTl(s,d). So n^-a=n^Pp,. By the inductive hypothesis, n^-P represents the set of s- free paths from s to d. By definition, p, represents the set of through the loop headed by t. Therefore, PATHS(n^-a)={ s-free paths from s to t}.

case 3: t is a selection exit.

Let d be the immediate dominator of t. Note that d is the entrance to the selection subgraph corresponding to t. Since s dominates t, s also dominates d, so d is the immedi- ate s-dominator of t. Let / 1 ,.../^ be the predecessors of t. Note that d dominates each of the ij. Every s-free path from s to t is of the form pdat, where pd is a s-free path from s to d, and do is a d-free path from d to one of the ij.

By the inductive h>'pothesis, for each j, l<j<m, the path expression a^ represented by the tree BTKd,/^), has the property that nj-a^ represents the set of d-free paths from d to ij. So after executing the statement marked by [*j, T2 represents a path expression (3, where n^-^ represents the set of d-free paths from d to t. By the inductive hypothesis, BTKs.d) represents a path expression y where n,-Y represents the set of s-frce paths from s to d. So BTl(s,t) represents the path expression yP where n,-y-p represents the set of s-free paths from s to t.B

When s does not dominate i, the situation becomes more complicated. Intuiuvely, this is because the graph G* is not necessarily a structured flow graph. In this situation

- 181 -

the immediate s-dominators of a ncxie is not necessarily the same as its immediate domi- nator.

Consider the flow graph shown in Figure All. 2. a. Let s be node 6. The tree of immediate 6-dominators is shown in Figure An.2.b. Those nodes which are loop headers (in G) have been marked with an asterisk, and those which are selection exits (in G) are marked with a plus sign.

Let t be node 10. Climbing upward through the tree of immediate-6-dominators from 10 to 6 yields the path expression a=6-8p4-910. This path expression represents some, but not all, of the paths firom node 6 to node 10. It does not represent those paths which, after visiting node 9, "bypass" node 10, going instead to nodes 11 and 12, then around the loop headed by node 2 zero or more times, then back to nodes 3, 4, and 9. We can view the problem as arising from the fact that node 4 can be preceded either by node 8, which immediately-s-dominates node 4, or by node 3, which does not s-dominate node 4. Note that the immediate 6-dominator of node 4 and the immediate dominator of node 4 differ. If we replace the symbol P4 in a by the sub-path-expression

(p4-9(10ull)12p2-3,^X)p4, which represents the set of paths from node 4 to node 4, we obtain the expression

6-8- (p4-9(10,^ll)12p2-3,^>.)p4 ^-lO, which represents the set of paths from node 6 to node 10, as desired.

A similar problem arises in the flow graph shown in Figure AII.3.a. Let s be node 4 and let t be node 8. The tree of immediate 4-dominators is shown in Figure AII.3.b. Climbing up the tree from node 8 to node 4 gives the path expression 4-6-7-8, which represents some, but not all of the paths from 4 to 8. If we replace the symbol 6 by the sub-paih-expression

(b)

-I!^l-

FIGURE All. 2

(aj

8*

4*

5

I 7

/

\ 9

/\

10 U 12-

2*

/\

3 13

I*<2,-

FIGURE All. 3

(a)

®

(b)

6-

8 10-

1'

11

- 184-

6(7(8u9)10 p2-3(4,^5)-6»^X),

which represents the set of paths starting and ending at ncxie 6, we obtain the path expres- sion 4■6■(7•(8^^9)•10•p2•3•(4^^5)•6^^X)•7•8, which represents the set of paths from node 4 to node 8, as desired.

We capture these idea algorithmically, as follows. As before, BT2 involves climb- ing upwards through the tree of immediate s-dominators from t to s. If the node being processed is a regular node, a loop header whose immediate s-dominator is the same as its immediate dominator, or a selection exit whose immediate s-dominator is the same as its immediate dominator, then BT2 will behave like BTl. However if the node n being processed is a loop header or selection exit whose immediate s-dominator and immediate dominator differ, it may be necessary to take special action, in order to include a sub- path-expression representing the set of n-free paths from n to n. As we shall see below, this special action is not always necessary, because the sub-path-expression produced may be redundant. We will present the algorithm for function BT2, after some prelim- inary definitions.

Let s be a node and let n be a loop header node in G. Let e be the exit node of the loop headed by n and let d be the immediate dominator of n. If d is reachable from s and e s-dominates n, we will call n an s-inverted loop header. Note that if n is s-inverted then s is in the loop headed by n and s^'n. If n is s-invened then the immediate dominator of n and the immediate s-dominator of n are different nodes. Thus, the loop in G^ headed by n is different than the loop in G headed by n.

LEMMA All 7:

Let n be an s-inverted loop header and let d be its immediate dominator. Then there is exactly one loop header h such that

- 185-

1) n is in the body of the loop headed by h, and

2) n properly s-dominates h and h s-dominates d.

Note that h may actually be d itself. We will call h the header corresponding node n.

PROOF:

Since n is s-inverted, s is in the body of the loop headed by n. Otherwise, every path from s to e would pass through n. Since d is reachable from s there must be some loop which contains both d and s (see Figure All. 4.) Let h be the header of the innermost such loop. It is immediately evident from the figure that h satisfies 1) and 2).

Let h' ^hbt another header which satisfies 1). Then the loop headed by h is nested within the loop headed by h'. There is a path from s to d which does not pass through h', so h' does not satisfy 2)M

We will now present the second algorithm, BT2, and argue that it is correct.

\1fi-

FIGURE All. 4

,^

® ...

/

\

/

187-

procedure BT2(s

l,t: NodeRange; var HeaderHit: boolean; var T: Tree);

var imm.

_dom, imm_s_dom :NodeRange;

begin

{1} ifs

1 dominates t then T := BTl(sl,t)

{2} else begin

{3}

imm.

.dom := immediate dominator of t;

{4}

imm_

.s_dom := immediate s-dominator of t;

{5)

if t is a regular node then

begir

1

{6}

BT2(sl,imm s dom,HeaderHit,Tl);

{7}

end

T := CONCAT(Tl,MAKEVERTEX(n,))

{8}

else i begir

f t is a loop header then

1

{9}

mark that t has been visited;

{10}

if (t is inverted loop header) then begin

{11}

HeaderHit := true;

{12}

if the header associated with t has not yet been visited then begin

{13}

BT2(t,imm_dom,HeaderHit,Tl);

{14}

T2 := UNION

(CONCAT(MAKE VER 1 hX(p, ),T1 ) MAKEVERTEX(X));

{15}

T3 := C0NCAT(T2,MAKEVERTEX(p,));

{16}

BT2(sl,imm s dom,HeaderHit,T4);

{17}

T := CONCAT(T4,T3); end

{18}

else begin

{19}

BT2(sl,imm s dom,HeaderHit,Tl);

{20}

T := CONCAT(Tl,MAKEVERTEX(p,)); end end

{21}

else begin

{22}

BT2(sl,imm s dom,HeaderHit,Tl);

{23}

end

T := CONCAT(Tl,MAKEVERTEX(p,)); end

{24}

else { t is a selection exit }

{25}

if imm_dom s-dominates t then

{26}

BT2(sl,imm s dom,HeaderHit,Tl);

{27}

T := CONCAT(Tl,BTl(imm_dom,t))

{28}

else if t s-dominates imm_dom and not HeaderHit then begin

(29) {30}

Tl := BTl(imm_dom,t); BT2(t,imm_dom,HeaderHit,T2);

- 188-

{31}

T3 :=C0NCAT(T2,T1);

{32}

T4 := UNI0N(MAKEVERTEX(X),T3);

{33}

T5 := CONCAT(MAKEVER lhX(n,),T4);

{34}

BT2(sl,imm s dom,HeaderHit,T6);

{35}

end

T := CONCAT(T6,T5)

{36}

else

begin

{37}

BT2(sl,imm s dom,HeaderHit,Tl);

{38}

end

T := CONCAT(Tl,MAKEVERTEX(n,))

end

end.

function BuildTree(G,s,t : NodeRange) : Tree;

begin

{initialize global variables}

build tree of immediate dominators;

build tree of immediate s-dominators;

classify each node as being a regular node, a loop header, or a selection exit;

mark each s-inverted loop header as such and find the associated loop header;

mark each loop header as unvisited;

HeaderHit := false;

(call procedure BT2 to build tree T=7",

where n^-a represents the set of candidates corresponding

to the definition-use pair (s,t)} BT2(s,t,HeaderHit,T);

BuildTree := CONCAT(MAKEVERTEX(n,),T); end;

We now argue, somewhat informally, that BuildTree(G,s,t) returns a tree represent- ing a set of paths which contains all of the s-free paths from s to t, as desired.

When BT2 is called from BuildTree, the effect is to "climb" upward (toward the root) through the tree of immediate s-dominators from t to s, backing down the tree under certain special circumstances.

We first discuss the cases where the mimediate dominator and immediate s- dominaior of the current node are the same. If the current node t is a regular node, then

- 189-

the algorithm "emits" a vertex labelled by n, (line 7) and continues to climb up through the tree (line 6). Similarly, if the t is a non-inverted loop header node, then the algorithm "emits" a vertex labelled by p, (line 23) and continues to climb up through the tree Gine 22). If the t is a selection exit whose immediate dominator, d, and immediate-s- dominator are the same node, then the algorithm "emits" a subtree representing the d-free paths from d to t Qine 27) then continues to climb up through the tree (line 26). In each of these cases, an argument similar to the ones made in the proof of correctness of BTl could be made to show that the resulting path expression includes all of the s-free paths from si to t.

We next consider what happens when the current node, t, is an inverted loop header. Let d be the immediate dominator of t, let e be the other predecessor of t (the exit node of the body of the loop headed by t, hence the immediate-s-dominator of t) let h be the header associated with t, and let n be the successor of t which is not in the loop (see Fig- ure All. 5.)

If node h, the header corresponding to t, has not yet been visited, then the algorithm returns a tree corresponding to the path expression T=P•(p,•a^^X,)•p, where P comes from the recursive call to BT2 in line 16 and a comes from the recursive call to BT2 in line 13.

The s-free paths from s to t are all of the form Tip, where 7t is an s-free path from s to e and p is either of the form eox^ or to, where a and ^ are s-free paths through the loop headed by t and i is an s-free path from n to d. Any s-free path from s to t can be decomposed in such a way that kg PATHS(p) and pe PATHS((p,-a^^X)p,).

If node h, the header corresponding to t, has already been visited, then a subtree labelled by a loop symbol representing the outermost loop header which is reachable

-liD-

FIGLRE AII.S

^~N,

®

h*

- 191-

from s and which reaches the original target has already been emitted. All of the paths from t to d are subpaths of paths represented by this loop header. So it would be redun- dant to emit the path expression (P;a\^X)p,). Thus, in lines 22 and 23 the algorithm emits a path expression representing the s-free paths which go from s to e, then through the loop headed by t.

If the current node t is a selection exit, there are four cases to consider. Let id represent the immediate dominator of t and let isd represent the immediate s-dominator of t. If id=isd, then s dominates id, and the situation is similar the situation treated in algorithm BTl. BT2 emits a path expression a- (3, where a (the path expression corresponding to the tree returned by the recursive call in line 22) represents s-free paths from s to id and (3 (the path expression corresponding to the tree returned by the call to BTl in line 23) represents paths through the selection subgraph.

If id=isd then node s is in the selection subgraph whose exit node is t. This sub- graph has a different "shape" in G'^ than in G. If t s-dominates id {line 28}, then there is a path from s to id. Each s-free path from s to t is of the form 7tp, where K is n s-free path from s to t and p is a path from t to t. Any path from t to t is either the null path or a path oC where a goes from t to id and I, goes from id to t. The path expression corresponding to the tree built in lines 29 to 35 represents these paths. Intuinvely, in lines 29 and 30 the algorithm "backs down" the tree of immediate s-dominators.

If the variable HeaderHii has already been set to true when control reaches line 28, then the outermost loop header containing s and the original target node has already been emitted, so the path expression representing paths from t to t is redundant, this case the algorithm emits a path expression (in lines 37 and 38) representing the s-free paths which go "directly" from s to t. without "backing down" the tree of immediate s-dominators.

- 192-

Similarly, if id does not s-dominate t and t does not s-dominate id, then id is not reachable from s, so the algorithm emits a path expression (in lines 37 and 38) represent- ing the s-free paths which go "directly" from s to t, without "backing down" the tree of immediate s-dominators.

In the execution of BuildTree(G,s,t) no vertex in the tree of immediate s-dominators is visited more that twice. This is because the only way a vertex which has already been visited can be visited again is if the algorithm "backs down" the tree of immediate-s- dominators in line 13 or line 30. When the recursive call in line 30 is performed, an inverted loop header node which is s-dominated by t and which s-dominates id is guaranteed to be visited, at which point HeaderHit will be set to true. So line 30 is exe- cuted at most once in the execution of BuildTree. When line 13 is executed, the headers corresponding to inverted loop header t and to all of the other inverted loop headers in whose loop t lies are visited. Thus in the execution of BuildTree, each edge (id,t) is backed down at most once. It follows that the length of the path expression produced by BuildTree is 0(N) and that the running time of BuildTree is 0(N) where N is the number of nodes in G.

- 193- APPENDIX III:

In this appendix we present a script of a session with an implemented prototype of the path expression heuristic. This implementation is interactive. It allows the user to select the loop symbol on which to concentrate. It also requires that the user perform the partial symbolic evaluation and theorem proving by hand. The implemented algorithm differs in certain minor ways from the algorithm described in chapter 5.

In this session we examine the association (2,(6, 14), switch) from the bubblesort program in Figure 5.2.2. This is the same association which was examined in Examples 5.2.2 and 5.5.1.

Text entered by the user is printed in bold face. Comments are in italics. The sym- bol F stands for 0 and the symbol L stands for X.

Script started on Tue Sep 22 21:10:05 1987

csd27> heuristic

Would you like to check another def-use association?(Y/N)

==>y

Enter the source and target node

2 14

Enter the variable

switch

The initial path expression is

N2.P3.N5.P6.N14

Select one of the following actions: A: Expand a loop symbol

- 194-

B: Toggle parentheses

C: Print current path expression

D: Print Always/Never info for subexpression

E: Give up

===>a

Enter position of a loop symbol

3

User directs system to first concentrate on the symbol P3,

which is the third symbol in the current path expression.

Expanding loop symbol P3.

Use partial symbolic evaluation to derive the panial path condition E1& ... &Ek for the path expression

botP3.N5.P6.N14top which follows the loop symbol.

Derive a predicate which is always true when control reaches the top of this occurrence of P3 by panially symbolically evaluating the path expression botNl.N2.P3top

For each i.

If (Ei & C) is unsatisfiable. let MRDi be the set of variables occurring in Ei. Otherwise MRDi is empty.

- 195- Enter the non-empty sets MRDi

Enter variables in the next MRDi set, one per line. Type "." to end the list of variables in this set. Enter "*" to end the list of sets.

Enter variables in the next MRDi set, one per line. Type "." to end the list of variables in this set.

Enter "*" to end the list of sets.

*

User enters the set (i) because the predicate i>5

is necessary on exit from the loop and is unsatisfiable before

entry into the loop. Thus i must change value on any executable

path through the loop.

Expanding loop symbol

The simplified subexpression representing the loop is ((N3.N4)-i- U L).N3

The new path expression is N2.((N3.N4)+ U L).N3.N5.P6.N14

Select one of the following actions: A: Expand a loop symbol B: Toggle parentheses

- 196-

C: Print current path expression

D: Print Always/Never info for subexpression

E: Give up

===>a

Enter position of a loop symbol

14

The loop symbol P6 is the 14th symbol in the path expression

(not counting parentheses.)

Expanding loop symbol P6.

Use panial symbolic evaluation to derive the partial path condition E1& ... &Ek for the path expression

botP6.N14top which follows the loop symbol.

Derive a predicate which is always true when control reaches the top of this occurrence of P6 by partially symbolically evaluating the path expression botNl.N2.{(N3.N4)+ U L).N3.N5.P6iop

For each i,

If (Ei & C) is unsatisfiablc. let MRDi be the

set of variables occurring in Ei.

Otherwise MRDi is empty.

Enter the non-empty sets MRDi

- 197-

Enter variables in the next MRDi set, one per line.

Type "." to end the list of variables in this set.

Enter "*" to end the list of sets.

n

switch

Enter variables in the next MRDi set, one per line. Type "." to end the list of variables in this set.

Enter "*" to end the list of sets.

*

n or switch must be redefined on any executable path

through the loop

Expanding loop symbol

Pruning the expression. Variable

switch is always defined in the sub-expression

(N6.N7.P8.N13H

The simplified subexpression representing the loop is

N6

Each path through the body of the loop has a definition of switch

None of the variables in the set

{ switch n )

are defined in the subexpression

N6

- 198-

The new path expression is F

No more candidates for unexecutable def-clear path.

The association is unexecutable

Would you like to check another def-use association?(Y/N)

==> n

csd27> *D

script done on Tue Sep 22 21:15:41 1987

NYU COMPSCI TR-394 c.l

Frankl, Phyllis

The use of data flow

information for the

selection and. . .

NYU COMPSCI TR-394

Frankl, Phyllis

The use of data flow information for the selection and...

c.l

This book may be kept SEP ^ 0 1988

FOURTEEN DAYS

A fine will be charged for each day the book is kept overtime.

I

OAVkORO 1«d

r: «

^ * * _.-^_«^.

-f :< ,<»

5

m ir:*

.&* .. *•■

»: '^..'^

® m

if>' IfH'

>:. »'

-*;*.

«. SP

S^J^

■*'"'^- %!«!♦-.*,

;***«.:*-*

«%%