(navigation image)
Home American Libraries | Canadian Libraries | Universal Library | Community Texts | Project Gutenberg | Children's Library | Biodiversity Heritage Library | Additional Collections
Search: Advanced Search
Anonymous User (login or join us)
Upload
See other formats

Full text of "Research in network data management and resource sharing : network file allocation"

c The person ch , r • ? It 
s Pon s ibi e fo • ar ff'ng- thi^ ^ ' 

lt » e *' D*7el mthdra *no* e ( hb /ary from 

I* 16 ""'Vr.,,/ OC,,on ond moy » of «><>o fes Qre 

*° ren,. re SOlf :_ .. re OSO„s 

rene w co/l Te , " d '"nis so ( fro 




APR 1 7 ROTD 



*<l6i_ 



O-J096 



Digitized by the Internet Archive 

in 2012 with funding from 

University of Illinois Urbana-Champaign 



http://archive.org/details/researchinnetworOObelf 




UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 

URBANA. ILLINOIS 61801 




CAC Document Number 203 
CCTC-WAD Document Number 6506 

Research in 

Network Data Management and 

Resource Sharing 

Network File Allocation 

August 2, 1976 




31» IJbrwv 






University or Illinois 
irt Urbarta-Cfiam 



CAC Document Number 203 
CCTC-WAD Document Number 6506 



Research in 
Network Data Management and 
Resource Sharing 



Network File Allocation 



by 

Geneva G. Belford 
John D. Day 
Enrique Grapa 
Paul M. Schwartz 



Prepared for the 

Command and Control Technical Center 

WWMCCS ADP Directorate 

of the 

Defense Communications Agency 

Washington, D.C. 



under contract 
DCA100-75-C-0021 



Center for Advanced Computation 
University of Illinois at Urbana- Champaign 
Urbana, Illinois 61801 



August 2, 1976 



Approved for release: 




Peter A. Alsberg, Pr%uCl | |jW*" ; E rtvgs ti 



gator 



Table of Contents 

Page 

Executive Summary 1 

Background 1 

Overview 1 

Models 2 

Finding the Optimal Allocation 6 

Applications to Special Situations 7 

Conclusions 8 

Introduction 10 

History of the Problem 10 

The Present Work 13 

Models 15 

Introduction 15 

The Casey Model 16 

The Chu Model 18 

The Levin Model 19 

Model for a Primary-Copy Synchronization Scheme 22 

Constraints 24 

Finding the Optimal Allocation 29 

Zero-One Programming 29 

Searching the Cost Graph 29 

Simplifying the Problem 38 

Applications to Special Situations 47 

Single Update Source 48 

Use of a Local Partial Copy 50 

Conclusions 54 



Table of Contents (continued) 

Page 

References 56 

Appendix 1 57 

Appendix 2 59 

Appendix 3 63 



Executive Summary 

Background 

This document presents the results to date of a research study 
of file allocation in a network. The study is part of a larger, compre- 
hensive investigation of problems in network data management and resource 
sharing. The goal is to develop techniques applicable to the World-Wide 
Military Command and Control System (WWMCCS) Intercomputer Network 
(WIN). The work is supported by the WWMCCS ADP Directorate, Command and 
Control Technical Center, of the Defense Communications Agency. 
Overview 

Computer networks can provide rapid access. to geographically 
distant computer systems. This creates an opportunity to set up an 
automated distributed data base. In a distributed data base, different 
files can be located at different sites in the network. Also, copies of 
files can be replicated at one or more backup sites. The WWMCCS data 
base is an example of such a multi-file, multi-copy, distributed data 
base. 

The engineering of a distributed data base leads naturally to 
the problem of network file allocation . This is the problem of placing 
the files at the various sites in some optimal (or near optimal) manner. 

In our study of the allocation problem we first develop a 
model for how the files are used. The model carefully defines the file 
usage process that we are trying to make efficient through an optimal 
choice of allocation strategy. The next step is to define what is meant 
by optimal. In most studies in the literature optimal is defined as 
least cost. "Cost" is a broadly defined term that can include far more 



than dollar considerations. For example, dollar costs can be multiplied 
by weight factors to reflect the scarcity of certain resources or the 
priority of special functions. There are alternatives to using least 
cost. For example, it might be worthwhile to define the optimum as the 
allocation which results in the minimum response time. 

Once a model has been developed and it has been decided that 
(say) cost is to be minimized, the next step is to develop a cost 
formula . The cost formula will reflect the model and will include all 
essential parameters. Additional formulas may have to be developed for 
constraints in the case where the cost formula alone is not sufficient 
to define the optimum. Examples of constraints that would be imposed in 
a WWMCCS environment are the a priori assignment of a file to a site or 
the elimination of a site from consideration because of security or 
command requirements. 

The last step is to solve the optimization problem defined by 
the model, cost formula and constraints. In general, this is a straight- 
forward, but very large, computational problem. 

In this document we examine several possible models and discuss 
associated cost formulas and constraints. We then study the question of 
solving the optimization problem and present several new techniques 
which will in many cases reduce the problem to a size that is computa- 
tionally tractable. Finally, we look at two file allocation questions 
for which the model is sufficiently simple that the problem may be 
solved with essentially no computational work. 
Models 

We have identified three aspects of file handling that must be 
included in any file allocation study. Basically, the file must be 



1. stored, 

2. updated, and 

3. queried. 

The precise impact on cost of these basic processes will depend on how 
they are carried out. In the models which one usually sees discussed in 
the file allocation literature, the cost formula is of the following 
form: 

Cost = Cost of Storage 

+ Cost of Sending Queries to Nearest Copy 

+ Cost of Sending Updates to All Copies 
A formula of this type was introduced by Casey [4], It varies in some 
respects from that used by Chu [6] in his pioneering work on file 
allocation. 

Notice that as the number of copies increases, the cost of 
updates and storage will also increase. But the cost of querying 
decreases as closer sites become available to receive the queries. We 
therefore have a classic tradeoff, which makes the problem of finding 
the cheapest allocation a nontrivial one. In case the reader feels that 
not all relevant costs appear to be included in this formula, we remark 
that only a slight modification is needed for this basic scheme to 
reflect such additional features as the costs of processing queries and 
updates. Furthermore, parameters can be interpreted broadly. The 
"nearest" copy is not necessarily nearest in the geographical sense, but 
the one that can respond to the query most cheaply. 

We have, however, identified a more serious difficulty with 
this scheme. The assumption is made that all sites generating updates 
send them independently to all the file copies. In our parallel work on 



synchronization [1] and resiliency [2], we have identified many diffi- 
culties with such a "broadcast" model for updating. We find that a 
primary-copy scheme, in which all updates are first sent to a "primary" 
has many advantages. The primary plays a role in ensuring both resil- 
iency and proper synchronization of the updates. For a primary-copy 
scheme in which the primary subsequently broadcasts the updates, the 
cost formula will be something like the following: 
Cost = Cost of Storage 

+ Cost of Sending Queries to Nearest Copy 

+ Cost of Sending Updates to Primary 

+ Cost for Primary to Send Updates to Copies 
Surprisingly, although this formula looks like it would probably lead to 
higher cost than the broadcast formula, this is not necessarily the 
case. That is, in many situations lower-cost file allocations can be 
obtained for a primary-copy scheme than for a broadcast scheme. 

There are a number of constraints which one might want to 
impose on the problem. The four most useful ones that we find discussed 
in the literature are: 

1. Forbidden or prescribed sites . For historical or policy 
reasons, a copy of a data base might be required at a given 
site. Conversely, security considerations could forbid 
certain locations. 

2. Prescribed number of copies . Primarily for reasons of in- 
creased availability [3], it may be desirable to specify that 
there be, say, two copies of the data base, or of certain 
files. One might also specify a minimal number of copies. 

3. Limited storage . It may be necessary to take into account the 
limited storage capacity of certain sites. 



A. Upper bound on response time . If the access process is 

modeled in considerable detail, and if enough information on 
system parameters is known, the expected time to access each 
file may be computed. The constraint can be imposed that this 
access time be less than some prescribed maximum value. 

Of these four constraints, the first two appear to be the most 
relevant to WWMCCS needs. Fortunately, they also turn out to be very 
easy to impose. Straightforward conditions on number of copies or on 
file locations reduce the options and hence tend to reduce the computa- 
tion necessary to find an optimum. 

On the other hand, the third and fourth constraints actually 
tend to increase the difficulty of the problem. This is because these 
constraints couple together the impact of all the files in the system. 
Storage limitations obviously must take into account all the files. In 
response-time constraints, the file locations interact in more subtle 
ways, such as in increased network traffic and resulting delays. 
Without such coupling, the best location for each file (or group of 
files) can be determined independently. It turns out to be much easier 
to solve a number of small allocation problems than one gigantic one. 

One further point should be made about storage constraints. 
It is not only difficult to impose them, but it may be unrealistic to do 
so. If it is otherwise worthwhile to store a file at a given site, 
there seems to be no good reason why additional storage capacity could 
not be added to accommodate that file. That is, "storage cost" in the 
cost formula can just as well reflect the cost of new storage capacity. 

If one takes this point of view, then the only classical 
constraint which is troublesome to impose is a bound on response time. 



Further study may be useful to determine whether response-time con- 
straints can be handled in some simple fashion - and also whether they 
are really worthwhile in most environments. It may be that some more 
straightforward constraint - for example, on network traffic between 
various sites - would be easier to apply and at the same time have the 
same effect on system performance as the response-time constraint. 
Finding the Optimal Allocation 

If there are n sites in a network, and for each site there are 
two possibilities - either it has a copy or it doesn't - then there are 
2 different file allocations to be considered. (This number becomes 
2 -1 if we omit the null allocation, i.e. no copy.) Computationally, 
there are no foolproof shortcuts to finding the optimal allocation. As 
the network increases in size, the work to solve the allocation problem 
will in general grow exponentially, doubling with every additional site. 
Consider the following example. Casey [4] notes that a program run on 
the 360/91 took less than 10 seconds to solve six different single-file 
allocation problems for a network of 19 sites. (This time included the 
Fortran compilation.) Thus, if we assume that it takes about 2 seconds 
to optimally allocate a single file in a 19-site network, it will take 
over an hour in a 30-site network, about 48 days in a 40-site network, 
and about 136 years in a 50-site network. These figures must then be 
multiplied by the number of files to be allocated. 

It therefore becomes imperative to reduce the computational 
work in any way possible. Casey [4] and Urano et al. [13] have given 
some theorems which help in this. Taking a different approach, we have 
developed three theorems which determine a. priori that certain sites 
should, or should not, be included in any optimal allocation. Each such 



a_ priori determination reduces by a factor of two the size of the 
remaining allocation problem. We therefore feel that our theorems can 
be enormously useful in reducing the file allocation problem for large 
networks down to manageable size. 
Applications to Special Situations 

Following another line of research, we have investigated 
certain simple models and related allocation questions which may be 
particularly useful in the WWMCCS environment. First we assumed that 
network transmission cost is the same for all site pairs. However, we 
added the costs of processing queries and updates to the model, and we 
assumed that these may differ from site to site. Here are the questions 
we investigated, with summaries of our results. 

Question 1 . Suppose that all updates are generated at a 
single site, but queries may originate anywhere in the network. What is 
the best allocation? 

In a homogeneous network, the optimal allocation often turns 
out to be a single copy, located at the site generating the 
updates. Specifically, this happens when it costs more to 
maintain (store and update) a copy elsewhere than it does to 
answer remote queries from the updating site. In a heteroge- 
neous network, more tradeoffs enter into the picture. We 
present two simple strategy rules for this case. 

Question 2 . Suppose that there are only two sites under 
consideration - a remote site holding a copy of the data base and 
responsible for updating it, and a local site generating queries for the 
data base. Is it worthwhile to have a local data cache? 

A simple inequality suffices to answer this question. For 
example, if we save $0.10 per query by having the local cache, 



and the cost of local storage is $100/tnonth, then the local 
cache is cost effective for all query rates over (100/0. 1) /mo. 
or about 1.4 per hour. If the savings per query is only 
$0.01; i.e., decreases by a factor of 10, then the crossover 
point increases by the same factor of 10, to 14 queries per 
hour. 
Conclusions 

It is not very hard to formulate a file allocation problem in 
any particular environment. It is only necessary to understand the file 
handling process well enough to model it. For small networks - say 
less than 20 sites - it is a simple matter to find the optimal alloca- 
tion, by brute force computation of the "costs" of all possible alloca- 
tions if necessary. Considerable headway has been made on reducing the 
computational difficulties for larger networks. In addition, in parti- 
cular situations the problem may be simplified to such an extent that 
general rules of thumb are available to yield good - if not optimal - 
allocations. 

In short, the theoretical problems of optimal file allocation 
are well in hand. The data base administrator who wishes to distribute 
files in a network must, however, address several practical problems. 
What model best reflects his environment? What are the values of the 
parameters? (For example, how should the various "costs" be computed?) 
What are the usage patterns? (That is, what query and update volumes 
are generated at the various sites?) The question of parameter values 
is a particularly difficult one. In some respects, it is a problem of 
measurement and statistical analysis. Only measurement can yield usage 
patterns for a real environment. Statistical analysis can help to say 



whether those patterns seem static, oscillatory, or have some long-term 
trend. "Costs" can be even more difficult to determine, since they may 
reflect not only "real" costs, but some subjective judgment as to the 
value or scarcity of resources. The data base administrator must play 
an active role in determining the "best" network file allocation. The 
researcher can only supply the basic tools and instructions on their 
use. 



Introduction 

Computer networks can provide rapid access to geographically 
distant computer systems. This opens up a number of attractive possi- 
bilities. One possibility is to set up a distributed data base, in 
which copies of different files (or segments of files) are located at 
different sites in the network. An obvious question then arises: 
where should the files be placed? This is the file allocation problem. 
It is not surprising that this problem, being basic as well as obvious, 
has a longer history than any other in distributed data management. 
History of the Problem 

Optimal allocation by zero-one programming . The earliest work 
on the network file allocation problem was done by Chu [6], Chu states 
the problem as follows: "Given a number of computers that process 
common information files, how can we allocate files so that the allo- 
cation yields minimum overall operating costs subject to the following 
constraints: (1) The expected time to access each file is less than a 
given bound, and (2) the amount of storage needed at each computer does 
not exceed the available storage capacity." Variables X.. to describe 
the allocation are introduced; X. .=1 if the ith file is stored in the 
ith computer and X. .=0 otherwise. In order to apply constraint (1), Chu 
develops a reasonably comprehensive formula for access time, including 
queueing delays and the effect of intercomputer traffic congestion. The 
overall cost expression to be minimized includes costs for storage as 
well as for transmission of queries and updates. Since the variables to 
be determined can take on only the values zero or one, the optimal 
allocation may be found as the solution to a so-called nonlinear zero- 
one programming problem. (In fact Chu notes that the problem may be 



10 



reduced to a linear one, which may be solved by straightforward 
techniques. ) 

In a later paper, Chu [7] discusses how an availability 
constraint can be added to the model. The main idea is to determine in 
advance (from simple assumptions on failure probabilities) how many 
redundant copies of a file are required to achieve a desired level of 
availability. This number is then inserted into the model. The scheme 
otherwise remains unchanged, since Chu assumed in his earlier work that 
the number of copies was given a^ priori . The difficulty with using 
zero-one programming to solve the file allocation problem is that it is 
so time-consuming as to seem impractical for a large file system. 

Optimal and suboptimal allocation by search procedures . 
Believing that zero-one programming is too costly an approach, Casey [4] 
developed an efficient search procedure for finding a minimal-cost 
solution, as well as heuristic methods for finding acceptably good 
solutions. Casey's model differs in some respects from Chu's. One 
important difference is that Casey lets the number of copies of a file, 
as well as its locations, be variables. Notice that as the number of 
copies of a file increases, the expense of querying the file decreases, 
but storage and updating costs increase. Thus Casey's approach to 
optimization strikes a balance between these two opposing trends. A 
disadvantage, of course, is that the minimization may not yield enough 
copies for reliability. 

Casey applied his optimization algorithm to real data for the 
ARPA network (when it had 19 sites) and thus showed the process feasible 
for networks of moderate size. His experiments indicated that when 
update traffic equals query traffic, it is most efficient to store the 



11 



file at a central node. As query traffic increases relative to updates, 
storage at multiple nodes is indicated. These results are intuitively 
reasonable. Although one always expects several local minima in a 
complex, multivariable minimization problem, it is noteworthy that 
Casey's experiments reveal extemely large numbers of them (over 100 in 
some cases) . It is clear that any optimal allocation procedure must 
take care to avoid being trapped in such a local minimum. It is possi- 
ble, of course, that a local minimum may provide an acceptable suboptimal 
solution to the problem. 

Morgan and Levin have criticized both Chu's and Casey's models 
on the basis that they do not allow for dependencies between files and 
programs [10,11]. That is, a program (which is itself a file) may need 
to make use of one or more data files. The fact that these files must 
interact with one another is not taken into account in the older models 
which assume file independence. Morgan and Levin also point out that in 
a heterogeneous network it may not always be possible to store a particular 
file at an arbitrary site. Their model takes into account this type of 
constraint, which also includes the possibility that the allocation may 
be restricted by security considerations. The algorithm used to solve 
the optimization problem is a systematic search procedure, along the 
lines suggested by Casey. Dynamic features were also introduced; that 
is, costs to change the file allocation were considered and balanced 
against savings expected from the improved allocation. 

Simplifying the problem . Model development has therefore been 
extensive and has produced models of considerable sophistication. More 
recent research in file allocation has emphasized ways to simplify the 
problem. This is an important consideration, since in general the work 



12 



to solve the file allocation problem increases as 2 , where n is the 
number of sites in the network. 

Urano et al. [13] have developed theorems which allow a net- 
work to be partitioned into what one might call "proximity groups" of 
sites. At most one site from each proximity group can appear in any 
optimal allocation. Roughly speaking, the costs of querying the various 
members of a proximity group (from any other site in the network) are 
all about the same. If querying costs are proportional to the distance 
the query must be shipped across the network, then members of a proximity 
group are physically close. For example, in a nation-wide network, a 
group of sites in, say, the Los Angeles area might form such a proximity 
group. 

Another step towards making the problem more tractable is the 
very recent work of Chandy and Hewes [5]. They first reformulate the 
problem so that it is linear, at the cost of adding additional variables 
and constraints. They then ignore the requirement that the variables 
X. . be zero or one and use a standard linear programming approach to 
find a minimum. If the minimizing X..'s turn out to have (as very often 
happens) zero-one values, then the problem is solved. If not, at least 
a possibly useful lower bound to the true minimum cost is obtained, and 
a linear integer-programming approach (with all the troubles involved 
for large numbers of constraints) can be tried. More detailed work 
needs to be done here, but the approach is very promising. 
The Present Work 

We here report on the work that we have carried out on file 
allocation. In the next section we discuss the different models in 
varying degrees of detail. In particular, Casey's model, and a varia- 
tion of it which we developed for a primary-copy environment, are 



13 



discussed fairly completely. We felt that Casey's model has the sim- 
plicity and flexibility needed to form a basis for our own work. We 
also list various constraints which might be imposed, and assess their 
importance and the relative ease with which they might be applied. 

In the following section we take up the very difficult problem 
of finding the optimal allocation and present some contributions that we 
have made. 

In the two final sections of this report we briefly discuss 
applications to several special cases for which simply analyzed models 
are possible, and we summarize some conclusions we have drawn from our 
work to date. 



14 



Models 



Introduction 

In order to make a rational decision, based on quantitative 
data, on where to put files in a network, it is first necessary to 
construct a model. The model must include those aspects of file han- 
dling which affect cost, response time, and any other features of the 
distributed data system which the file allocator may wish to take into 
account. Once the process of file handling is defined by a model, it is 
possible to identify the parameters which enter into cost, response 
time, etc., and to generate the necessary formulas. The reader should 
be aware of the fact that a model inevitably is a simplified representa- 
tion of the real world. A good model should be simple. It should 
include the key features of the process to be modeled. But at the same 
time, the level of detail should not be such as to preclude its applica- 
tion to slightly varying situations. For example, one should be able to 
modify parameter definitions. One should also be able to include more 
detail by developing submodels to compute the parameters in the basic 
model. 

With these thoughts in mind, we identify three aspects of file 
handling which must be included in any file allocation study. The file 
must be 

1. stored, 

2. updated, and 

3. queried. 

Effects of these basic processes must be included in any cost formula. 
Their precise impact on cost will depend on how the processes are 



15 



carried out. Response time is, of course, a function of the querying 
process. It will also depend on processor and network characteristics 
and loads. The possibility of modeling at an ever increasing level of 
detail immediately arises. In this report we will restrict ourselves to 
flexible, high level models. Where appropriate, we will point out how 
lower level detail may enter into the parameters. The four models that 
we will be considering in this section are 

1. the Casey model, which is perhaps the one most commonly 
discussed in the literature, 

2. the Chu model, which was the first network file allocation 
model to appear in the literature, 

3. the Levin model, an ambitious effort to include interaction 
between programs and files, and 

4. a new model (really a variant of Casey's) that we have 
developed for use in environments that apply primary copy 
synchronization schemes. 

After discussing parameters and cost formulas for the models, we will 
look briefly at formulas for various constraints which may be imposed. 
The Casey Model 

This model, as presented in [4], may be briefly described as 
follows. For simplicity, only a single file is considered. Any site 
may query or update the file. Updates must be sent to all file copies. 
Queries are sent to the "closest" one, where "closest" can be broadly 
defined in terms of lowest cost. 

Cost function . The cost computed is essentially total oper- 
ating cost over some chosen time period (day, week, month). The formula 
contains three terms: one for storage, one for updating, and one for 
querying. The parameters in the model are as follows: 



16 



I = index set of sites with a copy of the file 

n = number of sites in the network 

V . = update load originating at site j 

A . = query load originating at site j 

d., = cost of communication of one query unit from site j to site 

d' = cost of communication of one update unit from site j to 
jk 

site k 
o = storage cost of file at site k 

Storage costs and query loads must be those for the chosen time period. 

An allocation is specified by I, the index set of sites with a copy. 

That is, if sites numbered 1, 4, and 5 have a copy, I = {1,4,5}.* The 

cost C(I) of a given allocation I is then 

n 
C(I) = E [ Z V.d' + X. min d..] + Z a (1) 

j=l kel J JlC 3 kel JR kel 

Since this cost formula and the notation defined above will be used 

frequently throughout this report, this page is printed on light green 

stock so that the reader may locate it easily. 

Notice that as the number of copies increases, the cost of 

storage and of updating also increases. But the cost of querying 

decreases as closer sites become available to send the queries to. We 

therefore have here a classic tradeoff, which makes the problem of 

finding the minimal-cost allocation a nontrivial one. A complicating 

factor, from the computational point of view, is the appearance in the 



* Although brackets are most commonly used to denote sets in mathematical 
discussions, we will sometimes use parentheses to indicate an allocation 
index set. Thus, we might alternatively write I=(l,4,5). We will also 
refer to I (or to (1,4,5)) simply as an "allocation". This seems easier 
than to continue to talk about "index sets". 



17 



formula of min d.. . This makes the problem seriously nonlinear. The 
jk 

problem of actually determining the minimal cost allocation will be 
discussed in some detail in a later section. 

Extension to multiple files . The cost formula (1) is for a 
single file. If more than one file is to be allocated, the overall cost 
is readily obtained by adding the costs of the individual files. This 
is possible since the model assumes that there is no interaction between 
usage of the various files. Thus, to minimize the overall cost, one 
need only find the optimal (minimal cost) allocation for each individual 
file. We shall see that when constraints are added, or when file usage 
is not independent, this separation of the multi-file problem into a set 
of single-file problems may no longer be valid. 
The Chu Model 

Like Casey, Chu [6,7] assumes that updates must be sent to all 
copies, while queries go to just one. However, Chu assumes that, if 
there are r copies of a file, then a fraction 1/r of the queries for 
that file are sent to each copy. Alternatively, one may think of Chu's 
approach as averaging over the cost of sending queries randomly to any 
site which has a copy. Furthermore, Chu assumes that the number r of 
copies is fixed in advance, generally by considerations of availability. 
(Relaxing this assumption would, of course, simply require one to con- 
sider in turn the cases r=l,2,...,n (where n is the number of sites in 
the network) . ) 

Chu has some details in his model which appear to make it less 
flexible than Casey's. For example, he envisions a "transaction" to a 
file as resulting in a flow of u units of data from the remote site to 
the user site. Such a "transaction" corresponds roughly to a query in 



18 



Casey's model. But then updates are handled by assuming that a certain 
fraction of these transactions result in the shipment of the same u data 
units back to the remote site. That is, Chu models the updating process 
as one of sending for the file (or relevant portion thereof) , modifying 
it, and then shipping it back. This makes it easy for Chu to apply his 
model to, for example, a multiprocessor with memory hierarchy. However, 
it is unlikely that updates would be performed in this way in a network 
environment with a distributed data management system. 

There is another limiting feature of Chu's model. In Casey's 
model, it is easy to include processing costs by adding them into the 
factors d.. or d'... Thus, if we wish to include in Casey's model the 
cost of querying a local copy, we need only let d.. be that cost. Chu's 
cost formulation, however, has a built-in limitation that makes local 
query cost zero. (We will not write down Chu's cost formula here; the 
interested reader may consult his paper.) 

Chu's formulation is of the multiple-file problem. However, 
his cost function readily separates into a sum of terms, each of them 
being the cost associated with a single file. 
The Levin Model 

In his thesis [10], Levin argues that "in order to process a 
given transaction both the relevant program and the relevant file must 
be accessed." Thus, a user request originating at one site may have to 
be sent to a second site (the one containing the application program) , 
which in turn accesses the needed files. Since the optimal location of 
a file then depends on the locations of the programs which access it, 
Levin feels that "an optimizing model for file assignments should con- 
sider the location of the relevant programs and optimize their assign- 
ments simultaneously." This is an attractive idea. However, it does 



19 



lead to a model of considerable complexity. In particular, the assump- 
tion of file-usage independence is no longer valid when some of those 
"files" are actually programs which access other files. 

Levin's cost formula contains the usual three parts - costs of 
storage, updates, and queries. However, because of the inclusion of 
programs and because updates and queries are transmitted through an 
intermediate program, the cost function ends up with six terms: 

Cost = Cost of transmitting queries from initiating sites to the 
programs 

+ Cost of transmitting updates from initiating sites to the 
programs 

+ Cost of transmitting queries from programs to files 

+ Cost of transmitting updates from programs to files 

+ Storage cost of programs 

+ Storage cost of files. 

In any model there seems to be a problem with deciding where 
to transmit the queries. Recall that Casey sends them to the copy that 
is closest in some sense, while Chu sends them randomly to any site with 
a copy. Levin's approach is to introduce a new set of variables to 
describe the "routing discipline"; that is, to which program copy, and 
from there to which file copy, transactions from a given site are 
shipped. The values of these routing variables are then to be deter- 
mined, along with the locations of files and programs, in the optimi- 
zation process. This approach increases enormously the number of 
unknown variables in the problem. Furthermore, since one program may 
access several files - and one file may be accessed by several programs - 
locations of all files and programs are interdependent, and it is no 



20 



longer possible to simplify the problem by decomposing it into a set of 
single-file allocations. 

In order to carry out such a decomposition, and hence make the 
optimization computationally tractable, Levin makes several simplifying 
assumptions. In particular, he assumes that storage costs of programs 
are zero. In a very real sense this eliminates the program location 
problem, since it then becomes possible to freely place programs wher- 
ever it seems useful (and is possible) to do so. Once programs are 
located, one may readily optimize the routing discipline for any given 
file allocation. The file allocation itself is carried out by using 
Casey's suggested search procedure. (This procedure will be discussed 
later.) It is not surprising that Casey's solution method is ultimately 
applicable, since, by adding assumptions to make the problem computa- 
tionally tractable, Levin has simplified his model to the point where it 
looks very much like Casey's. In fact, Levin's simplified model can be 
obtained immediately from Casey's model if some precomputing is done to 
reflect the routing. Each transmission from user to file will have to 
touch one of the preestablished program sites. Thus, if neither the 
user site nor the file site contains a copy of the needed program, then 

Casey's d.. should be changed to d„ + d, . , where k is the program site 
ij lk kj 

which minimizes that sum. (If copies of necessary programs are at all 
sites, then Casey's model needs no modification. In a heterogeneous 
network, however, it may not be reasonable to assume an unrestricted 
allocation of programs.) 

As we have noted, Levin's basic model, before simplifications 
are introduced, is too complicated to be computationally tractable. 
However, Levin goes on to map out elaborate extensions of his model. He 
first notes that the basic model is deterministic and static. That is, 



21 



access patterns are assumed to be known, and costs incurred through 
dynamically changing the allocation are not taken into account. He 
therefore indicates (1) how the deterministic assumption may be relaxed 
by treating access patterns as random variables and using forecasting 
theory, and (2) how the static assumption may be relaxed by including 
the costs of changing allocations in a multi-period cost function. 
These are interesting ideas which may be of some importance in future 
work on dynamic, self-organizing, distributed data systems. In this 
report, however, we will restrict ourselves to examining the file 
allocation problem in a static, deterministic environment. 
Model for a Primary-Copy Synchronization Scheme 

In examining the literature on network file allocation, we 
find Casey's model the most attractive one. It has a simplicity and 
flexibility which makes it more readily applicable to a wide variety 
of situations. In addition, its simplicity provides clear insights 
into the key features of the problem, which are easily obscured in the 
more complicated models. We have therefore decided to build on Casey's 
model in our file allocation work. 

When we examine Casey's model in the light of our work on 
update synchronization [1], we find one serious problem. That model - 
like all others in the current literature - assumes that the site where 
the update originates broadcasts the update to all sites holding a copy 
of the file. Thus the update message flow corresponds to that of 
Johnson's scheme for synchronization [9], In that scheme, each site 
adds a timestamp (its own clock time) to the update and sends it to the 
copies. We have identified certain disadvantages with such a scheme 
which led us to study primary-copy schemes. In primary-copy schemes all 



22 



updates are sent to a distinguished, or primary, site, which subsequently 
broadcasts them. We may summarize the difference which this makes in 
the cost formula as follows. 

Casey (broadcast) model : 
Cost = Storage 

+ Transmission of Queries to Nearest Copy 
+ Transmission of Updates to All Copies. 
Primary-copy model : 
Cost = Storage 

+ Transmission of Queries to Nearest Copy 
+ Transmission of Updates to Primary 
+ Broadcast of Updates by Primary. 
The first two terms are the same; it is only the handling of updates 
that changes. Assuming that the primary is at site 1, we may write 
(using Casey's notation) the single-file cost function for the primary- 
copy model as 

n n 

C(I) -2 a + E X. min d + I V d' + f I d' (2) 
kel j=l J kel Jk j=l J Ji kel ik 

n 
In this formula, ¥ ■ £ ¥. ; i.e., ¥ is the total update load which 

j=l 2 

the primary must broadcast. 

It may appear to the reader that this formula, since it looks 
longer, yields higher costs than Casey's formula. In looking at examples, 
we find that this is not true. That is, a primary copy scheme may be 
cheaper than a timestamp scheme for synchronization. Intuitively, this 
occurs because a site generating updates need send them only to the 
primary; and it may be cheaper for the primary to transmit them to the 



23 



other copies than it would have been for the updating site to send them 
directly. (For example, when the parameters d'., are proportional to 
distance, the primary site may be closer to the other sites in the 
allocation than is the updating site.) This interesting result will be 
discussed in more detail in a later report. 

As in Casey's model, multiple files are handled by summing the 
cost functions for the individual files. 
Constraints 

It is not always reasonable to minimize the cost function over 
all possible allocations of copies to sites. In this section we look at 
various kinds of constraints which one might want to impose. 

Forbidden or prescribed sites . The simplest type of constraint 
is to require that certain sites have a copy of the file. In the WWMCCS 
environment, certain data bases are currently held at associated sites 
as a matter of policy and/or convenience. It would be unreasonable to 
throw such sites into a mindless file-allocation algorithm. In such 
situations, the algorithm would only be applied to determine the best 
locations for extra copies. The cost formula would, of course, include 
the information that a copy is at some prescribed site, since this 
affects the query cost. Similarly, it may not be feasible to consider 
putting certain data at some sites. This could be for reasons of 
security, local system overload, etc. We can then simply omit such 
sites when we apply the file-allocation procedure. However, if a 
forbidden site may query the data, this information must be included 
in the cost formula. 

Prescribed number of copies . As was previously noted, Chu's 
cost function includes the number of copies as a parameter. Chu [7] 
suggests choosing this number on the basis of availability. A simple 



24 



model can be used to determine the number of copies that are needed to 
yield a given level of availability. In many cases, two copies are 
adequate. If one site is prescribed, then one need only determine which 
of the (n-1) sites remaining is the best location for the extra copy. 
Given the cost function, it is a simple matter to carry out this 
determination. 

One could also assume that availability considerations set a 
lower bound on the number of copies needed. This will increase the 
number of possible allocations to be investigated, but may lead to a 
lower-cost allocation. (Adding sites can decrease total cost when query 
transmission costs are large relative to storage and update costs.) 

There is one problem with letting a simplified availability 
model determine the number of copies. A correct, complete availability 
analysis for a real network would involve such things as the detailed 
network topology, a combinatorial analysis of relevant "cutsets" and 
their probability of failure, etc. The number of copies r that are 
needed to satisfy an availability requirement would then in general vary 
with copy location. And availability would not, of course, be indepen- 
dent of the user's site. Some sort of average or minimal availability 
would have to be used in the constraint. One could envisage an itera- 
tive process. A rough estimate of availability could be used to deter- 
mine a number of copies. An allocation could then be determined and 
expected availability computed for that specific allocation. If this 
availability is too low, the number of copies could be increased. This 
is an attractive approach which, as far as we know, has not been previously 
proposed. 

It may also be that one wants to set a maximum number of 
copies. However, the most obvious basis for doing this would be that of 



25 



cost, and the allocation optimization process automatically selects the 
most economical solution. Thus, it is not necessary (though it may save 
computation time) to announce a priori that all allocations of more than 
four copies (say) are "too expensive". Indeed, without extensive 
computations of multi-copy allocation costs it may be impossible to 
verify such a pronouncement. 

Storage constraint . It may be necessary to take into account 
limited storage capacity at the sites. That is, it may not be physically 
possible to store all the files that one might want to at a given site. 
In his model, Chu introduces storage constraints of the form 

T. L < S (i = 1,2, ..., n), 
jeF 1 J 

where F. is the set of files to be stored at site i, L. is the length of 
the jth file, and S. is the storage capacity at site i. 

Adding such a constraint has an unfortunate computational 
effect. It makes it necessary to consider the allocations of all of the 
files simultaneously. It is no longer possible to allocate each file in 
turn, since the ultimate total file distribution is likely to violate 
one or more storage constraints. 

Including storage constraints may seem to be a necessity. 
However, this is not necessarily the case. It may be more reasonable to 
assume that additional storage space can be purchased as needed. The 
cost of storing a file at a given site can reflect "new storage" cost as 
well as "old storage" cost. Indeed, unless some sites have a lot of 
unused storage capacity which management elects to think of as "free", 
there seems to be no good reason to consider newly purchased storage 
capacity to have a cost basis different from the old. 



26 



Upper bound on response-time . Chu [6,7] also Includes in his 
model the condition that "the expected time to access each file is less 
than a given bound." To compute this expected time, Chu makes a number 
of simplifying assumptions. The access process is modeled as follows. 

1. The user request, after waiting in a queue for a network 
channel, is transmitted to a remote site that holds a file 
copy. 

2. The remote site accesses the file. 

3. The response is put in the queue for a network channel, and is 
ultimately sent back to the user. 

Chu assumes that request messages are relatively short, so 
that they can be given a high priority and their queueing delay can be 
neglected. With this and other simplifications, Chu can treat the 
entire access process as a straightforward single-server queueing 
system. Response time becomes a function of line capacity, line 
traffic, and request rates to the various sites. Request rates to sites 
and line traffic are in turn dependent upon the file allocation and the 
request rates to the files. 

The important thing to notice about a response-time constraint 
is that it inherently includes all of the files in the system and their 
locations. This is because line traffic and response delay due to line 
congestion must take into account transactions involving all of the 
files. Not being able to decompose the multi-file problem into a set of 
single-file problems increases the computational intractability of the 
optimization. 

Other constraints . The four types of constraints discussed 
above are those which one sees in the literature. There are other 



27 



possibilities which come to mind. For example, limited channel band- 
width between two sites may make it necessary to set an upper bound on 
the traffic between the two sites. Although limited bandwidth is taken 
into account in Chu's response-time model, a more direct imposition of a 
constraint on traffic might be simpler and more appropriate in some 
situations. 

As another example, one might sometimes want to take into 
account the limited processing capacity at certain sites. Including 
terms for processing costs in the cost function can help to eliminate 
the need for putting a constraint on processing time. For example, the 
"cost" of sending queries or updates to an already overloaded site could 
be set at such a large figure that no additional constraint is needed to 
effectively eliminate that site from the allocation. To take into 
account more subtle effects of processing capacity limitations, one 
would have to develop a realistic model for the processing needed for 
the query and update operations. 



28 



Finding the Optimal Allocation 

Zero-One Programming 

In Chu's formulation [6,7] of the file allocation problem, the 

variables defining an allocation are 

_ ( 1 if file j is at site i, 
ij |0 otherwise. 

The cost function is then readily written as a quadratic function of 

these variables. By increasing the number of variables in the problem, 

Chu is able to reformulate his cost function as a linear combination of 

variables that take on the values zero or one. Minimizing the cost then 

amounts to solving a zero-one linear programming problem. In general, 

solving such problems is computationally very expensive. On many 

occasions heuristics are used, sacrificing optimality for computational 

efficiency. For example, Levin [10] points out that Chu's approach 

would produce about 9000 zero-one variables with 18000 constraints for 

an ARPA-like network with 30 sites and at least 10 files. Although 

there are computational difficulties for large systems, it should be 

emphasized that this approach is mathematically straightforward. In 

addition, it is relatively easy to formulate constraints in terms of the 

zero-one variables. The constraints are also handled automatically by 

algorithms to solve the zero-one programming problem. This is how Chu 

may readily add response-time and storage-capacity constraints to his 

model. 

Searching the Cost Graph 

Casey's cost graph . Finding the optimal allocation can be 

simply thought of as a process of computing and comparing the costs of 

all allowable allocations. The set of possible allocations for a 



29 




(3,4) 



(2,3,4) 



(1,2,3,4) 



Figure 1 

Graph of all possible allocations 
among four sites. 



30 



single-file, unconstrained problem can be pictured in graphical form. 
Figure 1 shows such a graph for four-site allocations. For symmetry, 
the top node denotes the null allocation (0) , in which no sites have a 
copy. At level i in the graph we have all the allocations of i copies 
(each node representing an allocation). A branch is drawn from a node 
at level i to one at level i+1 if and only if the latter allocation can 
be obtained by adding a single copy to the former. If we associate the 
cost of each allocation with each node, we have what Casey [4] calls the 
cost graph . In the general n-site graph, the ith level contains (.) 
nodes. The total number of nodes in the graph (including the null 
allocation) is 2 . Each site added to the network therefore doubles the 
number of costs that must be computed, in general. 

Search procedures . Casey [4] suggests that the minimum cost 
node be located by carrying out a systematic search through the cost 
graph. In Casey's words, "A computer algorithm can be implemented in 
several different ways to select file nodes [sites] one at a time up to 
the optimum configuration. One approach is to follow all paths in 
parallel through the cost graph, stepping one level per iteration. This 
method is computationally efficient, but may require a great deal of 
storage in a large problem. Alternatively, a program can be written to 
trace systematically one path at a time. Such a technique uses less 
storage, but may require redundant calculations since many different 
paths intersect each vertex". In general we agree with this statement, 
but we dispute the last sentence. We believe that no node must be 
visited more than once in a "path at a time" approach. We will shortly 
defend this belief. 

First, let us make more specific the memory requirements of 
the level-by-level approach. At least ( n . ) memory locations (for n 



31 



even) are needed to save the cost values of the previous level in the 

cost graph. (This maximum occurs at the middle level (i=n/2) in the 

cost graph.) Some additional working space is needed for the evaluation 

of the current level. Note that ( *)_) is 184,756 for n=20 and 155,117,520 

n/2 

for n=30. As networks increase in size, the memory requirement of a 

level-by-level search quickly becomes prohibitive. This makes it 

important to devise a path-at-a-time search in which effort is not 

wasted. 

Theorems to truncate the search . In the worst possible case - 

when the optimal allocation is to put a copy everywhere - all nodes of 

the cost graph must be evaluated in any search procedure. Usually, 

however, one can truncate the search. Casey presents the following two 

theorems for doing so. The reader should refer back to the description 

of Casey's model for definitions of the notation. 

Theorem CI. Let d. =d' for all j,k. If for some integer r 

jk jk J ' 6 

(<n) , V .>\ .1 (r-1) for all j, then any r-site file assignment is more 
costly than the optimal one-site assignment. 

Theorem C2. Suppose assignment I is optimal. Then along any 
path in the cost graph from the null node to the node corresponding to 
I, cost is monotonically non- increasing from one node to the next. 

Actually Casey's Theorem C2 is somewhat more general and is 
not stated in terms of the notions of "paths" and "nodes". These 
notions belong to the cost graph visualization of the optimal allocation 
problem. (See figure 1.) 

Theorem CI gives us an upper bound (r) on the level beyond 

which there is no need to search; i.e. a gross stopping criterion. To 

see how this theorem may reduce our search, notice that V./A. is the 

J J 



32 



ratio of updates to queries generated at site j . If for all sites the 
volume of updates is, say, more than half that of queries, then the 
inequality in theorem CI holds with r=3, and we know that we need not 
search below level 2 in the cost graph. 

Theorem C2 gives us a more precise stopping criterion. If, 
while following a path down the cost graph, we find that the cost 
increases, then no further search along this path will be of any use. 

An efficient path search . We have devised a scheme for 
searching the cost graph one path at a time, but without revisiting 
nodes. The reader can probably understand the algorithm best from an 
example. The search order for four sites goes: 
(1), (1,2), (1,2,3), (1,2,3,4), (1,2,4), 
(1,3), (1,3,4), (1,4), (2), (2,3), ..., (4). 
More precisely, the following algorithm will print out the search order 
for n sites. Let V(i) be an array with index i. Let the instruction 
"print" be such that it will print the values of V(l) to V(k) 
contiguously. 

1. V(l)=l; k=l; print. 

2. If V(k)#n, then k=k+l; V(k)=V(k-l)+l; print; go to 2. 

3. Else k=k-l; 

If k=0, then halt. 
Else V(k)=V(k)+l; print; go to 2. 
What we have actually done is to convert the cost graph 
(figure 1) to a tree (figure 2). Our algorithm follows a preorder 
search of this tree. 

As the reader will observe, there are many paths that lead to 
the node (1,2,3,4) in figure 1, but only one in figure 2. There remains 



33 







U 



0) 

l-l • 

3 x: 
oo u 

•H W 

U-4 CO 

CO 



•H T3 
4J U 



o 



cu a) 



•h a 

CO -H 

I <4-t 



3 

o 



O 
0) 

4-1 CU 

C M 

CU 4J 

e 

cu cO 
60 

c o 



03 
U 

u 

CO 
CU 



an important question. Is it possible that by following only one path 
to a given node we could fail to recognize the optimum? The answer to 
this question is "No". In order for a node to have minimum cost, it 
must, by definition of minimum, have a cost no greater than that of any 
other node, in particular than that of any node in the previous level. 
This fact, in conjunction with Casey's theorem C2, produces the condi- 
tion that cost is nonincreasing along all paths leading to the optimum 
node. Thus, our simplified tree of figure 2 contains one such path that 
will lead to the optimum. We therefore conclude that our restricted 
search is sufficient to detect the optimum. 

As the reader will realize, our algorithm requires in the 
worst case (i.e. along the longest path) only one memory location per 
level, or n memory locations. In implementing the algorithm, we find 

that some efficiency can be gained by increasing the memory requirement 

2 
to 0(n ). However, it is clear that we have achieved our goal. Memory 

requirements are much more reasonable than that of the level-by-level 

search. But at the same time we have preserved computational efficiency 

by ensuring that each node cost is computed only once. 

Searching with constraints . The search algorithms described 
above have been for the unconstrained cost minimization problem. If 
constraints are imposed on the single-file problem, the effect is to cut 
down on the number of nodes of the cost tree which must be included in 
the search. This may or may not be helpful. Casey's theorem C2, for 
example, applies only to the overall minimum of the cost graph. More 
extensive computing, therefore, may be necessary to identify the minimum 
cost over a restricted set. 

If constraints (e.g., response time or storage) are imposed 
which couple the allocations of several files, the cost graph expands to 



35 



include all possible allocations of all files. This expanded graph has 
2 nodes, where f is the number of different files in the system. One 
arrives at this number of nodes by noting that (in Chu's notation) the 
problem has nf variables X . , each of which may take on the value zero 
or one. Casey's theorems for truncating the search are clearly no 
longer applicable. It is hard to see how one could avoid the tedious 
process of checking all allocations to see which ones satisfy the con- 
straints, and evaluating the costs of all that do. The alternative is 
to use Chu's formulation and a zero-one programming algorithm. It 
appears to us that Chu's approach, being straightforward, is the best 
one to take if interactions between the allocations of different files 
must be taken into account. 

The best allocation for a primary copy scheme . The search 
procedure needed to find the optimum allocation for the primary copy 
model is much like that for a constrained problem in which a copy is 
prescribed at one site. That is, given a location for the primary, we 
search that portion of the cost graph covering all allocations that 
include the primary site. This determines the best allocation with that 
particular site for the primary. By trying each site in turn as the 
primary, we can find the lowest cost of all and hence the best site 
for the primary. 

Local minima . In many cases, it is not feasible to search a 
large part of the cost graph for a true global minimum. One must con- 
tent oneself with a local minimum - i.e., an allocation which is less 
(or no more) costly than any "neighboring" allocation. 

As one such procedure for finding a local minimum we have 
tried the following. Let the sites be numbered 1,2,3,... in any arbi- 
trary order. (In the next subsection, we will discuss a good strategy 



36 



for choosing this order.) Begin by computing the cost of allocation 
(1); i.e. one copy at site 1. Then proceed cyclically through the list 
of sites, testing the effect of a change in allocation state (i.e., 
whether or not the site has a copy) of each site in turn. If the 
cost is decreased, make the change; otherwise proceed to the next site. 
Continue in this way until a local minimum is reached. A local minimum 
is found if we are able to make a complete cycle of the list of sites 
without any cost improvement. That is, if we have a node which repre- 
sents a local minimum, then all adjacent nodes will have a higher cost. 

An example might be helpful. After computing the cost C(l) of 
allocation (1), we next compute C(l,2). If this is larger than C(l), we 
next try C(l,3), etc. Suppose that C(l,4) is less than C(l) . We then 
use (1,4) as a new prospective local minimum, and test the neighboring 
allocations (1,4,5), (1,4,6) ... (l,4,n), (4), (1,2,4), (1,3,4). If 
none of these has a lower cost than (1,4), then (1,4) is a local mini- 
mum. It may help to visualize the process to note that neighboring 
nodes are defined in this algorithm as those joined to the given node by 
a branch of the cost graph. (See figure 1.) 

The reader might feel that, for example, allocation (1,3) 
should be considered as neighboring (1,4). But notice that these differ 
by two changes in allocation state - deletion of the copy at 3 and 
addition of the copy at 4. Thus these allocations are said to be a 
distance 2 from each other. It is quite possible to define a local 
minimum as being of lower cost than all allocations within distance 2 - 
or possibly within a farther distance. This simply increases the search 
required to locate and verify a local minimum. The usual sort of trade- 
off is involved here. By increasing the cost of the search, one can 
generally obtain a lower cost allocation. Indeed, we find that for many 



37 



examples a local optimum that has lower cost than all allocations within 
distance 2 is actually a global optimum. Chandy and Hewes [5] also 
report that such a "2-distance heuristic ... is a near optimal 
heuristic." 
Simplifying the Problem 

No matter how efficient a search procedure is devised, it 
remains true in general that the problem of finding the optimum grows 
like 2 , where n is the number of sites. Indeed, a rigorous proof to 
essentially this effect has recently been presented by Eswaran [8]. 
(Technically, Eswaran proved the problem "polynomial complete".) Let us 
see what this really means in terms of computation time. Casey [4] 
notes that a program run in the 360/91 took less than 10 seconds to 
solve six different optimal allocation problems for a network of 19 
sites. (This time included the Fortran compilation.) Thus, if we 
assume that it takes about 2 seconds to optimally allocate a file in a 
19-site network, and every additional site doubles the time, then it 
will take over an hour in a 30-site network, about 48 days in a 40-site 
network, and about 136 years in a 50-site network. It is therefore 
worthwhile to take other approaches to trying to cut down on the amount 
of work. 

In one such approach, Urano et al. [13] have shown that the 
sites may be partitioned into sets of "neighboring" sites in such a way 
that at most one member of each set should be included in an optimal 
allocation. (The geographic interpretation of their result is based on 
visualizing the d . . ' s as distances and was discussed in more detail in 
the Introduction.) More recently, Chandy and Hewes [5] have reformulated 
what is essentially Casey's model as a simple linear integer programming 



38 



problem. They find that solving the problem by a standard linear pro- 
gramming algorithm ( without the integrality constraints) leads in many 
cases to correct integer solutions. (The same phenomenon has been noted 
in similar optimization problems [12], and hence appears to be more than 
just an artifact of the particular examples tried.) 

New results on a priori exclusion or inclusion . We have 
attempted to decrease the size of the problem by a new approach; namely, 
by deriving rules for determining a priori that certain sites will (or 
will not) be included in an optimal allocation. Three theorems providing 
rules of this sort are presented in this section. Formal proofs of the 
theorems may be found in appendix 2. Casey's model and cost formula are 
assumed. (See equation (1), p. 17.) For simplicity, we shall also 

assume that d'..=d..=0 for all i. Thus we neglect the costs of queries 
33 33 

and updates applied locally. Relaxing this assumption, however, as well 
as making other minor alterations in the formulation, requires only 
small compensating alterations in our theorems. 

Our first theorem essentially states that if the cost of 
having a local file copy is smaller than the smallest possible cost of 
sending the locally generated queries elsewhere, then a local copy 
should unquestionably be included in the optimal allocation. The cost 
of storing and updating a copy at site i is an important quantity; we 
shall denote it by Z.. That is, we let 

n 

Z = a + £ <F . d ! . . 
i i iml J jx 

Theorem 1 . All optimal allocations will include site i if 
X. min d^ . > Z J . (3) 

1 in lj * 



39 



Our second theorem gives the other side of the coin. If it 
costs more to maintain a local copy than we can possibly save by having 
one, then we do not want one. 

Theorem 2 . No optimal allocation including more than one site 

will include site i if 

n 
Z > I X (max d - d ). (4) 

1 j=l J k JR Jl 

Theorems 1 and 2 can be summarized as follows. Let 

m. = A . min d. . , and 

n 
M . = £ X . (max d . . - d . . ) . 
j=l J k J J 

Then for each i the real line is partitioned by m. and M. into three 

J i i 

regions, as shown in figure 3. If Z. falls in region 1, then it should 
unquestionably be included in any optimal allocation. If it falls in 
region 3, it will generally be excluded. An exceptional case is when 
all sites fall in region 3; i.e., satisfy inequality (4). In this case, 
all optimal allocations will be single-copy ones, and we need only 
choose the best of these. Sites whose costs Z. fall in region 2 (in- 
cluding the boundary points) must be considered further. 



Region Region Region 

1,2,3 

Figure 3 



40 



One other comment should be made. If network transmission 
costs are independent of the sites involved, so that d =d for all i^j , 
then m.=M.=A d, and region 2 collapses to a single boundary point. 
Unless the cost Z of some site lies on the boundary, every site is a 
priori either included or excluded. Immediately we see that then the 
set of sites in region 1 forms the optimal allocation, unless the 
exceptional case noted above obtains. Computation is therefore reduced 
to a minimum (essentially n cost evaluations) for many real network 
environments. (For example, most commercial packet-switched networks 
charge per packet, irrespective of the distance the packet is to be 
sent.) There are, of course, real cost differentials incurred in the 
longer lines between sites, but from the user's point of view it is the 
network charging policy that matters. 

Our third theorem is somewhat in the spirit of the approach to 
simplification taken by Urano et al. [13]. However, Urano's results 
depend on a more complex model, involving "relay costs" at network nodes 
and restrictive hypotheses on the d... Also, instead of determining 
that no more than one of some group of geographically close sites can be 
included in an optimal allocation, we show that certain sites may a 
priori be precluded from being in an optimal allocation by the existence 
of a "better" site nearby. Thus, again we have a theorem which allows 
us to eliminate initially certain sites from any further consideration 
in the search for the optimum. 

Theorem 3 . A site i cannot be included in any optimal alloca- 
tion if there exists another site k in the network such that 

n 
Z. - Z. > I X.(d.. - d. ,). (5) 

where (f) = f flff i B ' 
wnere (z )+ < if f < 0. 

41 



Examples and further discussion . In order to assess their 
value, we have applied our theorems to several of the examples in 
Casey's paper [4], Casey assumes throughout that it costs the same to 
transmit a query as an update (d' =d ), that transmission costs are 
symmetric (d =d ) , and that storage costs are zero. In Casey's 5-site 



example, the data are as follows. The matrix of transmission costs d 
is 



ij 






6 


. 12 


9 


6 


6 





6 


12 


9 


12 


6 





6 


12 


9 


12 


6 





6 


6 


9 


12 


6 






All A.'s are 24, and the vector of *f.'s is (2,3,4,6,8). In order to 
apply theorems 1 and 2, we tabulate Z . , m., and M. for all sites i. 



Site 


1 


2 


3 


4 


5 


z. 

i 


168 


180 


174 


126 


123 


l 


144 


144 


144 


144 


144 


M. 

l 


648 


648 


576 


648 


648 



Theorem 1 then states that both sites 4 and 5 should be included in any 
optimal allocation. Theorem 2 does not eliminate any sites, so the 
effect of adding sites 1,2,3 would have to be checked. However, the 

problem has been reduced from an optimization among 32 (=2 ) choices to 

3 
one among 8 (=2 ). (The optimal allocation I is (1,4,5).) It is vir- 
tually obvious by inspection that costs (Z.) are sufficiently similar 
and query loads and transmission costs sufficiently large that theorem 3 
will not be useful. More will be said about this later. 



42 



As another example of the usefulness of theorem 1, consider 

Casey's 5-site example reduced to four sites by the elimination of site 

5. The table of Z., m., and M. is then 

1 l i 



Site 


1 


2 


3 


4 


Z. 

l 


120 


108 


78 


78 


m. 

l 


144 


144 


144 


144 



M i 

Since for all i Z. < m., theorem 1 says that the optimal allocation 
consists of all four sites, and no further computation is necessary. 

In another of Casey's examples - the 19-site ARPA network 
example with d.. equal to distance - theorem 3 does turn out to be 
useful. This is what is to be expected in a geographically widespread 
network in which there are groups of sites that are very close together 
(e.g., in the Boston and Los Angeles areas). We will not give the 
details (the interested reader may consult the data and notation in 
Casey's paper), but we find that theorem 3 eliminates six sites. Speci- 
fically, sites 3, 4 and 5 are excluded by site 2, site 15 by site 14, 
and sites 17 and 19 by site 18. This reduces by a factor of 64 the 
maximum size of the search required to find the optimum. 

The reader might question whether it is worthwhile, in general, 
to carry out the computations needed to check the conditions for our 
theorems. In this connection, we point out that the costs Z. are needed 
in any cost optimization algorithm and it only increases efficiency to 

pre-compute them. Computation of m. for i=l,...,n is easy, requiring 

2 
0(n ) operations altogether (n to find the minimum d.. for each i) . 

Application of theorem 1 is therefore highly advisable, since for each 



43 



site successfully identified as satisfying inequality (1) the search is 
reduced by 50 percent. Furthermore, it may happen that the set of sites 
satisfying (1) forms an acceptable sub-optimal allocation, so that no 

further search is necessary. 

2 
Computation of M is more time-consuming, requiring 0(n ) 

operations for each i. It may be that one would only want to try 
theorem 2 selectively, say when Z (or Z./X ) is "big". Testing for 
theorem 3 is even more time-consuming because of the dependence on k. 
However, the quantities involved in the test (the right sides of in- 
equality (5)) are sufficiently similar to M that it is not difficult to 
work out an efficient scheme for carrying out the computations needed 
for theorems 2 and 3 together. Again, to save time one can readily 
eliminate from the process many cases where checking inequality (5) is 
(or is likely to be) useless. For example, when Z.<Z,+X.d., , (5) 
clearly cannot be satisfied. This simple condition suffices to show 
conclusively that theorem 3 is of no help in the 5-site example dis- 
cussed above. 

Our intuitive idea of theorem 3 - the elimination of more 
expensive, neighboring sites by the existence of a given site - can be 
used to identify situations when theorem 3 is likely to be useful. 
Thus, if a number of network sites are clumped together (in the sense of 

small d..), then the "cheapest" of these (small Z./X.) may serve to 
ij 11 

eliminate others of the clump. Notice also that the minimization 

carried out in the test for theorem 1 identifies the "closest" site to 

site i in the sense of smallest d... This information can be saved and 

ij 

used to identify promising pairs for the theorem 3 test. 



44 



Finally we note that all of the results and comments of this 

section will hold for the primary-copy model (equation (2)), provided 

that we redefine Z as 

Z. - a, + Yd ' . . 
i i li 

This is intuitively reasonable. The cost of shipping all updates to the 

primary is the same for all allocations and hence has no effect on the 

cost comparisons. Therefore the appropriate cost of maintaining a copy 

is just the sum of storage cost plus the cost of shipping all updates to 

that copy from the primary. 

The quantity Z./A. appeared above as a measure of the cheap- 
ness of site i. Since an optimal allocation is likely to include the 
cheaper sites, this quantity is also a rough measure of the probability 
that site i will be included in an optimal allocation. We have success- 
fully used Z./A. as a heuristic "search factor" to expedite search 
procedures for optimal or near-optimal allocations. For example, in the 
search for local minima which we described earlier, we have obtained 
considerable improvement by numbering the sites 1,2,... in order of 
increasing Z./A.. That is, the use of this search factor generally 
seems to lead one to a better (i.e. lower cost) allocation with less 
effort than if the sites are numbered arbitrarily. 

If constraints are imposed on the problem, then our theorems 
may not be immediately applicable. If there are constraints excluding 
or including sites, or setting a minimum number of copies, our theorems 
are still useful as long as they are not applied mindlessly. Thus, one 
must make adjustments if theorem 1 would include some of the wrong 
sites - or too many sites - or if theorem 2 excludes too many sites. 
In some cases, one might want to consider the proofs of the theorems 



45 



(see appendix 2) in determining how they might be applied, since the 
proofs provide the basic cost comparisons which follow from the theorem 
hypotheses. 

For example, suppose we wish to find the best 2-site alloca- 
tion, and one copy is required to be at site 1. Then it may be possible 
to exclude a number of sites from consideration as the second site by 
applying lemma 1 (appendix 2) with k=l. If all sites are "excluded" by 
this lemma, one is reduced to brute-force comparisons of the possibilities, 
On the other hand, theorem 1 may show that some site - say site 2 - 
should appear in any globally optimal allocation. Then I=(l,2) is 
certainly a "good" allocation, and application of lemma 2 with k=2, i=3, 
..., n, may in fact serve to prove that (1,2) is the best 2-site alloca- 
tion containing site 1. 

The example above is, of course, very simplistic. Application 
of the theorems in this case is likely to save little over the brute- 
force cost computations and comparisons. The example is only designed 
to show the reader the kinds of variations on our results which are 
obtainable for application in some non-standard situations. 

Constraints may take on many forms, and for some forms it may 
not be clear that anything like our theorems can be derived. We believe, 
however, that for any file allocation problem it is worth investigating 
the possibility of deriving as much a_ priori information on the optimum 
as possible. The savings may well be enough to turn an impractically 
large computing problem into one that is of acceptable size. 



46 



Applications to Special Situations 

In addition to altering Casey's model to handle a primary-copy 
synchronization scheme, we have also, following another line of research, 
developed models for certain special situations which may be particularly 
relevant to the WWMCCS environment. In this section we report on two 
such studies. 

First, we consider the optimal allocation of a file that is 
updated by a single site (the site that generates the updates by its 
data collection activities), but is queried by other sites for its 
information. We will initially consider this situation under the 
assumptions of a uniform query load (i.e., each site generates the same 
number of queries) and a homogeneous environment. We then relax the 
homogeneity condition. An interesting feature of this study is that a 
partial (and sometimes complete) solution to the file allocation problem 
may be obtained by inspection of simple inequalities. The lengthy 
computation and comparison of multitudes of allocation costs are just 
not needed in many simple situations. 

Second, we consider the tradeoffs found in the local caching 
of data (as might occur in an intelligent terminal application.) Here 
we will consider a local system that maintains a partial copy of a file 
or data base. We assume that the system can satisfy some of its queries 
locally and the rest by using the remote copy. We then derive the 
conditions that must be satisfied for this situation to be cost-effective, 

We will slightly modify Casey's model by assuming that the 
costs d., and d' ., are the sums of two components - system (processing) 
cost and a constant cost of transmission. Assuming a constant cost of 



47 



transmission seems reasonable, since most packet-network charges are 
independent of distance. Including processing costs will allow us to 
investigate effects of system load on the problem. Thus we will assume 
that 



ju, for j=k 

d\ 

lu.+N for j^k 

1 K. U 



and 



for j=k 
W k +N q 



where u, is the cost of processing an update on the kth system, 

N is the total network cost to transmit an update, 
u 

q, is the cost of processing a query on the kth system, and 
N is the total network cost to transmit a query. 

Single Update Source 

In this section we wish to determine the optimal allocation of 
a file which is updated by only one node but queried by all. To repre- 
sent this situation, we will assume that all updates originate from the 
node j=l. (In other words, *^0 and ^.=0 for l < j_^n.) Initially we will 

assume that q,=q, u 1 . =u j °\T a anc * ^v = ^ ^ or a ^ ^> anc * t ^ iat ^ =N =N. This 
last condition is not really necessary, but will make the results some- 
what clearer. Thus our homogeneity assumptions include equal storage 
costs, query loads, and query and update processing costs. Under these 
assumptions we obtain the following. 

Strategy Rule 1 . [homogeneous network] 

If ^(u+N) + a > AN, 

then the cheapest allocation is to put a single copy at site 1, 

48 



Intuitively, this result says that if it costs more to store 
and update the copy at site k than it does to send k's queries to site 
1, then it does not pay to put a copy at site k. A rigorous proof of 
this result is in appendix 3. 

Now suppose that we introduce some heterogeneity into the 
system. Assume that for some site k in the network, u ^ll , o.^o , and 
q <q <q,+N. (This last inequality ensures that locally generated queries 
will be processed locally.) When will C£k) be less than C(l)? Straight- 
forward computation yields 

C(k)-C(l) = H' 1 (N+u k -u 1 )+nX(q k -q 1 )+a k -o 1 . 

From this we immediately conclude: 

Strategy Rule 2 [heterogeneous network] . 

If some site k is so much cheaper than site 1 that 
^ 1 u 1 +nAq 1 +a 1 > ^(N+u^+nAqj+c^, 

then it is cheaper to put a copy at site k than at site 1, 
even though site 1 is generating all the updates. 
Finally, we might ask when, in a heterogeneous network with best single- 
site allocation (1), it is worthwhile to add a second copy. We will 
first assume q, <q- <qi+N and compute 

C(l)-C(l,k) = XN+di-DACq^q^-O^^^+N). 
From this we obtain 

Strategy Rule 3 [heterogeneous network] 

If site k is sufficiently cheap that 

k + V U k +N) < M+Cn-DXCq^q^, 

then the two-site allocation (l,k) is cheaper than allocation (1) 



49 



If q is still smaller, so that q +N<q (which makes it economical for 
even site 1 to send its queries to site k) , then the condition in 
strategy rule 3 becomes 

nA(q r q k ) > V U k +N)+ V 

Notice that these last two rules do not give us the complete 
answer to the file allocation problem. For example, rule 3 says that we 
can save by putting a second copy at site k, but it does not select 
among several alternative sites which may satisfy the inequality, nor 
does it preclude greater savings from adding a third copy. 
Use of a Local Partial Copy 

In this section, we investigate the situation that would exist 
when an intelligent terminal acts as a front-end to a distributed data 
management system. Specifically, we consider the case where some queries 
can be satisfied by a local partial copy of a data base, while the rest 
must be answered by querying a remote copy. We will compare the cost of 
this two-copy allocation (one complete copy and one partial copy) with 
that of a single remote copy. We assume that all updates originate at 
the remote site (site 1), and that all queries originate at the local 
site (site 2) . 

Let us also assume that the storage cost of the partial copy 
is pa, where p is less than 1. Further, let 

y, = the volume of updates relevant to the remote copy but 
not to the local (partial) copy, and 

y~ = the volume of updates relevant to the local copy. 
Then Y, the total update volume, is 4',+H'^. 

Let A, = the query volume that must be sent to the remote copy, and 
\~ = the query volume that can be satisfied locally (i.e. by the 

partial copy) . 



50 



With these assumptions, the cost of having only the remote copy is 
C(l) = *u 1 +(X 1 +X 2 )(q 1 +N)-hj 1 . 

Adding the partial copy gives 

C(l,2) = >Fu 1 +4' 2 (u 2 +N)+X 1 (q 1 +N)+A 2 q 2 +(l+p)a 1 

Therefore 

C(l)-C(l,2) = A 2 (q 1 -q 2 +N)-4' 2 (u 2 +N)-p0 1 , 

and the local partial copy becomes cost-effective when 

x 2 (q 1 -q 2 +N) > Y 2 (u 2 +N)+p ai . 

There is one immediate point to be noted here. If q„>q +N, so that the 
local cost of processing queries is greater than that of sending them 
across the network and processing them at the primary site, then it 
never pays to have the local copy. This observation increases the 
importance of designing intelligent terminals to be as inexpensive as 
possible. 

Another interesting observation may be made. Let the query 
cost differential q^-q^+N be denoted by Q, and suppose that the locally 
cached data is relatively static, so that ^=0. Then the condition for 
cost-effectiveness becomes 

* 2 Q > pa r 

To visualize this condition, in figure 4 we have plotted crossover 
points. Each curve is for a fixed value of pa.. , and gives values of 
(A ,Q) such that A Q=pa . On the graph the units for Q are dollars per 
query, of A are queries per hour, and of pa, are dollars per month. 
Because of the conversion factors among these units, the actual quantity 
graphed is A =(pa /720Q). For example, if the differential querying 
cost Q is $0.10, and the cost pa. of the local storage cache is $10 a 



51 



1.0 



0.5 - 



0.3 - 



0.2 - 



o 

i 



a> 
ex 

o 

o 
a 



0.1 - 



005 - 



0.03 - 



0.02 - 



0.01 





1 


i 


1 


i 


i 


i 






\fc 














- 


Nft 












- 




\& 














- \^ 


\fe 












- 


- 


N? 












- 




i 


i 


» 


\ i \ 


i 


\ i 





0.1 






0.2 0.3 0.5 1.0 

X 2 (Queries per Hour) 



2.0 



3.0 



5.0 



Figure A 

Minimum query rates which make a local cache cost effective, 
Q is the savings per query caused by having the local cache, 
Curves are labeled with storage cost of the local cache (in 
dollars per month). See text for detailed assumptions. 



52 



month, the crossover point is at A =0.14 queries per hour. For any 
query rate over this, the local cache becomes increasingly cost- 
effective. If the differential querying cost Q is only $0.01; i.e., 
decreases by a factor of 10, the crossover point increases by the same 
factor of 10, to 1.4 queries per hour. 

There are many cost factors which enter into the quantity Q. 
Careful analysis and/or measurements would have to be carried out to get 
a good estimate of Q in any given situation. This very simple applica- 
tion of file allocation points up a basic problem. Good data on costs 
and usage patterns must be available before allocation optimization can 
be at all meaningful. 



53 



Conclusions 

Theoretically, the file allocation problem seems to be well 
understood. However, the models and associated cost formulas to be 
found in the literature do not necessarily reflect the realities of 
distributed data management. Witness the total lack of concern with the 
synchronization problems which the broadcasting of updates to multiple 
copies causes. Nevertheless, we find that it is not very difficult to 
develop new models to handle, say, a primary-copy scheme. Obtaining 
models and associated cost formulas for any specific environment is 
largely a matter of having the data access process carefully defined. 
In a way, the model ±s^ such a careful definition. 

If the network is not very large - say 20 sites or less - the 
best allocation can be found by brute force, if necessary. That is, the 
costs of all possible allocations can be computed and compared. As 
networks grow in size, this computational problem rapidly becomes too 
large to be feasible. Considerable progress has been made in making the 
problem more tractable. Doubtless there is room for more work in this 
area. We believe that one useful approach is the development of very 
simple models and resulting decision rules that answer specific ques- 
tions. Examples of this kind of approach were provided in the section 
just preceding. 

A serious problem - and one which research alone can not 
solve - is the determination of the parameters to be put into the model. 
We have indicated that "costs" need not (and perhaps should not) be 
based on real dollars, but instead might reflect the data base manager's 
judgment as to the "value" of the various resources, or the relative 
importance of the various terms in the overall cost formula. In short, 



54 



cost factors can be thought of as general weight factors, which may 
reflect subjective judgments as well as objective cost accounting. The 
information on data base usage is also specific to the system. Measure- 
ments must be taken of query and update volumes before the model can 
yield valid results. The researcher can aid in providing statistical 
analyses of the measurements. Such analyses might yield information on 
whether usage patterns appear to be static or oscillatory, or show long- 
term trends. But the burden is again on the data base manager to supply 
the raw data needed to put real numbers in the formulas. 

Finally, we note that it might be reasonable to attack the 
file allocation problem from an entirely different point of view than 
that of minimizing "cost". One might instead want to minimize, say, 
response time. Minimization of response time does have some attractions, 
especially in an environment where rapid response may be critical. The 
problem must be constrained, of course. Otherwise the fastest response 
is obtained by putting a copy at each site. Reasonable constraints 
might be to put a bound on storage and update costs, or to set a maximum 
number of copies. 

In our discussion of response time as a constraint, however, 
we have already implicity identified two difficulties with this approach. 
First, response models are complex. More data and parameters are needed 
to compute response. No readily usable response model seems to currently 
exist in the literature; additional research is needed on this problem. 
Second, response models necessarily couple together the effects of all 
the files in the system. Unless some clever way can be found to get 
around this problem, computation of a minimal response allocation will 
be intractable even for small networks. In spite of these difficulties, 
we believe that research along these lines would be definitely worthwhile, 



55 



References 



1. Alsberg, P. A. et al. Synchronization and Deadlock. CAC Document 
No. 185 (CCTC-WAD Document No. 6503), Center for Advanced Computation, 
University of Illinois at Urb ana -Champaign, March 1976. 

2. Alsberg, P. A., Belford, G.G., Day, J.D., and Grapa, E. Multi-Copy 
Resiliency Techniques. CAC Document No. 202 (CCTC-WAD Document No. 
6505), Center for Advanced Computation, University of Illinois at 
Urbana-Champaign, May 1976. 

3. Belford, G.G., Schwartz, P.M., and Sluizer, S. The Effect of 
Backup Strategy on Data Base Availability. CAC Document No. 181 (CCTC- 
WAD Document No. 6501), Center for Advanced Computation, University of 
Illinois at Urbana-Champaign, February 1976. 

4. Casey, R.G. Allocation of copies of a file in an information 
network. AFIPS Conference Proceedings 40, AFIPS Press, Montvale, N.J., 
1972, pp. 617-625. 

5. Chandy, K.M. and Hewes, J.E. File allocation in distributed 
systems. Proc. International Symp. on Comp. Performance Modeling, 
Measurement and Evaluation (March 1976), pp. 10-13. 

6. Chu, W.W. Optimal file allocation in a multi-computer information 
system. IEEE Trans, on Computers C-18 (1969), pp. 885-889. 

7. Chu, W.W. Optimal file allocation in a computer network. In 
Computer-Communications Networks , N. Abramson and F. Kuo (Eds.), Prentice- 
Hall, Englewood Cliffs, N.J., 1973. 

8. Eswaran, K.P. Placement of records in a file and file allocation 

in a computer network. Information Processing 74, North-Holland Publishing 
Co., Amsterdam, 1974, pp. 304-307. 

9. Johnson, P.R. and Thomas, R.H. The Maintenance of Duplicate 
Databases." RFC #677, NIC #31507, January 1975. (Available from ARPA 
Network Information Center, Stanford Research Institute-Augmentation 
Research Center, Menlo Park, CA. ) 

10. Levin, K.D. Organizing distributed data bases in computer networks. 
Ph.D Dissertation, University of Pennsylvania (1974). 

11. Morgan, H.L. and Levin, K.D. Optimal Program and Data Locations 
in Computer Networks, Report 74-10-01, Dept. of Decision Sciences, The 
Wharton School, University of Pennsylvania, 1974. 

12. Spinetto, R.D. A facility location problem. SIAM Review 18 (1976), 
pp. 294-5. 

13. Urano, Y. , Ono, K. , and Inoue, S. Optimal design of distributed 
networks. Second International Conf. on Comp. Comm. (1974), pp. 413- 
420. 



56 






Appendix 1 

In this appendix, we prove the analog of Casey's key lemma [4, 
p. 620] for the primary copy model (equation (2)). This lemma deals 
with a diamond-shaped subset of the cost graph, as shown below. 



(1,4) 




(1,2,4) »^ ^* (1,3,4) 

I - (1,2,3,4) 

It says that if the cost of I is less than (or no more than) that of its 
two predecessors in this subgraph, then the cost of the top node can not 
be smaller than that of those predecessors. The lemma holds in general 
for any four nodes in this configuration. The specific labels on the 
allocations are designed to clarify their relationship. Just as in 
Casey's model, this lemma leads straightforwardly to the fact that costs 
are monotonic non-increasing along paths to the minimum. 
Lemma . Without loss of generality, let I = {l,2,...,r} 

If C(I) < C(I~{k}), k = 2,3; 
then 

C(I~{k}) < C(I~{2,3}), k = 2,3. 
(The notation I~{.} means the set I with the set of elements in the 
brackets deleted from it.) 



57 



Proof . 

Let X^ = C(I~{k}) - C(I), and 

Y k = C(I~{2,3}) - C(I~{k}) 
We wish to show that if X„ > and X _> 0, then Y _> and Y > 0. 
This will be true if Y 3 ~X 2 > and Y 2 -X„ > 0. 

Now X„ "= - o„ - ¥d' + EX. (min d.. - min d.. ) 

2 2 I 2 , J i -r roi jk , T jk 

j keI-{2} J kel J 

Y = - o - ¥d' + ZA 4 (min d.. - min d..) 

3 2 12 j j kel~{2,3} Jk kel~{3} jk 

So 

Y„ - X 9 = EX. (min d , - min d., - min d,, + min d ) 

J j J kel~{2,3> Jk kel~{3} J * kel~{2} Jk kel Jk 

The remainder of the proof follows exactly Casey's proof; we will 

not copy it here. Essentially, one sees easily that the quantity 

in parentheses is positive for all j ; and the inequality Y -X~ _> 

follows by permuting indices. 



58 



Appendix 2 

In this appendix we give formal proofs of the theorems on 
exclusion and inclusion which were presented and discussed earlier. We 
are assuming Casey's cost formula (equation 1) and are also assuming 
that d..=d!.=0 for all j. The cost of maintaining a copy at site i is 
given by 



Z, - a. + EY.d ,. 
i i . ,J ji 

Theorem 1 . All optimal allocations will include site i if 

X J min d. . > Z. . (Al) 

i .1. ij i 

Proof . Assume that site i satisfies (Al) and that there is an 

optimal allocation I which does not include i. Let I'=IU{i}. By 

straightforward computation, 

C(I') = C(I) + Z. + S, 

n 
where S = EX. (min d., - min d.,). 
j=l J kel' J kel Jk 

Clearly each term in S is nonpositive, since taking a minimum over a 

larger set cannot increase that minimum. Furthermore, since for j=i the 

contribution to S is -X. min d,, , it follows that 

1 kei lk 

S < -X. min d.. < -X min d . . . 
— 1 . . ik - i , ,, ii 
kel j^i J 

Therefore (from (Al)), S+Z.<0, and hence C(I')<C(I). This contradicts 

the assumed optimality of I and proves the theorem. 

Theorem 2 . No optimal allocation including more than one site 

will include site i if 



Z. > I X.(max d.. - d. .). (A2) 

i i j i jk ji 
j=l J k J J 



59 



Proof . Given an allocation 1^0 (the null allocation), let 
I'=IU{i}. Just as in the proof of theorem 1, 
C(I') = C(I) + Z ± + S. 

Making a term by term comparison, we see that the right side of inequality 
(A2) is greater than or equal to (-S) provided that 

max d., - d. . + min d.. - min d., > (A3) 

i jk j l , _ , jk . -r jk — 

k J J kel ,j kel J 

for all j . Examining (A3) , we see that if 

min d ., = d. . , 
kel' Jk J1 

then the second and third terms cancel and (A3) is clearly satisfied. 

Otherwise, the last two terms cancel and (A3) is again satisfied. It 

then follows from (A2) that 

Z. + S > 0. 

l 

Hence C(I')>C(I), and the theorem is proved. 

Theorem 3 . A site i cannot be included in any optimal allocation 
if there exists another site k in the network such that 

n 

Z. - Z, > Z X.(d.. - d. ,). (A4) 

i k -, 1 jk j l + 

where (f) = f f if f > °» 

w K '+ ^0 if f < 0. 

This theorem is most readily proved in two parts. We will 
give these parts as two preliminary lemmas. As in the previous theorems, 
let I be an arbitrary allocation (but not inlcuding i or k) . Let 
I'=IU{k} and I"=I'U{i}. The first lemma states that if site i is 
sufficiently costly, then adding site i to an allocation which already 
includes k increases the total cost. 



60 



Lemma 1 . If site i satisfies 

n 

Z. > E X.(d., - d..), 
i j =1 J Jk ji + 

for some site k in the network, then C(I")>C(I'). 

Proof . By straightforward computation, 

n 
C(I") - C(I') = Z. + E X.(min d. - min d. ). 

1 • 1 3 Til J 111 Tl J 111 

j=l J mel J mel 

Similarly to the proof of theorem 2, the lemma is proved if, for all j, 

(d . . - d . . ) , + min d . - min d . > . (A5 ) 

me 1 me 1 

The inequality (A5) follows easily from consideration of two cases. 

(i) If min d. ^ d.., then the last two terms cancel and clearly 

tii J m J 1 
mel J 

the left side is positive. 

(ii) If min d. = d.., then (A5) reads 
mel" Jm J1 

d., - d . . + d . . - min d . > , 
J J J mel 

which again is clearly satisfied. 

Our second lemma says that, if (A4) is satisfied, replacing 
site k by site i in an allocation also increases the cost. 

Lemma 2 . Let I"'=lU{i}. If sites i and k satisfy inequality 
(A4), then C(I"' )>C(I' ) . 

Proof . By straightforward computation, 



C(I'") - C(I') = Z. - Z, + E A. (min d. - min d. ). 

i k . i 3 Tiit jm Tf jm 

j=l mel ' J mel J 

Similarly to the preceding lemma, the proof follows from showing that 

IX. (d.. - d..), > EX. (min d. - min d. ). 

j jk jr+ - .j _. jm Tll , im 

3 3 mel' J mel ' J 



61 



And this inequality holds if, for all j, we have 

(d.. - d.,K - min d. + min d, > 0. (A6) 

Jk Ji+ mel' Jm mel"' jm - 

Inequality (A6) can be verified by checking cases. 

(i) If min d. ^ d. , or d., , then the last two terras cancel and 
mel" Jm Ji Jk 

(A6) obviously holds. (Recall that I"=IlHi,k}.) 

(ii) If min d. = d.., then (A6) becomes 

Til J m J 1 

mel" J J 

d . , - d . . - min d . + d . . > . 
J J mel' J J 

(iii) If min d. = d.. , then the left side of (A6) becomes 
mel 

- d., + min d. , which must be nonnegative by the assumption 
J mel'" J 

defining this case. 

Proof of Theorem 3 . If inequality (A4) holds, then lemmas 1 

and 2 both apply. Since lemma 2 says that an allocation containing i is 

always improved by replacing i by k, and lemma 1 eliminates any allocation 

containing both i and k from being optimal, nothing more is needed. 



62 



Appendix 3 

Proof of Strategy Rule 1 

Under the assumptions given in the discussion of this rule in 
the text, the cost of allocation (1) is 
C(l) = V-jU + (n-l)AN + nXq + a. 

The cost of allocation (l,k), where k is any other site, is 
C(l,k) = Y (u+N) + Y u + (n-2)XN + nAq + 2a. 

Therefore 

C(l,k) - C(l) = V (u+N) - AN + a. 
Hence 

C(l,k) > C(l) if and only if 

^(u+N) + a > AN, (A7) 

which is the condition in strategy rule 1. Casey's theorem C2 can be 
applied to show that allocation (1) is the only allocation containing 
site 1 which could possibly be optimal. 

To complete the proof, we must show 

(i) that allocation (1) is cheaper than any other single-site 
allocation, and 

(ii) that adding a site to any other single-site allocation increases 
cost (so that Casey's theorem C2 may again be applied to show that the 
optimum can not lie below level 1 in the cost graph). 

Now C(k) for k^l is y (u+N) + a + (n-l)AN + nAq. To verify 
(i) we simply note that 

C(k) - C(l) = V-jN. 



63 



To verify (ii) , we compute C(k,j) for the case where neither k nor j is 1, 
C(k,j) = 2Y (u+N) + 2o + (n-2)XN + nAq. 

Therefore 

C(k,j) - C(k) = 4> (u+N) + o - AN. 

The condition for C(k)<C(k,j) is precisely that inequality (A7) be 
satisfied, which is our hypothesis. Thus the result is proved. 



64 



UNCLASSIFIED 



SECURITY CLASSIFICATION OF THIS PAGE (Whan Data Entered) 



REPORT DOCUMENTATION PAGE 


READ INSTRUCTIONS 
BEFORE COMPLETING FORM 


1 REPORT NUMBER 

CAC Document Number 203 


2. GOVT ACCESSION NO. 


3. RECIPIENT'S CATALOG NUMBER 


4. TITLE (and Subtitle) 

Research in Network Data Management and 
Resource Sharing - Network File Allocation 


5. TYPE OF REPORT ft PERIOD COVERED 

Research 


6. PERFORMING ORG. REPORT NUMBER 
CAC #203 


7. AUTHORf*.) 

G.G. Belford, J.D. Day, E. Grapa, and 
P.M. Schwartz 


8. CONTRACT OR GRANT NUMBERf*) 

DCA100-75-C-0021 


9. PERFORMING ORGANIZATION NAME AND ADDRESS 

Center for Advanced Computation 
University of Illinois at Urbana-Champaign 
Urbana, Illinois 61801 


10. PROGRAM ELEMENT, PROJECT, TASK 
AREA ft WORK UNIT NUMBERS 


11. CONTROLLING OFFICE NAME AND ADDRESS 

Command and Control Technical Center 

WWMCCS ADP Directorate 

11440 Isaac Newton Sq., N. , Reston, VA 22090 


12. REPORT DATE 

August 2, 1976 


13. NUMBER OF PAGES 


14. MONITORING AGENCY NAME ft ADDRESSf// different from Controlling Office) 


15. SECURITY CLASS, (of this report) 

UNCLASSIFIED 


15«. DECLASSIFI CATION/ DOWN GRADING 
SCHEDULE 


16. DISTRIBUTION ST ATEMEN T (of this Report) 

Copies may be obtained from the 

National Technical Information Service 
Springfield, Virginia 22151 


17. DISTRIBUTION STATEMENT (of the abstract entered In Block 20, If different from Report) 

No restriction on distribution 


18. SUPPLEMENTARY NOTES 

None 


19. KEY WORDS (Continue on reverse side it necessary and Identify by block number) 

File allocation 
Distributed data bases 
Network resource sharing 


20. ABSTRACT (Continue on reverse side if necessary and identity by block number) 

This report contains results to date of a study of file allocation in 
a network. Models and algorithms contained in the literature are surveyed. 
Some new models (for special situations and for update distribution through 
a primary site) are developed. Some new theorems for simplifying the 
computational problem are presented. 



DD 1 jan 73 1473 EDITION OF 1 NOV 65 IS OBSOLETE 



UNCLASSIFIED 



SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) 



BIBLIOGRAPHIC DATA 
SHEET 



1. Report No. 

UIUC-CAC-DN-76-203 



3. Recipient's Accession Ni 



4. 1 u U iiul suht it le 



Research in Network Data Management and Resource Sharing - 
Network File Allocation 



5- Report Date 

August 2, 1976 



7. A ut hop's 1 



G.G. Belford, J.D. Day, E. Grapa, and P.M. Schwartz 



8. Performing Organization Kept. 

N °- CAC #203 



9. Performing Organization Name and Address 

Center for Advanced Computation 
University of Illinois at Urbana-Champaign 
Urbana, Illinois 61801 



10. Pro|ect/Task/Work Unit N. 



11. Contract /Grant No. 

DCA100-75-C-0021 



12. Sponsoring Organization Name and Address 

Command and Control Technical Center 
WWMCCS ADP Directorate 
11440 Isaac Newton Square, N. 
Reston, Virginia 22090 - 



13. Type of Report & Period 
Covered 

Research 



14. 



15. Supplementary Notes 

None 



16. Abstracts 

This report contains results to date of a study of file allocation in a network. 
Models and algorithms contained in the literature are surveyed. Some new models (for 
special situations and for update distribution through a primary site) are developed. 
Some new theorems for simplifying the computational problem are presented. 



17. Key Words and Document Analysis. 17a. Descriptors 

File allocation 
Distributed data bases 
Network resource sharing 



17b. Identifiers /Open-Ended Terms 



17c. COSATI Field/Group 



18. Availability Statement 

No restriction on distribution 

Available from the National Technical Information 

Service, Springfield, Virginia 22151 



19. Security Class (This 
Report) 

UNCLASSIFIED 



20. Security Class (This 

Page 
UNCLASSIFIED 



21. No. of Pages 
68 



22. Price 



FORM NTIS-35 (REV. 3-72) 



USCOMM-DC 14952-P"