c The person ch , r • ? It
s Pon s ibi e fo • ar ff'ng thi^ ^ '
lt » e *' D*7el mthdra *no* e ( hb /ary from
I* 16 ""'Vr.,,/ OC,,on ond moy » of «><>o fes Qre
*° ren,. re SOlf :_ .. re OSO„s
rene w co/l Te , " d '"nis so ( fro
APR 1 7 ROTD
*<l6i_
OJ096
Digitized by the Internet Archive
in 2012 with funding from
University of Illinois UrbanaChampaign
http://archive.org/details/researchinnetworOObelf
UNIVERSITY OF ILLINOIS AT URBANACHAMPAIGN
URBANA. ILLINOIS 61801
CAC Document Number 203
CCTCWAD Document Number 6506
Research in
Network Data Management and
Resource Sharing
Network File Allocation
August 2, 1976
31» IJbrwv
University or Illinois
irt UrbartaCfiam
CAC Document Number 203
CCTCWAD Document Number 6506
Research in
Network Data Management and
Resource Sharing
Network File Allocation
by
Geneva G. Belford
John D. Day
Enrique Grapa
Paul M. Schwartz
Prepared for the
Command and Control Technical Center
WWMCCS ADP Directorate
of the
Defense Communications Agency
Washington, D.C.
under contract
DCA10075C0021
Center for Advanced Computation
University of Illinois at Urbana Champaign
Urbana, Illinois 61801
August 2, 1976
Approved for release:
Peter A. Alsberg, Pr%uCl  jW*" ; E rtvgs ti
gator
Table of Contents
Page
Executive Summary 1
Background 1
Overview 1
Models 2
Finding the Optimal Allocation 6
Applications to Special Situations 7
Conclusions 8
Introduction 10
History of the Problem 10
The Present Work 13
Models 15
Introduction 15
The Casey Model 16
The Chu Model 18
The Levin Model 19
Model for a PrimaryCopy Synchronization Scheme 22
Constraints 24
Finding the Optimal Allocation 29
ZeroOne Programming 29
Searching the Cost Graph 29
Simplifying the Problem 38
Applications to Special Situations 47
Single Update Source 48
Use of a Local Partial Copy 50
Conclusions 54
Table of Contents (continued)
Page
References 56
Appendix 1 57
Appendix 2 59
Appendix 3 63
Executive Summary
Background
This document presents the results to date of a research study
of file allocation in a network. The study is part of a larger, compre
hensive investigation of problems in network data management and resource
sharing. The goal is to develop techniques applicable to the WorldWide
Military Command and Control System (WWMCCS) Intercomputer Network
(WIN). The work is supported by the WWMCCS ADP Directorate, Command and
Control Technical Center, of the Defense Communications Agency.
Overview
Computer networks can provide rapid access. to geographically
distant computer systems. This creates an opportunity to set up an
automated distributed data base. In a distributed data base, different
files can be located at different sites in the network. Also, copies of
files can be replicated at one or more backup sites. The WWMCCS data
base is an example of such a multifile, multicopy, distributed data
base.
The engineering of a distributed data base leads naturally to
the problem of network file allocation . This is the problem of placing
the files at the various sites in some optimal (or near optimal) manner.
In our study of the allocation problem we first develop a
model for how the files are used. The model carefully defines the file
usage process that we are trying to make efficient through an optimal
choice of allocation strategy. The next step is to define what is meant
by optimal. In most studies in the literature optimal is defined as
least cost. "Cost" is a broadly defined term that can include far more
than dollar considerations. For example, dollar costs can be multiplied
by weight factors to reflect the scarcity of certain resources or the
priority of special functions. There are alternatives to using least
cost. For example, it might be worthwhile to define the optimum as the
allocation which results in the minimum response time.
Once a model has been developed and it has been decided that
(say) cost is to be minimized, the next step is to develop a cost
formula . The cost formula will reflect the model and will include all
essential parameters. Additional formulas may have to be developed for
constraints in the case where the cost formula alone is not sufficient
to define the optimum. Examples of constraints that would be imposed in
a WWMCCS environment are the a priori assignment of a file to a site or
the elimination of a site from consideration because of security or
command requirements.
The last step is to solve the optimization problem defined by
the model, cost formula and constraints. In general, this is a straight
forward, but very large, computational problem.
In this document we examine several possible models and discuss
associated cost formulas and constraints. We then study the question of
solving the optimization problem and present several new techniques
which will in many cases reduce the problem to a size that is computa
tionally tractable. Finally, we look at two file allocation questions
for which the model is sufficiently simple that the problem may be
solved with essentially no computational work.
Models
We have identified three aspects of file handling that must be
included in any file allocation study. Basically, the file must be
1. stored,
2. updated, and
3. queried.
The precise impact on cost of these basic processes will depend on how
they are carried out. In the models which one usually sees discussed in
the file allocation literature, the cost formula is of the following
form:
Cost = Cost of Storage
+ Cost of Sending Queries to Nearest Copy
+ Cost of Sending Updates to All Copies
A formula of this type was introduced by Casey [4], It varies in some
respects from that used by Chu [6] in his pioneering work on file
allocation.
Notice that as the number of copies increases, the cost of
updates and storage will also increase. But the cost of querying
decreases as closer sites become available to receive the queries. We
therefore have a classic tradeoff, which makes the problem of finding
the cheapest allocation a nontrivial one. In case the reader feels that
not all relevant costs appear to be included in this formula, we remark
that only a slight modification is needed for this basic scheme to
reflect such additional features as the costs of processing queries and
updates. Furthermore, parameters can be interpreted broadly. The
"nearest" copy is not necessarily nearest in the geographical sense, but
the one that can respond to the query most cheaply.
We have, however, identified a more serious difficulty with
this scheme. The assumption is made that all sites generating updates
send them independently to all the file copies. In our parallel work on
synchronization [1] and resiliency [2], we have identified many diffi
culties with such a "broadcast" model for updating. We find that a
primarycopy scheme, in which all updates are first sent to a "primary"
has many advantages. The primary plays a role in ensuring both resil
iency and proper synchronization of the updates. For a primarycopy
scheme in which the primary subsequently broadcasts the updates, the
cost formula will be something like the following:
Cost = Cost of Storage
+ Cost of Sending Queries to Nearest Copy
+ Cost of Sending Updates to Primary
+ Cost for Primary to Send Updates to Copies
Surprisingly, although this formula looks like it would probably lead to
higher cost than the broadcast formula, this is not necessarily the
case. That is, in many situations lowercost file allocations can be
obtained for a primarycopy scheme than for a broadcast scheme.
There are a number of constraints which one might want to
impose on the problem. The four most useful ones that we find discussed
in the literature are:
1. Forbidden or prescribed sites . For historical or policy
reasons, a copy of a data base might be required at a given
site. Conversely, security considerations could forbid
certain locations.
2. Prescribed number of copies . Primarily for reasons of in
creased availability [3], it may be desirable to specify that
there be, say, two copies of the data base, or of certain
files. One might also specify a minimal number of copies.
3. Limited storage . It may be necessary to take into account the
limited storage capacity of certain sites.
A. Upper bound on response time . If the access process is
modeled in considerable detail, and if enough information on
system parameters is known, the expected time to access each
file may be computed. The constraint can be imposed that this
access time be less than some prescribed maximum value.
Of these four constraints, the first two appear to be the most
relevant to WWMCCS needs. Fortunately, they also turn out to be very
easy to impose. Straightforward conditions on number of copies or on
file locations reduce the options and hence tend to reduce the computa
tion necessary to find an optimum.
On the other hand, the third and fourth constraints actually
tend to increase the difficulty of the problem. This is because these
constraints couple together the impact of all the files in the system.
Storage limitations obviously must take into account all the files. In
responsetime constraints, the file locations interact in more subtle
ways, such as in increased network traffic and resulting delays.
Without such coupling, the best location for each file (or group of
files) can be determined independently. It turns out to be much easier
to solve a number of small allocation problems than one gigantic one.
One further point should be made about storage constraints.
It is not only difficult to impose them, but it may be unrealistic to do
so. If it is otherwise worthwhile to store a file at a given site,
there seems to be no good reason why additional storage capacity could
not be added to accommodate that file. That is, "storage cost" in the
cost formula can just as well reflect the cost of new storage capacity.
If one takes this point of view, then the only classical
constraint which is troublesome to impose is a bound on response time.
Further study may be useful to determine whether responsetime con
straints can be handled in some simple fashion  and also whether they
are really worthwhile in most environments. It may be that some more
straightforward constraint  for example, on network traffic between
various sites  would be easier to apply and at the same time have the
same effect on system performance as the responsetime constraint.
Finding the Optimal Allocation
If there are n sites in a network, and for each site there are
two possibilities  either it has a copy or it doesn't  then there are
2 different file allocations to be considered. (This number becomes
2 1 if we omit the null allocation, i.e. no copy.) Computationally,
there are no foolproof shortcuts to finding the optimal allocation. As
the network increases in size, the work to solve the allocation problem
will in general grow exponentially, doubling with every additional site.
Consider the following example. Casey [4] notes that a program run on
the 360/91 took less than 10 seconds to solve six different singlefile
allocation problems for a network of 19 sites. (This time included the
Fortran compilation.) Thus, if we assume that it takes about 2 seconds
to optimally allocate a single file in a 19site network, it will take
over an hour in a 30site network, about 48 days in a 40site network,
and about 136 years in a 50site network. These figures must then be
multiplied by the number of files to be allocated.
It therefore becomes imperative to reduce the computational
work in any way possible. Casey [4] and Urano et al. [13] have given
some theorems which help in this. Taking a different approach, we have
developed three theorems which determine a. priori that certain sites
should, or should not, be included in any optimal allocation. Each such
a_ priori determination reduces by a factor of two the size of the
remaining allocation problem. We therefore feel that our theorems can
be enormously useful in reducing the file allocation problem for large
networks down to manageable size.
Applications to Special Situations
Following another line of research, we have investigated
certain simple models and related allocation questions which may be
particularly useful in the WWMCCS environment. First we assumed that
network transmission cost is the same for all site pairs. However, we
added the costs of processing queries and updates to the model, and we
assumed that these may differ from site to site. Here are the questions
we investigated, with summaries of our results.
Question 1 . Suppose that all updates are generated at a
single site, but queries may originate anywhere in the network. What is
the best allocation?
In a homogeneous network, the optimal allocation often turns
out to be a single copy, located at the site generating the
updates. Specifically, this happens when it costs more to
maintain (store and update) a copy elsewhere than it does to
answer remote queries from the updating site. In a heteroge
neous network, more tradeoffs enter into the picture. We
present two simple strategy rules for this case.
Question 2 . Suppose that there are only two sites under
consideration  a remote site holding a copy of the data base and
responsible for updating it, and a local site generating queries for the
data base. Is it worthwhile to have a local data cache?
A simple inequality suffices to answer this question. For
example, if we save $0.10 per query by having the local cache,
and the cost of local storage is $100/tnonth, then the local
cache is cost effective for all query rates over (100/0. 1) /mo.
or about 1.4 per hour. If the savings per query is only
$0.01; i.e., decreases by a factor of 10, then the crossover
point increases by the same factor of 10, to 14 queries per
hour.
Conclusions
It is not very hard to formulate a file allocation problem in
any particular environment. It is only necessary to understand the file
handling process well enough to model it. For small networks  say
less than 20 sites  it is a simple matter to find the optimal alloca
tion, by brute force computation of the "costs" of all possible alloca
tions if necessary. Considerable headway has been made on reducing the
computational difficulties for larger networks. In addition, in parti
cular situations the problem may be simplified to such an extent that
general rules of thumb are available to yield good  if not optimal 
allocations.
In short, the theoretical problems of optimal file allocation
are well in hand. The data base administrator who wishes to distribute
files in a network must, however, address several practical problems.
What model best reflects his environment? What are the values of the
parameters? (For example, how should the various "costs" be computed?)
What are the usage patterns? (That is, what query and update volumes
are generated at the various sites?) The question of parameter values
is a particularly difficult one. In some respects, it is a problem of
measurement and statistical analysis. Only measurement can yield usage
patterns for a real environment. Statistical analysis can help to say
whether those patterns seem static, oscillatory, or have some longterm
trend. "Costs" can be even more difficult to determine, since they may
reflect not only "real" costs, but some subjective judgment as to the
value or scarcity of resources. The data base administrator must play
an active role in determining the "best" network file allocation. The
researcher can only supply the basic tools and instructions on their
use.
Introduction
Computer networks can provide rapid access to geographically
distant computer systems. This opens up a number of attractive possi
bilities. One possibility is to set up a distributed data base, in
which copies of different files (or segments of files) are located at
different sites in the network. An obvious question then arises:
where should the files be placed? This is the file allocation problem.
It is not surprising that this problem, being basic as well as obvious,
has a longer history than any other in distributed data management.
History of the Problem
Optimal allocation by zeroone programming . The earliest work
on the network file allocation problem was done by Chu [6], Chu states
the problem as follows: "Given a number of computers that process
common information files, how can we allocate files so that the allo
cation yields minimum overall operating costs subject to the following
constraints: (1) The expected time to access each file is less than a
given bound, and (2) the amount of storage needed at each computer does
not exceed the available storage capacity." Variables X.. to describe
the allocation are introduced; X. .=1 if the ith file is stored in the
ith computer and X. .=0 otherwise. In order to apply constraint (1), Chu
develops a reasonably comprehensive formula for access time, including
queueing delays and the effect of intercomputer traffic congestion. The
overall cost expression to be minimized includes costs for storage as
well as for transmission of queries and updates. Since the variables to
be determined can take on only the values zero or one, the optimal
allocation may be found as the solution to a socalled nonlinear zero
one programming problem. (In fact Chu notes that the problem may be
10
reduced to a linear one, which may be solved by straightforward
techniques. )
In a later paper, Chu [7] discusses how an availability
constraint can be added to the model. The main idea is to determine in
advance (from simple assumptions on failure probabilities) how many
redundant copies of a file are required to achieve a desired level of
availability. This number is then inserted into the model. The scheme
otherwise remains unchanged, since Chu assumed in his earlier work that
the number of copies was given a^ priori . The difficulty with using
zeroone programming to solve the file allocation problem is that it is
so timeconsuming as to seem impractical for a large file system.
Optimal and suboptimal allocation by search procedures .
Believing that zeroone programming is too costly an approach, Casey [4]
developed an efficient search procedure for finding a minimalcost
solution, as well as heuristic methods for finding acceptably good
solutions. Casey's model differs in some respects from Chu's. One
important difference is that Casey lets the number of copies of a file,
as well as its locations, be variables. Notice that as the number of
copies of a file increases, the expense of querying the file decreases,
but storage and updating costs increase. Thus Casey's approach to
optimization strikes a balance between these two opposing trends. A
disadvantage, of course, is that the minimization may not yield enough
copies for reliability.
Casey applied his optimization algorithm to real data for the
ARPA network (when it had 19 sites) and thus showed the process feasible
for networks of moderate size. His experiments indicated that when
update traffic equals query traffic, it is most efficient to store the
11
file at a central node. As query traffic increases relative to updates,
storage at multiple nodes is indicated. These results are intuitively
reasonable. Although one always expects several local minima in a
complex, multivariable minimization problem, it is noteworthy that
Casey's experiments reveal extemely large numbers of them (over 100 in
some cases) . It is clear that any optimal allocation procedure must
take care to avoid being trapped in such a local minimum. It is possi
ble, of course, that a local minimum may provide an acceptable suboptimal
solution to the problem.
Morgan and Levin have criticized both Chu's and Casey's models
on the basis that they do not allow for dependencies between files and
programs [10,11]. That is, a program (which is itself a file) may need
to make use of one or more data files. The fact that these files must
interact with one another is not taken into account in the older models
which assume file independence. Morgan and Levin also point out that in
a heterogeneous network it may not always be possible to store a particular
file at an arbitrary site. Their model takes into account this type of
constraint, which also includes the possibility that the allocation may
be restricted by security considerations. The algorithm used to solve
the optimization problem is a systematic search procedure, along the
lines suggested by Casey. Dynamic features were also introduced; that
is, costs to change the file allocation were considered and balanced
against savings expected from the improved allocation.
Simplifying the problem . Model development has therefore been
extensive and has produced models of considerable sophistication. More
recent research in file allocation has emphasized ways to simplify the
problem. This is an important consideration, since in general the work
12
to solve the file allocation problem increases as 2 , where n is the
number of sites in the network.
Urano et al. [13] have developed theorems which allow a net
work to be partitioned into what one might call "proximity groups" of
sites. At most one site from each proximity group can appear in any
optimal allocation. Roughly speaking, the costs of querying the various
members of a proximity group (from any other site in the network) are
all about the same. If querying costs are proportional to the distance
the query must be shipped across the network, then members of a proximity
group are physically close. For example, in a nationwide network, a
group of sites in, say, the Los Angeles area might form such a proximity
group.
Another step towards making the problem more tractable is the
very recent work of Chandy and Hewes [5]. They first reformulate the
problem so that it is linear, at the cost of adding additional variables
and constraints. They then ignore the requirement that the variables
X. . be zero or one and use a standard linear programming approach to
find a minimum. If the minimizing X..'s turn out to have (as very often
happens) zeroone values, then the problem is solved. If not, at least
a possibly useful lower bound to the true minimum cost is obtained, and
a linear integerprogramming approach (with all the troubles involved
for large numbers of constraints) can be tried. More detailed work
needs to be done here, but the approach is very promising.
The Present Work
We here report on the work that we have carried out on file
allocation. In the next section we discuss the different models in
varying degrees of detail. In particular, Casey's model, and a varia
tion of it which we developed for a primarycopy environment, are
13
discussed fairly completely. We felt that Casey's model has the sim
plicity and flexibility needed to form a basis for our own work. We
also list various constraints which might be imposed, and assess their
importance and the relative ease with which they might be applied.
In the following section we take up the very difficult problem
of finding the optimal allocation and present some contributions that we
have made.
In the two final sections of this report we briefly discuss
applications to several special cases for which simply analyzed models
are possible, and we summarize some conclusions we have drawn from our
work to date.
14
Models
Introduction
In order to make a rational decision, based on quantitative
data, on where to put files in a network, it is first necessary to
construct a model. The model must include those aspects of file han
dling which affect cost, response time, and any other features of the
distributed data system which the file allocator may wish to take into
account. Once the process of file handling is defined by a model, it is
possible to identify the parameters which enter into cost, response
time, etc., and to generate the necessary formulas. The reader should
be aware of the fact that a model inevitably is a simplified representa
tion of the real world. A good model should be simple. It should
include the key features of the process to be modeled. But at the same
time, the level of detail should not be such as to preclude its applica
tion to slightly varying situations. For example, one should be able to
modify parameter definitions. One should also be able to include more
detail by developing submodels to compute the parameters in the basic
model.
With these thoughts in mind, we identify three aspects of file
handling which must be included in any file allocation study. The file
must be
1. stored,
2. updated, and
3. queried.
Effects of these basic processes must be included in any cost formula.
Their precise impact on cost will depend on how the processes are
15
carried out. Response time is, of course, a function of the querying
process. It will also depend on processor and network characteristics
and loads. The possibility of modeling at an ever increasing level of
detail immediately arises. In this report we will restrict ourselves to
flexible, high level models. Where appropriate, we will point out how
lower level detail may enter into the parameters. The four models that
we will be considering in this section are
1. the Casey model, which is perhaps the one most commonly
discussed in the literature,
2. the Chu model, which was the first network file allocation
model to appear in the literature,
3. the Levin model, an ambitious effort to include interaction
between programs and files, and
4. a new model (really a variant of Casey's) that we have
developed for use in environments that apply primary copy
synchronization schemes.
After discussing parameters and cost formulas for the models, we will
look briefly at formulas for various constraints which may be imposed.
The Casey Model
This model, as presented in [4], may be briefly described as
follows. For simplicity, only a single file is considered. Any site
may query or update the file. Updates must be sent to all file copies.
Queries are sent to the "closest" one, where "closest" can be broadly
defined in terms of lowest cost.
Cost function . The cost computed is essentially total oper
ating cost over some chosen time period (day, week, month). The formula
contains three terms: one for storage, one for updating, and one for
querying. The parameters in the model are as follows:
16
I = index set of sites with a copy of the file
n = number of sites in the network
V . = update load originating at site j
A . = query load originating at site j
d., = cost of communication of one query unit from site j to site
d' = cost of communication of one update unit from site j to
jk
site k
o = storage cost of file at site k
Storage costs and query loads must be those for the chosen time period.
An allocation is specified by I, the index set of sites with a copy.
That is, if sites numbered 1, 4, and 5 have a copy, I = {1,4,5}.* The
cost C(I) of a given allocation I is then
n
C(I) = E [ Z V.d' + X. min d..] + Z a (1)
j=l kel J JlC 3 kel JR kel
Since this cost formula and the notation defined above will be used
frequently throughout this report, this page is printed on light green
stock so that the reader may locate it easily.
Notice that as the number of copies increases, the cost of
storage and of updating also increases. But the cost of querying
decreases as closer sites become available to send the queries to. We
therefore have here a classic tradeoff, which makes the problem of
finding the minimalcost allocation a nontrivial one. A complicating
factor, from the computational point of view, is the appearance in the
* Although brackets are most commonly used to denote sets in mathematical
discussions, we will sometimes use parentheses to indicate an allocation
index set. Thus, we might alternatively write I=(l,4,5). We will also
refer to I (or to (1,4,5)) simply as an "allocation". This seems easier
than to continue to talk about "index sets".
17
formula of min d.. . This makes the problem seriously nonlinear. The
jk
problem of actually determining the minimal cost allocation will be
discussed in some detail in a later section.
Extension to multiple files . The cost formula (1) is for a
single file. If more than one file is to be allocated, the overall cost
is readily obtained by adding the costs of the individual files. This
is possible since the model assumes that there is no interaction between
usage of the various files. Thus, to minimize the overall cost, one
need only find the optimal (minimal cost) allocation for each individual
file. We shall see that when constraints are added, or when file usage
is not independent, this separation of the multifile problem into a set
of singlefile problems may no longer be valid.
The Chu Model
Like Casey, Chu [6,7] assumes that updates must be sent to all
copies, while queries go to just one. However, Chu assumes that, if
there are r copies of a file, then a fraction 1/r of the queries for
that file are sent to each copy. Alternatively, one may think of Chu's
approach as averaging over the cost of sending queries randomly to any
site which has a copy. Furthermore, Chu assumes that the number r of
copies is fixed in advance, generally by considerations of availability.
(Relaxing this assumption would, of course, simply require one to con
sider in turn the cases r=l,2,...,n (where n is the number of sites in
the network) . )
Chu has some details in his model which appear to make it less
flexible than Casey's. For example, he envisions a "transaction" to a
file as resulting in a flow of u units of data from the remote site to
the user site. Such a "transaction" corresponds roughly to a query in
18
Casey's model. But then updates are handled by assuming that a certain
fraction of these transactions result in the shipment of the same u data
units back to the remote site. That is, Chu models the updating process
as one of sending for the file (or relevant portion thereof) , modifying
it, and then shipping it back. This makes it easy for Chu to apply his
model to, for example, a multiprocessor with memory hierarchy. However,
it is unlikely that updates would be performed in this way in a network
environment with a distributed data management system.
There is another limiting feature of Chu's model. In Casey's
model, it is easy to include processing costs by adding them into the
factors d.. or d'... Thus, if we wish to include in Casey's model the
cost of querying a local copy, we need only let d.. be that cost. Chu's
cost formulation, however, has a builtin limitation that makes local
query cost zero. (We will not write down Chu's cost formula here; the
interested reader may consult his paper.)
Chu's formulation is of the multiplefile problem. However,
his cost function readily separates into a sum of terms, each of them
being the cost associated with a single file.
The Levin Model
In his thesis [10], Levin argues that "in order to process a
given transaction both the relevant program and the relevant file must
be accessed." Thus, a user request originating at one site may have to
be sent to a second site (the one containing the application program) ,
which in turn accesses the needed files. Since the optimal location of
a file then depends on the locations of the programs which access it,
Levin feels that "an optimizing model for file assignments should con
sider the location of the relevant programs and optimize their assign
ments simultaneously." This is an attractive idea. However, it does
19
lead to a model of considerable complexity. In particular, the assump
tion of fileusage independence is no longer valid when some of those
"files" are actually programs which access other files.
Levin's cost formula contains the usual three parts  costs of
storage, updates, and queries. However, because of the inclusion of
programs and because updates and queries are transmitted through an
intermediate program, the cost function ends up with six terms:
Cost = Cost of transmitting queries from initiating sites to the
programs
+ Cost of transmitting updates from initiating sites to the
programs
+ Cost of transmitting queries from programs to files
+ Cost of transmitting updates from programs to files
+ Storage cost of programs
+ Storage cost of files.
In any model there seems to be a problem with deciding where
to transmit the queries. Recall that Casey sends them to the copy that
is closest in some sense, while Chu sends them randomly to any site with
a copy. Levin's approach is to introduce a new set of variables to
describe the "routing discipline"; that is, to which program copy, and
from there to which file copy, transactions from a given site are
shipped. The values of these routing variables are then to be deter
mined, along with the locations of files and programs, in the optimi
zation process. This approach increases enormously the number of
unknown variables in the problem. Furthermore, since one program may
access several files  and one file may be accessed by several programs 
locations of all files and programs are interdependent, and it is no
20
longer possible to simplify the problem by decomposing it into a set of
singlefile allocations.
In order to carry out such a decomposition, and hence make the
optimization computationally tractable, Levin makes several simplifying
assumptions. In particular, he assumes that storage costs of programs
are zero. In a very real sense this eliminates the program location
problem, since it then becomes possible to freely place programs wher
ever it seems useful (and is possible) to do so. Once programs are
located, one may readily optimize the routing discipline for any given
file allocation. The file allocation itself is carried out by using
Casey's suggested search procedure. (This procedure will be discussed
later.) It is not surprising that Casey's solution method is ultimately
applicable, since, by adding assumptions to make the problem computa
tionally tractable, Levin has simplified his model to the point where it
looks very much like Casey's. In fact, Levin's simplified model can be
obtained immediately from Casey's model if some precomputing is done to
reflect the routing. Each transmission from user to file will have to
touch one of the preestablished program sites. Thus, if neither the
user site nor the file site contains a copy of the needed program, then
Casey's d.. should be changed to d„ + d, . , where k is the program site
ij lk kj
which minimizes that sum. (If copies of necessary programs are at all
sites, then Casey's model needs no modification. In a heterogeneous
network, however, it may not be reasonable to assume an unrestricted
allocation of programs.)
As we have noted, Levin's basic model, before simplifications
are introduced, is too complicated to be computationally tractable.
However, Levin goes on to map out elaborate extensions of his model. He
first notes that the basic model is deterministic and static. That is,
21
access patterns are assumed to be known, and costs incurred through
dynamically changing the allocation are not taken into account. He
therefore indicates (1) how the deterministic assumption may be relaxed
by treating access patterns as random variables and using forecasting
theory, and (2) how the static assumption may be relaxed by including
the costs of changing allocations in a multiperiod cost function.
These are interesting ideas which may be of some importance in future
work on dynamic, selforganizing, distributed data systems. In this
report, however, we will restrict ourselves to examining the file
allocation problem in a static, deterministic environment.
Model for a PrimaryCopy Synchronization Scheme
In examining the literature on network file allocation, we
find Casey's model the most attractive one. It has a simplicity and
flexibility which makes it more readily applicable to a wide variety
of situations. In addition, its simplicity provides clear insights
into the key features of the problem, which are easily obscured in the
more complicated models. We have therefore decided to build on Casey's
model in our file allocation work.
When we examine Casey's model in the light of our work on
update synchronization [1], we find one serious problem. That model 
like all others in the current literature  assumes that the site where
the update originates broadcasts the update to all sites holding a copy
of the file. Thus the update message flow corresponds to that of
Johnson's scheme for synchronization [9], In that scheme, each site
adds a timestamp (its own clock time) to the update and sends it to the
copies. We have identified certain disadvantages with such a scheme
which led us to study primarycopy schemes. In primarycopy schemes all
22
updates are sent to a distinguished, or primary, site, which subsequently
broadcasts them. We may summarize the difference which this makes in
the cost formula as follows.
Casey (broadcast) model :
Cost = Storage
+ Transmission of Queries to Nearest Copy
+ Transmission of Updates to All Copies.
Primarycopy model :
Cost = Storage
+ Transmission of Queries to Nearest Copy
+ Transmission of Updates to Primary
+ Broadcast of Updates by Primary.
The first two terms are the same; it is only the handling of updates
that changes. Assuming that the primary is at site 1, we may write
(using Casey's notation) the singlefile cost function for the primary
copy model as
n n
C(I) 2 a + E X. min d + I V d' + f I d' (2)
kel j=l J kel Jk j=l J Ji kel ik
n
In this formula, ¥ ■ £ ¥. ; i.e., ¥ is the total update load which
j=l 2
the primary must broadcast.
It may appear to the reader that this formula, since it looks
longer, yields higher costs than Casey's formula. In looking at examples,
we find that this is not true. That is, a primary copy scheme may be
cheaper than a timestamp scheme for synchronization. Intuitively, this
occurs because a site generating updates need send them only to the
primary; and it may be cheaper for the primary to transmit them to the
23
other copies than it would have been for the updating site to send them
directly. (For example, when the parameters d'., are proportional to
distance, the primary site may be closer to the other sites in the
allocation than is the updating site.) This interesting result will be
discussed in more detail in a later report.
As in Casey's model, multiple files are handled by summing the
cost functions for the individual files.
Constraints
It is not always reasonable to minimize the cost function over
all possible allocations of copies to sites. In this section we look at
various kinds of constraints which one might want to impose.
Forbidden or prescribed sites . The simplest type of constraint
is to require that certain sites have a copy of the file. In the WWMCCS
environment, certain data bases are currently held at associated sites
as a matter of policy and/or convenience. It would be unreasonable to
throw such sites into a mindless fileallocation algorithm. In such
situations, the algorithm would only be applied to determine the best
locations for extra copies. The cost formula would, of course, include
the information that a copy is at some prescribed site, since this
affects the query cost. Similarly, it may not be feasible to consider
putting certain data at some sites. This could be for reasons of
security, local system overload, etc. We can then simply omit such
sites when we apply the fileallocation procedure. However, if a
forbidden site may query the data, this information must be included
in the cost formula.
Prescribed number of copies . As was previously noted, Chu's
cost function includes the number of copies as a parameter. Chu [7]
suggests choosing this number on the basis of availability. A simple
24
model can be used to determine the number of copies that are needed to
yield a given level of availability. In many cases, two copies are
adequate. If one site is prescribed, then one need only determine which
of the (n1) sites remaining is the best location for the extra copy.
Given the cost function, it is a simple matter to carry out this
determination.
One could also assume that availability considerations set a
lower bound on the number of copies needed. This will increase the
number of possible allocations to be investigated, but may lead to a
lowercost allocation. (Adding sites can decrease total cost when query
transmission costs are large relative to storage and update costs.)
There is one problem with letting a simplified availability
model determine the number of copies. A correct, complete availability
analysis for a real network would involve such things as the detailed
network topology, a combinatorial analysis of relevant "cutsets" and
their probability of failure, etc. The number of copies r that are
needed to satisfy an availability requirement would then in general vary
with copy location. And availability would not, of course, be indepen
dent of the user's site. Some sort of average or minimal availability
would have to be used in the constraint. One could envisage an itera
tive process. A rough estimate of availability could be used to deter
mine a number of copies. An allocation could then be determined and
expected availability computed for that specific allocation. If this
availability is too low, the number of copies could be increased. This
is an attractive approach which, as far as we know, has not been previously
proposed.
It may also be that one wants to set a maximum number of
copies. However, the most obvious basis for doing this would be that of
25
cost, and the allocation optimization process automatically selects the
most economical solution. Thus, it is not necessary (though it may save
computation time) to announce a priori that all allocations of more than
four copies (say) are "too expensive". Indeed, without extensive
computations of multicopy allocation costs it may be impossible to
verify such a pronouncement.
Storage constraint . It may be necessary to take into account
limited storage capacity at the sites. That is, it may not be physically
possible to store all the files that one might want to at a given site.
In his model, Chu introduces storage constraints of the form
T. L < S (i = 1,2, ..., n),
jeF 1 J
where F. is the set of files to be stored at site i, L. is the length of
the jth file, and S. is the storage capacity at site i.
Adding such a constraint has an unfortunate computational
effect. It makes it necessary to consider the allocations of all of the
files simultaneously. It is no longer possible to allocate each file in
turn, since the ultimate total file distribution is likely to violate
one or more storage constraints.
Including storage constraints may seem to be a necessity.
However, this is not necessarily the case. It may be more reasonable to
assume that additional storage space can be purchased as needed. The
cost of storing a file at a given site can reflect "new storage" cost as
well as "old storage" cost. Indeed, unless some sites have a lot of
unused storage capacity which management elects to think of as "free",
there seems to be no good reason to consider newly purchased storage
capacity to have a cost basis different from the old.
26
Upper bound on responsetime . Chu [6,7] also Includes in his
model the condition that "the expected time to access each file is less
than a given bound." To compute this expected time, Chu makes a number
of simplifying assumptions. The access process is modeled as follows.
1. The user request, after waiting in a queue for a network
channel, is transmitted to a remote site that holds a file
copy.
2. The remote site accesses the file.
3. The response is put in the queue for a network channel, and is
ultimately sent back to the user.
Chu assumes that request messages are relatively short, so
that they can be given a high priority and their queueing delay can be
neglected. With this and other simplifications, Chu can treat the
entire access process as a straightforward singleserver queueing
system. Response time becomes a function of line capacity, line
traffic, and request rates to the various sites. Request rates to sites
and line traffic are in turn dependent upon the file allocation and the
request rates to the files.
The important thing to notice about a responsetime constraint
is that it inherently includes all of the files in the system and their
locations. This is because line traffic and response delay due to line
congestion must take into account transactions involving all of the
files. Not being able to decompose the multifile problem into a set of
singlefile problems increases the computational intractability of the
optimization.
Other constraints . The four types of constraints discussed
above are those which one sees in the literature. There are other
27
possibilities which come to mind. For example, limited channel band
width between two sites may make it necessary to set an upper bound on
the traffic between the two sites. Although limited bandwidth is taken
into account in Chu's responsetime model, a more direct imposition of a
constraint on traffic might be simpler and more appropriate in some
situations.
As another example, one might sometimes want to take into
account the limited processing capacity at certain sites. Including
terms for processing costs in the cost function can help to eliminate
the need for putting a constraint on processing time. For example, the
"cost" of sending queries or updates to an already overloaded site could
be set at such a large figure that no additional constraint is needed to
effectively eliminate that site from the allocation. To take into
account more subtle effects of processing capacity limitations, one
would have to develop a realistic model for the processing needed for
the query and update operations.
28
Finding the Optimal Allocation
ZeroOne Programming
In Chu's formulation [6,7] of the file allocation problem, the
variables defining an allocation are
_ ( 1 if file j is at site i,
ij 0 otherwise.
The cost function is then readily written as a quadratic function of
these variables. By increasing the number of variables in the problem,
Chu is able to reformulate his cost function as a linear combination of
variables that take on the values zero or one. Minimizing the cost then
amounts to solving a zeroone linear programming problem. In general,
solving such problems is computationally very expensive. On many
occasions heuristics are used, sacrificing optimality for computational
efficiency. For example, Levin [10] points out that Chu's approach
would produce about 9000 zeroone variables with 18000 constraints for
an ARPAlike network with 30 sites and at least 10 files. Although
there are computational difficulties for large systems, it should be
emphasized that this approach is mathematically straightforward. In
addition, it is relatively easy to formulate constraints in terms of the
zeroone variables. The constraints are also handled automatically by
algorithms to solve the zeroone programming problem. This is how Chu
may readily add responsetime and storagecapacity constraints to his
model.
Searching the Cost Graph
Casey's cost graph . Finding the optimal allocation can be
simply thought of as a process of computing and comparing the costs of
all allowable allocations. The set of possible allocations for a
29
(3,4)
(2,3,4)
(1,2,3,4)
Figure 1
Graph of all possible allocations
among four sites.
30
singlefile, unconstrained problem can be pictured in graphical form.
Figure 1 shows such a graph for foursite allocations. For symmetry,
the top node denotes the null allocation (0) , in which no sites have a
copy. At level i in the graph we have all the allocations of i copies
(each node representing an allocation). A branch is drawn from a node
at level i to one at level i+1 if and only if the latter allocation can
be obtained by adding a single copy to the former. If we associate the
cost of each allocation with each node, we have what Casey [4] calls the
cost graph . In the general nsite graph, the ith level contains (.)
nodes. The total number of nodes in the graph (including the null
allocation) is 2 . Each site added to the network therefore doubles the
number of costs that must be computed, in general.
Search procedures . Casey [4] suggests that the minimum cost
node be located by carrying out a systematic search through the cost
graph. In Casey's words, "A computer algorithm can be implemented in
several different ways to select file nodes [sites] one at a time up to
the optimum configuration. One approach is to follow all paths in
parallel through the cost graph, stepping one level per iteration. This
method is computationally efficient, but may require a great deal of
storage in a large problem. Alternatively, a program can be written to
trace systematically one path at a time. Such a technique uses less
storage, but may require redundant calculations since many different
paths intersect each vertex". In general we agree with this statement,
but we dispute the last sentence. We believe that no node must be
visited more than once in a "path at a time" approach. We will shortly
defend this belief.
First, let us make more specific the memory requirements of
the levelbylevel approach. At least ( n . ) memory locations (for n
31
even) are needed to save the cost values of the previous level in the
cost graph. (This maximum occurs at the middle level (i=n/2) in the
cost graph.) Some additional working space is needed for the evaluation
of the current level. Note that ( *)_) is 184,756 for n=20 and 155,117,520
n/2
for n=30. As networks increase in size, the memory requirement of a
levelbylevel search quickly becomes prohibitive. This makes it
important to devise a pathatatime search in which effort is not
wasted.
Theorems to truncate the search . In the worst possible case 
when the optimal allocation is to put a copy everywhere  all nodes of
the cost graph must be evaluated in any search procedure. Usually,
however, one can truncate the search. Casey presents the following two
theorems for doing so. The reader should refer back to the description
of Casey's model for definitions of the notation.
Theorem CI. Let d. =d' for all j,k. If for some integer r
jk jk J ' 6
(<n) , V .>\ .1 (r1) for all j, then any rsite file assignment is more
costly than the optimal onesite assignment.
Theorem C2. Suppose assignment I is optimal. Then along any
path in the cost graph from the null node to the node corresponding to
I, cost is monotonically non increasing from one node to the next.
Actually Casey's Theorem C2 is somewhat more general and is
not stated in terms of the notions of "paths" and "nodes". These
notions belong to the cost graph visualization of the optimal allocation
problem. (See figure 1.)
Theorem CI gives us an upper bound (r) on the level beyond
which there is no need to search; i.e. a gross stopping criterion. To
see how this theorem may reduce our search, notice that V./A. is the
J J
32
ratio of updates to queries generated at site j . If for all sites the
volume of updates is, say, more than half that of queries, then the
inequality in theorem CI holds with r=3, and we know that we need not
search below level 2 in the cost graph.
Theorem C2 gives us a more precise stopping criterion. If,
while following a path down the cost graph, we find that the cost
increases, then no further search along this path will be of any use.
An efficient path search . We have devised a scheme for
searching the cost graph one path at a time, but without revisiting
nodes. The reader can probably understand the algorithm best from an
example. The search order for four sites goes:
(1), (1,2), (1,2,3), (1,2,3,4), (1,2,4),
(1,3), (1,3,4), (1,4), (2), (2,3), ..., (4).
More precisely, the following algorithm will print out the search order
for n sites. Let V(i) be an array with index i. Let the instruction
"print" be such that it will print the values of V(l) to V(k)
contiguously.
1. V(l)=l; k=l; print.
2. If V(k)#n, then k=k+l; V(k)=V(kl)+l; print; go to 2.
3. Else k=kl;
If k=0, then halt.
Else V(k)=V(k)+l; print; go to 2.
What we have actually done is to convert the cost graph
(figure 1) to a tree (figure 2). Our algorithm follows a preorder
search of this tree.
As the reader will observe, there are many paths that lead to
the node (1,2,3,4) in figure 1, but only one in figure 2. There remains
33
U
0)
ll •
3 x:
oo u
•H W
U4 CO
CO
•H T3
4J U
o
cu a)
•h a
CO H
I <4t
3
o
O
0)
41 CU
C M
CU 4J
e
cu cO
60
c o
03
U
u
CO
CU
an important question. Is it possible that by following only one path
to a given node we could fail to recognize the optimum? The answer to
this question is "No". In order for a node to have minimum cost, it
must, by definition of minimum, have a cost no greater than that of any
other node, in particular than that of any node in the previous level.
This fact, in conjunction with Casey's theorem C2, produces the condi
tion that cost is nonincreasing along all paths leading to the optimum
node. Thus, our simplified tree of figure 2 contains one such path that
will lead to the optimum. We therefore conclude that our restricted
search is sufficient to detect the optimum.
As the reader will realize, our algorithm requires in the
worst case (i.e. along the longest path) only one memory location per
level, or n memory locations. In implementing the algorithm, we find
that some efficiency can be gained by increasing the memory requirement
2
to 0(n ). However, it is clear that we have achieved our goal. Memory
requirements are much more reasonable than that of the levelbylevel
search. But at the same time we have preserved computational efficiency
by ensuring that each node cost is computed only once.
Searching with constraints . The search algorithms described
above have been for the unconstrained cost minimization problem. If
constraints are imposed on the singlefile problem, the effect is to cut
down on the number of nodes of the cost tree which must be included in
the search. This may or may not be helpful. Casey's theorem C2, for
example, applies only to the overall minimum of the cost graph. More
extensive computing, therefore, may be necessary to identify the minimum
cost over a restricted set.
If constraints (e.g., response time or storage) are imposed
which couple the allocations of several files, the cost graph expands to
35
include all possible allocations of all files. This expanded graph has
2 nodes, where f is the number of different files in the system. One
arrives at this number of nodes by noting that (in Chu's notation) the
problem has nf variables X . , each of which may take on the value zero
or one. Casey's theorems for truncating the search are clearly no
longer applicable. It is hard to see how one could avoid the tedious
process of checking all allocations to see which ones satisfy the con
straints, and evaluating the costs of all that do. The alternative is
to use Chu's formulation and a zeroone programming algorithm. It
appears to us that Chu's approach, being straightforward, is the best
one to take if interactions between the allocations of different files
must be taken into account.
The best allocation for a primary copy scheme . The search
procedure needed to find the optimum allocation for the primary copy
model is much like that for a constrained problem in which a copy is
prescribed at one site. That is, given a location for the primary, we
search that portion of the cost graph covering all allocations that
include the primary site. This determines the best allocation with that
particular site for the primary. By trying each site in turn as the
primary, we can find the lowest cost of all and hence the best site
for the primary.
Local minima . In many cases, it is not feasible to search a
large part of the cost graph for a true global minimum. One must con
tent oneself with a local minimum  i.e., an allocation which is less
(or no more) costly than any "neighboring" allocation.
As one such procedure for finding a local minimum we have
tried the following. Let the sites be numbered 1,2,3,... in any arbi
trary order. (In the next subsection, we will discuss a good strategy
36
for choosing this order.) Begin by computing the cost of allocation
(1); i.e. one copy at site 1. Then proceed cyclically through the list
of sites, testing the effect of a change in allocation state (i.e.,
whether or not the site has a copy) of each site in turn. If the
cost is decreased, make the change; otherwise proceed to the next site.
Continue in this way until a local minimum is reached. A local minimum
is found if we are able to make a complete cycle of the list of sites
without any cost improvement. That is, if we have a node which repre
sents a local minimum, then all adjacent nodes will have a higher cost.
An example might be helpful. After computing the cost C(l) of
allocation (1), we next compute C(l,2). If this is larger than C(l), we
next try C(l,3), etc. Suppose that C(l,4) is less than C(l) . We then
use (1,4) as a new prospective local minimum, and test the neighboring
allocations (1,4,5), (1,4,6) ... (l,4,n), (4), (1,2,4), (1,3,4). If
none of these has a lower cost than (1,4), then (1,4) is a local mini
mum. It may help to visualize the process to note that neighboring
nodes are defined in this algorithm as those joined to the given node by
a branch of the cost graph. (See figure 1.)
The reader might feel that, for example, allocation (1,3)
should be considered as neighboring (1,4). But notice that these differ
by two changes in allocation state  deletion of the copy at 3 and
addition of the copy at 4. Thus these allocations are said to be a
distance 2 from each other. It is quite possible to define a local
minimum as being of lower cost than all allocations within distance 2 
or possibly within a farther distance. This simply increases the search
required to locate and verify a local minimum. The usual sort of trade
off is involved here. By increasing the cost of the search, one can
generally obtain a lower cost allocation. Indeed, we find that for many
37
examples a local optimum that has lower cost than all allocations within
distance 2 is actually a global optimum. Chandy and Hewes [5] also
report that such a "2distance heuristic ... is a near optimal
heuristic."
Simplifying the Problem
No matter how efficient a search procedure is devised, it
remains true in general that the problem of finding the optimum grows
like 2 , where n is the number of sites. Indeed, a rigorous proof to
essentially this effect has recently been presented by Eswaran [8].
(Technically, Eswaran proved the problem "polynomial complete".) Let us
see what this really means in terms of computation time. Casey [4]
notes that a program run in the 360/91 took less than 10 seconds to
solve six different optimal allocation problems for a network of 19
sites. (This time included the Fortran compilation.) Thus, if we
assume that it takes about 2 seconds to optimally allocate a file in a
19site network, and every additional site doubles the time, then it
will take over an hour in a 30site network, about 48 days in a 40site
network, and about 136 years in a 50site network. It is therefore
worthwhile to take other approaches to trying to cut down on the amount
of work.
In one such approach, Urano et al. [13] have shown that the
sites may be partitioned into sets of "neighboring" sites in such a way
that at most one member of each set should be included in an optimal
allocation. (The geographic interpretation of their result is based on
visualizing the d . . ' s as distances and was discussed in more detail in
the Introduction.) More recently, Chandy and Hewes [5] have reformulated
what is essentially Casey's model as a simple linear integer programming
38
problem. They find that solving the problem by a standard linear pro
gramming algorithm ( without the integrality constraints) leads in many
cases to correct integer solutions. (The same phenomenon has been noted
in similar optimization problems [12], and hence appears to be more than
just an artifact of the particular examples tried.)
New results on a priori exclusion or inclusion . We have
attempted to decrease the size of the problem by a new approach; namely,
by deriving rules for determining a priori that certain sites will (or
will not) be included in an optimal allocation. Three theorems providing
rules of this sort are presented in this section. Formal proofs of the
theorems may be found in appendix 2. Casey's model and cost formula are
assumed. (See equation (1), p. 17.) For simplicity, we shall also
assume that d'..=d..=0 for all i. Thus we neglect the costs of queries
33 33
and updates applied locally. Relaxing this assumption, however, as well
as making other minor alterations in the formulation, requires only
small compensating alterations in our theorems.
Our first theorem essentially states that if the cost of
having a local file copy is smaller than the smallest possible cost of
sending the locally generated queries elsewhere, then a local copy
should unquestionably be included in the optimal allocation. The cost
of storing and updating a copy at site i is an important quantity; we
shall denote it by Z.. That is, we let
n
Z = a + £ <F . d ! . .
i i iml J jx
Theorem 1 . All optimal allocations will include site i if
X. min d^ . > Z J . (3)
1 in lj *
39
Our second theorem gives the other side of the coin. If it
costs more to maintain a local copy than we can possibly save by having
one, then we do not want one.
Theorem 2 . No optimal allocation including more than one site
will include site i if
n
Z > I X (max d  d ). (4)
1 j=l J k JR Jl
Theorems 1 and 2 can be summarized as follows. Let
m. = A . min d. . , and
n
M . = £ X . (max d . .  d . . ) .
j=l J k J J
Then for each i the real line is partitioned by m. and M. into three
J i i
regions, as shown in figure 3. If Z. falls in region 1, then it should
unquestionably be included in any optimal allocation. If it falls in
region 3, it will generally be excluded. An exceptional case is when
all sites fall in region 3; i.e., satisfy inequality (4). In this case,
all optimal allocations will be singlecopy ones, and we need only
choose the best of these. Sites whose costs Z. fall in region 2 (in
cluding the boundary points) must be considered further.
Region Region Region
1,2,3
Figure 3
40
One other comment should be made. If network transmission
costs are independent of the sites involved, so that d =d for all i^j ,
then m.=M.=A d, and region 2 collapses to a single boundary point.
Unless the cost Z of some site lies on the boundary, every site is a
priori either included or excluded. Immediately we see that then the
set of sites in region 1 forms the optimal allocation, unless the
exceptional case noted above obtains. Computation is therefore reduced
to a minimum (essentially n cost evaluations) for many real network
environments. (For example, most commercial packetswitched networks
charge per packet, irrespective of the distance the packet is to be
sent.) There are, of course, real cost differentials incurred in the
longer lines between sites, but from the user's point of view it is the
network charging policy that matters.
Our third theorem is somewhat in the spirit of the approach to
simplification taken by Urano et al. [13]. However, Urano's results
depend on a more complex model, involving "relay costs" at network nodes
and restrictive hypotheses on the d... Also, instead of determining
that no more than one of some group of geographically close sites can be
included in an optimal allocation, we show that certain sites may a
priori be precluded from being in an optimal allocation by the existence
of a "better" site nearby. Thus, again we have a theorem which allows
us to eliminate initially certain sites from any further consideration
in the search for the optimum.
Theorem 3 . A site i cannot be included in any optimal alloca
tion if there exists another site k in the network such that
n
Z.  Z. > I X.(d..  d. ,). (5)
where (f) = f flff i B '
wnere (z )+ < if f < 0.
41
Examples and further discussion . In order to assess their
value, we have applied our theorems to several of the examples in
Casey's paper [4], Casey assumes throughout that it costs the same to
transmit a query as an update (d' =d ), that transmission costs are
symmetric (d =d ) , and that storage costs are zero. In Casey's 5site
example, the data are as follows. The matrix of transmission costs d
is
ij
6
. 12
9
6
6
6
12
9
12
6
6
12
9
12
6
6
6
9
12
6
All A.'s are 24, and the vector of *f.'s is (2,3,4,6,8). In order to
apply theorems 1 and 2, we tabulate Z . , m., and M. for all sites i.
Site
1
2
3
4
5
z.
i
168
180
174
126
123
l
144
144
144
144
144
M.
l
648
648
576
648
648
Theorem 1 then states that both sites 4 and 5 should be included in any
optimal allocation. Theorem 2 does not eliminate any sites, so the
effect of adding sites 1,2,3 would have to be checked. However, the
problem has been reduced from an optimization among 32 (=2 ) choices to
3
one among 8 (=2 ). (The optimal allocation I is (1,4,5).) It is vir
tually obvious by inspection that costs (Z.) are sufficiently similar
and query loads and transmission costs sufficiently large that theorem 3
will not be useful. More will be said about this later.
42
As another example of the usefulness of theorem 1, consider
Casey's 5site example reduced to four sites by the elimination of site
5. The table of Z., m., and M. is then
1 l i
Site
1
2
3
4
Z.
l
120
108
78
78
m.
l
144
144
144
144
M i
Since for all i Z. < m., theorem 1 says that the optimal allocation
consists of all four sites, and no further computation is necessary.
In another of Casey's examples  the 19site ARPA network
example with d.. equal to distance  theorem 3 does turn out to be
useful. This is what is to be expected in a geographically widespread
network in which there are groups of sites that are very close together
(e.g., in the Boston and Los Angeles areas). We will not give the
details (the interested reader may consult the data and notation in
Casey's paper), but we find that theorem 3 eliminates six sites. Speci
fically, sites 3, 4 and 5 are excluded by site 2, site 15 by site 14,
and sites 17 and 19 by site 18. This reduces by a factor of 64 the
maximum size of the search required to find the optimum.
The reader might question whether it is worthwhile, in general,
to carry out the computations needed to check the conditions for our
theorems. In this connection, we point out that the costs Z. are needed
in any cost optimization algorithm and it only increases efficiency to
precompute them. Computation of m. for i=l,...,n is easy, requiring
2
0(n ) operations altogether (n to find the minimum d.. for each i) .
Application of theorem 1 is therefore highly advisable, since for each
43
site successfully identified as satisfying inequality (1) the search is
reduced by 50 percent. Furthermore, it may happen that the set of sites
satisfying (1) forms an acceptable suboptimal allocation, so that no
further search is necessary.
2
Computation of M is more timeconsuming, requiring 0(n )
operations for each i. It may be that one would only want to try
theorem 2 selectively, say when Z (or Z./X ) is "big". Testing for
theorem 3 is even more timeconsuming because of the dependence on k.
However, the quantities involved in the test (the right sides of in
equality (5)) are sufficiently similar to M that it is not difficult to
work out an efficient scheme for carrying out the computations needed
for theorems 2 and 3 together. Again, to save time one can readily
eliminate from the process many cases where checking inequality (5) is
(or is likely to be) useless. For example, when Z.<Z,+X.d., , (5)
clearly cannot be satisfied. This simple condition suffices to show
conclusively that theorem 3 is of no help in the 5site example dis
cussed above.
Our intuitive idea of theorem 3  the elimination of more
expensive, neighboring sites by the existence of a given site  can be
used to identify situations when theorem 3 is likely to be useful.
Thus, if a number of network sites are clumped together (in the sense of
small d..), then the "cheapest" of these (small Z./X.) may serve to
ij 11
eliminate others of the clump. Notice also that the minimization
carried out in the test for theorem 1 identifies the "closest" site to
site i in the sense of smallest d... This information can be saved and
ij
used to identify promising pairs for the theorem 3 test.
44
Finally we note that all of the results and comments of this
section will hold for the primarycopy model (equation (2)), provided
that we redefine Z as
Z.  a, + Yd ' . .
i i li
This is intuitively reasonable. The cost of shipping all updates to the
primary is the same for all allocations and hence has no effect on the
cost comparisons. Therefore the appropriate cost of maintaining a copy
is just the sum of storage cost plus the cost of shipping all updates to
that copy from the primary.
The quantity Z./A. appeared above as a measure of the cheap
ness of site i. Since an optimal allocation is likely to include the
cheaper sites, this quantity is also a rough measure of the probability
that site i will be included in an optimal allocation. We have success
fully used Z./A. as a heuristic "search factor" to expedite search
procedures for optimal or nearoptimal allocations. For example, in the
search for local minima which we described earlier, we have obtained
considerable improvement by numbering the sites 1,2,... in order of
increasing Z./A.. That is, the use of this search factor generally
seems to lead one to a better (i.e. lower cost) allocation with less
effort than if the sites are numbered arbitrarily.
If constraints are imposed on the problem, then our theorems
may not be immediately applicable. If there are constraints excluding
or including sites, or setting a minimum number of copies, our theorems
are still useful as long as they are not applied mindlessly. Thus, one
must make adjustments if theorem 1 would include some of the wrong
sites  or too many sites  or if theorem 2 excludes too many sites.
In some cases, one might want to consider the proofs of the theorems
45
(see appendix 2) in determining how they might be applied, since the
proofs provide the basic cost comparisons which follow from the theorem
hypotheses.
For example, suppose we wish to find the best 2site alloca
tion, and one copy is required to be at site 1. Then it may be possible
to exclude a number of sites from consideration as the second site by
applying lemma 1 (appendix 2) with k=l. If all sites are "excluded" by
this lemma, one is reduced to bruteforce comparisons of the possibilities,
On the other hand, theorem 1 may show that some site  say site 2 
should appear in any globally optimal allocation. Then I=(l,2) is
certainly a "good" allocation, and application of lemma 2 with k=2, i=3,
..., n, may in fact serve to prove that (1,2) is the best 2site alloca
tion containing site 1.
The example above is, of course, very simplistic. Application
of the theorems in this case is likely to save little over the brute
force cost computations and comparisons. The example is only designed
to show the reader the kinds of variations on our results which are
obtainable for application in some nonstandard situations.
Constraints may take on many forms, and for some forms it may
not be clear that anything like our theorems can be derived. We believe,
however, that for any file allocation problem it is worth investigating
the possibility of deriving as much a_ priori information on the optimum
as possible. The savings may well be enough to turn an impractically
large computing problem into one that is of acceptable size.
46
Applications to Special Situations
In addition to altering Casey's model to handle a primarycopy
synchronization scheme, we have also, following another line of research,
developed models for certain special situations which may be particularly
relevant to the WWMCCS environment. In this section we report on two
such studies.
First, we consider the optimal allocation of a file that is
updated by a single site (the site that generates the updates by its
data collection activities), but is queried by other sites for its
information. We will initially consider this situation under the
assumptions of a uniform query load (i.e., each site generates the same
number of queries) and a homogeneous environment. We then relax the
homogeneity condition. An interesting feature of this study is that a
partial (and sometimes complete) solution to the file allocation problem
may be obtained by inspection of simple inequalities. The lengthy
computation and comparison of multitudes of allocation costs are just
not needed in many simple situations.
Second, we consider the tradeoffs found in the local caching
of data (as might occur in an intelligent terminal application.) Here
we will consider a local system that maintains a partial copy of a file
or data base. We assume that the system can satisfy some of its queries
locally and the rest by using the remote copy. We then derive the
conditions that must be satisfied for this situation to be costeffective,
We will slightly modify Casey's model by assuming that the
costs d., and d' ., are the sums of two components  system (processing)
cost and a constant cost of transmission. Assuming a constant cost of
47
transmission seems reasonable, since most packetnetwork charges are
independent of distance. Including processing costs will allow us to
investigate effects of system load on the problem. Thus we will assume
that
ju, for j=k
d\
lu.+N for j^k
1 K. U
and
for j=k
W k +N q
where u, is the cost of processing an update on the kth system,
N is the total network cost to transmit an update,
u
q, is the cost of processing a query on the kth system, and
N is the total network cost to transmit a query.
Single Update Source
In this section we wish to determine the optimal allocation of
a file which is updated by only one node but queried by all. To repre
sent this situation, we will assume that all updates originate from the
node j=l. (In other words, *^0 and ^.=0 for l < j_^n.) Initially we will
assume that q,=q, u 1 . =u j °\T a anc * ^v = ^ ^ or a ^ ^> anc * t ^ iat ^ =N =N. This
last condition is not really necessary, but will make the results some
what clearer. Thus our homogeneity assumptions include equal storage
costs, query loads, and query and update processing costs. Under these
assumptions we obtain the following.
Strategy Rule 1 . [homogeneous network]
If ^(u+N) + a > AN,
then the cheapest allocation is to put a single copy at site 1,
48
Intuitively, this result says that if it costs more to store
and update the copy at site k than it does to send k's queries to site
1, then it does not pay to put a copy at site k. A rigorous proof of
this result is in appendix 3.
Now suppose that we introduce some heterogeneity into the
system. Assume that for some site k in the network, u ^ll , o.^o , and
q <q <q,+N. (This last inequality ensures that locally generated queries
will be processed locally.) When will C£k) be less than C(l)? Straight
forward computation yields
C(k)C(l) = H' 1 (N+u k u 1 )+nX(q k q 1 )+a k o 1 .
From this we immediately conclude:
Strategy Rule 2 [heterogeneous network] .
If some site k is so much cheaper than site 1 that
^ 1 u 1 +nAq 1 +a 1 > ^(N+u^+nAqj+c^,
then it is cheaper to put a copy at site k than at site 1,
even though site 1 is generating all the updates.
Finally, we might ask when, in a heterogeneous network with best single
site allocation (1), it is worthwhile to add a second copy. We will
first assume q, <q <qi+N and compute
C(l)C(l,k) = XN+diDACq^q^O^^^+N).
From this we obtain
Strategy Rule 3 [heterogeneous network]
If site k is sufficiently cheap that
k + V U k +N) < M+CnDXCq^q^,
then the twosite allocation (l,k) is cheaper than allocation (1)
49
If q is still smaller, so that q +N<q (which makes it economical for
even site 1 to send its queries to site k) , then the condition in
strategy rule 3 becomes
nA(q r q k ) > V U k +N)+ V
Notice that these last two rules do not give us the complete
answer to the file allocation problem. For example, rule 3 says that we
can save by putting a second copy at site k, but it does not select
among several alternative sites which may satisfy the inequality, nor
does it preclude greater savings from adding a third copy.
Use of a Local Partial Copy
In this section, we investigate the situation that would exist
when an intelligent terminal acts as a frontend to a distributed data
management system. Specifically, we consider the case where some queries
can be satisfied by a local partial copy of a data base, while the rest
must be answered by querying a remote copy. We will compare the cost of
this twocopy allocation (one complete copy and one partial copy) with
that of a single remote copy. We assume that all updates originate at
the remote site (site 1), and that all queries originate at the local
site (site 2) .
Let us also assume that the storage cost of the partial copy
is pa, where p is less than 1. Further, let
y, = the volume of updates relevant to the remote copy but
not to the local (partial) copy, and
y~ = the volume of updates relevant to the local copy.
Then Y, the total update volume, is 4',+H'^.
Let A, = the query volume that must be sent to the remote copy, and
\~ = the query volume that can be satisfied locally (i.e. by the
partial copy) .
50
With these assumptions, the cost of having only the remote copy is
C(l) = *u 1 +(X 1 +X 2 )(q 1 +N)hj 1 .
Adding the partial copy gives
C(l,2) = >Fu 1 +4' 2 (u 2 +N)+X 1 (q 1 +N)+A 2 q 2 +(l+p)a 1
Therefore
C(l)C(l,2) = A 2 (q 1 q 2 +N)4' 2 (u 2 +N)p0 1 ,
and the local partial copy becomes costeffective when
x 2 (q 1 q 2 +N) > Y 2 (u 2 +N)+p ai .
There is one immediate point to be noted here. If q„>q +N, so that the
local cost of processing queries is greater than that of sending them
across the network and processing them at the primary site, then it
never pays to have the local copy. This observation increases the
importance of designing intelligent terminals to be as inexpensive as
possible.
Another interesting observation may be made. Let the query
cost differential q^q^+N be denoted by Q, and suppose that the locally
cached data is relatively static, so that ^=0. Then the condition for
costeffectiveness becomes
* 2 Q > pa r
To visualize this condition, in figure 4 we have plotted crossover
points. Each curve is for a fixed value of pa.. , and gives values of
(A ,Q) such that A Q=pa . On the graph the units for Q are dollars per
query, of A are queries per hour, and of pa, are dollars per month.
Because of the conversion factors among these units, the actual quantity
graphed is A =(pa /720Q). For example, if the differential querying
cost Q is $0.10, and the cost pa. of the local storage cache is $10 a
51
1.0
0.5 
0.3 
0.2 
o
i
a>
ex
o
o
a
0.1 
005 
0.03 
0.02 
0.01
1
i
1
i
i
i
\fc

Nft

\&
 \^
\fe


N?

i
i
»
\ i \
i
\ i
0.1
0.2 0.3 0.5 1.0
X 2 (Queries per Hour)
2.0
3.0
5.0
Figure A
Minimum query rates which make a local cache cost effective,
Q is the savings per query caused by having the local cache,
Curves are labeled with storage cost of the local cache (in
dollars per month). See text for detailed assumptions.
52
month, the crossover point is at A =0.14 queries per hour. For any
query rate over this, the local cache becomes increasingly cost
effective. If the differential querying cost Q is only $0.01; i.e.,
decreases by a factor of 10, the crossover point increases by the same
factor of 10, to 1.4 queries per hour.
There are many cost factors which enter into the quantity Q.
Careful analysis and/or measurements would have to be carried out to get
a good estimate of Q in any given situation. This very simple applica
tion of file allocation points up a basic problem. Good data on costs
and usage patterns must be available before allocation optimization can
be at all meaningful.
53
Conclusions
Theoretically, the file allocation problem seems to be well
understood. However, the models and associated cost formulas to be
found in the literature do not necessarily reflect the realities of
distributed data management. Witness the total lack of concern with the
synchronization problems which the broadcasting of updates to multiple
copies causes. Nevertheless, we find that it is not very difficult to
develop new models to handle, say, a primarycopy scheme. Obtaining
models and associated cost formulas for any specific environment is
largely a matter of having the data access process carefully defined.
In a way, the model ±s^ such a careful definition.
If the network is not very large  say 20 sites or less  the
best allocation can be found by brute force, if necessary. That is, the
costs of all possible allocations can be computed and compared. As
networks grow in size, this computational problem rapidly becomes too
large to be feasible. Considerable progress has been made in making the
problem more tractable. Doubtless there is room for more work in this
area. We believe that one useful approach is the development of very
simple models and resulting decision rules that answer specific ques
tions. Examples of this kind of approach were provided in the section
just preceding.
A serious problem  and one which research alone can not
solve  is the determination of the parameters to be put into the model.
We have indicated that "costs" need not (and perhaps should not) be
based on real dollars, but instead might reflect the data base manager's
judgment as to the "value" of the various resources, or the relative
importance of the various terms in the overall cost formula. In short,
54
cost factors can be thought of as general weight factors, which may
reflect subjective judgments as well as objective cost accounting. The
information on data base usage is also specific to the system. Measure
ments must be taken of query and update volumes before the model can
yield valid results. The researcher can aid in providing statistical
analyses of the measurements. Such analyses might yield information on
whether usage patterns appear to be static or oscillatory, or show long
term trends. But the burden is again on the data base manager to supply
the raw data needed to put real numbers in the formulas.
Finally, we note that it might be reasonable to attack the
file allocation problem from an entirely different point of view than
that of minimizing "cost". One might instead want to minimize, say,
response time. Minimization of response time does have some attractions,
especially in an environment where rapid response may be critical. The
problem must be constrained, of course. Otherwise the fastest response
is obtained by putting a copy at each site. Reasonable constraints
might be to put a bound on storage and update costs, or to set a maximum
number of copies.
In our discussion of response time as a constraint, however,
we have already implicity identified two difficulties with this approach.
First, response models are complex. More data and parameters are needed
to compute response. No readily usable response model seems to currently
exist in the literature; additional research is needed on this problem.
Second, response models necessarily couple together the effects of all
the files in the system. Unless some clever way can be found to get
around this problem, computation of a minimal response allocation will
be intractable even for small networks. In spite of these difficulties,
we believe that research along these lines would be definitely worthwhile,
55
References
1. Alsberg, P. A. et al. Synchronization and Deadlock. CAC Document
No. 185 (CCTCWAD Document No. 6503), Center for Advanced Computation,
University of Illinois at Urb ana Champaign, March 1976.
2. Alsberg, P. A., Belford, G.G., Day, J.D., and Grapa, E. MultiCopy
Resiliency Techniques. CAC Document No. 202 (CCTCWAD Document No.
6505), Center for Advanced Computation, University of Illinois at
UrbanaChampaign, May 1976.
3. Belford, G.G., Schwartz, P.M., and Sluizer, S. The Effect of
Backup Strategy on Data Base Availability. CAC Document No. 181 (CCTC
WAD Document No. 6501), Center for Advanced Computation, University of
Illinois at UrbanaChampaign, February 1976.
4. Casey, R.G. Allocation of copies of a file in an information
network. AFIPS Conference Proceedings 40, AFIPS Press, Montvale, N.J.,
1972, pp. 617625.
5. Chandy, K.M. and Hewes, J.E. File allocation in distributed
systems. Proc. International Symp. on Comp. Performance Modeling,
Measurement and Evaluation (March 1976), pp. 1013.
6. Chu, W.W. Optimal file allocation in a multicomputer information
system. IEEE Trans, on Computers C18 (1969), pp. 885889.
7. Chu, W.W. Optimal file allocation in a computer network. In
ComputerCommunications Networks , N. Abramson and F. Kuo (Eds.), Prentice
Hall, Englewood Cliffs, N.J., 1973.
8. Eswaran, K.P. Placement of records in a file and file allocation
in a computer network. Information Processing 74, NorthHolland Publishing
Co., Amsterdam, 1974, pp. 304307.
9. Johnson, P.R. and Thomas, R.H. The Maintenance of Duplicate
Databases." RFC #677, NIC #31507, January 1975. (Available from ARPA
Network Information Center, Stanford Research InstituteAugmentation
Research Center, Menlo Park, CA. )
10. Levin, K.D. Organizing distributed data bases in computer networks.
Ph.D Dissertation, University of Pennsylvania (1974).
11. Morgan, H.L. and Levin, K.D. Optimal Program and Data Locations
in Computer Networks, Report 741001, Dept. of Decision Sciences, The
Wharton School, University of Pennsylvania, 1974.
12. Spinetto, R.D. A facility location problem. SIAM Review 18 (1976),
pp. 2945.
13. Urano, Y. , Ono, K. , and Inoue, S. Optimal design of distributed
networks. Second International Conf. on Comp. Comm. (1974), pp. 413
420.
56
Appendix 1
In this appendix, we prove the analog of Casey's key lemma [4,
p. 620] for the primary copy model (equation (2)). This lemma deals
with a diamondshaped subset of the cost graph, as shown below.
(1,4)
(1,2,4) »^ ^* (1,3,4)
I  (1,2,3,4)
It says that if the cost of I is less than (or no more than) that of its
two predecessors in this subgraph, then the cost of the top node can not
be smaller than that of those predecessors. The lemma holds in general
for any four nodes in this configuration. The specific labels on the
allocations are designed to clarify their relationship. Just as in
Casey's model, this lemma leads straightforwardly to the fact that costs
are monotonic nonincreasing along paths to the minimum.
Lemma . Without loss of generality, let I = {l,2,...,r}
If C(I) < C(I~{k}), k = 2,3;
then
C(I~{k}) < C(I~{2,3}), k = 2,3.
(The notation I~{.} means the set I with the set of elements in the
brackets deleted from it.)
57
Proof .
Let X^ = C(I~{k})  C(I), and
Y k = C(I~{2,3})  C(I~{k})
We wish to show that if X„ > and X _> 0, then Y _> and Y > 0.
This will be true if Y 3 ~X 2 > and Y 2 X„ > 0.
Now X„ "=  o„  ¥d' + EX. (min d..  min d.. )
2 2 I 2 , J i r roi jk , T jk
j keI{2} J kel J
Y =  o  ¥d' + ZA 4 (min d..  min d..)
3 2 12 j j kel~{2,3} Jk kel~{3} jk
So
Y„  X 9 = EX. (min d ,  min d.,  min d,, + min d )
J j J kel~{2,3> Jk kel~{3} J * kel~{2} Jk kel Jk
The remainder of the proof follows exactly Casey's proof; we will
not copy it here. Essentially, one sees easily that the quantity
in parentheses is positive for all j ; and the inequality Y X~ _>
follows by permuting indices.
58
Appendix 2
In this appendix we give formal proofs of the theorems on
exclusion and inclusion which were presented and discussed earlier. We
are assuming Casey's cost formula (equation 1) and are also assuming
that d..=d!.=0 for all j. The cost of maintaining a copy at site i is
given by
Z,  a. + EY.d ,.
i i . ,J ji
Theorem 1 . All optimal allocations will include site i if
X J min d. . > Z. . (Al)
i .1. ij i
Proof . Assume that site i satisfies (Al) and that there is an
optimal allocation I which does not include i. Let I'=IU{i}. By
straightforward computation,
C(I') = C(I) + Z. + S,
n
where S = EX. (min d.,  min d.,).
j=l J kel' J kel Jk
Clearly each term in S is nonpositive, since taking a minimum over a
larger set cannot increase that minimum. Furthermore, since for j=i the
contribution to S is X. min d,, , it follows that
1 kei lk
S < X. min d.. < X min d . . .
— 1 . . ik  i , ,, ii
kel j^i J
Therefore (from (Al)), S+Z.<0, and hence C(I')<C(I). This contradicts
the assumed optimality of I and proves the theorem.
Theorem 2 . No optimal allocation including more than one site
will include site i if
Z. > I X.(max d..  d. .). (A2)
i i j i jk ji
j=l J k J J
59
Proof . Given an allocation 1^0 (the null allocation), let
I'=IU{i}. Just as in the proof of theorem 1,
C(I') = C(I) + Z ± + S.
Making a term by term comparison, we see that the right side of inequality
(A2) is greater than or equal to (S) provided that
max d.,  d. . + min d..  min d., > (A3)
i jk j l , _ , jk . r jk —
k J J kel ,j kel J
for all j . Examining (A3) , we see that if
min d ., = d. . ,
kel' Jk J1
then the second and third terms cancel and (A3) is clearly satisfied.
Otherwise, the last two terms cancel and (A3) is again satisfied. It
then follows from (A2) that
Z. + S > 0.
l
Hence C(I')>C(I), and the theorem is proved.
Theorem 3 . A site i cannot be included in any optimal allocation
if there exists another site k in the network such that
n
Z.  Z, > Z X.(d..  d. ,). (A4)
i k , 1 jk j l +
where (f) = f f if f > °»
w K '+ ^0 if f < 0.
This theorem is most readily proved in two parts. We will
give these parts as two preliminary lemmas. As in the previous theorems,
let I be an arbitrary allocation (but not inlcuding i or k) . Let
I'=IU{k} and I"=I'U{i}. The first lemma states that if site i is
sufficiently costly, then adding site i to an allocation which already
includes k increases the total cost.
60
Lemma 1 . If site i satisfies
n
Z. > E X.(d.,  d..),
i j =1 J Jk ji +
for some site k in the network, then C(I")>C(I').
Proof . By straightforward computation,
n
C(I")  C(I') = Z. + E X.(min d.  min d. ).
1 • 1 3 Til J 111 Tl J 111
j=l J mel J mel
Similarly to the proof of theorem 2, the lemma is proved if, for all j,
(d . .  d . . ) , + min d .  min d . > . (A5 )
me 1 me 1
The inequality (A5) follows easily from consideration of two cases.
(i) If min d. ^ d.., then the last two terms cancel and clearly
tii J m J 1
mel J
the left side is positive.
(ii) If min d. = d.., then (A5) reads
mel" Jm J1
d.,  d . . + d . .  min d . > ,
J J J mel
which again is clearly satisfied.
Our second lemma says that, if (A4) is satisfied, replacing
site k by site i in an allocation also increases the cost.
Lemma 2 . Let I"'=lU{i}. If sites i and k satisfy inequality
(A4), then C(I"' )>C(I' ) .
Proof . By straightforward computation,
C(I'")  C(I') = Z.  Z, + E A. (min d.  min d. ).
i k . i 3 Tiit jm Tf jm
j=l mel ' J mel J
Similarly to the preceding lemma, the proof follows from showing that
IX. (d..  d..), > EX. (min d.  min d. ).
j jk jr+  .j _. jm Tll , im
3 3 mel' J mel ' J
61
And this inequality holds if, for all j, we have
(d..  d.,K  min d. + min d, > 0. (A6)
Jk Ji+ mel' Jm mel"' jm 
Inequality (A6) can be verified by checking cases.
(i) If min d. ^ d. , or d., , then the last two terras cancel and
mel" Jm Ji Jk
(A6) obviously holds. (Recall that I"=IlHi,k}.)
(ii) If min d. = d.., then (A6) becomes
Til J m J 1
mel" J J
d . ,  d . .  min d . + d . . > .
J J mel' J J
(iii) If min d. = d.. , then the left side of (A6) becomes
mel
 d., + min d. , which must be nonnegative by the assumption
J mel'" J
defining this case.
Proof of Theorem 3 . If inequality (A4) holds, then lemmas 1
and 2 both apply. Since lemma 2 says that an allocation containing i is
always improved by replacing i by k, and lemma 1 eliminates any allocation
containing both i and k from being optimal, nothing more is needed.
62
Appendix 3
Proof of Strategy Rule 1
Under the assumptions given in the discussion of this rule in
the text, the cost of allocation (1) is
C(l) = VjU + (nl)AN + nXq + a.
The cost of allocation (l,k), where k is any other site, is
C(l,k) = Y (u+N) + Y u + (n2)XN + nAq + 2a.
Therefore
C(l,k)  C(l) = V (u+N)  AN + a.
Hence
C(l,k) > C(l) if and only if
^(u+N) + a > AN, (A7)
which is the condition in strategy rule 1. Casey's theorem C2 can be
applied to show that allocation (1) is the only allocation containing
site 1 which could possibly be optimal.
To complete the proof, we must show
(i) that allocation (1) is cheaper than any other singlesite
allocation, and
(ii) that adding a site to any other singlesite allocation increases
cost (so that Casey's theorem C2 may again be applied to show that the
optimum can not lie below level 1 in the cost graph).
Now C(k) for k^l is y (u+N) + a + (nl)AN + nAq. To verify
(i) we simply note that
C(k)  C(l) = VjN.
63
To verify (ii) , we compute C(k,j) for the case where neither k nor j is 1,
C(k,j) = 2Y (u+N) + 2o + (n2)XN + nAq.
Therefore
C(k,j)  C(k) = 4> (u+N) + o  AN.
The condition for C(k)<C(k,j) is precisely that inequality (A7) be
satisfied, which is our hypothesis. Thus the result is proved.
64
UNCLASSIFIED
SECURITY CLASSIFICATION OF THIS PAGE (Whan Data Entered)
REPORT DOCUMENTATION PAGE
READ INSTRUCTIONS
BEFORE COMPLETING FORM
1 REPORT NUMBER
CAC Document Number 203
2. GOVT ACCESSION NO.
3. RECIPIENT'S CATALOG NUMBER
4. TITLE (and Subtitle)
Research in Network Data Management and
Resource Sharing  Network File Allocation
5. TYPE OF REPORT ft PERIOD COVERED
Research
6. PERFORMING ORG. REPORT NUMBER
CAC #203
7. AUTHORf*.)
G.G. Belford, J.D. Day, E. Grapa, and
P.M. Schwartz
8. CONTRACT OR GRANT NUMBERf*)
DCA10075C0021
9. PERFORMING ORGANIZATION NAME AND ADDRESS
Center for Advanced Computation
University of Illinois at UrbanaChampaign
Urbana, Illinois 61801
10. PROGRAM ELEMENT, PROJECT, TASK
AREA ft WORK UNIT NUMBERS
11. CONTROLLING OFFICE NAME AND ADDRESS
Command and Control Technical Center
WWMCCS ADP Directorate
11440 Isaac Newton Sq., N. , Reston, VA 22090
12. REPORT DATE
August 2, 1976
13. NUMBER OF PAGES
14. MONITORING AGENCY NAME ft ADDRESSf// different from Controlling Office)
15. SECURITY CLASS, (of this report)
UNCLASSIFIED
15«. DECLASSIFI CATION/ DOWN GRADING
SCHEDULE
16. DISTRIBUTION ST ATEMEN T (of this Report)
Copies may be obtained from the
National Technical Information Service
Springfield, Virginia 22151
17. DISTRIBUTION STATEMENT (of the abstract entered In Block 20, If different from Report)
No restriction on distribution
18. SUPPLEMENTARY NOTES
None
19. KEY WORDS (Continue on reverse side it necessary and Identify by block number)
File allocation
Distributed data bases
Network resource sharing
20. ABSTRACT (Continue on reverse side if necessary and identity by block number)
This report contains results to date of a study of file allocation in
a network. Models and algorithms contained in the literature are surveyed.
Some new models (for special situations and for update distribution through
a primary site) are developed. Some new theorems for simplifying the
computational problem are presented.
DD 1 jan 73 1473 EDITION OF 1 NOV 65 IS OBSOLETE
UNCLASSIFIED
SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered)
BIBLIOGRAPHIC DATA
SHEET
1. Report No.
UIUCCACDN76203
3. Recipient's Accession Ni
4. 1 u U iiul suht it le
Research in Network Data Management and Resource Sharing 
Network File Allocation
5 Report Date
August 2, 1976
7. A ut hop's 1
G.G. Belford, J.D. Day, E. Grapa, and P.M. Schwartz
8. Performing Organization Kept.
N ° CAC #203
9. Performing Organization Name and Address
Center for Advanced Computation
University of Illinois at UrbanaChampaign
Urbana, Illinois 61801
10. Proect/Task/Work Unit N.
11. Contract /Grant No.
DCA10075C0021
12. Sponsoring Organization Name and Address
Command and Control Technical Center
WWMCCS ADP Directorate
11440 Isaac Newton Square, N.
Reston, Virginia 22090 
13. Type of Report & Period
Covered
Research
14.
15. Supplementary Notes
None
16. Abstracts
This report contains results to date of a study of file allocation in a network.
Models and algorithms contained in the literature are surveyed. Some new models (for
special situations and for update distribution through a primary site) are developed.
Some new theorems for simplifying the computational problem are presented.
17. Key Words and Document Analysis. 17a. Descriptors
File allocation
Distributed data bases
Network resource sharing
17b. Identifiers /OpenEnded Terms
17c. COSATI Field/Group
18. Availability Statement
No restriction on distribution
Available from the National Technical Information
Service, Springfield, Virginia 22151
19. Security Class (This
Report)
UNCLASSIFIED
20. Security Class (This
Page
UNCLASSIFIED
21. No. of Pages
68
22. Price
FORM NTIS35 (REV. 372)
USCOMMDC 14952P"