Full text of "Wide Area Information Server Concepts"

See other formats

Version 4 8/6/90 Draft

Wide Area Information Server
Concepts

TMC-202

Brewster Kahle

11/3/89

Thinking Machines

Wide Area Information Servers answer questions over a network feeding
information into personal workstations or other servers. As personal
workstations become sophisticated computers, much of the role of finding,
selecting, and presenting can be done locally to tailor to the users interests and
preferences. This paper describes how current technology can be used to open a
market of information services that will allow user's workstation to act as
librarian and information collection agent from a large number of sources. These
ideas form the foundation of a joint project between Apple Computer, Thinking
Machines, and Dow Jones. This document is intended for those that are
interested in the theoretical concepts and implications of a broad-based
information system.

The paper is broken up in three parts corresponding to the three
components of the system: the user workstation, the servers, and the protocol that
connects them. Whereas a workstation can act as a server, and a server can
request information from other servers, it is useful to break up the functionality
into client and server roles. A final section in the appendix outlines related
systems.

Ideas for this have come from Charlie Bedard, Franklin Davis, Tom
Erlickson, Carl Feynman, Danny Hillis, the Seeker group, Jim Salem, Gitta
Salomon, Dave Smith, Steve Smith, Craig Stanfill, and others. I am acting as
scribe. Comments are welcome (brewster@think.com).

Working Copy. Please see Brewster@think.com for a newer versions.

Version 4 8/6/90 Draft

Table of Contents

I. Introduction 3

II. The Workstation's Role in WAIS 4

A. Accessing Documents with Content Navigation 4

B. Dynamic Folders Find Information for the User 5

C. Using Information Servers 6

D. Other User Interface Possibilities 6

E. Advantages of Remote and Local Filtering 7

F. Local Caching of Documents 8

G. Local Scoring of Competing Servers 9

H. Budgeting the User's Time and Money 9

III. The Server's Role in WAIS 10

A. Probing Information Servers 10

B. Examples of Information Servers 11

C. Navigating through the "Directory of Services" .....12

D. Servers that Rate other Servers 14

E. The Role of Editors 15

F. Markets and Hierarchies: Using Silicon Valley 15

G. How Server Companies Can Make Money 16

IV. The Protocol's Role in WAIS 18

A. Open Protocols Promotes Wider Acceptance 18

B. Hardware Independence ..19

C. Protecting the User's Privacy 19

V. Conclusion: Why WAIS will Change the World 21

VI. Related Documents 22

VII. Appendix: Comparisons to Existing Systems -.23

A. CompuServe 23

B. Minitel 23

C. NetLib 24

D. Switzerland system 24

E. Lotus and NeXT text system ...24

F. Information Brokers ..24

G. Hypertext 25

Working Copy. Please see Brewstei-@think.com for a newer versions.

Version 4 8/6/90 Draft

I. Introduction

Distributing knowledge was first done with human memory and oral
tradition, later by manuscript, and then by paper books. While paper distribution
is still efficient distribution mechanism for some information, electronic
transmission makes sense for other. This project attempts to install an electronic
"backbone" for distribution of information. Some information is already
distributed electronically whether it is printed before it is consumed or not. This
project attempts to make electronic networks the distribution technique for more
types of information by exploiting new technology and standardizing on an
information interchange protocol.

The problems that are being addressed in the design of this system include
human interface issues, merging of information of many sources, finding
applicable sources of information, and setting up a framework for the rapid
proliferation of information servers. Accessing private, group, and public
information with one user model implemented on personal workstations is
attempted to allow users access to many sources without learning specialized
commands. A system for finding information in the sea of possible sources
without asking every question of every source can be accomplished by searching
descriptions of sources and selecting the sources by hand.

An open protocol for connecting user interfaces on workstations and server
computers is critical to the expansion of the available information servers. The
success of this system lies in a "critical mass" of users and servers. This protocol,
then, could be used on any electronic network from digital networks to phone

lines.

For the information owners to make their data available over a server, they
must be easy to start, inexpensive to operate, and profitable. One possible
approach would be to provide software at a low price that will help those with
information holdings to put their data on an electronic network. The power of the
current personal workstations is enough to enable sophisticate information
servicing capabilities. Charging for services can be done in a number of ways
that do not entail setting up large billing operations. In this way, it is easy to set
up, operate, and charge for information services.

The key ideas that the WAIS system are that information services should be
easily and freely distributed, that the power of the current workstations can
provide sophisticated tools as servers and consumers, and that electronic
networks should be exploited to distribute information.

Working Copy. Please see Brewster@think.com for a newer versions.

Version 4 8/6/90 Draft

n. The Workstation's Role in WAIS

The personal workstation has grown to be a sophisticated computer that
can store hundreds of books worth of information, multiprocess, and
communicate over a variety of netwoi'ks. The advanced capabilities of the
workstation are used to find appropriate information for the user by contacting,
probing, and negotiating with information servers. The explosion of available
information may change the way we use computers since the usual approaches to
information on workstations may not grow to make the new information
environment understandable. The proposed mechanism involves finding
information with one mechanism called "Content Navigation" whether the data
is local or remote, available immediately or over time. This section details what a
workstation might do to collect and present information from a variety of sources.

A. Accessing Documents with Content Navigation

Currently, the common way to find a document (or file) is the "Finder" on
the Macintosh or most other machines. This tree structure requires the user to
remember where s/he has put each file. This approach works when a user is
familiar with the file organization. It is also computationally efficient. To aid
those that have forgotten the exact location, many systems have some way to locate
files anywhere in the structure based on the filename ("Find File" on the the Mac,
and "find" on Unix machines). The number of potential files increases as the
disk space become less expensive and networks let users access remote files. At
some point, when the number of files becomes large, this organization can
become unwieldy because of the amount the user has to remember.

Another technique that is currently popular is to augment documents with
static HyperText links 1,2. These links help users move through 500 MegaByte
CD-ROMs of data without being overwhelmed. HyperText systems allows the
author to provide "paths" through the document. The HyperCard system, from
Apple, also has a simple content searching mechanism that helps navigate
without those links. HyperText links give the author another tool to guide the
user and augment the capabilities of the file system.

A different technique that would allow access to a large collection of
documents based on document content and similarity can be called "Content
Navigation." With this tool, documents are retrieved by starting with a question
in English. A single line, or headline, would describe possible documents that
are appropriate. These documents can be viewed, or used to further direct the
search by asking for "more documents like that one". Each document on the disk
(or some other source) is then scored on how well it answers the question and the
top scoring documents are listed for the user. Since full natural language
processing is currently impossible, each document type, be it and newspaper
article or a spread sheet, must have some simple measure to determine how
relevant it is to the question asked. For text documents a useful and powerful

1 Nelson, Ted. Literary Machines.

2 HyperCard by Apple (ref?)

Working Copy. Please see Brewster@think.com for a newer versions.

Version 4 8/6/90 Draft

measure is to count the number of words in common between the question and the
text. This well known technique of Information Retrieval1 can be augmented with
different weighting schemes for different words or constructions. Other types of
information might be retrieved with specific question formats.

Thus, documents can be found by asking the "navigator" for documents that
contain a set of words. Those documents that share the most words with the
question will come back at the top of the list (have the best "score"). In this system
the "answer" to a question is not a single document, rather it is an ordered list of
candidate documents.

Content navigation is not new; NeXT and Lotus have implemented systems
for personal computers,2 many text database systems on mini-computers, and the
DowQuest system using a super-computer. In general, there is no
standardization yet on how these systems should be queried and used.

B. Dynamic Folders Find Information for the User

Content navigation takes a question and returns an ordered list of possibly
relevant documents. The question can be further refined by giving feedback as to
how relevant the documents were. The results of a question can be seen as cousin
to the file folder in that it contains a list of documents. In reality, the answers to a
questions might not be a "copy" of a document, but a "reference" or pointer to a
document. These question and answer sessions can be saved just like a file folder
can be saved. Saving a session also frees the machine to find answers when the
user in not looking. This capability becomes important when some of the
questions take time to answer because the data might be far away or difficult to
answer. This section discusses one way to think of a saved question: a Dynamic
Folder.

"Dynamic Folders" are a cross between a database query and a Macintosh
folder that can give us great power in defining questions and probing databases.
Text database queries respond with a list of pointers to "hit articles", in the form
of titles or headlines, that might interest the user. At that point, the entire article
can then be retrieved, if desired. A Dynamic Folder, similarly, has a question
that is used to retrieve headlines. Further a Dynamic Folder can be saved and
viewed later. Since a folder is a also structure that holds documents so that they
can be viewed later, a Dynamic Folder is a folder that has a question associated
with it.. In that way a dynamic view acts like a database query in collecting
pointers to interesting documents and like a folder in that it can be closed and
opened at different times.

A Dynamic Folder's question or "charter" acts as instructions to an active
agent as to what what should be put in the folder. This charter gives the folder a
mission to keep itself full of appropriate pointers to files or documents. This
charter might be as simple as "all files on my personal disk that have a .c suffix",
or all mail received in the last day.

In some circumstances, it is important for a Dynamic Folder to contain
pointers to a part of a file rather than to an entire file. Treating parts of files as
first class documents is important in systems that group many independent

1 Salton, Gerald. Introduction to Modern Information Retrieval, McGraw Hill. 1989.

2 NeXT calls theirs the Digital Librarian, and Lotus calls theirs Megellan (sp?).

Working Copy. Please see Brewster@think.com for a newer versions.

Version 4 8/6/90 Draft

documents in one file, such often done with e-mail or news articles. In this way,
"documents" and "files" are slightly different.

A Dynamic Folder's contents will change when the charter has changed, at
fixed intervals, or when external events happen. The user interface should
indicate how current the folder is if it does not always appear up to date. Ideally,
when a user changes the charter of a Dynamic Folder, the contents would reflect
this instantly. This is possible for local searches and some remote searches.
Sometimes, however, changes in the available documents can not be reflected
immediately. This is the case when indexing the contents of new files can take a
while and is done in the background. Some folders should be updated periodically
to reflect new documents in remote databases. For example, a folder that uses the
New York Times should be rechecked every day for new articles. Other updates to
folders could be done based on events happening such as a new document being
stored on the local disk. This could cause all appropriate folders to see if that file
is appropriate to add to the contents.

C. Using Information Servers

Information servers sit on a network and answer questions. A server,
whether local or remote, has some database that can be queried and retrieved
from. These servers can be easily accessed by a workstation over a network with a
standard protocol (see the Protocol section) using the Content Navigation tool to
state queries and the Dynamic Folders to hold and coordinate the responses. In
this way, a user's sources of information can be seamlessly expanded past the
contents of the workstation without an extra conceptual burden on the user. Part
of the "charter" of a Dynamic Folder, then, is the servers that it should use. This
combination of tools extends the reach of the user while maintaining a consistent
view of information. The capabilities of the servers will be discussed more in the
server section, but it is important to see at this point that the workstation can be
negotiating with a large number of local and remote servers.

D. Other User Interface Possibilities

The "Dynamic Folder" is just one way to portray the results of a question.
Other visual and aural possibilities have been suggested including draw from
newspapers, books, library shelves, and sound recordings. This section touches
on these possibilities.

Presenting information in newspaper format has been tried at the MIT
Media Lab (NewsPeek). This approach shows not only a one-line headline, but
also the writer, date, place, and first few paragraphs of the article. This format
expresses importance by the size of the headline typeface, the organization of the
articles on the page, and the amount of text include on the first page.
Advertisements also have a place in such a presentation.

Borrowing from e-mail programs, listing the possibilities in order of
importance has been the technique used by Thinking Machines and NeXT for
displaying candidates. Selecting an article brought the text to another window.
This interface style allows the user to mark "good" documents to further refine the
question. This approach is closely related to the Babyl, Rmail, and Zmail mail
handler programs(ref?).

Working Copy. Please see Brewster@think.com for a newer versions.

Version 4 8/6/90 Draft

Showing the source of documents geographically was suggested by Tom
Erikson of Apple. In this approach, a world map can be used to show areas of
interest. This might be a good way to initiate browsing if geographical relevance
is an important factor to the user. The number of articles concerning or
originating from an area can be displayed conveniently.

Presenting documents like books on a shelf is a familiar metaphor to
librarians. Information about the age of the book, how frequently it has been
used, its size, if it is a picture book or monograph or pamphlet, when it was
published (by the age of the font) are easily gathered with this presentation.
Grabbing a book and looking at it, or looking on the shelves close by are natural
reactions in this metaphor. I do not know of any attempts to display information

in this way.

Generating a recording of a person reading the top articles can be useful tor
commuters. With simple skip forward and back capabilities, this might be an
effective way to deliver a custom newspaper to someone driving a car. This ideally
would be done with a CD player, but a cassette could be used.

The Dynamic Folder is just one possible presentation idea. This area will
be an interesting area for research and prototypes.

E. Advantages of Remote and Local Filtering

When a user subscribes to a remote server, the user can get a complete copy
of the database unfiltered, or can instruct the server to filter the documents
remotely. Printed newspapers are delivered whole whether all of it is relevant or
not. With electronic distribution, one can imagine a user asking for all sports
articles but not the business articles. A query is a form of filter that works at the
server. A broad query will retrieve a large number of documents that can be
further filtered on the personal workstation. The system and protocols can
handle filtering at either or both ends.

Local filtering can done by the content navigation on the local disk after the
documents have been retrieved. The quality of this filtering will depend on the
quality of the content navigator on the local workstation. The filtering might be
able to use knowledge about the user that is impractical to deliver to a server.
Local filtering gives the user the most flexibility, but it could entail too much
communication or too much disk space. How much filtering will be done on the
local workstation has tradeoffs that must be made on a server-by-server basis. If
the filtering is done locally, then the workstation might have a subscription to a
server that periodically retrieves the newest articles.

Remote filtering can reduce the communications bandwidth as well as
possibly offer better filtering. A server can have better filtering capabilities
because it can be database specific as opposed to the workstation's navigator that
must be quite general. Remote filtering, just like an interactive query, in initiated
by using a question.

As communications, storage, and local computation costs change relative
to each other, different filtering structures might make sense.

Working Copy. Please see Brewster@think.com for a newer versions.

Version 4 8/6/90 Draft

F. Local Caching of Documents

Documents that have been retrieved from a server are stored locally on the
personal workstation in a cache. A cache is a computer architecture term
meaning fast, short term storage that helps speed up access by remembering
commonly used entries. In this context, a cache would store documents that the
user has seen or might want to see so that access to those documents would be
faster and easier. A fundamental property of computer caches is that the use of
the cache only makes access faster rather than changing any functionality. In
certain circumstances, it might be useful to relax this constraint, but this will be
seen below. Most interactive queries will only use the cache and local files
because the cache will be up-to-date on its information subscriptions. The cache
is very important to make queries interactive even though data may have come
from remote servers.

The document cache would be stored locally but is shared between all
Dynamic Folders. In this way, an article retrieved for one reason could be used in
another folder without requiring two copies. A central repository would have to be
managed carefully to keep the most relevant articles but not to overload the
storage. A quota might be allocated to the cache, and a cache manager would
make decisions about what should stay and what should go. Sometimes the user
should be consulted, and other times it can be done automatically. The cache
manager should keep header information on how each document in the cache
such as:

(1) what server the document came from,

(2) how big it is,

(3) if it was looked at by the user,

(4) when it was retrieved,

(5) what folders point to it,

(6) if the user asked to keep it permanently,

(7) what the user thought about it ,

(8) how hard is it to retrieve it again,

(9) how to retrieve it again, if at all.

If a document has been deleted from the cache, but it is still being
referenced by a Dynamic Folder, the header information should be preserved
enough to be able to retrieve the document again. In this way, deleting a
document is not a catastrophe.

Since a cache can hold many of the articles seen by a user, the cache is
useful in answering retrieving documents based on "I read an article once
about..." (In a study of libraries users of scientific journals, about 60% of the
articles read were found by browsing, and about 30% were from remembering
that they saw it before and they wanted to know more). Supporting this type of
question is important for a WAIS interface. The cache can help here by storing
all the documents that the user has read. If the cache can not store all of them
then it can be instructed as to what type of documents it should keep on hand.

Working Copy. Please see Brewster@think.com for a newer versions.

Version 4 8/6/90 Draft

G. Local Scoring of Competing Servers

Since a Dynamic Folder can get its data from many servers, it must merge
this data and present it in a meaningful way to the user. While sei-vers that rate
other servers can help determine which seiwer's answers should be valued (see
the ***ratings section), these servers only rate the server as a whole and not the
individual documents. Furthermore, the article could be very good, just not
appropriate to the question. One way to order the responses presented to the user
could be based on a "score" that is assigned to each response by the server. Each
server might, for instance, judge the appropriateness of its response to the
question on a scale of 1-10. These lists from multiple sources could be merged in
that order (weighted by the ratings of the servers) and presented to the user.
Unfortunately, since a server would want its data to be used, it has every incentive
to rate all articles with at 10. Thus, determining how much to trust the server's
scores will improve the selection of documents presented to the user.

One possible solution to this problem is to have local scores for servers to
augment what the server says. Therefore, if a server always says "this answer is
worth 10" and the user never finds it useful, then the personal workstation can
lower the trustworthiness of that server's estimation of itself. Saying 10 all the
time is the equivalent to crying wolf; if it does it too often, then users will stop
listening. In such a scenario, then, all responses from that server could be
degraded by 30% before it is used to merge in with the other database's responses.
On the other hand, other databases may underrate themselves and should be
boosted.

This local scoring can be used to indicate a user's satisfaction with a
database and could be used by others to help in rating it. Further, this local score
could be used to determine if the server is worth subscribing to or keeping its
articles in the cache.

H. Budgeting the User's Time and Money

Since the users workstation will be spending the users money to contact
some servers, a system of accounting and budgeting must be installed so that
users get the most value for their money. The trade-offs of time and money can be
tricky to try to represent, so a simple system should be attempted first.

The underlying premise is that the computer knows how much it cost to
use different services. This can be easy if a service charges for connect time. If a
service is reached with a long distance phone call, however this rate could be
difficult. (Maybe a server should be set up that knows how much the phone
companies charge for different calls.) Further, if a server charges based on the
question, there must be a way for the protocol for limiting the amount spent.

Some queries are going to be very important to happen quickly or they are of
no use. Working this into the interface can be tricky.

Ideas towards automatic budgeting are still quite primitive. They involve
global limits per month, or limits per Dynamic Folder, etc. Should the
workstation enforce the limits? Who can override the limits? We need ideas on
this one.

Working Copy. Please see Brewster@think.com for a newer versions.

Version 4 8/6/90 Draft

HI. The Server's Role in WAIS

Servers sit on networks and answer questions. Successful servers will have
some expertise or service that others find useful whether it is primary
information, information about other servers, or a service. A file server, a
printer, and a human travel agent can all be viewed as forms of servers. This
section describes how servers might be used in a Wide Area Information Servers
system.

A. Probing Information Servers

Finding documents (or more generally, information) on one's personal disk
is important, but finding relevant information on remote systems would extend
the usefulness of personal computers. Currently, most remote database accesses
are not integrated with the workstation model using a "glass terminal" interface
which does not use the power of the workstation. Some servers look like
extensions of the file system and do integrate naturally (such as Sun NFS and
AppleShare) but do not provide ways documents based on content. One of the
major goals of the WAIS project is to integrate wide area requests in a natural
way with local area requests. This section will describe how different information
servers could be integrated into this model.

Using the Dynamic Folder, the user creates lasting questions that can
collect answers over time from a variety of sources. The charter of a Dynamic
Folder includes what sources should be used, which might include the local disk,
local special purpose information servers (such as dictionaries etc), AppleShare
file servers, and remote databases or WAIS (see the Examples of Information
Servers section).

A wide area information server is a computer which provides information
on a particular theme to other computers. Servers sit on a network, such as the
phone system, the Internet, or X.25, accept connections from other servers or
users in order to answer questions in a standard format.

Each information server can be queried at the time the charter is updated,
or it can be periodically polled for new information. Newspaper servers, for
instance, should be polled to find new articles, while dictionary servers should
only be queried once because repeatedly asking the same question is pointless.
Thus, the user's workstation keeps information about each server.

While a map, a spread sheet, an airline ticket, or music might be the
appropriate reply to a specific query, the initial question is stated in English. A
charter (or question) about "Beethoven's choral works" might result in an article
from the encyclopedia server, a schedule of concerts from the newspaper server,
and recordings from a music server. Depending on the networks used, some
responses might be impractical to retrieve, but the architecture allows for any
type of information exchange.

A Dynamic Folder can also be used as an information server to other
workstations. This simple form of server can enable others to share information
easily. This capability should be put into the user interface to encourage people to
exchange information. A Dynamic Folder could be "exported" or made available
to those that know about it, or "advertised" by adding it to a directory of services. If

Working Copy. Please see Brewster@think.com for a newer versions. 10

Version 4 8/6/90 Draft

it is entered into a directory (which is just another information server) then an
English description of the folder should be included.

An information server is probed by putting it in the sources section of the
folder's charter. These servers can be varied in size, content, and location. Using
content navigation and Dynamic Folders we have an metaphor for accessing
many types of information servers.

B. Examples of Infonnation Servers

Information servers, in the broadest sense, answer questions on a
particular subject on some network. Electronic networks have been used for years
to distribute information in this way. Some of the servers that are available on
local area networks have been:

File serving

Printers

Compute servers (such as supercomputers)

FAX

Mail services and archives

Bboard services

Modem pools

Shared databases

Text searching and automatic indexing

CD-ROM servers

Conferencing

Dictionary lookup

User's locations (finger)

Scanners/OCR

35mm Slide output

Working Copy. Please see Brewster@think.com for a newer versions. 11

Version 4 8/6/90 Draft

Wide area networks open up other possibilities for other services. Some
services will be offered because they are expensive to offer on a local basis, because
it requires some special expertise or machinery, or because it is used infrequently
on a local basis. Examples of wide area services that could be offered:

Current newspapers and periodicals

Movie and TV schedules with reviews

Bulletin boards and chat lines

Archive searching through public databases

Hobby specific information (ie sports scores or newletters)

Mail order shopping services

Banking services

Talk services, bboard, and party line styles

Directory information (both online sources and Yellow Pages)

Scientific papers

Government databases, such as patents, congressional record, and laws.

Library catalogs (eg. OCLC)

Weather predictions and maps

Usenet and Arpanet articles

Maps with driving directions included

Software distribution

Remote conferencing

Voice mail

Music and video archives

Pizza ordering

What services will be popular or commercially successful can only be
guessed.

C. Navigating through the "Directory of Services"

The Directory of Servers is an information server maintains a database of
available servers and how they are contacted. Like the white pages of the phone
system the directory should be easy and cheap to use and include everyone.
Equally important, this directory is easy to add to. Thus, people with something
interesting to offer are encouraged to add their service to the directory.

A directory entry, however, should give enough information to understand
what the service is and how to connect to it. This entry is similar to a yellow-
pages entry in the phone book since the goal is to advertise the service. A directory
entry includes:

(1) Description of server in English,

(2) the parent server if it is a subsidiary of a larger server,

(3) related servers,

(4) public encryption key, and

(5) contact information including networks and contact points,

(6) cost information.

Working Copy. Please see Brewster@think.com for a newer versions. 12

Version 4 8/6/90 Draft

A local workstation would keep extra information such as:

(1) locally determined "score" reflecting usefulness

(2) subscription information (if any),

(3) user comments, and

(4) time of last contact.

This information would be used to help determine when and if the server should
be contacted, and how the responses should be handled.

Navigating in the sea of servers to find new servers can be done using the
content navigation technique. In this way a question on classical music would
retrieve documents as well as directory entries. This could be done by storing the
directory entries on the local disk (in the cache) and accessing it just like local
documents based on the appropriateness of the description. Thus retrieving the
document would show all the directory information. In that way, a user that is
unaware of a certain server would be presented with a description of that server
with a listing of its hits for the current question so that s/he could effectively
evaluate its potential value of the server. If the server is added to the list of servers
for that viewer, then it would be queried in the future.

Maintaining an up-to-date list of services in the cache naturally falls out of
content navigation and Dynamic Folders model because a directory of services
viewer would have the charter to keep itself up-to-date on directoi-y changes, and
can be probed using content navigation. The directory of services viewer would
list the remote directory server or servers in the sources slot. That way, the
directory is kept locally and is fast to access.

Cost and availability information can help guide the workstation to alert its
user to new choices of databases. If a new server appears in the directory that is
cheaper than the current server, then it could be suggested as an alternative
server. This can be complicated to do well, but the benefits of not having the user
cull through new directory listings can warrant work in this direction. As
Stewart Brand said, "One of the problems with a market based system is that you
are always shopping!" Hopefully, the workstation can do some of the mindless
part of comparing servers.

Directories are classically owned and serviced by the communications
companies. In this role, the communications company is an unbiased party that
profits from the use of the system as a whole. Further, communications
companies generally take on a teaching role to get users familiar with the system
and aid those with problems. This has been true with AT&T with .the telephone,
the different phone companies with the 900 numbers, and the Network
Information Center for the Arpanet. Whether the communications companies
take over this role or not, the directory must be supported by some organization or
organizations that profit from the use of the system.

D. Servers that Rate other Servers

With a large number of servers, it would be nice to know which ones are
sponsored by crooks, and which ones are gems. The directory of information
servers necessarily accepts all applications for inclusion, just as the white pages
do. Unlike the white pages, however, is a description (or advertisement) of the
server is included which can be misleading with the result that users are charged
for contacting fraudulent servers. Some protection can be offered by independent

Working Copy. Please see Brewster@think.com for a newer versions. 13

Version 4 8/6/90 Draft

servers that rate or grade other servers. These servers can sei-ve somewhat the
same roles as Consumer Reports, Better Business Bureau, and movie reviewers.
This section describes what rating services might do within the WAIS system.

Just as people use movie reviewers to help them select what movies to see,
rating services can help in the selection of quality servers. Servers that provide
"grades" or reviews of other seiwers will become useful as the number of servers
grow. These ratings can come in many forms such as a numeric grade,
formatted reviews that can be used with filters, or a free foirn discussion.
Thresholds can be used by different users to ensure that a server is proven before
it is used. This threshold might best be used in conjunction with the cost so that
even worthless, but free databases might be tiied.

These rating services can come from professional servers or from friends.
A user does not have to subscribe to just one rating service, since a combination
might be more useful. Combining information from multiple ratings is an
interesting topic for exploration.

Creating the ratings server with personal ratings could also be automated
somewhat since, each user's workstation keeps track of how frequently a server
has been found useful. This information, or any other, can be exported so that
other people can select servers that are commonly used.

Numeric ratings of servers can be merged into the user interface by helping
order the documents suggested to the user. Therefore, for some user, articles
from the Wall Street Journal might get better scores than a similar article in the
People's Enquirer. This information could also be displayed by the color of the
headline, for instance, so that unrated services would not be oveiiy penalized.

Just as movie goers start to trust a reviewer that has agrees with them on
past movies, users will trust rating sei-vices that they agree with. Selecting a
rating service based on this criteria can have some interesting effects. The rating
services that a user has agreed with the most will single themselves out
automatically. Users with similar tastes would then find each other. With such
an arrangement, one could be lead to find other servers just because other users
have liked it whether it is logically related to the common servers or not. This is
an automated form of the "if you like this book, then you will like this other book"
system. Further, if two users like many of the same things, then they might want
to meet.

A generation of server speculators can also arise. Since servers are paid
based on people using them, a ratings server will want people to use them often.
If agreeing with user's past evaluations is criteria for using a ratings service,
then predicting what people will like will be a lucrative business. If a server
turns out to be right, then it will be used more. This type of speculation is closely
related to the stock market advisers that have become notable of late. A difference
would be that this form of speculation is trying to predict what will be interesting
to people.

E. The Role of Editors

One of the conclusions from the NewsPeek personal newspaper project at
MIT (I hear) was that editors still had a place in the electronic age by reviewing
and selecting certain articles as important. Unlike the rating sex-vices, an editor

Working Copy. Please see Brewster@think.com for a newer versions. 14

Version 4 8/6/90 Draft

grades specific articles as whether they are important. These grades are similar
in many ways to the rating services and might be able to be merged.

A Dynamic Folder might have a charter like: "any article from the front
page of the New York Times" which is a command to use what the editor suggests
the top articles are. Like the rating services, this can be independent of the
sources of the articles and combine the information from multiple sources.

A form of editor server would be if users kept track of their favorite articles
and put them in a Dynamic Folder and exported it for others. This way, many
favorite servers might emerge and articles could be selected based on friend's

suggestions.

Automatically figuring out what the user thought of a document is tricky.

Clues as to what the user thought of it are:

(1) how many folders point to it,

(2) if the user read it, how much of it, and for how long,

(3) has the user ever taken any information from it to be used in other

documents,

(4) has the user ever referenced it.

This type of information could greatly improve users ability to deal with the flood
of available information. Furthermore, throwing away all the thoughts a user
has about a document is denying others of that mental effort.

F. Markets and Hierarchies: Using Silicon Valley

Currently there are several online infomiation providers and many online
information "brokers". Brokers provide the connections between the workstations
and the information providers (such as PC-link and CompuServe). Sometimes
these brokers have services of their own such as electronic mail and bulletin
board services. These brokers try provide a complete information environment by
providing access to servers. This structure forces a new information server to be
connected to many brokers to have their product used since many users only use a
few brokers.. The airline reservation program Eaasy Sabre, for example, is
available on 20 of these broker networks. The approach of WAIS is to have an
open system of interconnection between users and servers where the brokers can
act as a server, but is not an all encompassing information environment. With an
open system we have a "market" of information servers rather than a controlled
environment or a "hierarchy"1 . Such a structure could open up the field to many
more servers and more sophisticated front-ends.

A market based approach would only standardize on the interchange
formats leaving different companies free to store and service queries in any way
deemed efficient. The user interfaces, similarly, are free to evolve to fit users
needs. Since the protocol is not "terminal oriented" (as most systems are today), it
frees the computers on either side to be sophisticated in serving the user.

Rapid evolution of a technology can happen in a market system if the
structure is designed well. As long as the protocols are flexible enough to start
with, and a procedure for changing the protocol is established, then the
components will evolve independently by companies seeking to gain a competitive
edge.

1 Malone, Thomas. Electronic Markets Electronic Hierarchies, CACM June 1987 ***Check this.

Working Copy. Please see Brewster@think.com for a newer versions. 15

Version 4 8/6/90 Draft

Silicon valley is an example of a market based system that led to r-apid
evolution of hardware in the 1970's and software in the 1980's. As the needs of the
customers became understood and defined, larger companies that had good
marketing and service reputations could make the profitable components without
the help of the plethora of small companies. Information servers is an innately
niche-based market given the diverse information needs of the population.
Furthermore, the industry is more like a service industry than a manufacturing
one because of the continual need for updates and new information. For these
reasons, the silicon valley structure can help in the rapid evolution of this market.

The key is to have enough users to make the servers profitable. Since,
small companies can not wait long before investment turns to profit, achieving
early income is important to get the system started. A "critical mass" of users
might form if the first interfaces were inexpensive or free, and a few useful
servers were available.

G. How Server Companies Can Make Money

If the WAIS system is to take off, then server companies must be able to
make money. Companies that offer servers can make money by billing users
directly, using credit cards, or by using 900 numbers to have the phone system bill
the users. Direct billing is difficult to set up and can be expensive to operate, but
large providers might want to do this. Credit card billing has been a popular one
for information providers. This enables any network to connect the user to the
server and then the user is charged for use of the server. Typically, the first
transaction with a server is a negotiation of how payment will occur and the
allocation of a password for future transactions. This could be automated in the
WAIS system so that the workstation could know how much the costs will be and
keep a total of everything spent. A risk with the credit card system is that a credit
card number in the hands of a crook can enable him to make fraudulent charges.
With the potentially large number of WAIS systems, this might prove dangerous.
Ratings services might be able to help weed out the fraudulent information
providers (if any).

Another approach is to use a phone company service over 900 numbers.
When a company is assigned one of these numbers, callers are charged per
minute of phone conversation and these charges appear on the phone bill every
month. Typically the phone company gets 50% of the revenue from this and the
charges range from $.10 to $2 per minute (PacBell gets $.25 for the first minute
and $.20 thereafter). This approach eliminates the need to have a negotiation of
credit card information and limits some of the risks of disclosing a credit card
number. On the other hand, the charge for billing is high. Another limitation is
that one must use the phone system to connect with the server.

In any case, there is very low overhead in starting a server and earning
money. All one needs is a phone, a computer, and some desirable information.
This is crucial to the success of the system.

All methods of billing are likely to be used and should be supported by the
WAIS interfaces.

Working Copy. Please see Brewster@think.com for a newer versions. 16

Version 4 8/6/90 Draft

IV. The Protocol's Role in WAIS

"... they have all one language; and this is only the beginning of what
they will do; and nothing that they propose to do will now be
impossible for them"

Genesis 11:6

To connect a workstation to a server requires a communication network
and a language to talk. The communications network can be anything that
allows computers to communicate such as modems, Internet, or digital phone
networks. A protocol is the language used to relate questions and receive answers
between the workstations and servers. This section describes some of the issues
involved in this protocol.

A Open Protocols Promotes Wider Acceptance

It is important to the success of this system to have an open protocol that
allows users to connect with servers. Several models for how to create an open
standard have been tried, such as: have a company own it and license it. (Adobe,
for instance), have a university develop it (X Windows, for instance), have a
standards organization bless it (Common Lisp, for instance), and simply make
the specification available and declare is open (IBM PC, for instance). Each
approach has advantages and disadvantages. The key point is that certain
attributes be adhered to.

1. The companies that are developing the protocol must be open to using
existing standards, and not feeling that new protocols should be protected.

2. A system for enhancements to the standard should be set up. Standards
committees are often used for this.

3. The standard should be able to transmit data in a variety of formats.
There are many emerging multi-media standards. A good standard will be able
to transmit these information standards.

4. The query part of the protocol should be able to accept different formats of
queries. Queries might, eventually, have multimedia expressions. These should
be free to evolve with periodic standardization.

5. The query must have some method to transmit cost restrictions and
time-outs. It should also be able to handle query forwarding while avoiding
circularities.

An idea for a query language is to use English that is restricted by the
constructs that are understood by the servers. As systems become more
complicated, they can handle more English constructs. In this way, future server
systems can get more information from a query and produce more appropriate
responses, simpler systems might use the words in the query without parsing the
structure of the query. This approach would allow the servers to change, while
the not changing the human interface and the protocols. The English language
approach has been very successful for untrained users of the Dow Jones
DowQuest system.

The overall success of this system largely depends on how well these
protocols work and how they are made available. There is a standard that could

Working Copy. Please see Brewster@think.com for a newer versions. 17

Version 4 8/6/90 Draft

solve part of the problem: NISO Z39. 50-1988. This standard can help with
connecting to servers, delivering queries, and getting responses back. It does not
specify the query language or the format of the retrieved records. Other
standards may be able to aid other communications needs.

B. Hardware Independence

Since this system depends on an open protocol rather than a particular
implementation, the workstation, servers, and communications systems can all
be made up of various hardware technologies that would evolve in time. This
independence fosters an appropriate use of all hardware pieces, and a freedom to
compete to produce the best components.

Each personal workstation platform has attributes that are appropriate to
exploit differently. These can be used to make tailored user interfaces. Further, a
competition for the best caching and selection criteria should emerge which will
hopefully settle into a good general standard. As personal workstations start to
handle audio and video, these can be retrieved with the WAIS system if the
bandwidth is available.

Nintendo, for instance, makes a home computer that connects to the
television that is installed about 25% of all American homes. They are providing
information services to 150,000 Japanese households using this technology. This
might be an attractive front-end to a WAIS system.

The server computers will range from personal workstations to
supercomputers. Most databases are under 1 gigabyte so they can be stored and
processed with a personal workstation unless there are a very large number of
users. Supercomputers will be used in applications where there is a large
amount of data or there are a very large number of users. Supercomputers can
offer superior query handling by doing extensive work on each query.

The communications systems used should be any that are locally available.
The bandwidth requirements for text can be satisfied with current phone systems
using modems. As advances in bandwidth and connectivity emerge, such as
X.25, ISDN, and InterNet; then the range of offerings from the information
providers should go up.

Since no component is centralized, this system is free to be established
anywhere in the world. Other more centralized systems, such as Minitel, have
had difficulty in expanding outside of France. This system should encourage
independent regions to set up a compatible system because of the availability of
software for servers and workstations.

C. Protecting the User's Privacy

"Electrical information devices for universal, tyrannical
womb-to-tomb surveillance are causing a very serious dilemma
between our claim to privacy and the community's need to know"

Marshall McLuhan, Media is the Message

To encourage users to trust their personal machines with their data and
interests, we must be sure to protect people's sense of privacy. As machines start

Working Copy. Please see Brewster@think.com for a newer versions. 18

Version 4 8/6/90 Draft

to learn more about their users and start to contact other machines on then-
user's behalf, the dangers to privacy are significant. There are technical as well
as legal issues involved. This section will cover the technical issues in protecting
privacy (any good ref for the legal side?).

There is no easy way to protect a personal workstation if an intruder can get
at the keyboard. Since the workstation acts on behalf of the user the potential
damage that could be done by a crook at the controls would be worse than is
currently possible. Since users will be leaving their computer on all the time so
that it can contact servers and be used by other servers, we lose the security of the
computer being off at night. One way around this might be to able to turn off input
from the user while leaving the computer on to contact servers over the network.
If a user knows that she is never around at night or on weekends, then this profile
might help lead the system to not trust off hour use and require a password. The
assumption so far in personal computers is that the machine stays in a secure
physical environment and all protection must be directed to network connections.
This is not a safe long term solution, and should be thought through carefully.

Other risks are involved when dealing with networks. There are problems
with intruders, spies, and forgers. An intruder will try to read, modify, or destroy
data that the user did not intend to leave accessible. Spies will watch the traffic
from a user to determine the servers contacted and the content of the messages.
A forger will copy password information to act like a different user.

Network intruders can be prevented from reading unwanted data by the
user only exporting certain Dynamic Folders to become servers for the outside
world. A question is whether we want "group" access as well as "world" access as
in the Unix file system or some other layered approach. A Dynamic Folder only
contains pointers to information. If the information is on the local disk, should
that be accessible by a remote machine? Should those files be protected from being
read? If the information came from a remote database, should the requester be
required to get it from the source even if a copy is on site? What are the copyright

issues I1GI*G?

Spies can watch communications networks and collect passwords and
credit card data if this information is sent in clear text (not encrypted) as well as
read the data. A public key system makes sense in this application because the
directory information can include a key. Public key systems are those that
everyone can lock a message (encrypt) for a recipient, but only the recipient can
read it. Presumably the public key system would be used in establishing a
connection and a special key for the conversation would be established. Current
public key systems are too compute intensive to be used for large volumes of data.
A conversation key could be used with DES or some other encryption system that
is easier to compute (usrEZ software has a product that runs at 30k
characters/second on a MacII). Adoption of such a system early in the WAIS
development would ensure that this type of protection is assumed in modem
information systems.

Forgers can be foiled with a system of authentication. Authentication is
important when the charges are high or when the system is used for ordering
goods. One solution is to use a public key signature system that is easy to
implement using the public key system (ref the Public Key papers). A signature is
passed so that only the sender could have created it.

Working Copy. Please see Brewster@think.com for a newer versions. 19

Version 4 8/6/90 Draft

V. Conclusion: Why WAIS will Change the World

Historically, when the distribution of information became easier or less
expensive, and explosive growth in learning occurred. Wide area information
servers are a new way to distribute information. Since anyone with a personal
computer, a phone, and some information can be a server, people are free to
create and distribute their woiit in ways that paper distribution made
impractical. The current electronic databases, in general, do not have a standard
for interchange. Just as the railroads were owned and controlled by relatively few
people current database brokers control access and hence the production of data.
The highway system was not owned by anyone and the incremental cost to start a
new business was very low. Small businesses flourished partly because of this.
WAIS systems, similarly, have very low initial costs and low distribution costs
which can pave the way to many servers in a short time.

Since the WAIS system is founded on computer to computer .
communications, new servers that just learn from other servers and produce
useful information or analysis can become profitable. Such a server could be
thought of as "smart" and the better servers will learn from other servers and
from its own mistakes. Thus a distributed "smai-t" intelligence can be formed.

BBoard systems have not produced any astounding works of literature, I
suggest, because it is difficult to reference older works. If older works were easy
to find and reference, then people would be more inclined to make better entries.
Better entries would get more references and be used more. No BBoard systems,
that I know of, make this easy. Since editors, content searching, and archiving
are all fundamental parts of the WAIS architecture, we stand a better chance of
high quality works being produced.

A large server, or sage, has a role in this distributed system because it can
infer correspondences between many pieces of information. Further, large
servers will have many users that it can learn from. Users will teach a server
what is important just by using the server. Thus a lar-ge server will be the place
that great new ideas will be created based on lots of existing information. This
new form of intelligence, that is formed out of many participating people and
machines, is an exciting prospect.

Working Copy. Please see Brewster@think.com for a newer versions. 20

Version 4 8/6/90 Draft

VI. Related Documents

Blip Culture Hypermedia, Harry Chesley, Apple.

Catalyzing a Market of Wide Area Information Servers, Brewster Kahle.

Wide Area Information Server Demonstration, Brewster Kahle and Charlie
Bedard.

Electronic Markets and Electronic Hierarchies, Thomas Malone CACM June
1987.

Introduction to Modern Information Retrieval, Gerald Salton, Cornell. McGraw
Hill.

Parallel Free-text search on the Connection Machine, Stanfill and Kahle CACM
Dec 1986.

Working Copy. Please see Brewster@think.com for a newer versions. 21

Version 4 8/6/90 Draft

VII. Appendix: Comparisons to Existing Systems

There are always precedents to any system, this one included. Some are
academic and some are commercial; some are computer oriented and some are
human services; some are special purpose and some are generally useful.

A. Compuserve(of Columbus Ohio, 1-800-848-8199) is a phone based service
with about 1000 services with 500,000 PC subscribers. It includes BBoards, hobby
services, home shopping, email, multiuser online games, etc. Interestingly, they
have contracted with the government to accept Export License Application
transactions and other user interface functions. They have "Personal
Newspaper" products and deliver data from many publishers. They own a lot of
the underlying communication system, but are afraid of ATT and Baby Bells.
They are building sophisticated user interfaces for the PCs and MACs.

CompuServe is owned by H&R Block and charges by the minute. They
handle their own billing. They have recently bought most of their competitors
(The Source, Access, Software House of Cambridge, and Collier-Jackson of
Tampa Florida) and are making a fortune. They turned a profit in 4th quarter
fiscal 1985 and by the end of fiscal 1986 it recorded a profit of $1.7 million on $100
million revenues and 300,000 users.

CompuServe is the closest model and can be easily accessed with the WAIS
system. On the other hand, WAIS helps you find the database you are interested
in, does not use a terminal interface (you use your PC with all of its speed), and
WAIS offers subscriptions to services where your PC will keep itself informed
automatically. Most importantly, WAIS is not "owned" by anyone and is free to
grow independently from a centralized company.

(For more technical information I have a book of their services, Thinking
Machines has an account, and I have a series of articles describing their business
activities.)

B. Minitel in France is an outgrowth of the phone company. As an
alternative to phone books, users were offered terminals for their homes. Many
people took the terminal. By all reports it has been a very popular system. A 1986
news report said: "The directory for Minitel sei-vices is now the size of a phone
directory for a small city, evidence that Minitel is a success." George Nahon,
managing directory of Intelmatique: "Then need to create a market of users
emerged as a prerequisite for a service." One reports speculated that France has
put about $500 million into the system by 1986.

Their interface is a terminal type interface and the servers are both human
and machine. [Europe is the most exciting continent for information services. It
seems that they take this very seriously, while the US government has yet to take
the bold steps of investment and standardization.]

Working Copy. Please see Brewster@think.com for a newer versions. 22

Version 4 8/6/90 Draft

C. NetLib is a free Unix utility for distributing files through the email.
Anyone that has access to the servers via electronic mail can make inquiries and
file requests. This system currently has about 100 (a guess) collections world-wide
and is growing. In 1987, about 10,000 requests per month were serviced. The bulk
of the offerings are software programs rather than raw data. Since no charges
are made for queries or requests this system is used by academics and
researchers. ATT and Argonne labs are supporting this woi-k.

The automatic reply system (remote-machine-to-local-machine rather than
remote-machine-to-local-human interface) in NetLib is similar to the WAIS
system. WAIS, however, is not centered solely around EMail as a transport layer;
it uses the phone system as well for interactive use. Also, WAIS would help find
databases that are relevant and handle the queries and requests through a more
"user friendly" interface. (For more on NetLib see Distribution of Mathematical
Software via Electronic Mail in Communications of the ACM May 1987)

D. Switzerland system Still assessing this system.

E. Lotus and NeXT text system

Both Lotus and NeXT have text searching systems that are similar to
Thinking Machine's Dow Jones system, but are based on local data (LAN based).
Since disks hold close to 1 gigabyte these days, and the entire CM at Dow Jones
holds 1 gigabyte, we are close in scope but not performance. On the other hand, a
PC will serve its 20 users adequately and the new daily information can be
effectively distributed from Dow Jones and other places. Lotus seems to be getting
into the information distribution business and is wiiting software to process that
data locally.

These companies see themselves as critically involved in this area. I
believe cooperating with them is in our best interest.

F. Information Brokers

Many companies act as brokers to other information providers. Often these
services will offer electronic mail and bulletin boards. These private systems
rarely communicate with each other. The systems that I know of are listed below.
If anyone has any information on these or other companies, please tell me.

AppleLink(Personal Edition) 1-800-227-6364 getting info

Delphi 1-800-544-4005 getting info

Dialcom, Inc. 1-800-435-7342

GE Information Services 1-800-433-3683 getting info

This company services the fortune 500 companies with network and
processing services using Honeywell and IBM mainframes. They
lease lines from ATT and provide an environment for their
customers including netwoiic services and value added filtering and
massaging of data.

GEnie 1-800-638-9636 getting info

IBM Information Network 1-800-IBM-2468 ext 100

INet 2000/TravelNet 1-800-267-8480 bad number

Inet 1-800-322-INET

NWI 1-800-624-5916

Quantum Computer Services since 1985, privately held,

"multimillion dollars" official commodore info service. Has been
supported by commodore.

Working Copy. Please see Brewster@think.com for a newer versions. 23

PC-link
Q-Link
America online

Snet

The Source

StarText

Travel+Plus

US videotel

Western Union EasyLink

Minitel Services

Omnet/SCIENCEnet

Version 4

1-800-458-8532
1-800-392-8200

8/6/90 Draft

IBM PC product
Commodore product
Mac product

1-800-272-SNET Dept AA
1-800-336-3366
1-817-390-7905
1-800-544-4005
1-713-323-3000
1-800-779-1111 Dept 31
1-914-694-6266
1-617-265-9230

Other systems that I would like to find out more about:

Holland system, Prodigy, Knight Ridder, Audio Tex, Airline Reservations
system, Hospital Ordering System, Verity, Personal Newspaper (Media lab),
Information Lens (Media Lab), Super-Text.

G. Hypertext

Hypertext and WAIS share many attributes for accessing textual
information. In some sense, WAIS is an attempt at a large-scale hypertext
system by allowing links to be deduced at run-time and across many databases
stored in many places. Since servers provide pointers to documents, a pointer to
a document can be put in a document and retrieved at a later time. Thus
document pointers can be thought of as a crude form of hypertext link.

This form of deducing hypertext links through content navigation might
lead to interesting paths that are tailored to a particular user. Automatic systems
will never replace the value of having users suggesting links. Suggested links
can be added directly to the documents (as in most hypertext systems) or then can
be made available in a distributed manner through the favorites databases. In
this way, users that found certain articles to be similar or usefully viewed
together can put them in a folder and export it as a database. One might ask,
"Does anyone have these documents grouped in a server, and if so, what other
documents are in that server?" These databases could then be used by others as
evidence that they belong together. By combining many people's groupings, one
can navigate through large number of documents in potentially interesting ways
in a hypertext style.

Working Copy. Please see Brewster@think.com for a newer versions.

Internet Archive Audio

Featured

Top

Images

Featured

Top

Software

Featured

Top

Books

Featured

Top

Video

Featured

Top

Mobile Apps

Browser Extensions

Archive-It Subscription

Save Page Now

Full text of "Wide Area Information Server Concepts"

See other formats