Skip to main content

Format Reference

Data Formats:
ARC File Format
DAT File Format
CDX File Format

Arc File Format

Authors: Mike Burner and Brewster Kahle


Date: September 15, 1996, Version 1.0
Internet Archive

Overview

The Archive stores the data it collects in large (currently 100MB) aggregate files for ease of storage in a conventional file system. It is the Archive's experience that it is difficult to manage hundreds of millions of small files in most existing file systems.

This document describes the format of the aggregate files. The file format was designed to meet several requirements:

The file must be self-contained: it must permit the aggregated objects to be identified and unpacked without the use of a companion index file.

The format must be extensible to accommodate files retrieved via a variety of network protocols, including http, ftp, news, gopher, and mail.

The file must be "stream able": it must be possible to concatenate multiple archive files in a data stream.

Once written, a record must be viable: the integrity of the file must not depend on subsequent creation of an in-file index of the contents.

The reader will quickly recognize, however, that an external index of the contents and object-offsets will greatly enhance the retrievability of objects stored in this format. The Archive maintains such indices, but does not seek to standardize their format.

The Archive File Format

The description below uses pseudo-BNF to describe the archive file format. By convention, archive files are named with a ".arc" extension (e.g., "IA-000001.arc").


arc_file == <version_block><rest_of_arc_file>
version_block == See definition below
rest_of_arc_file == <doc>|<doc><rest_of_arc_file>
doc == <nl><URL-record><nl><network_doc>
URL-record == See definition below
network_doc == whatever the protocol returned
nl == Unix-newline-delimiter
sp == ' ' (ascii space) comma is inappropriate because it can be in an URL.

The Version Block

The version block identifies the original filename, file version, and URL record fields of the archive file.


version-block == filedesc://<path><sp><version specific data><sp><length><nl>
<version-number><sp><reserved><sp><origin-code><nl>
<URL-record-definition><nl>
<nl>

version-1-block == filedesc://<path><sp><ip_address><sp><date><sp>text/plain<sp><length><nl>
1<sp><reserved><sp><origin-code><nl>
<URL IP-address ArchivArchivee-date Content-type Archive-length<nl>
<nl>

version-2-block == filedesc://<path><sp><ip_address><sp><date><sp>text/plain<sp>200<sp>
-<sp>-<sp>0<sp><filename><sp><length><nl>

2<sp><reserved><sp><origin-code><nl>
URL<sp>IP-address<sp>Archive-date<sp>Content-type<sp>Result-code<sp>Checksum<sp>Location<sp> Offset<sp>Filename<sp>Archive-length<nl>

<nl>


The "filedesc" line is a special-case URL record (see below). The path is the original path name of the archive file. The IP address is the address of the machine that created the archive file. The date is the date the archive file was created. The content type of "text/plain" simply refers to the remainder of the version block. The length specifies the size, in bytes, of the rest of the version block.


version-number == integer in ascii
reserved == string with no white space
origin-code == Name of gathering organization with no white space
URL-record-definition == names of fields in URL records

The URL Record

The URL record introduces an object in the archive file. It gives the name and size of the object, as well as several pieces of metadata about its retrieval.


URL-record-v1 == <url><sp>
<ip-address><sp>
<archive-date><sp>
<content-type><sp>
<length><nl>





URL-record-v2 == <url><sp>
<ip-address><sp>
<archive-date><sp>
<content-type><sp>
<result-code><sp>
<checksum><sp>
<location><sp>
<offset><sp>
<filename><sp>
<length><nl>










url == ascii URL string (e.g., "http://www.alexa.com:80/")
ip_address == dotted-quad (eg 192.216.46.98 or 0.0.0.0)
archive-date == date archived
content-type == "no-type"|MIME type of data (e.g., "text/html")
length == ascii representation of size of network doc in bytes
date == YYYYMMDDhhmmss (Greenwich Mean Time)
result-code == result code or response code, (e.g. 200 or 302)
checksum == ascii representation of a checksum of the data. The specifics of the checksum are implementation specific.

location == "-"|url of re-direct
offset == offset in bytes from beginning of file to beginning of URL-record
filename == name of arc file

Note that all field values are ascii text. All fields have at least one character. No field value contains a space.

Example of an Archive File

In the following example, please remember that length includes carriage returns and line feeds.

filedesc://IA-001102.arc 0 19960923142103 text/plain 76
1 0 Alexa Internet
URL IP-address Archive-date Content-type Archive-length

http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202
HTTP/1.0 200 Document follows
Date: Mon, 04 Nov 1996 14:21:06 GMT
Server: NCSA/1.4.1
Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT
Content-length: 30
<HTML>
Hello World!!!
</HTML>

filedesc://IA-001102.arc 0.0.0.0 19960923142103 text/plain 200 - - 0
IA-001102.arc 122
2 0 Alexa Internet
URL IP-address Archive-date Content-type Result-code Checksum
Location Offset Filename Archive-length

http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103
text/html 200 fac069150613fe55599cc7fa88aa089d - 209 IA-001102.arc 202
HTTP/1.0 200 Document follows
Date: Mon, 04 Nov 1996 14:21:06 GMT
Server: NCSA/1.4.1
Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMT
Content-length: 30
<HTML>
Hello World!!!
</HTML>

Reading an Archive File

As noted above, the best way to retrieve a specific object from an archive file is to maintain an external database of object names, the files they are located in, their offsets within the files, and the sizes of the objects. Then, to retrieve the object, one need only open the file, seek to the offset, and do a single read of <size> bytes.

Programs that need to read the file without an index (such as to unpack the whole file) should use buffered I/O. The URL record can then be read with an fgets(), and the objects can be read with an fread() of <size> bytes.

Using the Archive Format for other URL types

Since the Archive format uses the standard URL specification to identify objects, it naturally lends itself to the storage of data retrieved via protocols other than HTTP. For example, a news article might appear as follows:

news:28SEP96.21024750@alligator.dryswamp.edu 127.10.100.3 19960929142103 text/plain 328
Path: news.alexa.com!news1.best.com!news.dryswamp.edu!joebob
From: joebob@alligator.dryswamp.edu
Newsgroups: alt.food
Subject: Re: I am hungry
Date: 28 SEP 96 21:02:47 GMT
Organization: Dry Swamp University
Lines: 1
Message-ID: <28SEP96.21024750@alligator.dryswamp.edu>
NNTP-Posting-Host: alligator.dryswamp.edu

For more Information please contact info@archive.org