(logo)
(navigation image)
Home Donate | Forums | FAQs | Contributions | Terms, Privacy, & Copyright | Contact | Jobs | Bios

Search: Advanced Search

Anonymous User (login or join us)Upload
View Post [edit]
Poster: mark_k Date: October 06, 2003 11:16:13pm
Forum: web Subject: Broken/truncated .zip files in Wayback Machine
Hi,

Here's a problem that I've noticed when using the Wayback Machine. Basically, files which end with a zero byte are truncated.

Zip archive files normally end with a zero byte. If that last zero byte is not present, many or most archiver programs think the file is corrupted/bad.

I have downloaded several .zip archives from the Wayback Machine, and all are truncated by that one byte. This indicates that the web-crawling software used by the Internet Archive is (or was) defective.

If you have downloaded such a file, you can fix it by appending a zero byte to the end of the .zip file.


Regards,
-- Mark

 
Reply [edit]
 
Poster: Dolphin Date: December 23, 2003 10:19:39am
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine
That's great news. It's hard enough as it is to get a zip file from Wayback Machine because you almost always get the cannot connect error. Now how does one append a zero byte to a zip file?

 
Reply [edit]
 
Poster: Brad Date: December 24, 2003 07:29:40am
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine
Hi Mark,

Definitely possible that various crawlers that have accumulated the web archives over the years have had faults.

Another common cause for this kind of problem, and that may be what's occurring here, is that many of the crawlers stopped downloading files at 1MB, and that's all we have in the collection. Quite a calamity, but that's where we are.

If a zip file ends at 1MB exactly (or within a few bytes: sometimes 1MB includes the HTTP header, for webnerds) then this may be the problem. Generally, there are internal consistancy checks(CRCs) in the zip files themselves, so if your solution of adding a null byte to the end of the file seems to get you a usable zip file, then that's great! Please let me know if this is the case, by responding here, as we may be able to auto-detect and fix this problem as we serve the files, in the future.

Thanks for posting your solution.

Brad

 
Reply [edit]
 
Poster: mark_k Date: December 24, 2003 08:39:32am
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine
Hi,

The problem is definitely not related to the 1MB file size issue; the files I examined were 200K or so long, from memory.

As I understand it, in the .zip file format there is a list of filenames in the archive right at the end of the file. And presumably the file format requires a trailing zero byte.

Some/most archiver programs report an error like "end of central directory signature not found" (from memory) when you try to work with a file missing its final zero byte.

If you like, I can send some example web.archive.org URLs by private email so you can investigate this issue yourself.

Regards,
-- Mark

 
Reply [edit]
 
Poster: Z19 Date: December 31, 2005 02:18:58am
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine
On December 24, 2003 03:29:40pm Brad wrote:
> we may be able to auto-detect and fix this problem as we serve the files, in the future

Two years later, this has been reported numerous times, and you are still serving zip files one byte short. Is this ever going to be fixed?

 
Reply [edit]
 
Poster: wjgeorge Date: June 28, 2007 08:40:01pm
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine
1) Copy the file you want to fix to a file named "fixme"
2) Run the fixme.pl script:

open(FOO,">>fixme");
binmode(FOO);
syswrite(FOO,chr(0),1);

3) Test with your zip program
4) rename or copy the file named "fixme" to your orginal file name




This post was modified by wjgeorge on 2007-06-29 04:40:01

 
Reply [edit]
 
Poster: Z19 Date: June 28, 2007 10:25:24pm
Forum: web Subject: Re: Broken/truncated .zip files in Wayback Machine
Well, thanks, but I can fix the files myself.
The problem is, why doesn't the Archive fix their files?

Why doesn't anyone at the Archive even bother to reply?

It's only been an outstanding, known problem with a simple fix for at least 4 years.



Terms of Use (10 Mar 2001)