View Post [edit]
Poster: | mark_k | Date: | Oct 7, 2003 12:16am |
Forum: | web | Subject: | Broken/truncated .zip files in Wayback Machine |
Here's a problem that I've noticed when using the Wayback Machine. Basically, files which end with a zero byte are truncated.
Zip archive files normally end with a zero byte. If that last zero byte is not present, many or most archiver programs think the file is corrupted/bad.
I have downloaded several .zip archives from the Wayback Machine, and all are truncated by that one byte. This indicates that the web-crawling software used by the Internet Archive is (or was) defective.
If you have downloaded such a file, you can fix it by appending a zero byte to the end of the .zip file.
Regards,
-- Mark
Reply [edit]
Poster: | Brad | Date: | Dec 24, 2003 7:29am |
Forum: | web | Subject: | Re: Broken/truncated .zip files in Wayback Machine |
Definitely possible that various crawlers that have accumulated the web archives over the years have had faults.
Another common cause for this kind of problem, and that may be what's occurring here, is that many of the crawlers stopped downloading files at 1MB, and that's all we have in the collection. Quite a calamity, but that's where we are.
If a zip file ends at 1MB exactly (or within a few bytes: sometimes 1MB includes the HTTP header, for webnerds) then this may be the problem. Generally, there are internal consistancy checks(CRCs) in the zip files themselves, so if your solution of adding a null byte to the end of the file seems to get you a usable zip file, then that's great! Please let me know if this is the case, by responding here, as we may be able to auto-detect and fix this problem as we serve the files, in the future.
Thanks for posting your solution.
Brad
Reply [edit]
Poster: | Z19 | Date: | Dec 31, 2005 2:18am |
Forum: | web | Subject: | Re: Broken/truncated .zip files in Wayback Machine |
> we may be able to auto-detect and fix this problem as we serve the files, in the future
Two years later, this has been reported numerous times, and you are still serving zip files one byte short. Is this ever going to be fixed?
Reply [edit]
Poster: | wjgeorge | Date: | Jun 28, 2007 9:40pm |
Forum: | web | Subject: | Re: Broken/truncated .zip files in Wayback Machine |
This post was modified by wjgeorge on 2007-06-29 04:40:01
Reply [edit]
Poster: | Z19 | Date: | Jun 28, 2007 11:25pm |
Forum: | web | Subject: | Re: Broken/truncated .zip files in Wayback Machine |
The problem is, why doesn't the Archive fix their files?
Why doesn't anyone at the Archive even bother to reply?
It's only been an outstanding, known problem with a simple fix for at least 4 years.
Reply [edit]
Poster: | mark_k | Date: | Dec 24, 2003 8:39am |
Forum: | web | Subject: | Re: Broken/truncated .zip files in Wayback Machine |
The problem is definitely not related to the 1MB file size issue; the files I examined were 200K or so long, from memory.
As I understand it, in the .zip file format there is a list of filenames in the archive right at the end of the file. And presumably the file format requires a trailing zero byte.
Some/most archiver programs report an error like "end of central directory signature not found" (from memory) when you try to work with a file missing its final zero byte.
If you like, I can send some example web.archive.org URLs by private email so you can investigate this issue yourself.
Regards,
-- Mark
Reply [edit]
Poster: | Dolphin | Date: | Dec 23, 2003 10:19am |
Forum: | web | Subject: | Re: Broken/truncated .zip files in Wayback Machine |