Skip to main content

View Post [edit]

Poster: benwbrum Date: Dec 3, 2010 8:06am
Forum: opensource Subject: Uploading JPG files to get an OCRed DJVU

I have a set of JPG files that are the results of scanning a document. I'd like to upload them to Community Texts and get a DJVU file on the other end to leverage the BookReader, OCR, and E-book/PDF features. However, whenever I use the upload tool or FTP, I only seem to see JPG images on the created book.

Are people generally uploading DJVU files that have already been OCRed? Is the BookReader tool only compatible with IA-digitized material, or am I missing something?

Reply [edit]

Poster: hank_b Date: Dec 10, 2010 10:01am
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

Uploading PDFs is easiest, because all the images go into a single file and there are few restrictions on what you can name it, but some quality is lost when we convert the PDF into jp2s for further processing.

Ideal for quality is a stack of jp2s (or JPEGs or TIFFs). But that's also more complicated to arrange: you need to generate zips (or tars) following our obscure naming and directory-structure conventions. At some point, we plan to enable scooping up loose image files, as you've placed in your item, but I can't say how soon that will happen.

Below are instructions for forming jp2-based image stacks. If you have JPEGs in place of jp2s, just replace "jp2", wherever it appears, with "jpg"; for TIFFs, replace "jp2" with "tif" (only one 'f').

The directions are unfortunately not very user-friendly; the make-a-book-from-a-stack-of-images path wasn't really designed for external contributors, but for the automated tools that process our own scanned images, so the expectations are rather inflexible and a little odd.

* Determine an identifier to use for the item as a whole (the {itemID} part of an Archive details page URL: http://www.archive.org/details/{itemID}), and determine a name to use for this document; often an item has only a single document, with the item and the document sharing the same name, but you can choose something else for the document name if you like, and an item can contain multiple documents. In what follows, I'll just indicate the name with {docname}; substitute your chosen name.

* Name the jp2s so that the first one is "{docname}_0000.jp2" and the last one is "{docname}_nnnn.jp2", where nnnn is the number of images minus 1 (because we begin at 0000), with enough leading zeroes to produce a number of exactly four digits.

* Place them all in a directory named "{docname}_jp2".

* Pack that directory into either a zip file named "{docname}_jp2.zip" or a tar file named "{docname}_jp2.tar"; make the zip or tar from one level up, so the filenames inside the zip/tar include the directory as well as the individual filename. (If the total size is more than 2 GB, we need it to be a tar, and not a zip; if the total size is under 2 GB, it can be either.)

Hank Bromley
software engineer
Internet Archive

Reply [edit]

Poster: benwbrum Date: Jan 2, 2011 5:37am
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

Thanks, Hank!

I put together a little script (attached) to do the renaming you describe, but now it looks like the derive task is failing.

I'm having a bit of trouble interpreting the error -- either it's failing because it can't find a scandata.xml file or because my tarfile doesn't follow the appropriate conventions. Does the derivation task create the scandata file, or should I be uploading a scandata.xml file alongside my tar file? I'm familiar with the format of the file, and should be able to generate one pretty easily if necessary.

Here's the relevant section from the task log:
<--------------- Module PDF (v21957 2010Dec31 19:03) Starting  PST: 2010-12-31 19:03:22 ------------------

[ PST: 2010-12-31 19:03:22 ] Executing: rm -rf /tmp/derive-JWGravesDiariesVol2Book01TEST2-PDF/
[ PST: 2010-12-31 19:03:22 ] Executing: mkdir /tmp/derive-JWGravesDiariesVol2Book01TEST2-PDF/

[Built on PDFBase $Revision: 32145 $ $Date: 2010-12-08 20:33:32 +0000 (Wed, 08 Dec 2010) $]

Found image archive PROCESSED_JPG for JWGravesDiariesVol2Book01TEST2 in /tmp/derive/JWGravesDiariesVol2Book01TEST2
size check of /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar: 31580160
size check of /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar: 31580160
Input images in /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar
[ PST: 2010-12-31 19:03:22 ] Executing: /petabox/sw/bin/build_pdf.sh --ocr /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_djvu.xml --scandata /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_scandata.xml --title='Jeremiah White Graves Diaries Volume 2 Book 1' --keywords='http://www.archive.org/details/JWGravesDiariesVol2Book01TEST2' --tmp_dir='/tmp/derive-JWGravesDiariesVol2Book01TEST2-PDF/itext' --author='Jeremiah White Graves' --2right --output /tmp/derive-JWGravesDiariesVol2Book01TEST2-PDF/JWGravesDiariesVol2Book01TEST2_itext.pdf /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar
INFO: BuildPdf build version: 95
INFO: Latest class change: org.archive.books.BuildPdf $Revision: 1.14 $ Sat Dec 06 00:38:01 UTC 2008
INFO: Processing /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar
org.archive.books.PdfException: Unknown archive structure: /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar
at org.archive.books.data.ArchiveData.createArchiveData(ArchiveData.java:48)
at org.archive.books.data.DataManager.getImageData(DataManager.java:104)
at org.archive.books.pdf.BuildSearchablePdf.buildSearchablePdf(BuildSearchablePdf.java:99)
at org.archive.books.BuildPdf.main(BuildPdf.java:266)
org.archive.books.PdfException: org.archive.books.PdfException: Unknown archive structure: /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar
at org.archive.books.pdf.BuildSearchablePdf.buildSearchablePdf(BuildSearchablePdf.java:319)
at org.archive.books.BuildPdf.main(BuildPdf.java:266)
Caused by: org.archive.books.PdfException: Unknown archive structure: /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar
at org.archive.books.data.ArchiveData.createArchiveData(ArchiveData.java:48)
at org.archive.books.data.DataManager.getImageData(DataManager.java:104)
at org.archive.books.pdf.BuildSearchablePdf.buildSearchablePdf(BuildSearchablePdf.java:99)
... 1 more
ERROR: output /tmp/derive-JWGravesDiariesVol2Book01TEST2-PDF/JWGravesDiariesVol2Book01TEST2_itext.pdf not created. Leaving temp file JWGravesDiariesVol2Book01TEST2_itext.pdf
WARN: Error parsing scandata xml file: /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_scandata.xml. Ignoring it.
WARN: No scandata. No page number adjustments applied.
INFO: Global image dpi: 600
FATAL: Error building PDF file: org.archive.books.PdfException: org.archive.books.PdfException: Unknown archive structure: /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar

Module threw exception:


/petabox/sw/bin/build_pdf.sh --ocr /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_djvu.xml --scandata /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_scandata.xml --title='Jeremiah White Graves Diaries Volume 2 Book 1' --keywords='http://www.archive.org/details/JWGravesDiariesVol2Book01TEST2' --tmp_dir='/tmp/derive-JWGravesDiariesVol2Book01TEST2-PDF/itext' --author='Jeremiah White Graves' --2right --output /tmp/derive-JWGravesDiariesVol2Book01TEST2-PDF/JWGravesDiariesVol2Book01TEST2_itext.pdf /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar died with retval:1. Last line: FATAL: Error building PDF file: org.archive.books.PdfException: org.archive.books.PdfException: Unknown archive structure: /tmp/derive/JWGravesDiariesVol2Book01TEST2/JWGravesDiariesVol2Book01TEST2_jpg.tar

Derivation failed!




Attachment: rename_for_ia

Reply [edit]

Poster: hank_b Date: Jan 2, 2011 11:42am
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

It was failing because of a bug in our code! Although we handled jp2.tar files correctly - i.e., unpacked them before sending the images to the pdf-building tool - we were not doing so with jpg.tar files. The result was the "Unknown archive structure" java exception you saw in your task log.

I've corrected the code and the new version is being distributed across our cluster now. Once the pushout finishes, I'll rerun your derive, and it should finish cleanly this time.

As for the "Error parsing scandata xml file" warning, that's an unrelated and less critical problem. You don't need to provide a scandata file, unless there are certain special attributes you want to assign to individual pages of your document, as we make one automatically with default settings. But it would be helpful if you set a "ppi" value in your item's metadata when uploading image files rather than a pdf (we can guess the ppi from info in a pdf). When we make the scandata, we store the ppi value in it if we have it, and it was the absence of that info that caused the (misnamed) parsing error. Without ppi info, we don't know what size to make the new pdf; we'll still make one, but the size likely won't match that of your original document.

One other note: with an item like this, you might want to set a language value of "English-handwritten" in the metadata. That way we'll know not to try performing OCR, which is only going to produce garbage characters from cursive handwritten text.

Reply [edit]

Poster: Nemo_bis Date: Mar 14, 2011 1:55pm
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

The deriving fails even before on my VocabolarioDellaLinguaItaliana, see task log:

<--------------- Module ScandataXML (v30738 2011Mar14 13:15) Starting PDT: 2011-03-14 13:15:28 ------------------

[ PDT: 2011-03-14 13:15:28 ] Executing: rm -rf /tmp/derive-VocabolarioDellaLinguaItaliana-ScandataXML/
[ PDT: 2011-03-14 13:15:28 ] Executing: mkdir /tmp/derive-VocabolarioDellaLinguaItaliana-ScandataXML/

Module threw exception:
filename "VocabolarioDellaLinguaItaliana_JPG.tar" doesn't match known patterns for PROCESSED_JPG archive filenames

Derivation failed!

Cleaning up temporary dir:
[ PDT: 2011-03-14 13:15:29 ] Executing: rm -rf /tmp/derive-VocabolarioDellaLinguaItaliana-ScandataXML/

--------------- Module ScandataXML Finished PDT: 2011-03-14 13:15:29 (Took 1.0 second) ------------------->




FATAL ERROR -- EXITING: module failed! aborting rest of derive and failing!

Reply [edit]

Poster: Nemo_bis Date: Mar 15, 2011 12:37am
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

Can such an error be caused by the fact that the images started from _0001 instead of _0000 ? I see that it counted 870 images instead of the correct 869.

Reply [edit]

Poster: hank_b Date: Mar 15, 2011 11:39am
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

The problem with VocabolarioDellaLinguaItaliana is that you put JPG into upper case. (Ideally that shouldn't matter to our code, but as it happens, it does, and fixing that is more work than you might think.) The error:

filename "VocabolarioDellaLinguaItaliana_JPG.tar" doesn't match known patterns for PROCESSED_JPG archive filenames

occurred because it expected to find VocabolarioDellaLinguaItaliana_jpg.tar. Just changing the name of the tar file won't help, though, because then we'll still have trouble with all the internal filenames within the tar containing "JPG".

We *almost* have a easy solution for you, which I was planning to blog about shortly. You can see a preliminary version here:

http://raj.blog.archive.org/2011/02/24/new-upload-format-_images-zip-for-scribe-style-uploads/

The _images.zip format described there would be perfect for you - leave the individual files as-is, and pack them into a new zip named "VocabolarioDellaLinguaItaliana_images.zip", and our system would take it from there...except that we also have trouble with zips of more than 2 GB, which yours would be.

This particular case - I was already looking at it before you posted here - has convinced me I need to implement an _images.tar format, too (we have no size limit on tar's). Then you can just change the name of the your tar to "VocabolarioDellaLinguaItaliana_images.tar", without even having to upload a replacement. Likewise for the other large tar in your item.

I hope to get to that within the next day or two, and plan to post here again when I do.

Reply [edit]

Poster: benwbrum Date: May 28, 2011 7:04am
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

I wanted to mention that the new, fault-tolerant naming conventions work great! Thanks very much to Hank and the IA team for their hard work.

Reply [edit]

Poster: Nemo_bis Date: Mar 15, 2011 1:32pm
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

Thank you very much for your help and great news! This new format is a big improvement.
In the meanwhile, since you've identified the issue, I don't mind much about renaming files and re-uploading this archive with everything lowercase etc., if the "wait for admin" doesn't mean that everything will be locked or ruined.

Reply [edit]

Poster: hank_b Date: Mar 15, 2011 2:13pm
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

No need to wait, then. Go right ahead and exchange your files for new versions (via the "Edit Item!" link). I can rerun the stuck derive whenever they're ready.

Reply [edit]

Poster: Nemo_bis Date: Mar 15, 2011 6:25pm
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

Done. Thank you! P.s.: Delete any non-needed file except the last one I uploaded, if you want (I don't know if they're useful or harmful for the next derive).
This post was modified by Nemo_bis on 2011-03-16 01:25:23

Reply [edit]

Poster: Nemo_bis Date: Mar 31, 2011 3:10am
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

Thank you. It produced a quite nice DjVu and so on but it raised a new fatal error when trying to produce an ePub. I don't care about the ePub, but could you unlock the item so that I can upload other files and change metadata? Thank you!

Reply [edit]

Poster: Nemo_bis Date: Mar 12, 2011 9:16am
Forum: opensource Subject: Re: Uploading JPG files to get an OCRed DJVU

Thank you, this is very useful! The derive feature of IA is great, but not well known.
Could you add this information to http://www.archive.org/about/faqs.php#195 (or to another/new section of the faq)?
Thank you.