Skip to main content

View Post [edit]

Poster: brewster Date: Dec 9, 2004 12:46am
Forum: toronto Subject: OCR output for indexing, proofreading, and maybe research

The folks that designed and built DJVU (at att labs and lizardtech) put in a feature that can be very useful to we folks building digital libraries:

They made an OCR output in XML that can be used for further processing. Usually OCR output is just dumped to a text file that loses the position information in the file. This is great for human readability, but means that you can not then use it to find where the word was on the page, so you can bold it when searching.

DJVU guys made an interesting XML file that keeps the bounding box information.

We hope this can be used in a few different ways. Imagine that the Distributed Proofreaders used these as their input and output format-- then we would could weave this back into the DJVU, and probably back into PDF for an enhanced version.

Further, this corrected version could be used to make better versions of e-books using the ascii.

And a dream-- that might be do-able-- is that we could then use these corrected versions to train an open-source OCR engine to do better recognition. There are many languages that dont have OCR done at all, and if we had a big training set of images of words and their correct unicode version, then the image processing researchers could get great work done. Sanscrit ocr, better arabic ocr, tamil, on and on.

Anyway, pls check out this cool capability.

http://www.archive.org/texts/texts-details-db.php?collection=toronto&collectionid=englishc00caltuoft&from=BA

http://www.archive.org/download/englishbookbindings00davenuoft/englishbookbindings00davenuoft_djvuxml.xml

-brewster

Reply [edit]

Poster: Branko Collin Date: Dec 27, 2004 11:38am
Forum: toronto Subject: Re: OCR output for indexing, proofreading, and maybe research

"We hope this can be used in a few different ways. Imagine that the Distributed Proofreaders used these as their input and output format-- then we would could weave this back into the DJVU, and probably back into PDF for an enhanced version."

Why, imagine that we would. :-)

Still, it might be best if you either brought this up on the pgdp.net forums, or talked to Charles or Juliet about this.

Reply [edit]

Poster: aronsson Date: Jan 27, 2005 4:58pm
Forum: toronto Subject: Re: OCR output for indexing, proofreading, and maybe research

I think this can become something great. What OCR software are you using now, that can output this format? As far as I know, the commonly used ABBYY FineReader doesn't support DJVU, does it? Could you set up a public OCR server where uploaded images could be OCRed?

Right now I'm looking at one book that I picked on random from the Canadian Libraries collection, "libroberthoe01andeuoft", and it seems to have OCR text only for some pages, e.g. for pages 275 and 281, but not for pages 276-280. What's the reason for that?

Reply [edit]

Poster: brewster Date: Jan 27, 2005 10:30pm
Forum: toronto Subject: Universal OCR

LizardTech's djvu encoder does not use Abbyy, but abbyy can be made to output this format, because I believe it is being done. This would be a very useful open source module.

As for page 276, I looked at the djvu xml and you are right, there is no text there, yet on the image of the page there is clearly text.

therefore the OCR did not work well in this case. we find it often crashes completely so you will not see any OCR output.

On the other hand there are plenty of books with good OCR.

I hope that someone does start working on the universal OCR problem.

-brewster

Reply [edit]

Poster: aronsson Date: Jan 30, 2005 9:11pm
Forum: toronto Subject: Re: Universal OCR

I looked around the IA text collections to find books pertaining to Scandinavia, which I can reuse in Project Runeberg, and immediately found three, which are now available at http://runeberg.org/pictswed/ , http://runeberg.org/ivar/ and http://runeberg.org/utveck/

Of these, the first one (Pictures of Sweden, by Hans Christian Andersen) has already been through proofreading at Distributed Proofreaders and is available from Project Gutenberg in TXT and HTML. However, for Project Runeberg I need to know where the page breaks and line breaks are, and this information is lost in the PG e-text, so now we are publishing the raw OCR and letting our volunteers proofread it anew. This is a waste of effort that I wish I knew how to avoid. Further, both PG/DP and Project Runeberg lose the pixel coordinates of each word that is available in the DjVu format.

One way out of this, would be to improve the proofreading processes of DP and Project Runeberg, so no information is lost. Another way might be to rebuild the information after it is lost. Perhaps something like the GNU wdiff (word difference) utility can be used to see which words have been moved, joint or changed during proofreading, and tying this back to the pixel coordinates of the original DjVu file. Has anybody tried this?

For example, the first line of raw OCR text of page 9 of the DjVu file UF00001842 reads:


I
is
a
delightf-ul
spring,
the
birds
warble,


and the proofed plain text at Project Gutenberg reads:

It is a delightful spring: the birds warble,

so the words "is" and "warble" matches unchanged. Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"? Would this be useful?

Reply [edit]

Poster: aronsson Date: Jan 30, 2005 9:35pm
Forum: toronto Subject: Re: Universal OCR

Sorry, the formatting was lost there. What I wanted to say was this:

For example, the first line of raw OCR text of page 9 of the DjVu file UF00001842 reads:

(LINE)
(WORD coords="382,2455,466,2381")I(/WORD)
(WORD coords="511,2455,568,2380")is(/WORD)
(WORD coords="618,2455,660,2408")a(/WORD)
(WORD coords="705,2481,1077,2377")delightf-ul(/WORD)
(WORD coords="1132,2482,1418,2379")spring,(/WORD>
(WORD coords="1485,2456,1606,2382")the(/WORD>
(WORD coords="1652,2458,1848,2380")birds(/WORD>
(WORD coords="1901,2471,2171,2380")warble,(/WORD)
(/LINE)

and the proofed plain text at Project Gutenberg reads:

It is a delightful spring: the birds warble,

so the words "is", "birds", and "warble" match unchanged. Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"? Would this be useful?

Reply [edit]

Poster: brewster Date: Jan 30, 2005 10:54pm
Forum: toronto Subject: Re: Universal OCR

If there were a tool to go from

djvuxml -> distributed proofreaders -> djvuxml
and preserve as many bounding boxes as possible (some of that will be difficult or impossible, so it is not that important that it have all)

Then we have a set of images-of-words and unicode-words-- or you can think of it as a training set for OCR.

We have gotten interest from the Machine Learning folks in making a universal OCR engine out of this.

What would be particularly interesting is non-roman scripts, so we may need to construct the DJVUxml more from scratch.

If anyone is interested in this, please let us know by forum post, email, or phoning the archive.

-brewster

Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:24am
Forum: toronto Subject: Re: Universal OCR

"However, for Project Runeberg I need to know where the page breaks and line breaks are, and this information is lost in the PG e-text"

DP now tries to retain at least page numbers in its HTML versions (though they are unlikely to appear at the exact page boundaries all the time, because we reconnect words that were broken across page boundaries). Also, footnotes, columns and other items that span pages are unlikely to be in the right position, so to speak.

In other words, when sending a text through DP, it is not unreasonable to ask our volunteers to retain page breaks.

"Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"?"

I don't see why not.

"Would this be useful?"

I think it is.

Reply [edit]

Poster: Branko Collin Date: Feb 22, 2005 8:34am
Forum: toronto Subject: Re: Universal OCR

BTW, you could use DP just for proofreading. During proofreading rounds, we retain line breaks to make it easier for our volunteers to compare the text with the scan. Line breaks are only removed during the post-processing round.