View Post [edit]
Poster: | brewster | Date: | Dec 9, 2004 12:46am |
Forum: | toronto | Subject: | OCR output for indexing, proofreading, and maybe research |
They made an OCR output in XML that can be used for further processing. Usually OCR output is just dumped to a text file that loses the position information in the file. This is great for human readability, but means that you can not then use it to find where the word was on the page, so you can bold it when searching.
DJVU guys made an interesting XML file that keeps the bounding box information.
We hope this can be used in a few different ways. Imagine that the Distributed Proofreaders used these as their input and output format-- then we would could weave this back into the DJVU, and probably back into PDF for an enhanced version.
Further, this corrected version could be used to make better versions of e-books using the ascii.
And a dream-- that might be do-able-- is that we could then use these corrected versions to train an open-source OCR engine to do better recognition. There are many languages that dont have OCR done at all, and if we had a big training set of images of words and their correct unicode version, then the image processing researchers could get great work done. Sanscrit ocr, better arabic ocr, tamil, on and on.
Anyway, pls check out this cool capability.
http://www.archive.org/texts/texts-details-db.php?collection=toronto&collectionid=englishc00caltuoft&from=BA
http://www.archive.org/download/englishbookbindings00davenuoft/englishbookbindings00davenuoft_djvuxml.xml
-brewster
Reply [edit]
Poster: | Branko Collin | Date: | Dec 27, 2004 11:38am |
Forum: | toronto | Subject: | Re: OCR output for indexing, proofreading, and maybe research |
Why, imagine that we would. :-)
Still, it might be best if you either brought this up on the pgdp.net forums, or talked to Charles or Juliet about this.
Reply [edit]
Poster: | aronsson | Date: | Jan 27, 2005 4:58pm |
Forum: | toronto | Subject: | Re: OCR output for indexing, proofreading, and maybe research |
Right now I'm looking at one book that I picked on random from the Canadian Libraries collection, "libroberthoe01andeuoft", and it seems to have OCR text only for some pages, e.g. for pages 275 and 281, but not for pages 276-280. What's the reason for that?
Reply [edit]
Poster: | brewster | Date: | Jan 27, 2005 10:30pm |
Forum: | toronto | Subject: | Universal OCR |
As for page 276, I looked at the djvu xml and you are right, there is no text there, yet on the image of the page there is clearly text.
therefore the OCR did not work well in this case. we find it often crashes completely so you will not see any OCR output.
On the other hand there are plenty of books with good OCR.
I hope that someone does start working on the universal OCR problem.
-brewster
Reply [edit]
Poster: | aronsson | Date: | Jan 30, 2005 9:11pm |
Forum: | toronto | Subject: | Re: Universal OCR |
Of these, the first one (Pictures of Sweden, by Hans Christian Andersen) has already been through proofreading at Distributed Proofreaders and is available from Project Gutenberg in TXT and HTML. However, for Project Runeberg I need to know where the page breaks and line breaks are, and this information is lost in the PG e-text, so now we are publishing the raw OCR and letting our volunteers proofread it anew. This is a waste of effort that I wish I knew how to avoid. Further, both PG/DP and Project Runeberg lose the pixel coordinates of each word that is available in the DjVu format.
One way out of this, would be to improve the proofreading processes of DP and Project Runeberg, so no information is lost. Another way might be to rebuild the information after it is lost. Perhaps something like the GNU wdiff (word difference) utility can be used to see which words have been moved, joint or changed during proofreading, and tying this back to the pixel coordinates of the original DjVu file. Has anybody tried this?
For example, the first line of raw OCR text of page 9 of the DjVu file UF00001842 reads:
I
is
a
delightf-ul
spring,
the
birds
warble,
and the proofed plain text at Project Gutenberg reads:
It is a delightful spring: the birds warble,
so the words "is" and "warble" matches unchanged. Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"? Would this be useful?
Reply [edit]
Poster: | aronsson | Date: | Jan 30, 2005 9:35pm |
Forum: | toronto | Subject: | Re: Universal OCR |
For example, the first line of raw OCR text of page 9 of the DjVu file UF00001842 reads:
(LINE)
(WORD coords="382,2455,466,2381")I(/WORD)
(WORD coords="511,2455,568,2380")is(/WORD)
(WORD coords="618,2455,660,2408")a(/WORD)
(WORD coords="705,2481,1077,2377")delightf-ul(/WORD)
(WORD coords="1132,2482,1418,2379")spring,(/WORD>
(WORD coords="1485,2456,1606,2382")the(/WORD>
(WORD coords="1652,2458,1848,2380")birds(/WORD>
(WORD coords="1901,2471,2171,2380")warble,(/WORD)
(/LINE)
and the proofed plain text at Project Gutenberg reads:
It is a delightful spring: the birds warble,
so the words "is", "birds", and "warble" match unchanged. Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"? Would this be useful?
Reply [edit]
Poster: | brewster | Date: | Jan 30, 2005 10:54pm |
Forum: | toronto | Subject: | Re: Universal OCR |
djvuxml -> distributed proofreaders -> djvuxml
and preserve as many bounding boxes as possible (some of that will be difficult or impossible, so it is not that important that it have all)
Then we have a set of images-of-words and unicode-words-- or you can think of it as a training set for OCR.
We have gotten interest from the Machine Learning folks in making a universal OCR engine out of this.
What would be particularly interesting is non-roman scripts, so we may need to construct the DJVUxml more from scratch.
If anyone is interested in this, please let us know by forum post, email, or phoning the archive.
-brewster
Reply [edit]
Poster: | Branko Collin | Date: | Feb 22, 2005 8:24am |
Forum: | toronto | Subject: | Re: Universal OCR |
DP now tries to retain at least page numbers in its HTML versions (though they are unlikely to appear at the exact page boundaries all the time, because we reconnect words that were broken across page boundaries). Also, footnotes, columns and other items that span pages are unlikely to be in the right position, so to speak.
In other words, when sending a text through DP, it is not unreasonable to ask our volunteers to retain page breaks.
"Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"?"
I don't see why not.
"Would this be useful?"
I think it is.
Reply [edit]
Poster: | Branko Collin | Date: | Feb 22, 2005 8:34am |
Forum: | toronto | Subject: | Re: Universal OCR |