[texhax] search for text in a pdf file

Karl Berry karl at freefriends.org
Sat Aug 7 03:12:05 CEST 2004


    > so now i'm back where i started, only just a bit smarter.  so what else do
    > y'all use to pull text out of a pdf such as this one?

In general, pdftotext from xpdf can be better than pdf2ps | ps2ascii.
But if the text search in xpdf or acrobat doesn't find anything, it
won't help, and OCR is your only hope.

    Fortunately there is at least one open source project:
    http://jocr.sourceforge.net/

Yep, that's a big one.  There are other.

Another one is OCRAD, which was offered to GNU, and eventually accepted:
http://www.gnu.org/software/ocrad/ocrad.html

I found these links while evaluating ocrad about a year ago, don't know
if they're still valid, but FWIW:

http://www.claraocr.org
http://lem.eui.upm.es/ocre.html
http://www.pattern-lab.de/index_e.html
http://www.math.nwu.edu/~mlerma/locr/
http://http.cs.berkeley.edu/~fateman/kathey/ocrchie.html

I've never tried any of them personally.

Good luck,
k



More information about the texhax mailing list