Tesseract OCR Scripts on A Billion Billion
A Billion Billion uses Tesseract OCR to OCR documents on A Billion Billion. It's integrated with libtiff to support compressed tiff images and we use imagemagick to add support color tiffs. The scripts are written in Python.
Overview of OCR implementation in A Billion Billion:
In order to implement the OCR in A Billion Billion we did the following:
1) Install Tesseract OCR using these steps.
2) Install Imagemagick using the basic steps from Imagemagick web site.
3) Pass the document to the python script below to retrieve the OCR text.
We've written a how-to for Plone.org on the exact steps needed to use Tesseract OCR with Plone, which you can view here.
The following Python script takes an image, converts it to greyscale using Imagemagick, OCRs it using Tesseract and returns the outputted text. It can stand to be cleaned up but it does work. Feel free to send us suggestions or modifications:
import urllib
import os
import sys
import tempfile
import shutil
tess = 'tesseract'
def convertgray(filename, filename1):
(fi, fo, fe) = os.popen3('convert ' + filename + ' -colorspace Gray -depth 8 ' + filename1, 't')
fi.close
out = fo.read()
fo.close()
error = fe.read()
fe.close()
return out, error
def ocrfile(self, f):
dir1 = tempfile.mkdtemp()
txtfilename = dir1 + '/output'
imagefilename = dir1 + '/image.tif'
file = open(imagefilename, "wb")
file.write(f)
file.close()
(imagefilename1, extension) = os.path.splitext(imagefilename)
imagefilename1 = imagefilename1 + '1' + extension
(out, error) = convertgray(imagefilename, imagefilename1)
(fi, fo, fe) = os.popen3(tess + ' ' + imagefilename1 + ' ' + txtfilename, 't')
#fi.write(data)
fi.close()
out = out + fo.read()
fo.close()
error = error + fe.read()
fe.close()
if os.path.exists(txtfilename + '.txt'):
file = open(txtfilename + '.txt', "r")
s = file.read()
file.close()
shutil.rmtree(dir1)
return s, out, error
Future changes
The following are changes we may make to improve the process.
1) Currently every image gets converted to grayscale. This seems to be a waste of resources and we should probably check the color depth and only run the conversion if necessary.
2) Use Imagemagick to add support for other file types.

