top of page

OCR with Tesseract in Sparkflows

Sparkflows provides OCR node for performing OCR with Tesseract. The good thing is that Sparkflows runs it in distributed mode on the Apache Spark cluster. Hence you can be performing OCR on tens to millions of files at a time.

The details of the Tesseract project is available at

The language model files are available at

Details of installing the Tesseract language files in Sparkflows is available at


Below is a workflow which uses the OCR Node in Sparkflows.

It consists of 3 Nodes:

  • BinaryFiles - It reads in the Binary files from a given file or directory

  • OCR - It performs OCR on the image using Tesseract

  • PrintNRows - It prints the first X rows of the result


It reads in the binary file 'data/ocrimage.png' from HDFS. It places the filename into the column 'fileName' and the content of the binary file into the column 'content' of the DataFrame it generates.


It performs OCR using Tesseract. It reads in the content of the binary file from the column 'content'. After performing OCR, it places the result into the column 'ocrcol'. Its output DataFrame contains 'fileName' and 'ocrcol'.


It prints the first 10 rows from the result

Input File

The input image file looks like below.

Executing the workflow

When the workflow is executed, it produces the following output for the single file we used in this case.


We see that its extremely easy to start processing your binary files to extract text with Sparkflows. As the next steps, you can as easily perform NLP on this text, and/or save it into HDFS/HIVE/Solr/Elastic Search by adding additional building blocks/nodes to your workflow.

You can get started by downloading Sparkflows from

1,966 views0 comments


bottom of page