- Mar 1, 2017
- 2 min read

OCR with Tesseract in Sparkflows

Sparkflows provides OCR node for performing OCR with Tesseract. The good thing is that Sparkflows runs it in distributed mode on the Apache Spark cluster. Hence you can be performing OCR on tens to millions of files at a time.

The details of the Tesseract project is available at https://github.com/tesseract-ocr

The language model files are available at https://github.com/tesseract-ocr/tessdata

Details of installing the Tesseract language files in Sparkflows is available at https://www.sparkflows.io/nodes-ocr

Workflow

Below is a workflow which uses the OCR Node in Sparkflows.

It consists of 3 Nodes:

BinaryFiles - It reads in the Binary files from a given file or directory
OCR - It performs OCR on the image using Tesseract
PrintNRows - It prints the first X rows of the result

BinaryFiles

It reads in the binary file 'data/ocrimage.png' from HDFS. It places the filename into the column 'fileName' and the content of the binary file into the column 'content' of the DataFrame it generates.

OCR

It performs OCR using Tesseract. It reads in the content of the binary file from the column 'content'. After performing OCR, it places the result into the column 'ocrcol'. Its output DataFrame contains 'fileName' and 'ocrcol'.

PrintNRows

It prints the first 10 rows from the result

Input File

The input image file looks like below.

Executing the workflow

When the workflow is executed, it produces the following output for the single file we used in this case.

Summary

We see that its extremely easy to start processing your binary files to extract text with Sparkflows. As the next steps, you can as easily perform NLP on this text, and/or save it into HDFS/HIVE/Solr/Elastic Search by adding additional building blocks/nodes to your workflow.

You can get started by downloading Sparkflows from https://www.sparkflows.io/download