Sparkflows provides OCR node for performing OCR with Tesseract. The good thing is that Sparkflows runs it in distributed mode on the Apache Spark cluster. Hence you can be performing OCR on tens to millions of files at a time.
The details of the Tesseract project is available at https://github.com/tesseract-ocr
The language model files are available at https://github.com/tesseract-ocr/tessdata
Details of installing the Tesseract language files in Sparkflows is available at https://www.sparkflows.io/nodes-ocr
Below is a workflow which uses the OCR Node in Sparkflows.
It consists of 3 Nodes:
BinaryFiles - It reads in the Binary files from a given file or directory
OCR - It performs OCR on the image using Tesseract
PrintNRows - It prints the first X rows of the result
It reads in the binary file 'data/ocrimage.png' from HDFS. It places the filename into the column 'fileName' and the content of the binary file into the column 'content' of the DataFrame it generates.
It performs OCR using Tesseract. It reads in the content of the binary file from the column 'content'. After performing OCR, it places the result into the column 'ocrcol'. Its output DataFrame contains 'fileName' and 'ocrcol'.
It prints the first 10 rows from the result
The input image file looks like below.
Executing the workflow
When the workflow is executed, it produces the following output for the single file we used in this case.
We see that its extremely easy to start processing your binary files to extract text with Sparkflows. As the next steps, you can as easily perform NLP on this text, and/or save it into HDFS/HIVE/Solr/Elastic Search by adding additional building blocks/nodes to your workflow.
You can get started by downloading Sparkflows from https://www.sparkflows.io/download