Sparkflows provides OCR node for performing OCR with Tesseract. The good thing is that Sparkflows runs it in distributed mode on the Apache Spark cluster. Hence you can be performing OCR on tens to millions of files at a time.
The details of the Tesseract project is available at https://github.com/tesseract-ocr
The language model files are available at https://github.com/tesseract-ocr/tessdata
Details of installing the Tesseract language files in Sparkflows is available at https://www.sparkflows.io/nodes-ocr
Workflow
Below is a workflow which uses the OCR Node in Sparkflows.
It consists of 3 Nodes:
BinaryFiles - It reads in the Binary files from a given file or directory
OCR - It performs OCR on the image using Tesseract
PrintNRows - It prints the first X rows of the result
BinaryFiles
It reads in the binary file 'data/ocrimage.png' from HDFS. It places the filename into the column 'fileName' and the content of the binary file into the column 'content' of the DataFrame it generates.
OCR
It performs OCR using Tesseract. It reads in the content of the binary file from the column 'content'. After performing OCR, it places the result into the column 'ocrcol'. Its output DataFrame contains 'fileName' and 'ocrcol'.
PrintNRows
It prints the first 10 rows from the result
Input File
The input image file looks like below.
Executing the workflow
When the workflow is executed, it produces the following output for the single file we used in this case.
Summary
We see that its extremely easy to start processing your binary files to extract text with Sparkflows. As the next steps, you can as easily perform NLP on this text, and/or save it into HDFS/HIVE/Solr/Elastic Search by adding additional building blocks/nodes to your workflow.
You can get started by downloading Sparkflows from https://www.sparkflows.io/download
Comments