Sparkflows enables easily building spark workflows using 100+ building blocks. Workflow editor is an essential part of the whole process.
Sparkflows provides the ability to preview the output of any Node in the workflow. For this, Sparkflows executes the workflow on a subset of the input data. Sparkflows further optimizes the process by executing only up-to the node whose output is desired.
Streaming & Previewing
Now, in comes streaming workflows. Streaming workflows make it harder to preview the data as Spark Streaming Jobs cannot be started and stopped easily within a JVM.
We heard from our customers that testing of the Streaming jobs doing design time would be of great value to them. So, the way we implemented it is to execute the workflow as a non-streaming job. For the nodes which read in data from a streaming source, eg. a socket, we implemented our own code to read in X rows of data from the streaming source. We are working to extend it to other sources like Kafka, Flume etc.
Below is a streaming workflow which reads in lines from a socket and does some simple transforms.
The workflow reads data from a socket. We feed in a text file into netcat with:
python streamfile.py spark | nc -lk 9999
Below is the code for streamfile.py :
Below is the result of previewing the output of the Tokenizer node. The workflow is reading in lines from a text file and tokenizing the lines.
This way, during design time, its much easier to quickly iterate though the process of building powerful workflows.