top of page

The Last Mile problem


When using pre-existing nodes, the problem always comes up is what to do with the logic which does not fit into any of the pre-existing nodes. This defeats the whole flow and value preposition for many of the use cases.

How does Sparkflows solve it?

Sparkflows provides 3 kinds of nodes for these kinds of problems.

  • SQL

  • Scala

  • Jython

These nodes allow us to add code to the workflow for our specific problem.

SQL

The SQL node allows us to write any SQL into the workflow. It registers the incoming DataFrame as a temporary table. This allows us to write the SQL referring this temporary table.

Sparkflows provides a separate Node called RegisterTempTable. This allows registering any DataFrame as a Temporary table in the workflow. These tables can be referred in any of the SQL Node.

Below is an example of an SQL node. The incoming DataFrame is registered as the temporary table 'fire_temp_table'.


Scala

Scala node allows us to write Scala code into the workflow. It provides access to the incoming DataFrame as the variable 'inDF'.

We are free to write any Spark/Scala code. This gives a lot of power into the hands of the user. Scala node works for the interactive mode too, allowing us to view the output of the Scala node immediately.

Scala Node produces an output DataFrame which is passes over to the next Node of the workflows. This is done by registering the output DataFrame as a temporary table 'outDF'.

Below is an example of a Scala Node. It is fitting K-Means on the incoming DataFrame to generate a model. It is also grouping the data on column 'c1', performing a count and outputting the resulting DataFrame by registering it as a Temporary Table 'outDF'.


Jython

The Jython node allows us to write Jython code into the workflow. It provides access to the incoming DataFrame as the variable 'inDF'. The output DataFrame is placed in 'outDF'.

Below is an example Jython Node.


Python

This is the last piece of the puzzle. A number of users luv to use Python with Spark.

We are working on the Python Node at this time and hope to release it in the next month. We are using Py4J for it.


98 views0 comments
bottom of page