Spark ML Pipeline using Sparkflows

Spark ML has Pipelines which allows us to run a sequence of Algorithms.

A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.



We use the message field of our spam dataset.




We create the above workflow in Sparkflows

It consists of the following steps:


  • Read in the dataset form a tab separated file

  • Split the data for training and test sets

  • Create a Pipeline consisting of the following stages:

    • Tokenize the message field of the dataset

    • Convert each document’s words into a numerical feature vector

    • Train a Logistic Regression Model

  • Predict on the test dataset

  • Print the predictions


Sparkflows identifies the parts of the workflow that make up the Pipeline intelligently. In the above, Sparkflows identifies that Tokenizer, HashingTF and LogisticRegression are part of the Pipeline.


Below we see the configurations for each of the Nodes. We also see that the schema is passed on from one Node to the next. Some of the Nodes also end up updating the Schema.

Executing the Workflow

Next, we execute the workflow. Below is the output of the sample predictions.

The final output schema has prediction.









Contact Us

© 2020 Sparkflows, Inc. All rights reserved. 


  • Facebook
  • LinkedIn
  • Twitter
  • angellist

Terms and Conditions