Spark ML Pipeline using Sparkflows

Spark ML has Pipelines which allows us to run a sequence of Algorithms.

http://spark.apache.org/docs/latest/ml-guide.html#pipeline

A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.

DataSet

 

We use the message field of our spam dataset.

 

Workflow

 

We create the above workflow in Sparkflows

It consists of the following steps:

 

  • Read in the dataset form a tab separated file

  • Split the data for training and test sets

  • Create a Pipeline consisting of the following stages:

    • Tokenize the message field of the dataset

    • Convert each document’s words into a numerical feature vector

    • Train a Logistic Regression Model

  • Predict on the test dataset

  • Print the predictions

 

Sparkflows identifies the parts of the workflow that make up the Pipeline intelligently. In the above, Sparkflows identifies that Tokenizer, HashingTF and LogisticRegression are part of the Pipeline.

 

Below we see the configurations for each of the Nodes. We also see that the schema is passed on from one Node to the next. Some of the Nodes also end up updating the Schema.

Executing the Workflow

Next, we execute the workflow. Below is the output of the sample predictions.

The final output schema has prediction.

RESOURCES

SOCIAL

  • facebook
  • linkedin
  • twitter
  • angellist
© 2020 Sparkflows, Inc. All rights reserved. 

Terms and Conditions

output schema.png