Spark ML Pipeline using Sparkflows
Spark ML has Pipelines which allows us to run a sequence of Algorithms.
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.
We use the message field of our spam dataset.
We create the above workflow in Sparkflows
It consists of the following steps:
Read in the dataset form a tab separated file
Split the data for training and test sets
Create a Pipeline consisting of the following stages:
Tokenize the message field of the dataset
Convert each document’s words into a numerical feature vector
Train a Logistic Regression Model
Predict on the test dataset
Print the predictions
Sparkflows identifies the parts of the workflow that make up the Pipeline intelligently. In the above, Sparkflows identifies that Tokenizer, HashingTF and LogisticRegression are part of the Pipeline.
Below we see the configurations for each of the Nodes. We also see that the schema is passed on from one Node to the next. Some of the Nodes also end up updating the Schema.
Executing the Workflow
Next, we execute the workflow. Below is the output of the sample predictions.