Spark ML Pipeline using Sparkflows
Spark ML has Pipelines which allows us to run a sequence of Algorithms.
http://spark.apache.org/docs/latest/ml-guide.html#pipeline
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage.
DataSet
We use the message field of our spam dataset.
Workflow
We create the above workflow in Sparkflows
It consists of the following steps:
-
Read in the dataset form a tab separated file
-
Split the data for training and test sets
-
Create a Pipeline consisting of the following stages:
-
Tokenize the message field of the dataset
-
Convert each document’s words into a numerical feature vector
-
Train a Logistic Regression Model
-
-
Predict on the test dataset
-
Print the predictions
Sparkflows identifies the parts of the workflow that make up the Pipeline intelligently. In the above, Sparkflows identifies that Tokenizer, HashingTF and LogisticRegression are part of the Pipeline.
Below we see the configurations for each of the Nodes. We also see that the schema is passed on from one Node to the next. Some of the Nodes also end up updating the Schema.
Executing the Workflow
Next, we execute the workflow. Below is the output of the sample predictions.