Churn Prediction using Sparkflows
In this solution, we create a Random Forest Model to predict churn and evaluate the results.
Dataset
The dataset is artificial Churn Data based on claims, similar to real world. It is taken from the following location.
Sample Data
KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.
OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.
NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.
OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.
OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False.
Dataset Fields
State: discrete.
Account length: continuous.
Area code: continuous.
Phone number: discrete.
International plan: discrete.
Voice mail plan: discrete.
Numbervmail messages: continuous.
Total day minutes: continuous.
Total day calls: continuous.
Total day charge: continuous.
Total eve minutes: continuous.
Total eve calls: continuous.
Total eve charge: continuous.
Total night minutes: continuous.
Total night calls: continuous.
Total night charge: continuous.
Totalintl minutes: continuous.
Totalintl calls: continuous.
Totalintl charge: continuous.
Number customer service calls: continuous.
Workflow
Below is the workflow we use for creating the model for Churn Prediction.
The workflow performs the following steps:
-
Reads in the dataset from a tab separated file
-
Applies StringIndexer on the field intl_plan
-
Applies VectorAssembler on the fields we want to model on
-
Splits the dataset into (.8, .2)
-
Performs Random Forest Classification
-
Performs prediction using the model generated on the remaining 20% dataset
-
Finally evaluates the prediction results
String Indexer
VectorAssembler
In the VectorAssembler, we select the fields we want to include in the model. Only the numeric fields are displayed as VectorAssembler supports only the numeric fields.
Split
Here we split the dataset into training and test datasets. We split it into (.8, .2)

RandomForestClassifier
Here we use a RandomForestClassifier for predicting churn. We use 20 trees.
Predict
Here we predict using the model on the test dataset.
BinaryClassificationEvaluator
Here we evaluate the quality of our results.
Executing the Workflow
Next, we execute the workflow. We come up with the below model. It is a forest with 20 trees.
From the evaluator we get the following results:
Workflow JSON
The workflow consists of the below JSON. The workflow can be run interactively from the Sparkflows UI, or it can be easily schedule with spark-submit with any scheduler.