Churn Prediction using Sparkflows

In this solution, we create a Random Forest Model to predict churn and evaluate the results.

Dataset

The dataset is artificial Churn Data based on claims, similar to real world. It is taken from the following location. 

 

 

Sample Data

KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.

OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.

NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.

OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.

OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False.

 

Dataset Fields


State: discrete.
Account length: continuous.
Area code: continuous.
Phone number: discrete.
International plan: discrete.
Voice mail plan: discrete.
Numbervmail messages: continuous.
Total day minutes: continuous.
Total day calls: continuous.
Total day charge: continuous.
Total eve minutes: continuous.
Total eve calls: continuous.
Total eve charge: continuous.
Total night minutes: continuous.
Total night calls: continuous.
Total night charge: continuous.
Totalintl minutes: continuous.
Totalintl calls: continuous.
Totalintl charge: continuous.
Number customer service calls: continuous.

Workflow

 

Below is the workflow we use for creating the model for Churn Prediction.

The workflow performs the following steps:​

  • Reads in the dataset from a tab separated file

  • Applies StringIndexer on the field intl_plan

  • Applies VectorAssembler on the fields we want to model on

  • Splits the dataset into (.8, .2)

  • Performs Random Forest Classification

  • Performs prediction using the model generated on the remaining 20% dataset

  • Finally evaluates the prediction results

String Indexer

VectorAssembler

In the VectorAssembler, we select the fields we want to include in the model. Only the numeric fields are displayed as VectorAssembler supports only the numeric fields.

Split

Here we split the dataset into training and test datasets. We split it into (.8, .2)

RandomForestClassifier

Here we use a RandomForestClassifier for predicting churn. We use 20 trees.

Predict

Here we predict using the model on the test dataset.

BinaryClassificationEvaluator

Here we evaluate the quality of our results.

Executing the Workflow

Next, we execute the workflow. We come up with the below model. It is a forest with 20 trees.

From the evaluator we get the following results:

Workflow JSON

The workflow consists of the below JSON. The workflow can be run interactively from the Sparkflows UI, or it can be easily schedule with spark-submit with any scheduler.

SUPPORT

For support please email:

SOCIAL

  • facebook
  • linkedin
  • twitter
  • angellist
© 2019 Sparkflows, Inc. All rights reserved. 

Terms and Conditions