- Nov 20, 2022
- 4 min read

Multi-Model Workflow using Sparkflows

Updated: Jun 19, 2023

Business Problem :

Application of multiple models in sale price prediction. Price may be the single most important figure in any business case. But in most cases, our dataset is limited to basic information about the product and our list price. In a simple retail case, the list price and sale price are almost identical, or tightly controlled by discounts we specifically create.

In B2B licensing and SAAS sales, the final sale price is almost never the suggested price. The final sale price is subject to many unquantifiable factors, such as negotiations and current buyer sentiment. Since sales can take some time, it is important to be able forecast our future sales to make business decisions. This is a perfect case for us to use data science and machine learning models to fill in the gaps in our understanding

Solution Proposal :

Often when starting a data science journey it is not immediately clear what methods are appropriate to produce the best results. Testing is a crucial part of data science, but can be tedious and time consuming. For example, if a problem has been identified that requires a regression type model, we have only begun to narrow down the options available to us.

These days there are dozens of regression models available in multiple forms in different libraries. Drilling down deeper, we can also find thousands of permutations of hyper-parameters that we might want to explore further.

The obvious solution may be to automate the process by using Spark or H2O AutoML, but this may not always be the most appropriate solution. In many ways we are giving up control when we automate model building.

Often regression models built using AutoML can be harder to interpret and control. Instead, we need a way to be able to build and test multiple models, simultaneously, that we can fully understand and control.

To do this we will be using Sparkflows to build a single workflow that can test multiple models at once. By encapsulating the models into a single workflow, we can retain the performance advantages of distributed computing and cut down on the amount of data bandwidth we require.

This will speed up overall runtimes without compromising the performance of each model. Using this method will let us maintain full control of each model independently.

Implementation:

Data Ingestion and Cleaning :

To start we will load in our dataset using the Read CSV node since our data has been stored in the machine’s HDFS. Then we will do a little bit of data cleaning by dropping all of the rows with null. Note that this is not always the best solution, since if there is a column with many nulls we may be dropping a significant amount of our data, it may be better in that case to drop that column instead to preserve our data points.

Conversely, if we have only a few nulls, we can instead use a variety of imputation methods available. The main idea, in any case, is to preserve the most usable data possible without compromising its quality.

Data Preparation:

Next in our data preparation process is to make our string categorical columns usable. Machine learning models, for the most part, cannot properly utilize categorical columns. To make these columns usable we use the String Indexer node. This will replace each unique string value with an integer. The final step of almost any data preparation process will be to split the data into testing and training(or validation) datasets. This will ensure that later we are properly able to performance test our models and better validate their results.

Data Distribution:

Normally, this is where we would start the modeling and performance testing, however, since we want to run multiple models, we need to ensure that each model is receiving the same input. This is to make each model more comparable to each other. To do this we are simply using an intermediary node, the Print N Rows node.

The execution output of this node is not important to us, but the ability it has to distribute the same output to nodes within the workflow is important.

Modeling:

Now we can add the models that we want to test. For ease of use we have elected to use three H2O modeling nodes. Using models from the same library can help simplify the process of building multiple models since they will use the same testing nodes, and even saving nodes if we want to save the models to use in another workflow.

We also need to use another Print N Rows node to connect our testing dataset to the testing nodes. Now we have a workflow that can test multiple models at once, and if we want to we can tweak the parameters of each model separately.

If you want to learn more about how to build this workflow check out this video here:

Building Multi-Model Workflows in Sparkflows

Conclusion:

Looking at our output, we can now compare the results from each of our models. While our Generalized Linear models performed below average, and our XGBoost clearly may not be the right model for this application, our Gradient Boosting Machine has given us an extremely high accuracy.

Now if we want, we can move forward with our Gradient Boosting Machine as is, or improve on its performance by using hyper-parameter tuning. Additionally, thanks to the flexibility of the workflow we built, we can also go back and either fix issues with our other models or replace them with new models we want to test.