Recommendation Workflow using Sparkflows

Sparkflows
Oct 28, 2022
5 min read

Business Problem :

As marketplaces and social media have evolved there has been a need for companies to figure out how to efficiently deliver the right content to the right users. Business’s need to find a way to utilise user input to shape what content is shown to others, also known as a recommendations system.

Recommendations systems improve the user experience by minimising the search time for them to find what they are looking for by making it readily available. For businesses, this means more traffic and a more efficient use of bandwidth, therefore reducing costs while increasing revenue. With all of these benefits, they are considered vital to any online business.

Solution Proposal :

Recommendations systems these days vary wildly in complexity. As with all data related issues, there are important tradeoffs between complexity and speed that need to be considered. With a simple algorithm we will be able to constantly provide updated recommendations to every user without incurring speed and cost penalties. Simplifying further we can move away from computationally expensive machine learning models, and explore rule-based approaches.

In the current environment, it may seem counterintuitive to not take advantage of machine learning models, but simpler rule-based approaches have potentially massive advantages. Not only speed and cost, but it may be important for users to understand rankings and recommendations too.

A prime example of this is the site IMDB. IMDB is a review site for movies and shows, and is one of the most trusted and largest by traffic volume. By implementing a rule-based approach, they are able to provide a near infinite number of rankings, lists, and comparisons based on millions of reviews for thousands of shows and movies. Users also lean on IMDB’s rankings and lists to discover new shows and movies that they might like, so it is important that they understand how rankings are generated.

As an alternative to the simpler rule-based models, machine learning models are a great way to incorporate more data and increase performance. For this we will be using the Alternating Least Squares (ALS) algorithm. This algorithm works by finding commonality between users, and using a matrix of these commonalities to predict how a user would rate a product they have not used.

By predicting how the user will rate a product we can then build a list of products to recommend. An additional advantage of using a more complex model is that we can build additional features using this information. A major advantage of machine learning models is that they will customize to recommendations for each user. This makes recommendations more accurate for a wider range of users. However, this also means that the model will need to be applied for each user separately rather than generating a master list that can be easily propagated to every user. The performance advantages are often worth it for larger companies.

Implementation:

Rule Based Approach:

To use the IMDB formula we will need to first calculate a few statistics from our data. Specifically, we will need the number of ratings for each product(V), a set minimum number of ratings to be included(M), the average rating for each product(R), the average rating across all products in the same category(C).

The purpose of the variable M is to filter out products that do not have enough ratings to produce reliable results, and for our uses this will be set to 1.

To start with we will load in the dataset, in this case a CSV, and do a little bit of simple data cleaning. We will use the Filter By Number Range node to remove those ratings with a zero value. Depending on the way data collection is conducted, there may be values which we want to remove. In this case we want to filter out those ratings that have a value of 0, since this is not a valid rating.

Next we need to calculate the aggregate statistics required to apply the formula. All the necessary statistics are listed above, in this step to values for V and R will be calculated. To do this the Group By node will be set to group by each product’s, specifically book’s, ISBN number. Also in this step we will calculate the value for C using the Summary Statistics node.

Now that we have all our statistics calculated, for organization and readability the Columns Rename node will be used to correct the column names to line up with our formula. Finally, the Math Expression node will be used to apply the formula and provide the predicted ratings. The formula applied here is (V/(V+1) * R) + (1/(V+1) * 7.601). You will notice that the variables M and C are missing. For simplicity, we have substituted the value 1 for M, and manually input the value for C (7.601) that we calculated at the end of step 2.

The last step in this workflow is to rank the products for recommendation using the Sort By node and to display the recommendations. To display the recommendations as a list the Print N Rows node is perfectly simple, and to provide a more visually stimulating display the Graph Values node is used to build a chart of recommendations.

ALS Modeling Approach:

One of the major advantages of using a machine learning focused platform such as Sparkflows is that workflows can be greatly simplified when compared to writing in a language such as Python. As you can see, in many ways the workflow for building an ALS model is simpler than even building a rule-based algorithm.

As with all workflows, the first step is to load in the dataset and prepare it for modeling and other tasks. Since later in the workflow we will be using Spark ML to create the ALS model, we also need to make sure that all features are numeric. In this case the ISBN is a string type column, to convert it to a numeric column we use the String Indexer node. The last data preparation step is to split the dataset into training and testing datasets. This will ensure proper performance testing of the model.

The next step is to create the model that will be providing recommendations and use the model to predict ratings. Using Sparkflows this is made simple by using the ALS node. ALS model configuration is a little different to more common models such as logistic or linear regression. To build an ALS model the node needs a User, Item, and Rating column to build the recommendation matrix.

Now that the recommendation model and the recommendations it produces have been created, the final step in the workflow is to display the results and test the performance of the model. Before performance testing can be conducted, we need to clean up the output of the model. This is the purpose of the Row Filter node, which removes all the NaN values produced by the model. The Regression Evaluator and Graph Values are both used to quantify and qualify the results of the model. While the Summary Statistics and Print N Rows allows for exploration of the results.

Conclusion:

Looking at the output from each of our workflows, we have a very similar looking result from both options. There are important differences between both the methodologies and the output on closer inspection. While the simple rule-based model is easier to run, and can be re-computed at any time, it is less accurate and not personalized for each user. The machine learning based ALS model is personalized, but will not scale as easily.

With these trade-offs, there is no definitive conclusion, and the nature of your business and products will determine which is right for you. The key takeaway from this exercise

is definitely that no matter which option you choose, both are simple to implement and can be built quickly, especially with new low code tools such as Sparkflows. Hopefully this article gave you more insight into the options available to you today in the essential space of recommendation engines.