Forum

Ignite Discussions : Ask Questions, Find Answers, Share Expertise about Sparkflows

To see this working, head to your live site.

Jun 22, 2023

How to find the most optimal number of clusters with K-Means clustering in Sparkflows?

I am new to cluster analysis. In K-means cluster analysis you have to provide some value for k which is the number of clusters. I don't know exactly how to find the most optimal number of clusters for this dataset.

I would appreciate some help.

1 comment

Comments (1)

Lakshay

Jun 22, 2023

Hey Chris,

Estimating the optimal number of clusters in a dataset is a common challenge in cluster analysis. There are several methods and techniques that can help determine the appropriate number of clusters. Here are a few commonly used approaches:

Elbow Method: The Elbow Method involves plotting the number of clusters against the corresponding within-cluster sum of squares (WCSS) or other measures of cluster compactness. As the number of clusters increases, the WCSS tends to decrease, reflecting better intra-cluster similarity. The idea is to select the number of clusters at the "elbow" of the plot, where the improvement in clustering performance starts to diminish significantly.

Silhouette Score: The Silhouette score measures the quality of clustering by assessing both the compactness of clusters and the separation between different clusters. It ranges from -1 to +1, with higher values indicating better clustering. The optimal number of clusters corresponds to the maximum Silhouette score.

The Davies-Bouldin Index: The DBI is a clustering evaluation metric used to estimate the quality of clustering results. It quantifies the separation between clusters based on within-cluster similarity and between-cluster dissimilarity. Lower values of the DBI indicate better clustering.

And there are many more methods.

In Sparkflows, the “H2O K-Means node” has an option to estimate the number of clusters for a given range.

H2O uses proportional reduction in error (PRE) to determine when to stop splitting. The PRE value is calculated based on the sum of squares within clusters (WCSS). If you increase the number of clusters, then WCSS usually reduces. If the reduction is negligible then it stops.

The following image shows how to set an estimating number of clusters automatically.

And the workflow is shown as below:

Forum

How to find the most optimal number of clusters with K-Means clustering in Sparkflows?

© 2025 Sparkflows, Inc. All rights reserved.