top of page

Workflow Automation Templates

A library of ready-to-use workflow templates to accelerate your data journey

SparkML Clustering - KMeans

Cluster data using K-Means

Data-cleaning.jpg
Overview

This workflow performs clustering on the California Housing dataset using the Spark ML K-Means algorithm. It groups data points with similar characteristics, helping uncover patterns and segmentations within housing data.

Details

The workflow starts with importing the housing dataset, followed by cleaning steps using Drop Rows With Null to remove incomplete records. The Vector Assembler converts all numeric features into a single vector column suitable for machine learning algorithms.

Next, the K-Means node trains a clustering model that groups data into distinct clusters based on housing attributes. The trained model is applied to the dataset using the Spark Predict node, assigning each record to its corresponding cluster.

Finally, the Drop Columns and Print N Rows nodes refine and display the output for analysis. This workflow enables quick and scalable clustering for data exploration, customer segmentation, or pattern detection using Spark ML.

bottom of page