Clustering via KMeans (Housing Data Set)

In this solution, we want to highlight how easy it is to implement clustering via the Sparkflows visual designer.  As Wikipedia  highlights, "k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells."


The dataset in question is as follows:

  • Is a sample of a housing dataset. 

  • It can be found @ data/housing.csv under the Sparkflows installation home. It is to illustrate the concept. Once the DNA of the flow is established, it can be run against a much larger dataset by "spark submitting" the workflow.

  • The columns include:

    • Id​

    • Price: sale price of a house

    • Lotsize: the lot size of a property in square feet

    • #Bedrooms: number of bedrooms

    • #Bathrms: number of full bathrooms

    • #Stories: number of stories excluding basement

    • #garagepl: number of garage places

    • driveway: dummy, 1 if the house has a driveway

    • recroom: dummy, 1 if the house has a recreational room

    • fullbase: dummy, 1 if the house has a full finished basement

    • gashw: dummy, 1 if the house uses gas for hot water heating

    • airco: dummy, 1 if there is central air conditioning

    • prefarea: dummy, 1 if located in the preferred neighbourhood of the city

Workflow Overview:

Below is an overview of the workflow we use. This workflow was simply created via the drag and drop capabilities of the Sparkflows Designer UI. This ability to construct this data processing pipeline (or any DAG - Distributed Acyclic Graph, for that matter) in a WYSIWYG Plug-and-Play manner is a key innovation to continue our community's collective march to on-demand-instant-analytics. 

This workflow consists of the following steps:

  • Read in the dataset from a tab separated file

    • Please refer to the earlier section for details on the data-set.

    • Instead of a CSV, one can easily read it from a data-lake/Persistence Store (HDFS/RDBMS/NoSQL).

  • Applies VectorAssembler to concatenate columns into a feature vector\.

  • Performs K-Means Clustering to arrive at centroids:

  • (ORTHOGONALLY, note steps 1, 4 and 5 (PrintRows, SQL and Graph) - which is about visualizing the raw data. An optional step, but for those visually inclined, it helps to "get/grok" the data. In this case we try to visualize the raw data as indicated in this sql -> "select avg(price) as avgPrice, avg(lotsize) as avgLotSize, bedrooms from fire_temp_table group by bedrooms order by bedrooms asc"

Workflow Details:

Reading in the DataSet

Vector Assembler

Select the columns we want to cluster.

KMeans Clustering

Executing the Workflow

Next we execute the workflow. We come up with 5 cluster centroids since k was 5.

Workflow JSON

The visual workflow is represented by a JSON behind the scenes, which could be checked into source control for example. The workflow can be run interactively from the Sparkflows UI (i.e. locally for testing), or it can be easily scheduled to run on a Spark Cluster via spark-submit with any scheduler --> example:

spark-submit --class fire.execute.WorkflowExecuteFromFile --master yarn-client --executor-memory 2G --num-executors 10 --executor-cores 4 core/target/fire-core-1.1.0-jar-with-dependencies.jar     --workflow-file      data/workflows/

Executing this workflow for yourself:

  • Download Sparkflows here

  • Follow installation instructions here

    • Start the Sparkflows server component and login to the Designer

    • cd <install_dir>/sparkflows-fire-1.2.0

    • ./

    • Point your browser to <machine_name>:8080/index.html

    • Log in with admin/admin.

  • Click on the Workflows Tab and search for “KMeans”. Click “Execute” next to the HousingKMeans workflow and view these results. Modify visually as appropriate.


As one can glean from the above example, with the Sparkflows Visual Designer, we were able to execute clustering

  • And arrive at cluster centers visually via plug-play “nodes”.

  • Able to easy iterate easy iteration by just changing KMeans-node parameters visually.

  • More features are on the way to allow for comparison of execution runs with different parameters so one can easily get to a model and parameters that yield the best results for a given dataset, which can then be deployed to production.

So download here and bedazzle your colleagues by going from concept to production at virtually the speed of thought.









Contact Us

© 2020 Sparkflows, Inc. All rights reserved. 


  • Facebook
  • LinkedIn
  • Twitter
  • angellist

Terms and Conditions