top of page

Data Exploration of Housing Data

Updated: Oct 6, 2021

This workflow shows how to explore the Housing Dataset from Kaggle with Sparkflows.


The below workflow:

  • Reads the Housing dataset

  • Calculates summary statistics for important variables

  • Creates a histogram to show the distribution of the Sale Price variable

  • Creates a graph to show the relationship between Sale Price and Basement Square Footage

  • Creates a matrix to show the correlation between important variables

  • Flags outliers in Ground Living Area and graphs the results

Reading Housing Dataset

DatasetStructured Processor creates a Dataframe of your dataset named Housing Training by reading data from HDFS, HIVE etc. which have been defined earlier in Fire by using the Dataset feature.

Processor Output

Calculate Summary Statistics

Summary Statistics Processor calculates summary statistics for the selected variables.

Processor Configuration

Processor Output

Create Histogram Graph

HistoGram Processor creates a histogram to show distribution by count of Sale Price.

Processor Configuration

Processor Output

Graph Values

Graph Values Processor graphs the relationship between Sale Price and Basement Square Footage.

Processor Configuration

Processor Output

Plot Correlation Matrix

Correlation Processor creates a correlation matrix of selected variables and plots the results.

Processor Configuration

Processor Output

Flag Outliers and Create Graph

Flag Outlier Processor creates a new flag column to mark outliers and Graph Group by Column Processor graphs the count in each category.

Processor Configuration

Processor Output

104 views0 comments


bottom of page