top of page

Workflow Automation Templates

A library of ready-to-use workflow templates to accelerate your data journey

Dimensionality Reduction using PCA

Reduce dimensions while preserving variance

Data-cleaning.jpg
Overview

This workflow performs Principal Component Analysis (PCA) to simplify datasets by transforming correlated features into a smaller set of uncorrelated principal components, retaining maximum data variance for efficient modeling and visualization.

Details

The process starts with loading the housing dataset and cleaning it by flagging and removing price outliers. Next, categorical variables are converted into numerical form using String Indexer and One Hot Encoder.

The Vector Assembler node combines all features into a single vector, preparing the data for PCA. The PCA node then identifies the top components that explain most of the data variance. The transformed output is displayed using Print N Rows for review.

This workflow effectively reduces complexity, enhances model speed, and minimizes overfitting while retaining critical information from the original dataset.

bottom of page