Workflow Automation Templates
A library of ready-to-use workflow templates to accelerate your data journey
Dimensionality Reduction using PCA
Reduce dimensions while preserving variance

Overview
This workflow performs Principal Component Analysis (PCA) to simplify datasets by transforming correlated features into a smaller set of uncorrelated principal components, retaining maximum data variance for efficient modeling and visualization.
Details
The process starts with loading the housing dataset and cleaning it by flagging and removing price outliers. Next, categorical variables are converted into numerical form using String Indexer and One Hot Encoder.
The Vector Assembler node combines all features into a single vector, preparing the data for PCA. The PCA node then identifies the top components that explain most of the data variance. The transformed output is displayed using Print N Rows for review.
This workflow effectively reduces complexity, enhances model speed, and minimizes overfitting while retaining critical information from the original dataset.