top of page

Workflow Automation Templates

A library of ready-to-use workflow templates to accelerate your data journey

StringIndexer and OneHotEncoder

Convert categorical data into numeric form

Data-cleaning.jpg
Overview

This workflow demonstrates how to encode categorical variables for machine learning by using StringIndexer and OneHotEncoder in Spark. It transforms text-based or discrete categorical features into numerical representations suitable for modeling.

Details

The housing dataset is first loaded, and the StringIndexer node converts categorical columns—such as the number of bedrooms and bathrooms—into numeric indices. These indices are then passed to the OneHotEncoder node, which creates binary vector representations, ensuring models interpret categories without implying ordinal relationships.

The Print N Rows nodes display encoded outputs, allowing comparison between indexed and one-hot encoded data. This preprocessing step helps improve model accuracy and compatibility with algorithms requiring numerical input.

bottom of page