Workflow Automation Templates
A library of ready-to-use workflow templates to accelerate your data journey
StringIndexer and OneHotEncoder
Convert categorical data into numeric form

Overview
This workflow demonstrates how to encode categorical variables for machine learning by using StringIndexer and OneHotEncoder in Spark. It transforms text-based or discrete categorical features into numerical representations suitable for modeling.
Details
The housing dataset is first loaded, and the StringIndexer node converts categorical columns—such as the number of bedrooms and bathrooms—into numeric indices. These indices are then passed to the OneHotEncoder node, which creates binary vector representations, ensuring models interpret categories without implying ordinal relationships.
The Print N Rows nodes display encoded outputs, allowing comparison between indexed and one-hot encoded data. This preprocessing step helps improve model accuracy and compatibility with algorithms requiring numerical input.