top of page

Workflow Automation Templates

A library of ready-to-use workflow templates to accelerate your data journey

H2O Pyspark GLM

Train & validate models with H2O GLM

Data-cleaning.jpg
Overview

This workflow builds and evaluates a predictive model using the H2O Generalized Linear Model (GLM) algorithm within a PySpark environment. It splits the dataset for training and validation, enabling the generation of key model performance metrics such as ROC curves and confusion matrices.

Details

The workflow starts with importing the diabetes dataset, which is then processed and split into training and testing subsets using the Split node. Approximately 80% of the data is used for model training while the remaining portion is reserved for validation.

The H2O Generalized Linear Models node trains the model using the training data, applying GLM’s flexibility to handle various distributions and link functions. The H2O Score node evaluates the trained model on the validation set, producing performance metrics that measure predictive accuracy.

Finally, the Print N Rows node displays the output, allowing users to review predictions and metrics interactively. This workflow is ideal for regression or classification tasks requiring explainable, statistically grounded modeling with integrated validation.

bottom of page