Workflow Automation Templates
A library of ready-to-use workflow templates to accelerate your data journey
H2O Pyspark GLM
Train & validate models with H2O GLM

Overview
This workflow builds and evaluates a predictive model using the H2O Generalized Linear Model (GLM) algorithm within a PySpark environment. It splits the dataset for training and validation, enabling the generation of key model performance metrics such as ROC curves and confusion matrices.
Details
The workflow starts with importing the diabetes dataset, which is then processed and split into training and testing subsets using the Split node. Approximately 80% of the data is used for model training while the remaining portion is reserved for validation.
The H2O Generalized Linear Models node trains the model using the training data, applying GLM’s flexibility to handle various distributions and link functions. The H2O Score node evaluates the trained model on the validation set, producing performance metrics that measure predictive accuracy.
Finally, the Print N Rows node displays the output, allowing users to review predictions and metrics interactively. This workflow is ideal for regression or classification tasks requiring explainable, statistically grounded modeling with integrated validation.