top of page

Workflow Automation Templates

A library of ready-to-use workflow templates to accelerate your data journey

H2O Pyspark DRF

Train & validate models

Data-cleaning.jpg
Overview

This workflow builds a predictive model using the H2O Distributed Random Forest (DRF) algorithm within PySpark. It splits the dataset for training and validation, generating key model performance metrics such as ROC curves and confusion matrices.

Details

The workflow starts by loading the diabetes dataset and splitting it into training and testing sets using the Split node, with 80% of data used for training. The H2O Distributed Random Forest node trains the model, leveraging multiple decision trees to improve prediction accuracy and model robustness.

The trained model is evaluated on the test data using the H2O Score node, which produces validation metrics and diagnostic plots for performance assessment. The Print N Rows node displays a preview of predictions for easy verification.

This workflow delivers scalable and interpretable classification results, ideal for medical or categorical prediction tasks using distributed machine learning.

bottom of page