Workflow Automation Templates
A library of ready-to-use workflow templates to accelerate your data journey
H2O Pyspark DRF
Train & validate models

Overview
This workflow builds a predictive model using the H2O Distributed Random Forest (DRF) algorithm within PySpark. It splits the dataset for training and validation, generating key model performance metrics such as ROC curves and confusion matrices.
Details
The workflow starts by loading the diabetes dataset and splitting it into training and testing sets using the Split node, with 80% of data used for training. The H2O Distributed Random Forest node trains the model, leveraging multiple decision trees to improve prediction accuracy and model robustness.
The trained model is evaluated on the test data using the H2O Score node, which produces validation metrics and diagnostic plots for performance assessment. The Print N Rows node displays a preview of predictions for easy verification.
This workflow delivers scalable and interpretable classification results, ideal for medical or categorical prediction tasks using distributed machine learning.