One of the core needs of Data Processing is Data Validation. Sparkflows provides powerful capability in this regard. At a high level, it allows us to:
Define complex rules for validating each of the records in the incoming Dataset.
Splits the incoming Dataset into 2 output Datasets - one containing the valid records and the other containing the invalid ones.
It also adds a new column to the output containing the reason for failure of the Validation.
Overview
In this blog we discuss the Multiple Validations Processor in Sparkflows. It allows us to define multiple rules on various columns of the dataset.
Each of the rules can consist of up to 3 expressions connected with logical expressions.
The output result consists of 2 DataFrames, one containing the records which pass the validation and the other containing the records which fail the validation tests.
They also contain an extra column containing the reason for the failure of Validation.
Multiple Validation Processor
Below is the settings in validation multiple processor applied to Housing dataset.
Workflow
Below is a workflow containing the Validation Multiple Node applied to the Housing Dataset. It consists of 3 Processors.
The first one reads the dataset, the second one applies the validation rules to it, and the final one saves the dataset onto HDFS.
Results from Multiple Validation Processor
Below are the results from applying the Multiple Validation Processor to Housing Dataset.
We see that only the valid records have passed the validation tests.
Comments