Saving Data To HDFS
Workflows allow us to create data pipelines. In data pipelines, data is transformed, models can be generated etc. We might want to save the resulting data at some point to HDFS.
Sparkflows has 2 Nodes to saving data to HDFS:
-
SaveParquet : Saves the data in Parquet format
-
SaveJSON : Saves the data in JSON format
Cluster vs Standalone Mode
-
Sparkflows can be running in Cluster mode or in the Standalone mode. These settings are in Administration/Configuration. The specific parameter is app.runOnCluster.
-
When running in standalone mode, the results are written out onto the local file system of the computer on which Sparkflows is running.
-
When running in cluster mode, the results are written out onto HDFS.
Workflow containing SaveParquet
Below is a workflow containing SaveParquet. The path specified it 'output' directory on HDFS.
Results of execution of Workflow
When the workflow is executed, the data is written into the 'output' directory as Parquet files.
Below are the Parquet files on HDFS in the 'output' directory.