Saving Data To HDFS

Workflows allow us to create data pipelines. In data pipelines, data is transformed, models can be generated etc. We might want to save the resulting data at some point to HDFS.

 

Sparkflows has 2 Nodes to saving data to HDFS:

 

  • SaveParquet : Saves the data in Parquet format

  • SaveJSON : Saves the data in JSON format

 

Cluster vs Standalone Mode

  • Sparkflows can be running in Cluster mode or in the Standalone mode. These settings are in Administration/Configuration. The specific parameter is app.runOnCluster.

  • When running in standalone mode, the results are written out onto the local file system of the computer on which Sparkflows is running.

  • When running in cluster mode, the results are written out onto HDFS.

 

Workflow containing SaveParquet

 

Below is a workflow containing SaveParquet. The path specified it 'output' directory on HDFS. 

Results of execution of Workflow

When the workflow is executed, the data is written into the 'output' directory as Parquet files.

Below are the Parquet files on HDFS in the 'output' directory.

SUPPORT

For support please email:

SOCIAL

  • facebook
  • linkedin
  • twitter
  • angellist
© 2019 Sparkflows, Inc. All rights reserved. 

Terms and Conditions