Creating a DataSet from Parquet files
Sparkflows can handle both Structured and Unstructured data. For structured data, DataSets are created in Sparkflows. The datasets created can then be used in any workflow.
Here we create a DataSet over Parquet files. Parquet files have schema embedded in them. Sparkflows is able to automatically extract the schema.
Viewing Existing DataSets
The existing DataSets are displayed in the DataSets page of Sparkflows.
Creating a New Parquet DataSet
We now create a dataset overpeople.parquet. It is a parquet file.
In the ‘Create DataSet’ page we fill in the required fields as below.
Above we have specified a name for the DataSet we are creating, ‘Header’ and ‘Delimiter’ are ignored for Parquet files.
Once we have specified the above, we hit the ‘Update’ button. This brings up the sample data, extracts the schema and displays it. Below we see that there are 2 fields : age and name. Age is of type integer and name is of type string.
Clicking the ‘Save’ button creates the new DataSet for us.
Now we are ready to start using our new DataSet in Workflows.