Creating a DataSet for CSV files
Sparkflows can handle both Structured and Unstructured data. For structured data, DataSets are created in Sparkflows. The datasets created can then be used in any workflow.
Here we create a DataSet over CSV files. Sparkflows is able to automatically infer the schema. Sparkflows uses the spark-csv library from Databricks for it.
Viewing Existing DataSets
The existing DataSets are displayed in the DataSets page of Sparkflows.
Creating a New DataSet
We now create a dataset over housing.csv file. It is a comma separated file. It also has a header row specifying the names for the various columns.
In the ‘Create DataSet’ page we fill in the required fields as below.
Above we have specified a name for the DataSet we are creating, ‘Header’ is set to true indicating that the file has a header row, field delimiter is comma and we have also specified the path to the file.
Once we have specified the above, we hit the ‘Update’ button. This brings up the sample data, infers the schema and displays it. We can change the column names and also the data types. Format column is used for specifying the format for date/time fields.