Powerful Data Prep & ETL with Sparkflows

Sparkflows
Jan 24, 2018
2 min read

Sparkflows allows you to quickly do very complex data preparation and ETL on Big Data.

Sparkflows has a number of features for enabling it including:

Data Cleaning
Joins, GroupBy, Cube, SQL etc.
Reading data from multiple sources and writing results to various sinks.

Sparkflows also give you detailed control over your Spark jobs with Repartition, Coalesce, Cache etc.

Overview

In this blog we start with 5 datasets, read them in, understand their schema in the process, perform data cleaning and then certain aggregations and joins.

The final results are written to HDFS as CSV files. They can as well we written as Parquet, Avro, JSON, XML files or to HIVE/Relational tables.

Datasets

There are 5 datasets:

facts.dat : Contains fixed length records of products sold to customers
geo.csv : Contains mapping of geo ids to geo names
product.csv : Contains mapping of product ids to product names
customer.csv : Contains mapping of customer ids to customer names
time.csv : Contains mapping of various time interval ids to corresponding names

Workflow

The workflow achieves the following tasks:

Parses the facts data and performs various cleanup operations on it
Performs groupby with aggregations operations and saves it to a file
Joins the fact data with various dimensions to create a large table and saves it to a CSV file.

The workflow is shown below:

Data Parsing and Cleaning

While the various dimension data is available as csv files, the fact data is in fixed field size format.

Each record has a fixed number of characters. In each record each field consists of fixed number of characters. The steps for data parsing and cleaning are as follows: