Building Data Pipelines in Sparkflows

Sparkflows
Nov 14, 2022
3 min read

This blog is the first of many to come in the

Data and AI Pipeline series.

Data pipelines are a means to move data from source systems to other data sinks while performing transformations on it along the way. These data sources can be silos of data at rest residing in filesystems, databases, cloud storage, among others or they can be streaming data sources.

Once the data is ingested into the pipeline, it goes through a series of data standardization steps like data profiling, data quality checks, filtering and multiple other transformations, and finally the cleaned and enriched data is saved on one or multiple data sinks. The standardized data is now ready to be consumed by downstream processes like building predictive models, running analytical queries, reporting, and others.

Sparkflows enables a user to build data pipelines 10x faster through its low-code/no-code platform via drag and drop. Sparkflows provides 350+ processors to be used to build the data pipeline. Each of these processors performs a specific task like filtering rows, columns, joins, pivots, and data quality checks, among others. More on the processors can be found in our data sheet here: https://www.sparkflows.io/data-sheets and our capabilities page here: https://www.sparkflows.io/productcapablities .

A good representative example of a data pipeline built and deployed in production in a few hours via Sparkflows can be seen below. Such pipelines usually take much longer to develop and deploy if coded by a data engineer.

The above pipeline performs the following operations:

Reads data from multiple different sources like files on disk, databases, and cloud storage Joins them based on column column
Performs some data cleaning operations like drop duplicate rows, cast columns to a type, drop columns
Performs data standardization operations like validate fields to lie in a range, fields to have specific values only
Performs data quality checks by computing correlation among columns and summary statistics
Writes the cleaned and standardized data to a database (for analytical queries) and also to the file system (for running predictive models).

An assortment of node configurations and the results page can be seen below :

Once the data pipeline is developed, users can choose any of the following engines available to execute the workflow :

Execute the pipeline on the local machine running Sparkflows
Connect to a Spark cluster and execute there
Execute on an EMR cluster
Execute it as an Airflow pipeline.

The same workflow will be submitted to all the above options to execute. This is possible because of the seamless integration which the team has developed over the years. The metadata is streamed back and persisted onto the Sparkflows server for the users to have a deep insight on what’s happening under the hood and the execution state of the pipeline. The metadata streamed back can be seen in the image below

Some of the common data pipelines developed by our customers :

Complex ETL pipelines to read, transform and write back to data lake
Read data from historical systems, build an ETL pipeline and then ingest the data into a new age data lake
Standardize the data and save it on a data lake so that predictive models can be built
Run data pipelines by creating Airflow DAG’s and executing them as EMR steps
And many more…

A rich set of functionalities are available to make the pipeline development and operationalizing it as a true Self-serve platform. Some of these functionalities are - the ability to schedule the pipeline, email notifications on failure and success, git versioning of the pipeline, auto trigger of pipeline if some events are generated, ability to run the steps in the pipeline as sequential or in parallel among others.

Summary :

Data pipelines have become very complex, building and managing them is equally complex, while maintaining a zero downtime is a challenging task. Sparkflows, with its rich capabilities out of the box, enables the data journey of customers seamlessly, enabling 10x faster building of the data pipelines, providing a huge ROI.