BG-01_edited.jpg

Data Preparation

Sparkflows enables users to build data pipelines via 100+ pre-built processors and custom code processors to validate data, transform data, and clean it so that it can be consumed to build machine learning models. 

 

Sparkflows has push down analytics built into the core architecture, so one need not worry about pulling data from different sources. The processing will happen where the data resides and hence the data governance constraints are met.

Data cleaning and validation

Sparkflows enables users to build data pipelines via 100+ pre-built processors and custom code processors to validate data, transform data, and clean it so that it can be consumed to build machine learning models. 

 

Sparkflows has push down analytics built into the core architecture, so one need not worry about pulling data from different sources. The processing will happen where the data resides and hence the data governance constraints are met.

Data Connectors

Sparkflows supports 30+ file formats to be read/write via the pre-built processors, each of the file types having their own advantages and scenarios in which they fit in well.

 

Sparkflows supports connectors to 20+ data sources to enable reading/writing data like Amazon Redshift, Amazon S3, Databricks DBFS, Snowflake, SQL Databases, Google Big Query among others.

 

Sparkflows supports reading/writing data to/from streaming endpoints and queues like Apache Kafka, most of the queues like RabbitMQ, Twitter firehose  among others.

Statistical data preparation

Sparkflows has pre-built processors to enable data cleaning, statistical imputations, data enrichments via processors like Join, Union, Filter, Group among others. The data can be prepared and cleaned to be fed to downstream data processes.

 

A coder can also write their custom logic in python,scala, jython and plug the logic in as a node in a matter of minutes. These processors make the system extensible and give Sparkflow the pedigree to tackle super complex data prep pipelines.

Push down data preparation at scale

The entire data preparation pipeline runs as a Workflow in Sparkflow which is versioned and each Execution is tracked. These pipelines are run in a push down manner and data is not pulled onto a centralized location. This enables Sparkflows to prepare and process Petabytes of data residing in data lakes.

Analytical Apps

Sparkflows enables the coders to build the data engineering pipeline, feature engineering pipeline, machine learning model and finalize the one that performs the best and then abstract away all the details into a simplified UI form based Application to be used by non-coders and business users....

Know More

Collaboration

Sparkflows enables the users to work in collaboration with other team members via the share feature which is tied to a project. A business user can create and define a use case in a sparkflows project, admin will give access to data required for the use case, data engineer can then build a data engineering pipeline...

Know More

Deployment options and Integrations

Sparkflows can be deployed either on-premise or in any cloud. Sparkflows has deep integrations with  Amazon Web Services, Azure, Databricks, Google Cloud and Snowflake. While being deployed at any of the aforementioned environments

Know More