Bike Sharing Data-Set Analysis using Sparkflows

In this solution, we want to learn to predict "bike rental counts per hour" from past data i.e. records that are grouped by hour, containing information such as day of the week, count of rentals and environmental factors (weather, season, temperature, humidity, holiday etc.). Having good predictions of customer demand allows a business or service to mutate supply as needed and achieve Just-In-Time glory :-)

DataSet

  • The dataset contains bike rental info from 2011 and 2012 in the Capital bikeshare system, plus additional relevant information. This dataset is from Fanaee-T and Gama (2013) and is hosted by the UCI Machine Learning Repository

  • The sample being used boils down to

    • Structured data set in the form of a comma-separated CSV and

    • Consists of 10877 rows ( can be found in /data directory of the Sparkflows installation).

    • Each record is count of rentals grouped by a given hour in the past and environmental factors at that time (season, holiday, temperature, wind-speed etc.)

Workflow Overview

Below is an overview of the workflow we use. This workflow was simply created via the drag and drop capabilities of the Sparkflows Designer UI. This ability to construct this data processing pipeline (or any DAG - Distributed Acyclic Graph, for that matter) in a WYSIWYG Plug-and-Play manner is a key innovation to continue our community's collective march to on-demand-instant-analytics. Benefits include:

  • It opens up the power of ETL and ML (such pre-packaged functionality is available as a catalog of "Nodes") to a wider audience of analysts and semi-technical resources[1] [2].

  • The actual execution can either be local (testing) or can be submitted to a SPARK cluster.

  • We have seen during the adoption that a single workbench improves collaborative iteration across data-engineers, data-scientists and analysts, which in turn accelerates time-to-market.

  • As one might observe, the visual approach doubles up as workflow documentation and hence contributes to solving the data-lineage problem.

This workflow consists of the following steps:

  • Read in the dataset from a tab separated file:

    • Please refer to the earlier section for details on the data-set.

    • Instead of a CSV, one can easily read it from a data-lake or a Persistence Store (HDFS/RDBMS/NoSQL).

  • Apply some basic pre-processing of the input data. This takes the form of:

    • DataTimeFieldExtract on the “datedate” column. This basically parses out the datetimestamp as year, month, dayofmonth, hour.

    • CastColumnType on the “count” column. This basically converts the integer to a double.  This step is needed, since Spark-GBTRegression (an upcoming step) requires it.

  • Applies VectorAssembler to concatenate columns into a feature vector and VectorIndexer:

  • Splits the dataset into (.8, .2)

    • 80% (~8700 rows) is used for training and 

    • The model is evaluated based on how it predicts on the remaining 20% (~ 2900 rows).

  • Performs GBT (Gradient Boosted Tree) Regression:

  • Performs prediction using the model generated on the remaining 20% dataset

  • Finally evaluates the result.

  • (ORTHOGONALLY, note steps 10 and 11 (SQL and Graph) - which is about visualizing the raw data. An optional step, but for those visually inclined, it helps to "get/grok" the data. Our data in this example, looks like this - (makes one wonder who is renting bikes at 3am ;-) ?) : 

Workflow Steps

 

Data pre-processing (DateTimeField Extract and CastColumnType​)

Vector Assembler and Indexer

Date Set Split

GBT Regression

Prediction (on 20% of the data-set)

Model Performance Evaluation:

Executing the Workflow

Next we execute the workflow. We come up with demand predictions and a rmse ( root-mean-square error) of 74.82

Workflow JSON


The visual workflow is represented by a JSON behind the scenes, which could be checked into source control for example. Sparkflows also maintains the various versions of your workflows. The workflow can be run interactively from the Sparkflows UI, or it can be easily scheduled to run on a Spark Cluster via spark-submit with any scheduler. 

Executing this workflow

  • Download Sparkflows here

  • Follow installation instructions here

    • Start the Sparkflows server component and login to the Designer

    • cd <install_dir>/sparkflows-fire-1.4.0

    • ./run-fire-server.sh

    • Point your browser to <machine_name>:8080/index.html

    • Log in with admin/admin.

  • Click on the Workflows Tab and search for “BikeSharing”. Click “Execute” next to the BikeSharing_Analysis workflow and view these results. Modify visually as appropriate.

Conclusion

As one can glean from the above example, with the Sparkflows Visual Designer, we were able to:

  • Solve the use case visually via plug-play “nodes”.

  • Allowing for easy iteration by just changing node parameters visually.

  • View and compare the results of past executions.

So download here and bedazzle your colleagues by going from concept to production at virtually the speed of thought.

RESOURCES

SOCIAL

  • facebook
  • linkedin
  • twitter
  • angellist
© 2020 Sparkflows, Inc. All rights reserved. 

Terms and Conditions