At Sparkflows we are obsessed with powering our users to build amazing data analytics applications in < 30 mins.
Below we build a Streaming Analytics workflow and dashboard. It:
Reads bike sharing data from Kafka
Parses the incoming data
Finds the number of rentals on a hourly basis
Displays the results visually in a graph.
The dataset contains bike rental info from 2011 and 2012 in the Capital bikeshare system, plus additional relevant information.
This dataset is from Fanaee-T and Gama (2013) and is hosted by the UCI Machine Learning Repository. It consists of 10877 rows ( can be found in /data directory of the Fire installation). Each record is count of rentals grouped by a given hour in the past and environmental factors at that time (season, holiday, temperature, wind-speed etc.)
Start Kafka and create Topic 'bike-sharing'
The quick start guide of Kafka is at : https://kafka.apache.org/quickstart
The steps for Kafka are:
Start zookeeper and Kafka server. You can also use an existing instance of Zookeeper/Kafka
Create the topic 'bike-sharing'
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic bike-sharing
Send the data file 'bike_sharing_noheader.csv' to the Kafka Topic
bike_sharing_noheader.csv is in the data directory of the Fire Install
cat bike_sharing_noheader.csv | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic bike-sharing
You can also keep sending data randomly to Kafka in a loop. We used the below bash script:
echo "Press [CTRL+C] to stop.."
sort -R bike_sharing_noheader.csv | head -n 100 | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic bike-sharing
Below is a workflow for Streaming Analytics of the Bike Sharing dataset.
It consists of 6 Nodes:
StreamingKafka - It reads in streaming data from the Kafka topic bike-sharing
FieldSplitter - It splits each line in fields
StringToDate - Converts the datetime column into Timestamp type
DateTimeFieldExtract : Extracts year, month, day, hour from the datetime column
GraphGroupByColumn - Groups the data on the hour column, sums it up and display it in a Graph.
PrintNRows : Prints the first 10 records in a table
It reads in streaming data from Kafka and creates a dataframe with one column containing the lines.
It splits each line on the separator - comma - and outputs a new DataFrame with the columns defined.
It converts the datetime column into new column of type 'Timestamp'.
It extracts the year, month, day of month and hour from the datetime_dt column.
Aggregates the data on the hour column, and displays it in a Graph.
Executing the workflow
When the workflow is executed, Fire submits a spark streaming job to the Spark cluster. The spark streaming job keeps running and processing the incoming from Kafka. Below are some of the output produced by the job.
Since we are still very much under 30 minutes, we also go ahead and create a Dashboard for the workflow. Since we have set the mini-batch duration to be 30 seconds, the Dashboard would update itself every 30 seconds.
Below is the Dashboard editor. Select the nodes whose output you want displayed and drag and drop them onto the canvas.