Reading from a MySQL table
In Sparkflows, we can read data from relational stores.
Here we read the ‘employees’ table from MySQL into a DataFrame in Sparkflows and display the first 10 rows.
Installing MySQL
We first download and install MySQL from http://www.mysql.com/downloads/
Installing MySQL JDBC driver
-
Download the MySQL JDBC driver from
-
Extract the JDBC driver JAR file from the downloaded file. For example:
-
tar zxvf mysql-connector-java-5.1.42.tar.gz
-
-
Copy the mysql JDBC driver JAR file to the user-lib directory of fire-1.4.0
-
cd fire-1.4.0 (if you are not already in that directory)
-
cp ...../mysql-connector-java-5.1.42/mysql-connector-java.jar user-lib
-
Load Test Data into MySQL
We load test data from : https://github.com/datacharmer/test_db
git clone
/usr/local/mysql/bin/mysql -uroot -p < employees.sql
This creates the ‘employees’ databases with a few tables in it.
Check out the employees DB created
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| employees |
| mysql |
| performance_schema |
| sys |
+--------------------+
5 rows in set (0.00 sec)
The following tables have been created in the employees database.
mysql> use employees
Database changed
mysql> show tables;
+----------------------+
| Tables_in_employees |
+----------------------+
| current_dept_emp |
| departments |
| dept_emp |
| dept_emp_latest_date |
| dept_manager |
| employees |
| salaries |
| titles |
+----------------------+
8 rows in set (0.00 sec)
The employees table has 300,024 records
mysql> select count(*) from employees;
+----------+
| count(*) |
+----------+
| 300024 |
+----------+
1 row in set (0.04 sec)
Read in the employees table into a Sparkflows workflow
We create the workflow below. It reads in the data from the ‘employees’ table and displays the first 10 rows.

The first node is the JDBC node. We provide the URL, DB Table and Driver.

On interactively executing the node ‘PrintNRows’, we see the below output.

Executing the workflow
When we execute the workflow, we get the below results.


Specifying a sub-query
In the configuration of the JDBC node, for ‘dbtable’ anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.
More details are available on the Spark Guide : https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#jdbc-to-other-databases

Above we have specified a subquery which select only the ‘first_name’ from the employees table.
