Reading from a MySQL table

In Sparkflows, we can read data from relational stores.

 

Here we read the ‘employees’ table from MySQL into a DataFrame in Sparkflows and display the first 10 rows.

Installing MySQL

 

We first download and install MySQL from http://www.mysql.com/downloads/

Installing MySQL JDBC driver

  • Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/5.1.html
  • Extract the JDBC driver JAR file from the downloaded file. For example:
    • tar zxvf mysql-connector-java-5.1.42.tar.gz
  • Copy the mysql JDBC driver JAR file to the user-lib directory of fire-1.4.0
    • ​cd fire-1.4.0    (if you are not already in that directory)
    • cp ...../mysql-connector-java-5.1.42/mysql-connector-java.jar      user-lib

Load Test Data into MySQL

 

We load test data from : https://github.com/datacharmer/test_db

 

git clone https://github.com/datacharmer/test_db.git

/usr/local/mysql/bin/mysql -uroot -p < employees.sql

 

This creates the ‘employees’ databases with a few tables in it.

Check out the employees DB created

 

mysql> show databases;

+--------------------+

| Database           |

+--------------------+

| information_schema |

| employees          |

| mysql              |

| performance_schema |

| sys                |

+--------------------+

5 rows in set (0.00 sec)


 

The following tables have been created in the employees database.

 

mysql> use employees

Database changed

mysql> show tables;

+----------------------+

| Tables_in_employees  |

+----------------------+

| current_dept_emp     |

| departments          |

| dept_emp             |

| dept_emp_latest_date |

| dept_manager         |

| employees            |

| salaries             |

| titles               |

+----------------------+

8 rows in set (0.00 sec)


 

The employees table has 300,024 records

 

mysql> select count(*) from employees;

+----------+

| count(*) |

+----------+

|   300024 |

+----------+

1 row in set (0.04 sec)

 

Read in the employees table into a Sparkflows workflow


We create the workflow below. It reads in the data from the ‘employees’ table and displays the first 10 rows.

The first node is the JDBC node. We provide the URL, DB Table and Driver.

On interactively executing the node ‘PrintNRows’, we see the below output.

Executing the workflow

 

When we execute the workflow, we get the below results.

Specifying a sub-query

 

In the configuration of the JDBC node, for ‘dbtable’ anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.

 

More details are available on the Spark Guide : https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#jdbc-to-other-databases

Above we have specified a subquery which select only the ‘first_name’ from the employees table.

RESOURCES

SOCIAL

  • facebook
  • linkedin
  • twitter
  • angellist
© 2020 Sparkflows, Inc. All rights reserved. 

Terms and Conditions