In-memory analytics with Tableau, SparkSQL, and MapR

by
April 16, 2015

Last week Tableau released version 9.0 of their data visualisation tool. From a Big Data point of view the nicest new feature was support for querying cached (in-memory) SchemaRDDs (Data Frames as of Spark 1.3).
[big_data_promotion]
In this tutorial I will show you how to connect to Spark 1.2.1 on the MapR 4.1 sandbox with Tableau 9.0.
Pre-requisites:
– Download the MapR 4.1 sandbox
– Install Spark standalone 1.2.1   and configure Hive on the sandbox
– Install Tableau Desktop. A free trial is available from Tableau
– Install the SparkSQL ODBC Driver from Databricks
Adjust Spark default memory

Change

Start Spark Slaves

Have a look at the Spark UI at http://maprdemo:8080/
image1
Start Spark Thrift Server on port 10001

Connect with Beeline to Spark Thriftserver
/opt/mapr/spark/spark-1.2.1/bin/beeline
!connect jdbc:hive2://maprdemo:10001
Use username to log on, e.g. mapr.
Leave Password blank 

 We will now load a JSON file as a temporary table into Spark
Copy file /opt/mapr/spark/spark-1.2.1/examples/src/main/resources/people.json onto the MapR-FS and create a temp table from it.

A temporary table persists for the lifetime of the Thrift Server.
It does not register with the Hive metastore and as a result won’t be listed when issuing show tables.
We can view execution details of create table statement at http://maprdemo:4040/jobs/
spark2
 
Next we cache the table in memory.

You can get confirmation at http://maprdemo:4040/storage/
spark3
If we want to register the table with the Hive metastore and want it to be accessible via Hive we need to create a permanent table.

 
spark4
 
Next we launch the Hive CLI and create a new table

spark5
We go back to beeline and cache tables people_hive and people_sum in memory:
cache table people_hive;
cache table people_sum;
 
spark7

Query in-memory tables in Tableau
We have three tables cached in memory. Now we can query them using Tableau. Select Spark SQL as your connection.
spark8

As expected we only see two of the three tables. Remember that table people is a temp table and has not been registered with Hive metastore. As a result Tableau does not know about it when retrieving metadata through the Spark thrift server. We could query it using the Initial SQL functionality in Tableau.
spark6
In a last step we create a Tableau worksheet and run some analysis.
spark9
What else is there?
You can remove a table from memory by using the uncache table command.
For tuning performance you can set various parameters such as applying in memory columnar compression to cached data.
In beeline or Initial SQL in Tableau use the following command:

[big_data_promotion]