In-memory analytics with Tableau, SparkSQL, and MapR

Uli Bethke Big Data, Hadoop, Hive, MapR, Spark, SparkSQL, Tableau

Last week Tableau released version 9.0 of their data visualisation tool. From a Big Data point of view the nicest new feature was support for querying cached (in-memory) SchemaRDDs (Data Frames as of Spark 1.3).

Teach me Big Data to Advance my Career

In this tutorial I will show you how to connect to Spark 1.2.1 on the MapR 4.1 sandbox with Tableau 9.0.

Pre-requisites:

- Download the MapR 4.1 sandbox

- Install Spark standalone 1.2.1   and configure Hive on the sandbox

- Install Tableau Desktop. A free trial is available from Tableau

- Install the SparkSQL ODBC Driver from Databricks

Adjust Spark default memory

Change

Start Spark Slaves

Have a look at the Spark UI at http://maprdemo:8080/

image1

Start Spark Thrift Server on port 10001

Connect with Beeline to Spark Thriftserver

/opt/mapr/spark/spark-1.2.1/bin/beeline
!connect jdbc:hive2://maprdemo:10001

Use username to log on, e.g. mapr.

Leave Password blank 

 We will now load a JSON file as a temporary table into Spark

Copy file /opt/mapr/spark/spark-1.2.1/examples/src/main/resources/people.json onto the MapR-FS and create a temp table from it.

A temporary table persists for the lifetime of the Thrift Server.

It does not register with the Hive metastore and as a result won’t be listed when issuing show tables.

We can view execution details of create table statement at http://maprdemo:4040/jobs/

spark2

 

Next we cache the table in memory.

You can get confirmation at http://maprdemo:4040/storage/

spark3

If we want to register the table with the Hive metastore and want it to be accessible via Hive we need to create a permanent table.

 

spark4

 

Next we launch the Hive CLI and create a new table

spark5

We go back to beeline and cache tables people_hive and people_sum in memory:

cache table people_hive;
cache table people_sum;

 

spark7

Query in-memory tables in Tableau

We have three tables cached in memory. Now we can query them using Tableau. Select Spark SQL as your connection.

spark8

As expected we only see two of the three tables. Remember that table people is a temp table and has not been registered with Hive metastore. As a result Tableau does not know about it when retrieving metadata through the Spark thrift server. We could query it using the Initial SQL functionality in Tableau.

spark6

In a last step we create a Tableau worksheet and run some analysis.

spark9

What else is there?

You can remove a table from memory by using the uncache table command.

For tuning performance you can set various parameters such as applying in memory columnar compression to cached data.

In beeline or Initial SQL in Tableau use the following command:

Teach me Big Data to Advance my Career