Last week Tableau released version 9.0 of their data visualisation tool. From a Big Data point of view the nicest new feature was support for querying cached (in-memory) SchemaRDDs (Data Frames as of Spark 1.3).
[big_data_promotion]
In this tutorial I will show you how to connect to Spark 1.2.1 on the MapR 4.1 sandbox with Tableau 9.0.
Pre-requisites:
– Download the MapR 4.1 sandbox
– Install Spark standalone 1.2.1   and configure Hive on the sandbox
– Install Tableau Desktop. A free trial is available from Tableau
– Install the SparkSQL ODBC Driver from Databricks
Adjust Spark default memory

cd /opt/mapr/spark/spark-1.2.1/conf
vi spark-env.sh

Change

export SPARK_DAEMON_MEMORY=1g
export SPARK_WORKER_MEMORY=2g

Start Spark Slaves

/opt/mapr/spark/spark-1.2.1/sbin/start-slaves.sh

Have a look at the Spark UI at http://maprdemo:8080/
image1
Start Spark Thrift Server on port 10001

/opt/mapr/spark/spark-1.2.1/sbin/start-thriftserver.sh  --master spark://maprdemo:7077 --hiveconf hive.server2.thrift.bind.host maprdemo --hiveconf hive.server2.thrift.port 10001

Connect with Beeline to Spark Thriftserver
/opt/mapr/spark/spark-1.2.1/bin/beeline
!connect jdbc:hive2://maprdemo:10001
Use username to log on, e.g. mapr.
Leave Password blank 

Connected to: Spark SQL (version 1.2.1)
Driver: null (version null)
Transaction isolation: TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://maprdemo:10001>

 We will now load a JSON file as a temporary table into Spark
Copy file /opt/mapr/spark/spark-1.2.1/examples/src/main/resources/people.json onto the MapR-FS and create a temp table from it.

create temporary table people
using org.apache.spark.sql.json
options (path 'maprfs://maprdemo/apps/spark/people.json');

A temporary table persists for the lifetime of the Thrift Server.
It does not register with the Hive metastore and as a result won’t be listed when issuing show tables.
We can view execution details of create table statement at http://maprdemo:4040/jobs/
spark2
 
Next we cache the table in memory.

cache table people

You can get confirmation at http://maprdemo:4040/storage/
spark3
If we want to register the table with the Hive metastore and want it to be accessible via Hive we need to create a permanent table.

create table people_hive as select * from people;

 
spark4
 
Next we launch the Hive CLI and create a new table

create table people_sum as select sum(age) from people_hive;

spark5
We go back to beeline and cache tables people_hive and people_sum in memory:
cache table people_hive;
cache table people_sum;
 
spark7

Query in-memory tables in Tableau
We have three tables cached in memory. Now we can query them using Tableau. Select Spark SQL as your connection.
spark8

As expected we only see two of the three tables. Remember that table people is a temp table and has not been registered with Hive metastore. As a result Tableau does not know about it when retrieving metadata through the Spark thrift server. We could query it using the Initial SQL functionality in Tableau.
spark6
In a last step we create a Tableau worksheet and run some analysis.
spark9
What else is there?
You can remove a table from memory by using the uncache table command.
For tuning performance you can set various parameters such as applying in memory columnar compression to cached data.
In beeline or Initial SQL in Tableau use the following command:

set spark.sql.inMemoryColumnarStorage.compressed=true;
Also Important
Sonra are based in Dublin and provide services and hardware around Big Data and Hadoop.
Make sure to come to our next Hadoop User Group Ireland meetup. We will talk about data lakes, puddle, and swamps.

[big_data_promotion]