War of the Hadoop SQL engines. And the winner is …?

War of the Hadoop SQL engines. And the winner is …?

You may have wondered why we were quiet over the last couple of weeks? Well, we locked ourselves into the basement and did some research and a couple of projects and PoCs on Hadoop, Big Data, and distributed processing frameworks in general. We were also looking at Clickstream data and Web Analytics solutions. Over the next couple of weeks we will update our website with our new offerings, products, and services. The article below summarises some of our research on using SQL on Hadoop.

I believe that one of the pre-requisites for Hadoop to make inroads into the Enterprise Data Warehouse space is to have the following three items in place: (1) Subsecond response times for SQL queries (often refered to as interactive or real time queries). Performance similar to existing MPP RDBMS such as Teradata. (2) Support for a rich SQL feature set (3) Support for Update and Delete DML operations. Currently, I don’t see any of the existing solutions ticking all of these boxes. However, we are getting closer and closer. The post will shed some light on the current status of SQL on Hadoop and my own recommendations, which of these solutions you should bet your house on.

Hive

Initially developed by Facebook, Hive is the original SQL framework on Hadoop. The motivation to develop Hive was to provide an abstraction layer on top of Map Reduce (M/R) to make it easier for analysts and data scientists to query data on the Hadoop File System. Rather than write hundreds of lines of Java code to get answers to relatively simple questions the objective was to offer SQL, the natural choice of the data analyst. While this approach works well in a batch oriented environment it does not perform well for interactive workloads in near real time. The problem with the original M/R framework was that it works in stages and at each stage the data is set down to disk and then again read from disk in the next phase. In addition the various stages can not be parallelized. This is highly inefficent and the rationale for the Apache Tez project. Similar to M/R, Tez is a Hive execution engine developed by Hortonworks (also committers from Facebook, Microsoft, and Yahoo).

Hive on Apache Tez

Tez  is part of the Stinger initiative led by Hortonworks to make Hive enterprise ready and suitable for realtime SQL queries. The two main objectives of the initiative were to increase performance and offer a rich set of SQL features such as analytic functions, query optimization, and standard data types such as timestamp etc. Tez is the underlying engine that creates more efficient execution plans in comparison to Map Reduce. The Tez design is based on research done by Microsoft on parallel and distributed computing. The two main objectives were delivered as part of the recent Hive 0.13 release. The roadmap for release 0.14 includes DML functionality such as Updates and Inserts for lookup tables.

Hive on Spark

Recently, Cloudera together with MapR, Intel, and Databricks spearheaded a new initiative to add a third execution engine to the mix. They propose to add Spark as a third Hive execution engine. Developers then will be able to choose between Map Reduce, Tez, and Spark as their execution engine for Hive. Based on the design document the three engines will be fully interchangeable and compatible. Cloudera see Spark as the next generation distributed processing engine, which has various advantages over the Map Reduce paradigm, e.g. intermediate resultsets can be cached in memory. Going forward, Spark will underpin many of the components in the Cloudera platform. The rationale for Hive on Spark then is to make Spark available to the vast amount of Hive users and establish Hive on the Spark framework. It will also allow users to run faster Hive queries without having to install Tez. Contrary to Hortonworks, Cloudera don’t see Hive on Spark (or Hive on Tez) to be suitable as a realtime SQL query engine. Their flagship product for interactive SQL queries is Impala, while Databricks see Spark SQL as the tool of choice for realtime queries.

Cloudera Impala

Impala is a massively parallel SQL query engine. It is based on Google Dremel and Google Big Query.

Based on their own benchmarks Cloudera conclude that Presto and Hive Tez are not fit for purpose for interactive query loads. Cloudera see Hive as a batch processing engine. Of course, Hortonworks see this differently and they believe that Hive on Tez is also useful for interactive queries. The jury is out on this one.
You can install and test Impala as part of the Cloudera distribution. Cloudera have been accused of using Impala as vehicle to lock customers into their own distribution. However, you can also download Impala from GitHub.

Presto

Facebook was and is a heavy user of Hive. However, for some of their workloads they required low latency response times in an interactive fashion. This is behind the rationale of Presto.

One of the advantages of Presto is that you can also query non HDFS data sources such as an RDBMS. It seems to be relatively easy to write your own connectors.

Spark SQL and Shark

Spark is the new darling on the Big Data scene and widely seen as the replacement for Map Reduce. Originally developed by AMPLab at UC Berkeley it is now developed by Databricks and also runs on Hadoop YARN. There are various components that ship with Spark. A micro batch near realtime processing module (Spark Streaming), a machine learning component (MLLIB), a graph database (alpha release), SparkR (alpha release), and what we are interested for the purpose of this article Spark SQL.Spark SQL to some extent borrows from Shark its predecssor, which was based on the Hive codebase but similar to Tez came with its own execution engine. The big advantage of Spark SQL over the other engines is that it is easy to mix machine learning with SQL. BTW, an alternative to this is to use the HiveMall machine learning library (unfortunately, there is very little documentation). This is similar to in database analytics as offered by vendors such as Oracle and has the advantage that you don’t have to move around the data between different tools and technologies. While you can write resultsets back into Hive, in my opinion Spark SQL is currently not really an option for doing SQL based ETL/batch as you would have to intermix it with Scala code to perform more complex transformations, which makes things somewhat ugly. So the primary use case is to use it as an interactive query tool and mix it with machine learning. It also does not offer a rich SQL feature set right now, e.g. analytic functions are missing. Like the other engines, Spark SQL also has its own optimizer named Catalyst. Based on performance benchmarks by Databricks, Spark SQL seems to be able to trump its predecessor Shark in terms of performance (last slide in deck).

Apache Drill

Similar to Impala, Apache Drill is another MPP SQL query engine inspired by the Google Dremel paper. Apache Drill is mainly supported by MapR. At the moment it is in alpha release. Together with Spark SQL It is at the moment of this writing the least mature SQL solution on Hadoop. As outlined by MapR Apache Drill will be available Q2 2014.

Similar to Presto, Apache Drill will also support non Hadoop data sources.

InfiniDB

InfiniDB is rather different to any of the other SQL engines on Hadoop. I want to include it here as it is an interesting product that we will hear about more in the future. InfiniDB is an open source MPP columnar RDBMS. As such it falls more into the category of the likes of Amazon Redshift, Teradata, or Vertica. Unlike its competitors, it allows for its data to sit on the Hadoop File System (HDFS). The only and pretty fundamental caveat, however, is that you would have to load the data into the InfinDB proprietary data format. It currently does not support popular data serialization formats such as Parquet or Sequence Files. They may add this feature in a future release. However, this will negatively impact performance. I will keep an eye on this product as it as an excellent and open source alternative to MPPs such as Teradata, Vertica, Netezza etc. However, the data duplication issue is a problem if you have subscribed to the paradigm of bringing the processing to the data rather than the other way around. On the other hand InfiniDB (as you would expect from an MPP RDBMS) supports Updates and Deletes.

Others

There are various other SQL engines on top of Hadoop including Cascading Lingual built as a SQL abstraction layer on top of Cascading, Hadapt, and various other commercial products. Another open source solution is Apache Tajo.

 

Benchmarks

This benchmark by Cloudera compares Impala to Shark (disk and memory), Hive Tez (0.13), and Presto. Unsurprisingly, Cloudera Impala scores best here :-).

This benchmark by Cloudera compares Impala to Hive (0.12 and not 0.13) and to an unnamed MPP RDBMS. Surprise surprise, Cloudera Impala scores best here :-).
This benchmark by InfiniDB compares InfinDB to Presto, Impala, Hive on M/R. For all workloads InfinDB is the performance winner.
This benchmark by AMPLab at UC Berkley compares Redshift to Hive on M/R, Hive on Tez, Impala, and Shark. Performance winner for most workloads is Amazon Redshift
This benchmark by Hortonworks compares performance between Hive M/R (0.10) and Hive Tez (0.13). Interestingly there is no comparison to other Hadoop SQL engines. You can draw or own conclusions why this is.
This benchmark by Gruter compares Hive M/R to Impala and Apache Tajo.

Conclusion and Recommendation

As of this writing the most mature product with the richest feature set is Apache Hive (running on Tez). Crucially it offers analytic functions, support for the widest set of file formats, and ACID support (full support in release 0.14)

As of the current release, Impala lacks important SQL features. However, this is about to change in Impala 2.0.
Once it has matured Hive on Spark should be a very good alternative to Hive on Tez.
While Hortonworks claims that Hive can be used for interactive queries, Cloudera questions this. The various benchmarks are not conclusive. As always you should test yourself if Hive is suitable for realtime queries for your workload and use case.
All of the different solutions follow a similar approach in that they all first create a logical query plan in a Directed Acyclic Graph (DAG). This is then translated into a physical execution plan and the various components and operators of the explain plan are then executed in a distributed fashion.
There are various benchmarks out there, which suggest that Impala is the fastest for various workloads. However, I wouldn’t trust any of these too much and would suggest for you to perform your own benchmarks for your specific workload.
Spark SQL looks very promising for use cases where you want to use SQL to run machine learning algorithms  (similar to in database analytics, e.g. in Oracle). As an alternative you could look at using HiveMall. It also looks promising for interactive SQL.
Performance benchmarks suggest that none of the Hadoop SQL execution frameworks currently match the performance of an MPP RDBMS such as InfiniDB, Amazon Redshift, or Teradata. One Cloudera benchmark suggests otherwise. However, this benchmark is criticised for not implementing the full set of the TPC-DS benchmark and various other items and as a result is somewhat questionable. This does not come as a surprise really as decades of experience have gone into these relational engines.

So what to do? Right now I would run both batch style queries (ETL) and interactive queries on Hive Tez as Hive offers the richest SQL feature set, especially analytic functions and supports a wide set of file formats. If you don’t get satisfactory query performance for your realtime queries you may want to look at some of the other engines. Impala is a mature solution. However, it lacks support for analytic functions, which are crucially important for data analysis tasks. Analytic functions will be added to the next release of Impala though. Another option is Presto, which offers this feature set. At this stage Spark SQL is only in alpha release and does not yet look very mature especially in terms of the SQL features. However, it is quite promising for in database style machine learning and predictive analytics (bring the processing to the data rather than data to processing). Apache Drill is also only in alpha release and may not be mature enough for your use case. If I had to bet my house on which of the solutions will prevail I would put it on a combination of Hive on Spark (for batch ETL) and Spark SQL (interactive queries and in database style machine learning and predictive analytics) to cover all use cases and workloads. If Spark SQL matures further in terms of the SQL feature set (analytic functions etc.) and allows for ETL based on the SQL paradigm I would exclusively put my money on it.

REAL TIME BI PODCAST ON ORACLE DATA INTEGRATOR 12C. Part II.

In the second part of the series we cover:

More discussion on ODI vs Informatica
More on migrating from OWB to ODI
Using ODI outside the data warehouse (BI Apps)
ODI in the cloud
ODI and Big Data

)

Big Data Presentation

The Big Data presentation I gave yesterday is now available for download. In this presentation I define some common features of Big Data use cases, explain what the big deal about Big Data is all about and explore the impact of Big Data on the traditional data warehouse framework.

Real Time BI Podcast on Oracle Data Integrator 12c. Part I.

I recently did a podcast with Stewart Bryson (Chief Innovation Officer RittmanMead), Kevin McGinley, and Alex Shlepakov (both Oracle Analytics at Accenture).

In the first part of this two part series we cover the following areas:

ODI 12c. What are the advantages? When should you upgrade?
Migration from OWB to ODI 12c. Should you migrate? How and when?
Comparison of ODI to Informatica and other ETL tools.
ETL style vs. ELT style data integration tools.
ODI, ETL, and data integration in the cloud.

What’s the Big Deal about Big Data? Hear me speak at OUG Ireland. 11 March 2014. Convention Centre Dublin.

What’s the Big Deal about Big Data? Hear me speak at OUG Ireland. 11 March 2014. Convention Centre Dublin.

So what’s the Big Deal about Big Data? Oil has fueled the Industrial Revolution. Data will fuel the Information Revolution.

Not convinced? Did you know that Amazon has recently patented a technology based on a Big Data algorithm that will start the shipping process before you have completed your order. That’s right. Amazon knows that you will buy some stuff from their website before you know it yourself. How amazing or scary is that?

In my upcoming presentation on 11 March in the Convention Centre in Dublin I will explore this topic further and I will talk about

- What is Big Data (and what it is not)?
- Some interesting use cases to show you what is possible.
- Why the traditional data warehouse framework and technology don’t work for Big Data.
- Big Data architecture for the Information revolution.
- The Oracle Big Data armoury

Registration for the event is now open.

Hope to see you there and talk about your Big Data project.

ODI 11g Repository Book Out Now

The book is for free. All you need to do is contact us. We will then send you a PDF version of the book. All we ask you to do is to permanently link from your blog, website, Facebook etc. to our site.

Alternatively, you can buy the book on Amazon.com. You can get the Kindle edition for $9.99. Send us your Amazon receipt for the the PDF version, which scales the submodels better.

You may be asking yourself why we have written a reference book for the ODI repository? Why not use the excellent ODI SDK to query that information? While the SDK is great at automating certain repetitive tasks it is not so good at retrieving information (in bulk) from the repository. Most ODI developers are familiar with using SQL. Not too many of us have a background in Java programming.

Some of the use cases for this book

- Extract information from the Operator log for reporting
- Impact analysis, e.g. find all interfaces a datastore is used in.
- Identify objects for deployment
- Identify objects that have recently changed
- Error message extraction
- Automated code checklist

We have split the book into two sections. The first section covers the most important subject areas of the master repository, the second section covers the most important subject areas of the work repository. Each subject area comes with a submodel, a list of relevant tables and corresponding ODI screens, and most importantly one or more useful pieces of SQL code to query the subject area in question. We have also made the models of the subject areas available online on our website http://sonra.io/odi11rep.

The book is for free. All you need to do is contact us. All we ask you to do is to permanently link from your blog, website, Facebook etc. page.

Alternatively, you can buy the book on Amazon.com. You can get the Kindle edition for $9.99. Send us your Amazon receipt for the the PDF version, which scales the submodels better.

In the book we cover the following subject areas

Physical & Logical Architecture
Repository
Link Master and Work Repository
Solution
Version
Internal Tables
Data Store
Interface Mapping
Interface Flow
Interface Knowledge Module Options
Interface Model
Interface Clause (Where & Join)
Session (Operator Log)
Package
Model Hierarchy
Project Hierarchy

ODI Training. Learn ODI from the experts.

You may also be interested in our ODI training courses. Value-priced. Customised. Onsite-Offsite. Online. Get all the details from our ODI training site

Oracle Endeca Demo Environments

Just a quick note to let you everyone know that the latest version of Endeca is available as a demo environment (hosted by Oracle).

The environment is hosted by Oracle and can be accessed with the credentials below.

URL: http://slc02oku.oracle.com:7101/eid/web
Userid = publicuser@oracle.com
Password = Admin123

Note: As the demo environment runs on port 7101 you may have problems accessing it from inside your corporate environment as your firewall will likely block this port.

Need ODI Training? Learn Oracle Data Integrator from the experts.

I am proud and delighted to announce that we are now offering ODI classes.

Learn from a combined 18+ years hands on ODI experience and get all these benefits.

Top Quality

Choose world class ODI training from THE ODI gurus. Don’t settle for less.

Value-Priced

Unrivaled in quality AND affordable at the same time.

Training world-wide

On-site or via our state of the art virtual classrooms.

FREE consulting

Get free consulting with our aftercare consulting package.

Tailor-made Training

Tell us your requirements and we will put together a custom course for you.

Mix and Match Modules

Mix and match standard modules. Combine them with custom modules.

For more information on course content and how to book your ODI course visit our ODI training page.

The Oracle Data Integrator 12c Masterstroke

Visual data flows and OWB to ODI 12c migration path

In their latest release of Oracle Data Integrator (ODI 12c), Oracle have addressed two main concerns of the analysts and some of their customers and partners. The first one is the unclear migration path from Oracle Warehouse Builder to Oracle Data Integrator. Another frequent concern was that the declarative design approach based on reusable code templates (knowledge modules) was not visual enough and scored badly in this category against other ETL tools.

People were rightly raising concerns that complex queries could not be implemented through the ODI user interface. I had always been wondering how Oracle would address those two issues. In what has to be labeled as a masterstroke of the ODI product and development teams they were able to kill two birds with one stone. And two big birds those are.

Note: As a side note I have always been wondering why ODI’s declarative design approach has not really made it on the analyst’s list of strengths for ODI. To me this metadata driven design approach that makes extreme reusability possible and significantly reduces development time is the core innovation and ultimate strength of Oracle Data Integrator.

Declarative Flow-Based User Interface

In ODI 12c, Oracle have introduced a feature that they call Declarative Flow-Based User Interface. Don’t be distracted by the name. What it means is that we can now create data flows (Interfaces were re-named to Mappings) that combine the two approaches: the visual approach of Mappings and Operators that we already know from Oracle Warehouse Builder and the powerful declarative design approach we love from Oracle Data Integrator.

With this new paradigm you essentially get the best of both worlds: ease of visual data flow development and the power of the declarative design approach. No other vendor can offer anything comparable. When would you combine the two approaches? For simple requirements I would recommend to stick to the traditional ODI approach. However, when you have complex requirements, e.g. you need to pre-aggregate data or filter a recordset based on the application of an analytic function then the new feature comes in handy.

At the moment we don’t have all of the Operators available that we know from OWB, e.g. I am missing the Pivot and Unpivot Operators, which means that complex queries requiring these can still only be generated using workarounds such as Views or similar. What would be handy is an Operator SDK that would allow to create GUI representations of database specific SQL dialects, e.g. I love subquery factoring and it would be handy to have an Operator for this or one for recursive queries etc.

ODI12c

ODI 12c Mapping: Visual Design with new Operators

OWB to ODI 12c migration

The introduction of OWB style Mappings should also faciliate the migration from Warehouse Builder to ODI 12c. The migration tool is still in Beta. While it still remains to be seen how well the migration toll will work, ODI 12c for the time being ships with the OdiStartOwbJob tool, which allows to kick off OWB jobs through ODI and store execution results in the ODI Operator log. If you have an immediate requirement to migrate from OWB to ODI contact us for our advice.

Other new features

Another interesting feature in ODI 12c is the ability to run Mappings in parallel. In practice that means that each of the temporary tables gets its own unique name avoiding clashes. In the past you had to apply workarounds.

There have also been enhancement to the Knowledge Module editor. For those of you writing a lot of their own Knowledge Modules this is welcome news.

ODI 12c now also has much tighter integration with Oracle Golden Gate. A welcome introduction for those of you running near real time data integration projetcs (another concern of analysts and some customers).

ODI 12c Gaps

While ODI 11g has brought us the SDK to automate and script common tasks, ODI 12c now brings us visual Mappings giving us new options to create data flows. Oracle Data Integrator remains the leading ETL tool on the market. With ODI 12c it has extended this lead even further. How Gartner don’t rate it above DataStage or Powercenter in their Magic Quadrant is a mystery to me.

One or two weaknesses remain. ODI 12c does not yet integrate with  source control systems out of the box. There are also no automated deployment options. While this functionality can be scripted it takes significant effort to do so. I am currently in the process of testing such an in-house developed solution. If you would like our advice how source control integration and automated deployments can be achieved get in touch.

While ODI 11g brought us improvements in the way we schedule and orchestrate data flows I am a strong advocate of dependency driven data flow execution. It is so much simpler and more efficient to hard coding of data orchestration routines. An enterprise solution with several thousand data flows will not work smoothly without a dependency driven scheduler. If you would like to find out more about our solution get in touch. You can also download our presentation ODI Scheduler- Source Control – performance on both source control and dependency driven scheduling.

Resources

The ODI 12c Developer Guide: http://docs.oracle.com/middleware/1212/odi/ODIDG/index.html

New Features of ODI 12c: http://docs.oracle.com/middleware/1212/odi/ODIDG/whatsnew.htm#sthref3

Download Oracle Data Integrator 12c: http://www.oracle.com/technetwork/middleware/data-integrator/downloads/index.htm

Install tip ODI Studio: https://blogs.oracle.com/dataintegration/entry/odi_12c_installing_odi_studio

ODI Training. Learn ODI from the experts.

You may also be interested in our ODI training courses. Value-priced. Customised. Onsite-Offsite. Online. Get all the details from our ODI training site

ODI 11g Cookbook – The leading ETL tool now also has the best ETL book

I don’t know too many books or movies where the sequel is better than the original. The ODI 11g cookbook is such a rare case. It is stuffed with 60 valuable recipes that every ODI developer should know. Apart from the value of these recipes in their own right they also showcase the flexibility of ODI and can be transfered to other problems.

My favourite chapters are; Knowledge Module Internals, Advanced Coding Techniques, Advanced Topology, and Using Variables.

A big thank you to the ODI product management team for sharing their insights.