What makes MapR superior to other Hadoop distributions?

These days Hortonworks with their IPO and Cloudera sitting on $1bn of cash grab all the headlines. However,the real visionary in the field is someone else. Someone blasting the previous world record in TeraSort . A Hadoop distribution on both Amazon Web Services and the Google Compute Engine. A company that Google is invested in. While their competitors have been in skirmishes with each other, MapR has been quietly working away and innovating.

MapR-FS: Features and benefits compared to HDFS

The key innovation underpinning many of the other cool features of the MapR distribution and the subject of today’s blog post is the MapR-FS (the proprietary file system of the MapR distro). It allows for random writes! Yes, you have read correctly. You can update a file sitting on MapR-FS. While HDFS is a rather limited file system that only lets you append data to a file, MapR-FS can write to a file at any offset.

This innovation allows for all sorts of cool things.

  • File system metadata is distributed (think of it in terms of many mini name nodes). No central name node is needed. This eliminates name node bottlenecks.
  • MapR-FS is written in C. No JVM garbage collection choking.
  • NFS mount. You can mount the MapR-FS locally and read directly from it or write directly to it.
  • MapR-FS implements POSIX. There is no need to learn any new commands. Your Linux administrator can apply existing knowledge to navigate the file system. You can view the content on MapR-FS using standard Unix commands, e.g. to view the contents of a file on MapR-FS you can just use tail <file_name>.
  • While MapR-FS is proprietary it is compatible with the Hadoop API. You don’t have to rewrite your applications if you want to migrate to MapR. hadoop fs -ls /user on MapR-FS works the same as ls /user.
  • You can directly load the data into the file system. No need to set down the data on the local file system first. Guess what? Using NFS mounts there is no distinction between MapR-FS and the local filesystem. MapR-FS in a way is the local filesystem. No additional tools such as Flume etc. are needed to ingest data.
  • True and consistent snapshots. Run point in time queries against your snapshots.

 

One other manifestation of the power of MapR-FS is the fact that an RDBMS such as the Vertica MPP engine can run directly against files stored on MapR-FS. Unthinkable of for HDFS. Other offerings that claim to run their MPP RDBMS against Hadoop often just have a connector that hooks into HDFS and then copies the data into their own storage layer or creates a copy of the data on HDFS. Not exactly what I mean by running your data warehouse on top of Hadoop. Similarly, Lucidworks, a Solr based Enterprise Search solution lets you run their search indexes directly atop MapR-FS.

There are other unique features of MapR that are superior to the other distributions: MapR-DB, Volumes & Multi-Tenancy, Enterprise Security, Drill (the latest innovation) and many more.

Conclusion

In summary. The MapR-FS brings us one step closer to the vision of bringing the processing to the data rather than the other way around. Running the MapR distrubution gives you a true competitive advantage. If your only use case is batch and if you are just appending data you are probably fine with HDFS and the other distros. However, if you have a requirement to run mixed workloads and have different use cases you need something that is more flexible. While vanilla Hadoop ticks some of the checkboxes such as NFS mounting and Snapshots they had to be implemented as workarounds due to the described limitations in HDFS. With MapR-FS and the MapR Hadoop distribution you are ready to enter the next stage of Big Data processing.

The MapR Hadoop distribution really rocks. We are very excited about this product!

Download the Community edition.

Download the MapR sandbox.

Other useful links discussing MapR-FS:

Blog Post: Comparing MapR-FS to HDFS.
Blog Post: Get Real with Hadoop: Read-Write File System.
MapR-FS White Paper.
Video: Comparison of MapR-FS and HDFS
Video: Comparing MapR FS and HDFS NFS and Snapshots
Video: MC Srivas Hadoop Summit 2011 Design, Scale and Performance of MapR’s Distribution for Apache Hadoop

Contact us for a demo of MapR.

Book Review: Predictive Analytics Using Oracle Data Miner

My friend and colleague, ACE Director Brendan Tierney, has recently published the reference book Predictive Analytics Using Oracle Data Miner. It is the first comprehensive book on the subject matter. The book is primarily aimed at the Oracle Data Scientist/Data Miner. The other target audience are Oracle developers who implement the data mining models created by the Data Scientists in their applications, e.g. OBIEE. Some of the areas covered are also relevant for Oracle DBAs as they represent typical tasks a DBA would perform.

While I have a high level understanding of predictive analytics I was hoping that the book would give me some new insights. Here are my impressions.

In the first chapter, Brendan provides us with an overview on the various data mining options in Oracle. We learn that one of the key differentiators of Oracle Data Mining to other tools is that it runs inside the Oracle database. This has the advantage that the data resides in one place and does not have to be moved into other desktop style tools. Brendan also stresses the point that data mining has been around for decades and does not necessarily require Big Data size volumes to derive insights. In chapter two, Brendan introduces us to the data mining lifecycle and methodology. He discusses the CRISP-DM in detail. We learn that this is the most widely used data mining lifecycle. The key points in this chapter are on the one hand that the lifecylce is iterative and as such lends itself to an agile methodology. On the other hand it is important to have well defined business questions and objectives. Equally important are data profiling and data preparation. Chapters three and four then talk about how to install, set up and use the Oracle Data Miner client tools. It gets more interesting again in chapters five (Exploring your Data) and six (Data Preparation). We learn that both of these stages are extremely important for the success of a data mining project and take up significant time as data needs to be profiled, cleansed, put into a suitable format, relevant features (data mining speak for attributes) need to be extracted and new features need to be derived to feed the algorithms. Some of the nice features that ODM provides are Automated Feature Selection and Automatic Data Preparation (ADP). Automatic Feature Selection allows you to let ODM algorithmycally determine which features of your data set are relevant for your supervised machine learning algorithms.

In chapters seven to eleven Brendan introduces us to the various types of machine learning and algorithms that are supported through the Oracle Data Miner GUI. These include various types of supervised (a target variable exists) and unsupervised algorithms such as Association Rules, Classification, Clustering, Regression, and Anomaly Detection. For each of the types of analytics Brendan gives us some high level background information, the use cases, and how to build, evaluate, and deploy the model. While chapters four to eleven show us how to conduct data mining using the Oracle Data Miner GUI, chapters twelve to eighteen guide us on how we can achieve the same using SQL or PL/SQL. Similarly, the chapters cover data preparation, Association Rules, Classification, Clustering, Regression, and Anomaly Detection.

The last two chapters are dedicated on model deployment and how you can make use of the models in tools such as OBIEE dashboards.

Predictive Analytics Using Oracle Data Miner is an excellent introduction to the world of data mining in general and data mining on Oracle in particular. You don’t need any prior knowledge to read this book. Brendan manages to explain all of the relevant points really well. Throughout the book he gives invaluable advice and tips and tricks that will allow you to quickly master data mining on Oracle. After reading the book you will have an in-depth understanding on how to apply machine learning to various use cases using Oracle Data Miner. The book also introduces us to the types of data mining and algorithms that are supported in the tool. If you want to learn more about the various machine learning algorithms mentioned I would recommend to read it side by side with a data mining reference book such as Data Mining Techniques by Linoff/Berry. Having read Predicitive Analytics Using Oracle Data Miner I feel confident to successfully implement a first data mining project on my own. I wish everyone happy hunting for signals and insights.

Oracle Data Integrator (ODI) Architecture Review Service.

There are various tell-tale signs that something is not quite right with your Oracle Data Integrator implementation. Does your ODI architecture suffer from the following symptoms?

  • The ETL seems to take forever
  • Some data flows take more than 30 minutes
  • Your developers take ages to implement new data flows or change existing mappings
  • The ETL breaks at least once a week
  • No meaningful and consistent set of naming standards and coding conventions has been implemented.
  • The Knowledge Modules that you use have NOT been modified and adapted to your specific requirements
  • You are using more than 7 Knowledge Modules in your project



These are indications that something is going terribly wrong with your ODI implementation.

Get an ODI Architecture Review Quote

Sonra now offer an ODI architecture review service. We health check your ODI implementation and put it through the paces.

The architecture review typically takes three days. On the first two days we look under the covers of your ODI implementation. We then compile findings and actionable recommendations in a RAG report and present the detailed results back to you on the third day. We offer this service world-wide, on site or remotely. Talk to us about your specific requirements.

The following items are included in the standard audit:

  • ODI Extract Strategy
  • ODI Integration Strategy
  • ODI Load Stratgey
  • ODI Data Quality Strategy (Error Hospital)
  • ODI Data Change Audit Strategy (History of records)
  • ODI agent location, type, and sizing
  • ODI Naming Standards
  • ODI Mappings audit
  • ODI Code checklist
  • ODI Coding Standards
  • ODI Deployment Strategy
  • ODI Integration Source Control
  • ODI Scenario Scheduling
  • ODI Parallelism
  • ODI Performance



Contact us if you are interested in our ODI architecture review. We can then schedule a call for an introduction and discuss your specific requirements in more detail.

Get an ODI Architecture Review Quote

Big Data 2.0 and Agile BI all at Irish BI OUG (24 September).

I will give a presentation on 24 September at the Jury’s Inn in Dublin on the next generation of Big Data 2.0 tools and architecture.

Over the last two years there have been significant changes and improvements in the various Big Data frameworks. With the release of Yarn (Hadoop 2.0) the most popular of these platforms now allows you to run mixed workloads. Gone are the days when Hadoop was only good for batch processing. We can now ingest and query data in realtime, run distributed versions of various machine learning algorithms, analyse social graphs, and do Enterprise Search. What exactly are these tools and how will they impact Enterprise Information Architecture over the next couple of years? What does Oracle offer in this space? These are the three questions we will answer in this session.

Lawrence Corr, author of best selling book Agile Data Warehouse Design will give a presentation about Agile BI.

Get the detailed agenda and register at the UKOUG website.

Hope to see you there.

Oracle Data Integrator and Hadoop. Is ODI the only ETL tool for Big Data that works?

Both ODI and the Hadoop ecosystem share a common design philosophy. Bring the processing to the data rather than the other way around. Sounds logical, doesn’t it? Why move Terabytes of data around your network if you can process it all in the one place. Why invest millions in additional servers and hardware just to transform and process your data?

In the ODI world this approach is known as ELT. ELT is a marketing concept pointing to the fact that data transformations are performed in the same processing engine where the data resides than moving it around for transformations. It has underpinned the product since its inception.

While other ETL tools such as Informatica now also offer some pushdown functionality (e.g. Hive pushdown) it is not in the DNA of these tools or companies to do so. Traditionally, these tools settled for a completely different approach and the problems of this are now showing more so than ever before. It is hard for these vendors to work around their original design philosophy. Let me compare this to Microsoft and Google. While the latter has the Internet and Big Data in their DNA as a company the former doesn’t and Microsoft are throwing huge resources at this problem without being overly successful at closing the gap. Let me ask you another way. Why settle for the copy if you can get the real thing?

The advantage of ODI over traditional ETL tools doesn’t stop there. ODI has a concept of reusable code templates aka Knowledge Modules. This meta data driven design approach encapsulate common data integration strategies such as timestamp based extracts, data merging, auditing of changes, truncate loads, parking defective records in an error hospital etc. and makes them available for reuse. This can result in ETL developer productivity gains of more than 40%.

What will the future of data integration on Hadoop look like? At the moment a lot of the ETL is still hand written using custom Map Reduce jobs. As SQL engines on Hadoop reach a higher level of maturity they will be the vehicles for 90%+ data transformation flows for Big Data. Only for very specific use cases where performance is the highest priority will we see custom coding on Spark, Map Reduce etc. Based on the underlying design principles, Oracle Data Integrator is a perfect match for Hadoop.

Coming back to my question in the headline. Yes, I believe that Oracle Data Integrator really is the only ETL and data integration tool that is fit for purpose for Big Data workloads.

If you are planning to run a Big Data project, an ODI implementation, or both then get in touch with us. Why settle for second best if you can get the ODI and Big Data experts?

War of the Hadoop SQL engines. And the winner is …?

You may have wondered why we were quiet over the last couple of weeks? Well, we locked ourselves into the basement and did some research and a couple of projects and PoCs on Hadoop, Big Data, and distributed processing frameworks in general. We were also looking at Clickstream data and Web Analytics solutions. Over the next couple of weeks we will update our website with our new offerings, products, and services. The article below summarises some of our research on using SQL on Hadoop.

I believe that one of the pre-requisites for Hadoop to make inroads into the Enterprise Data Warehouse space is to have the following three items in place: (1) Subsecond response times for SQL queries (often refered to as interactive or real time queries). Performance similar to existing MPP RDBMS such as Teradata. (2) Support for a rich SQL feature set (3) Support for Update and Delete DML operations. Currently, I don’t see any of the existing solutions ticking all of these boxes. However, we are getting closer and closer. The post will shed some light on the current status of SQL on Hadoop and my own recommendations, which of these solutions you should bet your house on.

Hive

Initially developed by Facebook, Hive is the original SQL framework on Hadoop. The motivation to develop Hive was to provide an abstraction layer on top of Map Reduce (M/R) to make it easier for analysts and data scientists to query data on the Hadoop File System. Rather than write hundreds of lines of Java code to get answers to relatively simple questions the objective was to offer SQL, the natural choice of the data analyst. While this approach works well in a batch oriented environment it does not perform well for interactive workloads in near real time. The problem with the original M/R framework was that it works in stages and at each stage the data is set down to disk and then again read from disk in the next phase. In addition the various stages can not be parallelized. This is highly inefficent and the rationale for the Apache Tez project. Similar to M/R, Tez is a Hive execution engine developed by Hortonworks (also committers from Facebook, Microsoft, and Yahoo).

Hive on Apache Tez

Tez  is part of the Stinger initiative led by Hortonworks to make Hive enterprise ready and suitable for realtime SQL queries. The two main objectives of the initiative were to increase performance and offer a rich set of SQL features such as analytic functions, query optimization, and standard data types such as timestamp etc. Tez is the underlying engine that creates more efficient execution plans in comparison to Map Reduce. The Tez design is based on research done by Microsoft on parallel and distributed computing. The two main objectives were delivered as part of the recent Hive 0.13 release. The roadmap for release 0.14 includes DML functionality such as Updates and Inserts for lookup tables.

Hive on Spark

Recently, Cloudera together with MapR, Intel, and Databricks spearheaded a new initiative to add a third execution engine to the mix. They propose to add Spark as a third Hive execution engine. Developers then will be able to choose between Map Reduce, Tez, and Spark as their execution engine for Hive. Based on the design document the three engines will be fully interchangeable and compatible. Cloudera see Spark as the next generation distributed processing engine, which has various advantages over the Map Reduce paradigm, e.g. intermediate resultsets can be cached in memory. Going forward, Spark will underpin many of the components in the Cloudera platform. The rationale for Hive on Spark then is to make Spark available to the vast amount of Hive users and establish Hive on the Spark framework. It will also allow users to run faster Hive queries without having to install Tez. Contrary to Hortonworks, Cloudera don’t see Hive on Spark (or Hive on Tez) to be suitable as a realtime SQL query engine. Their flagship product for interactive SQL queries is Impala, while Databricks see Spark SQL as the tool of choice for realtime queries.

Cloudera Impala

Impala is a massively parallel SQL query engine. It is based on Google Dremel and Google Big Query.

Based on their own benchmarks Cloudera conclude that Presto and Hive Tez are not fit for purpose for interactive query loads. Cloudera see Hive as a batch processing engine. Of course, Hortonworks see this differently and they believe that Hive on Tez is also useful for interactive queries. The jury is out on this one.
You can install and test Impala as part of the Cloudera distribution. Cloudera have been accused of using Impala as vehicle to lock customers into their own distribution. However, you can also download Impala from GitHub.

. Alternatively, you can add the Cloudera repository and download it from there. Impala has various per-requisites in terms of the libraries and respective versions it supports, e.g. it relies (as of this post’s writing) on Hive 0.12 and Hadoop 2.3. So if you want to install it on the latest Hortonworks distro, you are out of luck.

Presto

Facebook was and is a heavy user of Hive. However, for some of their workloads they required low latency response times in an interactive fashion. This is behind the rationale of Presto.

One of the advantages of Presto is that you can also query non HDFS data sources such as an RDBMS. It seems to be relatively easy to write your own connectors.

Spark SQL and Shark

Spark is the new darling on the Big Data scene and widely seen as the replacement for Map Reduce. Originally developed by AMPLab at UC Berkeley it is now developed by Databricks and also runs on Hadoop YARN. There are various components that ship with Spark. A micro batch near realtime processing module (Spark Streaming), a machine learning component (MLLIB), a graph database (alpha release), SparkR (alpha release), and what we are interested for the purpose of this article Spark SQL.Spark SQL to some extent borrows from Shark its predecssor, which was based on the Hive codebase but similar to Tez came with its own execution engine. The big advantage of Spark SQL over the other engines is that it is easy to mix machine learning with SQL. BTW, an alternative to this is to use the HiveMall machine learning library (unfortunately, there is very little documentation). This is similar to in database analytics as offered by vendors such as Oracle and has the advantage that you don’t have to move around the data between different tools and technologies. While you can write resultsets back into Hive, in my opinion Spark SQL is currently not really an option for doing SQL based ETL/batch as you would have to intermix it with Scala code to perform more complex transformations, which makes things somewhat ugly. So the primary use case is to use it as an interactive query tool and mix it with machine learning. It also does not offer a rich SQL feature set right now, e.g. analytic functions are missing. Like the other engines, Spark SQL also has its own optimizer named Catalyst. Based on performance benchmarks by Databricks, Spark SQL seems to be able to trump its predecessor Shark in terms of performance (last slide in deck).

Apache Drill

Similar to Impala, Apache Drill is another MPP SQL query engine inspired by the Google Dremel paper. Apache Drill is mainly supported by MapR. At the moment it is in alpha release. Together with Spark SQL It is at the moment of this writing the least mature SQL solution on Hadoop. As outlined by MapR Apache Drill will be available Q2 2014.

Similar to Presto, Apache Drill will also support non Hadoop data sources.

InfiniDB

InfiniDB is rather different to any of the other SQL engines on Hadoop. I want to include it here as it is an interesting product that we will hear about more in the future. InfiniDB is an open source MPP columnar RDBMS. As such it falls more into the category of the likes of Amazon Redshift, Teradata, or Vertica. Unlike its competitors, it allows for its data to sit on the Hadoop File System (HDFS). The only and pretty fundamental caveat, however, is that you would have to load the data into the InfinDB proprietary data format. It currently does not support popular data serialization formats such as Parquet or Sequence Files. They may add this feature in a future release. However, this will negatively impact performance. I will keep an eye on this product as it as an excellent and open source alternative to MPPs such as Teradata, Vertica, Netezza etc. However, the data duplication issue is a problem if you have subscribed to the paradigm of bringing the processing to the data rather than the other way around. On the other hand InfiniDB (as you would expect from an MPP RDBMS) supports Updates and Deletes.

Others

There are various other SQL engines on top of Hadoop including Cascading Lingual built as a SQL abstraction layer on top of Cascading, Hadapt, and various other commercial products. Another open source solution is Apache Tajo.

Probably the most mature of the SQL Hadoop engines is BigSQL from IBM. I recently had the opportunity to attend a session by one of their chief architects and it looks quite impressive, e.g. it builds on the DB2 optimizer and as a result is built on decades of experience. As this post mainly deals with open source engines I won’t go into any more detail on BigSQL. However, if you want to find out more have a look at this video.

 

Benchmarks

This benchmark by Cloudera compares Impala to Shark (disk and memory), Hive Tez (0.13), and Presto. Unsurprisingly, Cloudera Impala scores best here :-).

This benchmark by Cloudera compares Impala to Hive (0.12 and not 0.13) and to an unnamed MPP RDBMS. Surprise surprise, Cloudera Impala scores best here :-).
This benchmark by InfiniDB compares InfinDB to Presto, Impala, Hive on M/R. For all workloads InfinDB is the performance winner.
This benchmark by AMPLab at UC Berkley compares Redshift to Hive on M/R, Hive on Tez, Impala, and Shark. Performance winner for most workloads is Amazon Redshift
This benchmark by Hortonworks compares performance between Hive M/R (0.10) and Hive Tez (0.13). Interestingly there is no comparison to other Hadoop SQL engines. You can draw or own conclusions why this is.
This benchmark by Gruter compares Hive M/R to Impala and Apache Tajo.

Conclusion and Recommendation

As of this writing the most mature product with the richest feature set is Apache Hive (running on Tez). Crucially it offers analytic functions, support for the widest set of file formats, and ACID support (full support in release 0.14)

As of the current release, Impala lacks important SQL features. However, this is about to change in Impala 2.0.
Once it has matured Hive on Spark should be a very good alternative to Hive on Tez.
While Hortonworks claims that Hive can be used for interactive queries, Cloudera questions this. The various benchmarks are not conclusive. As always you should test yourself if Hive is suitable for realtime queries for your workload and use case.
All of the different solutions follow a similar approach in that they all first create a logical query plan in a Directed Acyclic Graph (DAG). This is then translated into a physical execution plan and the various components and operators of the explain plan are then executed in a distributed fashion.
There are various benchmarks out there, which suggest that Impala is the fastest for various workloads. However, I wouldn’t trust any of these too much and would suggest for you to perform your own benchmarks for your specific workload.
Spark SQL looks very promising for use cases where you want to use SQL to run machine learning algorithms  (similar to in database analytics, e.g. in Oracle). As an alternative you could look at using HiveMall. It also looks promising for interactive SQL.
Performance benchmarks suggest that none of the Hadoop SQL execution frameworks currently match the performance of an MPP RDBMS such as InfiniDB, Amazon Redshift, or Teradata. One Cloudera benchmark suggests otherwise. However, this benchmark is criticised for not implementing the full set of the TPC-DS benchmark and various other items and as a result is somewhat questionable. This does not come as a surprise really as decades of experience have gone into these relational engines.

So what to do? Right now I would run both batch style queries (ETL) and interactive queries on Hive Tez as Hive offers the richest SQL feature set, especially analytic functions and supports a wide set of file formats. If you don’t get satisfactory query performance for your realtime queries you may want to look at some of the other engines. Impala is a mature solution. However, it lacks support for analytic functions, which are crucially important for data analysis tasks. Analytic functions will be added to the next release of Impala though. Another option is Presto, which offers this feature set. At this stage Spark SQL is only in alpha release and does not yet look very mature especially in terms of the SQL features. However, it is quite promising for in database style machine learning and predictive analytics (bring the processing to the data rather than data to processing). Apache Drill is also only in alpha release and may not be mature enough for your use case. If I had to bet my house on which of the solutions will prevail I would put it on a combination of Hive on Spark (for batch ETL) and Spark SQL (interactive queries and in database style machine learning and predictive analytics) to cover all use cases and workloads. If Spark SQL matures further in terms of the SQL feature set (analytic functions etc.) and allows for ETL based on the SQL paradigm I would exclusively put my money on it.

REAL TIME BI PODCAST ON ORACLE DATA INTEGRATOR 12C. Part II.

In the second part of the series we cover:

More discussion on ODI vs Informatica
More on migrating from OWB to ODI
Using ODI outside the data warehouse (BI Apps)
ODI in the cloud
ODI and Big Data

)

Big Data Presentation

The Big Data presentation I gave yesterday is now available for download. In this presentation I define some common features of Big Data use cases, explain what the big deal about Big Data is all about and explore the impact of Big Data on the traditional data warehouse framework.

Real Time BI Podcast on Oracle Data Integrator 12c. Part I.

I recently did a podcast with Stewart Bryson (Chief Innovation Officer RittmanMead), Kevin McGinley, and Alex Shlepakov (both Oracle Analytics at Accenture).

In the first part of this two part series we cover the following areas:

ODI 12c. What are the advantages? When should you upgrade?
Migration from OWB to ODI 12c. Should you migrate? How and when?
Comparison of ODI to Informatica and other ETL tools.
ETL style vs. ELT style data integration tools.
ODI, ETL, and data integration in the cloud.

What’s the Big Deal about Big Data? Hear me speak at OUG Ireland. 11 March 2014. Convention Centre Dublin.

What’s the Big Deal about Big Data? Hear me speak at OUG Ireland. 11 March 2014. Convention Centre Dublin.

So what’s the Big Deal about Big Data? Oil has fueled the Industrial Revolution. Data will fuel the Information Revolution.

Not convinced? Did you know that Amazon has recently patented a technology based on a Big Data algorithm that will start the shipping process before you have completed your order. That’s right. Amazon knows that you will buy some stuff from their website before you know it yourself. How amazing or scary is that?

In my upcoming presentation on 11 March in the Convention Centre in Dublin I will explore this topic further and I will talk about

– What is Big Data (and what it is not)?
– Some interesting use cases to show you what is possible.
– Why the traditional data warehouse framework and technology don’t work for Big Data.
– Big Data architecture for the Information revolution.
– The Oracle Big Data armoury

Registration for the event is now open.

Hope to see you there and talk about your Big Data project.