Archives

Data Warehouse

Big Data Presentation

The Big Data presentation I gave yesterday is now available for download. In this presentation I define some common features of Big Data use cases, explain what the big deal about Big Data is all about and explore the impact of Big Data on the traditional data warehouse framework.

Exadata, Exalytics, SAP HANA and the multi billion dollar question.

Recently there has been a lot of noise around in memory databases and how Exadata, Exalytics and SAP HANA compare to each other. Here are my two cents on the debate.

Why you can’t compare Exalytics to SAP Hana

It is the vision of SAP Hana to be used for both OLTP and analytics. As the name already suggests Exalytics just caters for analytics. Exalytics needs to be loaded from the data warehouse or transactional systems. Not needed for Hana. Everything already sits in memory. While Exalytics is near near realtime. Hana is realtime. However, currently there are not too many OLTP applications running on Hana. The problem is that applications need to be adapted or rewritten to make full use of HANA’s new architecture.This seems about to change. SAP has recently released a service pack for HANA that will allow it to do just that. However, the claim of being able to run OLTP and Analytics in the same in memory database remains somewhat unproven.

Why you can’t compare Exadata to SAP Hana

Exadata V3 now ships with 4 TB of RAM. Contrary to the claims of Oracle this does not make it an in-memory database. It lacks two important features:

- Most of the data still sits on disk. Mostly SSD. Still disk
- It lacks the optimized algorithms and design features (in memory indexes etc.) that have specifically been designed for in memory access.

How does HANA compare to Exadata/Exalytics then?

Well, it doesn’t. Oracle (or anyone else for that matter) still has to come up with a product that can be compared to HANA.

Use cases of SAP HANA

The number 1 use case for SAP HANA is for realtime operational business intelligence. True realtime BI now seems a distinct possibility. In this use case HANA represents the Interactive Sector as defined by Inmon et al. in their DW 2.0 book.

The other use case is similar to what Exalytics is used for. A performance booster for the data warehouse. In this scenario we load or replicate data into the in-memory database either directly from our OLTP systems or the data warehouse. This is what we have predominantly seen HANA being used for so far. However, this falls well short of the vision and long term strategy, namely to change the way we do databases.

The third use case of course and the end game for HANA is to run OLTP, data warehouse, and analytics all on HANA. Not very realistic at the moment. DRAM is still too costly.

What happens next?

This is the billion dollar question. Curt Monash thinks:

“Putting all that together, the analytic case for SAP HANA seems decently substantiated, there are years of experience with the technology and its antecedents, and column stores (including in-memory) are well-established for analytics via multiple vendors. The OLTP case for HANA, however, remains largely unproven. It will be interesting to see how it plays out.”

It will indeed be interesting to see how this plays out. SAP are currently heavily promoting HANA. There is a developer license available. You can rent an instance on AWS. SAP are pushing it hard for start-ups. I am sure that it will replace Oracle in many SAP implementations as the underlying database. It remains to be seen, however, if it can eat into the wider database market (the ultimate SAP objective). The other interesting question is: when will Oracle come up with a product that can compete head on with HANA? I believe there is plenty of time. The game hasn’t really started yet.

In the meantime Microsoft have also announced an in memory database named Hekaton for release in 2014/15. It seems to be for OLTP only and from what I read a bit of a joke. Interesting times indeed though.

The wider picture and Google

At the moment we are seeing some tectonic shifts in the technology space that we haven’t seen in a generation. New technologies like Hadoop, Impala, IMDBs, NoSQL, Cloud etc are emerging that can handle ever bigger data at an ever faster speed. These innovations will have knock on effects on the way we architect information systems and on enterprise architecture in general. I even believe that they will ultimately change the way we do business. Unknown to many, a lot of these innovations are pioneered by Google. Papers on MapReduce  and the Google File System kicked off Hadoop. Google BigTable  inspired a lot of the NoSQL databases we have seen recently. You may have heard of Impala from Cloudera. Again a brainchild of Google. Based on Google Spanner and probably more so their in-house RDBMS F1.

I would expect more innovation to come from that corner. After all Google is at the epicentre of the Big Data problem. They already have their own offering named BigQuery, which recently left the Beta phase. Doesn’t look like much right now but I expect them to up their game.

Of course you can ignore those trends but you do so at your own peril.

If you want to find out more about IMDBs and SAP Hana I recommend to read In-Memory Data Management: Technology and Applications by Hasso Plattner, one of the co-founders of SAP.

Guest Post: Can Database Developers do Data Mining ?

I was recently invited by Sandro Saitta, who runs the Data Mining Research blog (http://www.dataminingblog.com/), to write a guest blog post for him. The topic for this guest post was Can Database Developers do Data Mining ?

The original post is available at – Can Database Developers do Data Mining

Here is the main body of the post

Over the past 20 to 30 years Data Mining has been dominated by people with a background in Statistics. This is primarily due to the type of techniques employed in the various data mining tools. The purpose of this post is to highlight the possibility that database developers might be a more suitable type of person to have on a data mining project than someone with a statistics type background.

Lets take a look at the CRISP-DM lifecycle for data mining (Figure 1). Most people involved in data mining will be familiar with this life cycle.

Figure 1 – CRoss Industry Standard Process for Data Mining.

It is well documented that the first three steps in CRISP-DM can take up to 70% to 80% of the total project time. Why does it take so much time. Well the data miner has to start learning about the business in question, explore the data that exists, re-explore the business rules and understand etc. Then can they start the data preparation step.

Database developers within the organisation will have gathered a considerable amount of the required information because they would have been involved in developing the business applications. So a large saving in time can be achieved here as this will already have most of the business and data understanding. They are well equipped at querying the data, getting to the required data quicker. The database developers are also best equipped to perform the data preparation step.

If we skip onto the deployment step. Again the database developers will be required to implement/deploy the selected data mining model in the production environment.

The two remaining steps, Modelling and Evaluation, are perhaps the two steps that database developers are less suited too. But with a bit of training on Data Mining techniques and how to evaluate data mining models, they would be well able to complete the full data mining lifecycle.

If we take the stages of CRISP-DM that a database developer is best suited to, Business Understanding, Data Understanding, Data Preparation and Deployment, this would equate to approximately 80% to 85% of the total project. With a little bit of training and up skilling, database developers are the based kind of person to perform data mining within their organisation.

Brendan Tierney

Comparing Exadata and Netezza TwinFin

Comparison between Exadata and Netezza Twin Fin. Ok, it comes from Netezza and as such is biased, but still an interesting read.

It is worthwhile to remember though that Exadata is designed for mixed workloads (OLTP and Analytics), which is a key differentiator to any of the other DW appliance vendors.

Interesting posts by Curt Monash on this

http://www.dbms2.com/2009/09/29/integration-oltp-data-warehousing-exadata-2/

http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/

Where is the response from Oracle?

BigQuery: Data Warehousing with Google?

Google has added two new products to their Labs. The first one is BiqQuery, which according to Google allows users to query trillions of records in an SQL dialect via a RESTful web service. If they get their pricing right on this then I can see Google becoming a top player in the data warehousing as a service space. This one could be quite interesting as unlike Hadoop or the other players in the nosql space do not support SQL. Only problem is that there are currently no query tools that will support BigQuery.

The other product they have added is a Prediction API. This one is a machine learning algorithm implemented via a RESTful web service.

10 Reasons you really need predictive analytics

SPSS have recently posted and article called “10 Reasons you really need predictive analytics“. I thought it would be interesting to post the main points from this article to illustrate that not all predictive analytic projects involve Data Mining, but involve a number of different techniques and looking the the business data in a different way. Yes data mining can be a very important element in some of the following

1. Get a higher return on your data investment
Your organization has a significant investment in data – data that contains critical information about every aspect of your business. Today more than ever, you need to get the best return on the data you have collected–and predictive analytics is the most effective way to do this. Predictive analytics combines information on what has happened in the past, what is happening now, and what’s likely to happen in the future to give you a complete picture of your business.

2. Find hidden meaning in your data
Predictive analytics helps you maximize the understanding gained from your data. It enables you to uncover hidden patterns, trends, and relationships and transform this information into action.

3. Look forward, not backward
Unlike reporting and business intelligence solutions that are only valuable for understanding past and current conditions, predictive analytics helps organizations look forward. By leveraging sophisticated statistical and modeling techniques, you can use the data you already have to help you anticipate future events and be proactive, rather than reactive.

4. Deliver intelligence in real time
Your business is dynamic. With predictive analytics, you can automatically deploy analytical results to both individuals and operational systems as changes occur, helping to guide customer interactions and strategic nd tactical decision making.

5. See your assumptions in action
Advanced analytical methods give you the tools to develop hypotheses about your organization’s toughest challenges and test them by creating predictive models. You can then choose the scenario that is likely to result in the best outcome for your organization.

6. Empower data-driven decision making
Better processes help people throughout your organization make better decisions every day. Predictive analytics enables your organization to automate the flow of information to match your business practices and deliver the insights gained through this technology to people who can apply them in their daily work.

7. Build customer intimacy
When you know each of your customers or constituents intimately—including what they think, say, and do—you can build stronger relationships with them. Predictive analytics gives you a complete view of your customers, and enables you to capture and maximize the value of each and every interaction.

8. Mitigate risk and fraud
Predictive analytics helps you evaluate risk using a combination of business rules, predictive models, and information gathered from customer interactions. You can then take the appropriate actions to minimize your organization’s exposure to fraudulent activities or highrisk customers or transactions.

9. Discover unexpected opportunities
Your organization can use predictive analytics to respond with greater speed and certainty to emerging challenges and opportunities, helping you to keep pace in a constantly changing business environment.

10. Guarantee your organization’s competitive advantage
Predictive analytics can drive improved performance in every operational area, including customer relations, supply chain, financial performance and cost management, research and product development, and strategic planning. When your organization runs more efficiently and profitably, you have what it takes to out-think and out-perform your competitors

So what is Predictive Analytics. Check out the description on Wikipedia

Let me know you views and comments on the above.

Brendan Tierney

Oracle Data Miner – New Resources

Over the past couple of weeks a couple of new web resources have appeared on Oracle Data Miner

The first one is that Charlie Berger, the director of Oracle Data Mining Product Management, has started a blog specifically for Oracle Data Miner. Check it out,
http://blogs.oracle.com/datamining/

If you are already using Oracle Data Miner or are interested in following its developments why not join the Oracle Data Miner Facebook group
http://www.facebook.com/pages/Oracle-Data-Mining/287065104533?ref=mf

Why Has Data Mining Struggled So Much?

Bill Inmon has recently posted an article on “Why has Data Mining struggled so much?”

The article discusses 7 diferent reasons why data mining has struggled, as it has been around for a very long time.

The main points are
1. We have been waiting a long time for it to become available in a usable way
2. Data mining is considered an academic focused with very few practitioners. But this is become less so
3. Data mining requires a different set of skills. Yes you need data management skills but you also need some data mining skills. I will be making a posting focusing on the skill sets required for data mining in the coming weeks.
4. Some industries and application areas are more suited to data mining than others. The difficult is in identifying suitable projects.
5. Data for Data Mining is unclean. Not if you use a data warehouse. Idealy an organisation who has a matur-ish BI infrastrucure will benefit must from a Data Mining project
6. Data is incomplete. Yes you may need to enrich the data from various sources. But again if you have a Data Warehouse you will have most of these
7. Approaches to data mining inadequate. Alot of the approches to data mining projects as based on its statistical history. New problem areas are evolving all the time and we can use data mining in lots of different way.

To view Bill Inmon’s article – click here.

To view our 2 training courses on data mining – click here

Brendan Tierney

Data Warehousing Books: Design and architecture

In another post I have covered data warehousing books in the world of Oracle. We’ve also had a look at data warehousing and business intelligence books for project management and business analysis. Today we will look at data warehousing and business intelligence books that look at the technical design and architecture of a data warehouse solution.

Must Have

DW 2.0: The Architecture for the Next Generation of Data Warehousing: Bill Inmon revisits his data warehouse architecture. Addresses the following issues: Real-time BI, unstructured data, the enterprise data warehouse and change, the data life cycle, time variance of data. Very useful from a conceptual point of view, but not enough detail.

The Data Warehouse Toolkit- The Complete Guide to Dimensional Modelling. My first book on data warehousing. Still valuable today. Great for dimensional modelling data marts or small non-realtime Enterprise Data Warehouses based on Kimball’s conformed dimensions. It also has a good overview on industry specific data model patterns in a dimensional context. A must have.

The Data Model Resource Books Vol 1-3: The books describe fundamental data modeling patterns that can be applied and reused across the enterprise. If you are assigned the task of modelling an Enterprise Data Warehouse, these books give you great insight into best practices in data modelling. Volume 2 offers industry specific data model patterns and provides invaluable information to better understand the issues at hand in a particular industry. Personally I find it that you should actually start with volume 3 as this is the most generic of the three books. Also if you only get one of the books get volume 3.

If you have a requirement around near-real time data warehousing and operational business intelligence I recommend to look into Dan Linstedt’s data vault modelling techniques. The Business of Data Vault Modeling will get you started.

Some more recent additions to the data warehouse architecture league of books includes Building and Maintaining a Data Warehouse and Advanced Data Warehouse Design. The first of these walks us through all the technical areas of a data warehouse project: source system analysis, database design, bi reporting, data quality, metadata. In my opinion, the best chapter is on data integration and ETL. There are very few dedicated ETL books out there and this is one of the few that touches on the subject, albeit from a high level. In Advanced Data Warehouse Design the authors discuss the shortcomings of existing data warehouse implementations focusing mainly on spatial and temporal data, e.g. the shortcomings of slowly changing dimensions when capturing changes over time. They propose a truly temporal and spatial data warehouse. Examples are given in MS SQL Analysis Service (temporal) and Oracle OLAP (temporal and spatial).

To my knowledge the only book out there dedicated to the physical design of databases is Physical Database Design: the database professional’s guide to exploiting indexes, views, storage, and more. Most of the stuff covered here is for advanced users. It covers Oracle, DB2, SQL Server, and for some of the MPP stuff Teradata. Personally I found the chapter on physical design for a shared nothing architecture, and the chapter on hardware (CPU architecture, disks, server sizing etc.) the most useful.

               
               
               

Dr. Ronnie Abrahiem, Software Engineer at CIBER has recently published a book on combining SOA and data warehousing in a near-real time environment. This looks quite interesting but I haven’t read the book myself. It has the rather long title Data Warehousing with Service-oriented Architecture: Designing and Implementing Prototype Models For an Integration of Near-Real-Time Data Warehousing Architecture with Service-oriented Architecture. I am currently working on a project where we want to integrate a SOA based MDM solution with the data warehouse. The book may offer some interesting insights around this.

Should Have

If you have a lot of aggregate tables in your warehouse I recommend to have a look at Mastering Data Warehouse Aggregates for a formalised methodology and some really useful tips and tricks around an aggregate navigator.

Another recent addition to data warehouse design books is Data Warehouse Design: Modern Principles and Methodologies. Very useful chapter on ETL and quite affordable.

Could Have

Data Warehouse Design Solutions. This is useful as a second reference for industry specific dimensional models. However, it can not replace Kimball’s original book on the subject.

Clickstream Data Warehousing. If you are implementing a data warehouse for web analytics you should have a look here. However, in light of the explosion of data volumes and with Hadoop and MapReduce at hand this one is slightly obsolete.

TIME dimension script Oracle

SELECT 
   n AS time_id,
   TO_CHAR(to_date(n,'SSSSS'),'HH24') AS hour,
   TO_CHAR(to_date(n,'SSSSS'),'MI') AS minute,
   TO_CHAR(to_date(n,'SSSSS'),'SS') AS second
FROM (
   SELECT
      level-1 n
   FROM 
      DUAL
   CONNECT BY LEVEL <= 86400
)