Archives

Business Intelligence

Exadata, Exalytics, SAP HANA and the multi billion dollar question.

Recently there has been a lot of noise around in memory databases and how Exadata, Exalytics and SAP HANA compare to each other. Here are my two cents on the debate.

Why you can’t compare Exalytics to SAP Hana

It is the vision of SAP Hana to be used for both OLTP and analytics. As the name already suggests Exalytics just caters for analytics. Exalytics needs to be loaded from the data warehouse or transactional systems. Not needed for Hana. Everything already sits in memory. While Exalytics is near near realtime. Hana is realtime. However, currently there are not too many OLTP applications running on Hana. The problem is that applications need to be adapted or rewritten to make full use of HANA’s new architecture.This seems about to change. SAP has recently released a service pack for HANA that will allow it to do just that. However, the claim of being able to run OLTP and Analytics in the same in memory database remains somewhat unproven.

Why you can’t compare Exadata to SAP Hana

Exadata V3 now ships with 4 TB of RAM. Contrary to the claims of Oracle this does not make it an in-memory database. It lacks two important features:

- Most of the data still sits on disk. Mostly SSD. Still disk
- It lacks the optimized algorithms and design features (in memory indexes etc.) that have specifically been designed for in memory access.

How does HANA compare to Exadata/Exalytics then?

Well, it doesn’t. Oracle (or anyone else for that matter) still has to come up with a product that can be compared to HANA.

Use cases of SAP HANA

The number 1 use case for SAP HANA is for realtime operational business intelligence. True realtime BI now seems a distinct possibility. In this use case HANA represents the Interactive Sector as defined by Inmon et al. in their DW 2.0 book.

The other use case is similar to what Exalytics is used for. A performance booster for the data warehouse. In this scenario we load or replicate data into the in-memory database either directly from our OLTP systems or the data warehouse. This is what we have predominantly seen HANA being used for so far. However, this falls well short of the vision and long term strategy, namely to change the way we do databases.

The third use case of course and the end game for HANA is to run OLTP, data warehouse, and analytics all on HANA. Not very realistic at the moment. DRAM is still too costly.

What happens next?

This is the billion dollar question. Curt Monash thinks:

“Putting all that together, the analytic case for SAP HANA seems decently substantiated, there are years of experience with the technology and its antecedents, and column stores (including in-memory) are well-established for analytics via multiple vendors. The OLTP case for HANA, however, remains largely unproven. It will be interesting to see how it plays out.”

It will indeed be interesting to see how this plays out. SAP are currently heavily promoting HANA. There is a developer license available. You can rent an instance on AWS. SAP are pushing it hard for start-ups. I am sure that it will replace Oracle in many SAP implementations as the underlying database. It remains to be seen, however, if it can eat into the wider database market (the ultimate SAP objective). The other interesting question is: when will Oracle come up with a product that can compete head on with HANA? I believe there is plenty of time. The game hasn’t really started yet.

In the meantime Microsoft have also announced an in memory database named Hekaton for release in 2014/15. It seems to be for OLTP only and from what I read a bit of a joke. Interesting times indeed though.

The wider picture and Google

At the moment we are seeing some tectonic shifts in the technology space that we haven’t seen in a generation. New technologies like Hadoop, Impala, IMDBs, NoSQL, Cloud etc are emerging that can handle ever bigger data at an ever faster speed. These innovations will have knock on effects on the way we architect information systems and on enterprise architecture in general. I even believe that they will ultimately change the way we do business. Unknown to many, a lot of these innovations are pioneered by Google. Papers on MapReduce  and the Google File System kicked off Hadoop. Google BigTable  inspired a lot of the NoSQL databases we have seen recently. You may have heard of Impala from Cloudera. Again a brainchild of Google. Based on Google Spanner and probably more so their in-house RDBMS F1.

I would expect more innovation to come from that corner. After all Google is at the epicentre of the Big Data problem. They already have their own offering named BigQuery, which recently left the Beta phase. Doesn’t look like much right now but I expect them to up their game.

Of course you can ignore those trends but you do so at your own peril.

If you want to find out more about IMDBs and SAP Hana I recommend to read In-Memory Data Management: Technology and Applications by Hasso Plattner, one of the co-founders of SAP.

Oracle Data Miner Comes of Age

I have an article in the June editions of the Oracle Scene magazine, titled Oracle Data Miner Comes of Age. This article focuses on the new Oracle Data Miner 11g R2 tool (ODM). This is the first major upgrade of the tool in a number of years and it is not integrated into the SQL Developer 3 tool.

The new ODM tool (11gR2 in SQL Developer 3) has addressed all of these short comings and the development team have really created a tool that is now at a comparative level with the likes of SAS Enterprise Miner, which is considered the number one product in the market.

The ODM tool in SQL Developer is free to use, but you need to licence the data mining option as part of the Oracle 11.2g Enterprise Edition. If you take the list price for the Data Mining option in Oracle 11gEE, compared to the cost of purchasing SAS Enterprise Miner, plus all the added benefits of less data movement, in database data mining and the ease of deployment of the models, the likes of SAS and others are going to come under increasing pressure from Oracle Data Mining.

The article discusses some of the main new features available in the new tool and gives some pointers on how to get started with the tool. The main part of the article gives a sample walk through of how you can use ODM to define a data source, create some classification data models using this data, evaluate the data models and then how you can apply one of these models to new data.

So check out the article in the June edition of the Oracle Scene magazine.

I have also put together a 1 minute video introduction to the article. Check it out at
Oracle Data Miner Comes of Age – An Introduction

Brendan Tierney

Guest Post: Can Database Developers do Data Mining ?

I was recently invited by Sandro Saitta, who runs the Data Mining Research blog (http://www.dataminingblog.com/), to write a guest blog post for him. The topic for this guest post was Can Database Developers do Data Mining ?

The original post is available at – Can Database Developers do Data Mining

Here is the main body of the post

Over the past 20 to 30 years Data Mining has been dominated by people with a background in Statistics. This is primarily due to the type of techniques employed in the various data mining tools. The purpose of this post is to highlight the possibility that database developers might be a more suitable type of person to have on a data mining project than someone with a statistics type background.

Lets take a look at the CRISP-DM lifecycle for data mining (Figure 1). Most people involved in data mining will be familiar with this life cycle.

Figure 1 – CRoss Industry Standard Process for Data Mining.

It is well documented that the first three steps in CRISP-DM can take up to 70% to 80% of the total project time. Why does it take so much time. Well the data miner has to start learning about the business in question, explore the data that exists, re-explore the business rules and understand etc. Then can they start the data preparation step.

Database developers within the organisation will have gathered a considerable amount of the required information because they would have been involved in developing the business applications. So a large saving in time can be achieved here as this will already have most of the business and data understanding. They are well equipped at querying the data, getting to the required data quicker. The database developers are also best equipped to perform the data preparation step.

If we skip onto the deployment step. Again the database developers will be required to implement/deploy the selected data mining model in the production environment.

The two remaining steps, Modelling and Evaluation, are perhaps the two steps that database developers are less suited too. But with a bit of training on Data Mining techniques and how to evaluate data mining models, they would be well able to complete the full data mining lifecycle.

If we take the stages of CRISP-DM that a database developer is best suited to, Business Understanding, Data Understanding, Data Preparation and Deployment, this would equate to approximately 80% to 85% of the total project. With a little bit of training and up skilling, database developers are the based kind of person to perform data mining within their organisation.

Brendan Tierney

2010 Rexer Analytics Data Mining Survey

The Rexer Analytics 4th Annual Rexer Analytics Data Miner Survey for 2010 is now available. 735 data miners participated in the 2010 survey. The main highlights of the survey are:

• FIELDS & GOALS: Data miners work in a diverse set of fields. CRM / Marketing has been the #1 field in each of the past four years. Fittingly, “improving the understanding of customers”, “retaining customers” and other CRM goals are also the goals identified by the most data miners surveyed.

• ALGORITHMS: Decision trees, regression, and cluster analysis continue to form a triad of core algorithms for most data miners. However, a wide variety of algorithms are being used. This year, for the first time, the survey asked about Ensemble Models, and 22% of data miners report using them.
A third of data miners currently use text mining and another third plan to in the future.

• MODELS: About one-third of data miners typically build final models with 10 or fewer variables, while about 28% generally construct models with more than 45 variables.

• TOOLS: After a steady rise across the past few years, the open source data mining software R overtook other tools to become the tool used by more data miners (43%) than any other. STATISTICA, which has also been climbing in the rankings, is selected as the primary data mining tool by the most data miners (18%). Data miners report using an average of 4.6 software tools overall. STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and 2009.

• TECHNOLOGY: Data Mining most often occurs on a desktop or laptop computer, and frequently the data is stored locally. Model scoring typically happens using the same software used to develop models. STATISTICA users are more likely than other tool users to deploy models using PMML.

• CHALLENGES: As in previous years, dirty data, explaining data mining to others, and difficult access to data are the top challenges data miners face. This year data miners also shared best practices for overcoming these challenges.

• FUTURE: Data miners are optimistic about continued growth in the number of projects they will be conducting, and growth in data mining adoption is the number one “future trend” identified. There is room to improve: only 13% of data miners rate their company’s analytic capabilities as “excellent” and only 8% rate their data quality as “very strong”.

You can request a copy of the full report by going to their data mining survey webpage

Regards
Brendan Tierney

New Oracle Data Miner tool is now Available

Today the new Oracle Data Miner tool has been made available as part of the SQL Developer 3.0 (Early Adoptor Release 4).
The new ODM tool has been significantly redeveloped, with a new workflow interface and new graphical outputs. These include graphical representations of the decision trees and clustering.
To download the tool and to read the release documentation go to

http://tinyurl.com/62u3m4y

http://tinyurl.com/6heugsh

If you download and use the new tool, let me know what you think of it.

Exploring the Oracle Data Miner Interface

Once you have successfully installed Oracle Data Miner and established a connection you will be presented with the main ODM window with the navigator pane on the left hand side. All the tasks that you will need to do in ODM can be accessed from the Navigator pane or by the menu across the top of the window.2-1

    The Navigator

Before commencing any data mining exercise you will want to explore your data. To do this you will need to expand the Data Sources branch of the tree in the Navigator pane.

When expanded you will get a list of all users in the database who has some data visible to you or publicly to all users in the database. You only need to look for your Oracle Data Miner user. This will be the one you created as part of the installation steps given previously. By clicking on your ODM user you will see the branch will divided into Views and Tables. By expanding these branches you will be able to see what views and tables have been defined in your ODM user. The tables and views that you will see will be those that were defined as part of the installing steps given previously.

    2-2
    Loading/Importing Data

There are a number of ways of getting data into your ODM schema. As your ODM schema is like any other oracle schema you can use all the techniques you know to copy and load data into the schema. The ODM tool on the other hand has a data Import option, which can be found under the Data menu option. ODM uses SQL*Loader as the tool to load data into your ODM schema.

Before you select the Import option you will need to set the location of the SQL*Load executable. To do this, select Preferences from the Tool menu (Tool -> Preferences). In the Environment tab enter the location or browse for the executable.2-3

When you have specified the location of SQL*Loader you can now load CSV type files into you ODM schema. Select the Import option from the Data menu. This will open the File Import Wizard. After the introduction screen you can enter the name of a text file or search for the file.

2-4

In the example given here we will load a file of transactions from a grocery retailer for organic purchases.
On the next screen of wizard you can specify what the delimiter is, state if the first record contains the field headers. The Advanced Setting button allows you to enter some of the SQL*Loader specific setting, like skip count, max errors, etc.

2-5

The next screen of the wizard allows you to view the imported fields that were in the first row of the organics.csv file. You may need to do some tidying up of the column names, change the data types and the sizes for each attribute.

The next screen of the wizard allows you to specify the name of the table that the data should be inserted into. If a new table is needed then you can give the table name. If the data is to be appended to the data of an existing table then that table can be select from the drop down list. In our example you will need to select that the data will go into a new table and give the name of the new table name, e.g. ORGANIC_PURCHASES. At this point you have entered all the details and you can proceed to the last screen of the wizard and select the Finish button. It will take a minutes or so for the data to be loaded. Once the data is loaded the table will appear in the Navigator pane under Tables and the table structure will appear in the main part of the ODM window. You can now select the Data tab to look at the data that has been loaded.

2-6

    Exploring Data

Before beginning any data mining task we need to performs some data investigation. This will allow us to explore the data and to gain a better understanding of the data values. We can discover a lot by doing this can it can help us to identify areas for improvement in the source applications, as well as identifying data that does not contribute to our business problem (this is called feature reduction), and it can allow us to identify data that needs reformatting into a number of additional features (feature creation). A simple example of this is a date of birth field provides no real value, but by creating a number of additional attributes (features) we can now use the date of birth field to determine what age group they fit into.

To begin exploring the data in ODM, we need to select the table or view from the navigation tree. Hole the mouse over the table/view name and right click. From the menu that now displays, select the Show Summary Single-Record. We will come back to the other menu options when we cover the next topic, Data Transformations.

2-8

The Data Summarization screen will present some high level information for your data. Taking our Organics data we can explore the ranges of values for each attribute. There is a default setting of 10, which is the number of divisions that ODM will divide the data into. During the data exploration exercise you may want to vary this value from 5 to 20 to see if there are any groupings or trends in the data. This value can be changed by pressing the Select All button followed by the Preferences button.

The following couple of example will illustrate how using the data exploration tool in ODM can help you discover information about your data in a simple graphical manner. The alternative is to log into SQL*Plus or Oracle Developer to write and execute queries to do something similar.

For the fist example we will select he Gender attribute. Although we have our number of bins set to 10, we only get 5 bins displayed. The tool will try to divide the data in to 10 equally spaced bins, but if insufficient data exists then it will present the histogram with the existing set of distinct value.

2-9

From this example on the Gender attribute we can see that there are five distinct values. F for female, M for male, U for unknown, “” (space) for records that contain a space and Other for records that contain a null. From this data we can work out that maybe we should only have three possible vales (F, M, U) but we have 105 records where we do not have a value and this equates to just over 10% of the data samples. This is a sizeable number of records. So one of the steps in the Data Transforation phase (or data clean-up) is what do we work out what to do with these records. This can include removing the data from the data set, working our what the correct value should be or changing the value to U for unknown.

For our second example we will look at the MINING_DATA_BUILD_V view (this can be found under the views branch in the navigator pane). Again right click on this object and select Show Summary Single-Record to bring up the Data Summarisation window. One of the first things that you will notice is that we get a lot more details relating to each attribute.

2-10

Some of these extra details include the average, max, min variance and the number of null values. Again we can go exploring the data, changing the number of bins to varying sizes to see if there is any hidden information in the data. An example of this is if we select AGE and set the number of bins to 10. We get a nice histogram showing that most of our customers are in the 31 to 46 age ranges. So maybe we should be concentrating on these.

2-11

Now if we change the number of bins to 30 can get a completely different picture of what is going on in the data. Now we can see that there are a number of important age groups what stand out more than others. If we look at the 21 to 34 age range, in the first histogram we can see that there is not much change between each of the age bins. But when we look at the second histogram for the 30 bins for the same 21 to 34 age range we get a very different view of the data. In this second histogram we see that that the ages of the customers vary a lot. What does mean. Well it can mean lots of different things and it all depends on the business scenario. In our example we are looking at an electronic goods store. What we can deduce from this second histogram is that there are a small number of customers up to about age 21. Then there is a big jump. Is this due to people having obtained their main job after school having some disposable income. This peak is followed by a drop off in customers followed by another peak, drop off, peak, drop off etc. Maybe we can build a profile of our customer based on their age just like what our financial organisations due to determine what products to sell to use based on our age and life stage.

2-12

From this histogram we can maybe categorise the customers into the follow

• Early 20s – out of education, fist job, disposable income
• Late 20s to early 30s – settling down, own home
• Late 30s – maybe kids, so have less disposable income
• 40s – maybe people are trading up and need new equipment. Or maybe the kids have now turned into teenagers and are encouraging their parents to buy up todate equipment.
• Late 50s – These could be empty nesters where their children have left home, maybe setting up home by themselves and their parents are building things for their home. Or maybe the parents are treating themselves with new equipment as they have more disposable income
• 60s + – parents and grand-parents buying equipment for their children and grand-children. Or maybe we have very techie people who have just retired
• 70+ – we have a drop off here.

As you can see we can discover a lot in the day by changing the number of bins and examining the data. The important part of this examination is trying to relate what you are seeing from the graphical representation of the data on the screen, back to the type of business we are examining. A lot can be discovered but you will have to spend some time looking for it.

In my next posting, I will cover some of the Data Transformation function that are available in Oracle Data Miner.

Brendan Tierney

10 Reasons you really need predictive analytics

SPSS have recently posted and article called “10 Reasons you really need predictive analytics“. I thought it would be interesting to post the main points from this article to illustrate that not all predictive analytic projects involve Data Mining, but involve a number of different techniques and looking the the business data in a different way. Yes data mining can be a very important element in some of the following

1. Get a higher return on your data investment
Your organization has a significant investment in data – data that contains critical information about every aspect of your business. Today more than ever, you need to get the best return on the data you have collected–and predictive analytics is the most effective way to do this. Predictive analytics combines information on what has happened in the past, what is happening now, and what’s likely to happen in the future to give you a complete picture of your business.

2. Find hidden meaning in your data
Predictive analytics helps you maximize the understanding gained from your data. It enables you to uncover hidden patterns, trends, and relationships and transform this information into action.

3. Look forward, not backward
Unlike reporting and business intelligence solutions that are only valuable for understanding past and current conditions, predictive analytics helps organizations look forward. By leveraging sophisticated statistical and modeling techniques, you can use the data you already have to help you anticipate future events and be proactive, rather than reactive.

4. Deliver intelligence in real time
Your business is dynamic. With predictive analytics, you can automatically deploy analytical results to both individuals and operational systems as changes occur, helping to guide customer interactions and strategic nd tactical decision making.

5. See your assumptions in action
Advanced analytical methods give you the tools to develop hypotheses about your organization’s toughest challenges and test them by creating predictive models. You can then choose the scenario that is likely to result in the best outcome for your organization.

6. Empower data-driven decision making
Better processes help people throughout your organization make better decisions every day. Predictive analytics enables your organization to automate the flow of information to match your business practices and deliver the insights gained through this technology to people who can apply them in their daily work.

7. Build customer intimacy
When you know each of your customers or constituents intimately—including what they think, say, and do—you can build stronger relationships with them. Predictive analytics gives you a complete view of your customers, and enables you to capture and maximize the value of each and every interaction.

8. Mitigate risk and fraud
Predictive analytics helps you evaluate risk using a combination of business rules, predictive models, and information gathered from customer interactions. You can then take the appropriate actions to minimize your organization’s exposure to fraudulent activities or highrisk customers or transactions.

9. Discover unexpected opportunities
Your organization can use predictive analytics to respond with greater speed and certainty to emerging challenges and opportunities, helping you to keep pace in a constantly changing business environment.

10. Guarantee your organization’s competitive advantage
Predictive analytics can drive improved performance in every operational area, including customer relations, supply chain, financial performance and cost management, research and product development, and strategic planning. When your organization runs more efficiently and profitably, you have what it takes to out-think and out-perform your competitors

So what is Predictive Analytics. Check out the description on Wikipedia

Let me know you views and comments on the above.

Brendan Tierney

Oracle Data Miner – New Resources

Over the past couple of weeks a couple of new web resources have appeared on Oracle Data Miner

The first one is that Charlie Berger, the director of Oracle Data Mining Product Management, has started a blog specifically for Oracle Data Miner. Check it out,
http://blogs.oracle.com/datamining/

If you are already using Oracle Data Miner or are interested in following its developments why not join the Oracle Data Miner Facebook group
http://www.facebook.com/pages/Oracle-Data-Mining/287065104533?ref=mf

Why Has Data Mining Struggled So Much?

Bill Inmon has recently posted an article on “Why has Data Mining struggled so much?”

The article discusses 7 diferent reasons why data mining has struggled, as it has been around for a very long time.

The main points are
1. We have been waiting a long time for it to become available in a usable way
2. Data mining is considered an academic focused with very few practitioners. But this is become less so
3. Data mining requires a different set of skills. Yes you need data management skills but you also need some data mining skills. I will be making a posting focusing on the skill sets required for data mining in the coming weeks.
4. Some industries and application areas are more suited to data mining than others. The difficult is in identifying suitable projects.
5. Data for Data Mining is unclean. Not if you use a data warehouse. Idealy an organisation who has a matur-ish BI infrastrucure will benefit must from a Data Mining project
6. Data is incomplete. Yes you may need to enrich the data from various sources. But again if you have a Data Warehouse you will have most of these
7. Approaches to data mining inadequate. Alot of the approches to data mining projects as based on its statistical history. New problem areas are evolving all the time and we can use data mining in lots of different way.

To view Bill Inmon’s article – click here.

To view our 2 training courses on data mining – click here

Brendan Tierney

Good Oracle Data Mining Link & Book

The following link is a good resource giving details of various aspects of Oracle Data Mining. It is by BC Consulting.
http://www.dba-oracle.com/data_mining/

There is also a link to a book on Oracle Data Miner which covers the version of ODM for 10g, but some of the material in the book also applies for the 11g version. The book is by Dr. Ham and is available from Rampant Books

http://www.rampant-books.com/book_2006_1_oracle_data_mining.htm

 

Brendan Tierney