What is Data Mining ?

October 28, 2009

In this weeks topic I will explore what is Data Mining, the different meanings, how the term is used, etc. I will give you my interpretation of what it is and how other descriptions of data mining can be categorised.
Every article you read, every presentation you hear, etc. you get a slightly different description, or should it be that they hint to a description of how they use data mining in the products or their applications. By giving this hint at what data mining is they try to claim that they are using it, as it gives their products, applications and services a higher degree of sophistication compared to others. There is also the idea that it is a one of those trendy terms that is thrown out without them really knowing what it is really about.
Data Mining Definition
One of the most commonly cited definitions of what data mining is, “..it is the non-trivial extraction of previously unknown and potentially useful information from data” by Usama Fayyad et al (Chief Data Officer, Yahoo Inc) in their landmark paper back in 1996.
Based on this definition data mining is does not involve some basic analytics, decision making based on some defined rules, being able to identify events based on current data, etc. But these type of scenarios are typically talked about as being data mining. If we go back to the definition by Fayadd above, by say the “non-trivial” it means that we cannot write some code/queries to pull data out of our data that answers some simple questions. Another important part of the definition is “potentially useful information”, tells use that some times and may in a lot of cases, data mining does not give use anything useful. But it can give us useful information only if we have a good understanding of the data, the business rules of the data, the meta-data, how the rules and the data relate to each other, etc. All of this requires extensive experience of working with the data. Who is best at doing this, but database designers and developers. People with a statistics background (typical what you see in data mining roles) have to go and learn all about the data, the business rules, the meta-data etc. This can be a huge waste of time and resources as the database people are generally ignored.
Some examples
I was at an IT conference last week (I was co-author of a paper on Opinion Data Mining). One of the key note talks was given by a technical lead in IBM (one of two thousand in the company). He gave some good examples of how Business Intelligence (BI) could be used to manage the energy needs of a new city being build out in the middle east. He also gave another example of how BI is being used in and around Galway city and coast line. There were several mentions of data mining during his talk, but I don’t think any of his examples reflected what data mining is. Yes he did give examples of how you can intelligently use your data. For example, if an object is spotted out in Galway bay then you can predict where this object will come to shore. But data mining is not the technique that is used in this case. Instead it is a rules based type system, that takes into account a number of factors, link the size of the object, the current position, currents, wind direction, etc. Using these rules (and not data mining) they can identify the landing position and let all the necessary bodies know this (like the coast guard, Galway county council, environmental control, etc).
Generally data mining can be used when you have a mature BI environment in your organisation that includes not just transactional and business reporting, but also data warehousing, data analytics, prediction systems (based on rules), etc. Data mining allows you to explore for and identify patterns in your data (and you need lots of data really). Going back to the definition of data mining a lot of the results from a data mining project may not be of any value. What you are looking for are the nuggets of gold that exists in the data and you may take some time to fine these, if they exist at all.
One of the aims of this weeks posting was to explore what data mining really is. At this point I haven’t really talked much about what it is, but what I hope you have gotten so far is that the term data mining is overly used in the IT world and can be seen as one of those trendy words that organisation like use (and use incorrectly). Data mining is used as an umbrella term that covers any processing of your data that involves a bit a processing, applying some rules and some analytics.
Over the coming weeks we will explore what Data Mining really is and what are the different stages of a Data Mining project.
The next posting will be about CRISP-DM, which is a industry neutral, product neutral data mining life cycle.