Archives

Uncategorized

Endeca text enrichment. Entities extraction, sentiment analysis, and text tagging with Lexalytics customer defined lists. I love this one!

One of the interesting options of Endeca is its integration with text mining software Lexalytics (licensed separately). Lexalytics offers many text analysis functions such as sentiment analysis, document collection analysis, named entity extraction, theme and context extraction, summarization, document classification etc. Endeca exposes some of this functionality via its text enrichment component in Clover ETL. It is worthwhile noting that not all of the text analytics functionality is exposed via text enrichment and of those features that are exposed only a limited number of methods of the Lexalytics API are exposed (more on that later). A great way of learning more about Lexalytics is to visit their website, Wiki, and blog. I will post some more stuff on text analytics and Lexalytics in particular in one of my next posts.

Text tagging with Lexalytics

In my last post on Endeca we were using the text tagger component to tag a list of 16K Irish IT job records with 30K skills extracted from LinkedIn. If you remember we saw some dreadful performance and also realised that the text tagger component is not multi-threaded and maxed out at 25% on a quad-core CPU. The text enrichment component also offers text tagging and based on the documentation is also multi-threaded. So my expectation is that the tagging process is a lot quicker. Apart from the skills tagging exercise we will also perform some entity extraction on the jobs data and focus on People, Companies, and Products.

In a first step we download Lexalytics from edelivery.

We then add the license file that comes with the download to the Lexalytics root directory. In my case E:\Program Files (x86)\Lexalytics

We then add the salience.properties file to our project. This configuration file sets the various properties that we can make us of, e.g which types of entities we want to extract. As you can see from the screenshot below we will extract entities of type Person, Company, Product, and List. The interesting entity is the List entity. Through the List entity we can specify our own custom lists to the Lexalytics Salience engine.

Custom lists are made available to Lexalytics in the E:\Program Files (x86)\Lexalytics\data\user\salience\entities\lists folder as a tab separated Customer Defined List file, which contains various values and a label separated by Tab.

One big shortcoming of the Endeca text enrichment component is that it does not extract the label that can be defined in the customer defined list, e.g. The following entry has the tab separated label of Business Intelligence. The text enrichment component does not extract this even though this is exposed by the Lexalytics API. This also means that you can only define one customer defined list per text enrichment batch, as all of your CDLs will be dumped into the List field.

OBIEE, Cognos, Micro Strategy Busines Intelligence

Anyway, below is our tab separated custom list

Next we add the location of the Lexalytics license file and the data folder as variables to our project workspace parameter file.

As in the previous post we read the scraped data of Irish IT jobs from a MySQL database. The number now stands at 21K. The data flow is a lot simpler than what we had to do with the text tagger.

We are now ready to configure our text enrichment component. It’s all pretty straightforward. We supply:

Configuration file: This is the path to the salience properties file

Input file: This is the field in our recordset that we want to extract entities from and use for tagging.

Salience license file: Path to Lexalytics license file

Salience data path: Path to Lexalytics data folder

Number of threads: Hoooraaaay! This component is multi-threaded. During my test runs it used 1 core and maxed out at 25% CPU for 1 thread, 2 cores and maxed out at 50% CPU for 2 threads and so on.

In a last step we have to define the metadata and add it to the edge of the graph.

We are now ready to run the graph. If you read my last post on text tagging you will remember that tagging with the text tagger component took a whopping 12 hours to tag 16K records against 30K skills. With the text enrichment component it took 20 minutes to tag 21K records and on top of that we also extracted Person, Product, and Company entities and did a bit of sentiment analysis as well.

Here are some of the results

Conclusion

  • If you have Lexalytics licensed as part of your Endeca install use it for text tagging rather than the Text Tagger component, which is pretty…, well,  lame.
  • Unlike the Text Tagger component the Text Enrichment component (thanks to the underlying Lexalytics salience engine) is fine piece of software engineering. Unlike the text tagger it is multi-threaded and increasing the number of threads to 4 increased my CPU usage to 100%. It processes the incoming records in batches and it was a joy to watch how it cleared out the memory properly after each batch.
  • The text enrichment component only exposes a subset of the Lexalytics functionality. If you want to make use of the full potential of the Salience engine you need to write your own custom code.
  • The text enrichment component offers a lot of room for improvement (1) It does not expose the full feature set of Lexalytics (2) The implementation of CDLs should include Label extraction. If you want to make use of the full Lexalytics functionality you would need to write a bit of custom code, which looks pretty straightforward to me.
  • [/list_check]

    In the next posts we will load the data into the MDEX engine and start our discovery process. I wonder what we will find out???

     

    Oracle User Group Dublin, 12 March 2013. It’s FREE.

    The annual Oracle User Group meeting in Dublin is this year on a special day. 12th of March. That’s my birthday.

    The agenda is live now. Unfortunately, there is only one data warehouse/business intelligence stream this year. A lot of interesting presentations had to be left out.

    I will present with Maciek Kocon on “Oracle Data Integrator 11g Best Practices. Busting your performance, deployment, and scheduling headaches”.

    Other presentations include Mark Rittman on deploying OBIEE in the enterprise. Peak Indicators will be there to present on the concept and case study of a BI Competency Centre (BICC), and there will be a presentation on best practices in migrating from OWB to ODI.

    Oracle themselves also have a stream on the 12c database.

    If you are not interested in any of the above you may be interested in the free lunch.

    Register now. Only a few places remain.

    See you there!

    Exporting & Importing Oracle Data Miner (11gR2) Workflows

    As with all development environments there will be need to move your code from one schema to another or from one database to another.

    With Oracle Data Miner 11gR2, we have the same requirement. In our case it is not just individual procedures or packages, we have a workflow consisting of a number of nodes. With each node we may have a number of steps or functions that are applied to the data.

      Exporting an ODM (11gR2) Workflow

    In the Data Miner navigator, right-click the name of the workflow that you want to export.

    The Save dialog opens. Specify a location on you computer where the workflow is saved as an XML file.

    The default name for the file is workflow_name.xml, where workflow_name is the name of the workflow. You can change the name and location of the file.

      Importing an ODM (11gR2) Workflow

    Before you import your ODM workflow, you need to make sure that you have access the the same data that is specified in the workflow.

    All tables/views are prefixed with the schema where the table/view resides.

    You may want to import the data into the new schema or ensure that the new schema has the necessary grants.

    Open the connection in ODM.

    Select the project under with you want to import the workflow, or create a new project.

    Right click the Project and select Import Workflow.

    Search for the XML export file of the workflow.

    Preserve the objects during the import.

    When you have all the data and the ODM workflow imported, you will need to run the entire workflow to ensure that you have everything setup correctly.

    It will also create the models in the new schema.

      Data encoding in Workflow

    All of the tables and views used as data sources in the exported workflow must reside in the new account

    The account from which the workflow was exported is encoded in the exported workflow e.g. the exported workflow was exported from the account DMUSER and contains the data source node with data MINING_DATA_BUILD. If you import the schema into a different account (that is, an account that is not DMUSER) and try to run the workflow, the data source node fails because the workflow is looking for USER.MINING_DATA_BUILD_V.

    To solve this problem, right-click the data node (MINING_DATA_BUILD_V in this example) and select Define Data Wizard. A message appears indicating that DMUSER.MINING_DATA_BUILD_V does not exist in the available tables/views. Click OK and then select MINING_DATA_BUILD_V in the current account.

      Video

    I have created a video of this blog. It illustrates how you can Export a workflow and Import the workflow into a new schema.

    Exporting and Importing Oracle Data Miner (11gR2) Workflows

    Make sure to check out my other Oracle Data Miner (11gR2) videos.

    http://www.youtube.com/user/btierney70

    Brendan Tierney

    Exporting & Importing Oracle Data Miner (11gR2) Workflows

    As with all development environments there will be need to move your code from one schema to another or from one database to another.

    With Oracle Data Miner 11gR2, we have the same requirement. In our case it is not just individual procedures or packages, we have a workflow consisting of a number of nodes. With each node we may have a number of steps or functions that are applied to the data.

      Exporting an ODM (11gR2) Workflow

    In the Data Miner navigator, right-click the name of the workflow that you want to export.

    The Save dialog opens. Specify a location on you computer where the workflow is saved as an XML file.

    The default name for the file is workflow_name.xml, where workflow_name is the name of the workflow. You can change the name and location of the file.

      Importing an ODM (11gR2) Workflow

    Before you import your ODM workflow, you need to make sure that you have access the the same data that is specified in the workflow.

    All tables/views are prefixed with the schema where the table/view resides.

    You may want to import the data into the new schema or ensure that the new schema has the necessary grants.

    Open the connection in ODM.

    Select the project under with you want to import the workflow, or create a new project.

    Right click the Project and select Import Workflow.

    Search for the XML export file of the workflow.

    Preserve the objects during the import.

    When you have all the data and the ODM workflow imported, you will need to run the entire workflow to ensure that you have everything setup correctly.

    It will also create the models in the new schema.

      Data encoding in Workflow

    All of the tables and views used as data sources in the exported workflow must reside in the new account

    The account from which the workflow was exported is encoded in the exported workflow e.g. the exported workflow was exported from the account DMUSER and contains the data source node with data MINING_DATA_BUILD. If you import the schema into a different account (that is, an account that is not DMUSER) and try to run the workflow, the data source node fails because the workflow is looking for USER.MINING_DATA_BUILD_V.

    To solve this problem, right-click the data node (MINING_DATA_BUILD_V in this example) and select Define Data Wizard. A message appears indicating that DMUSER.MINING_DATA_BUILD_V does not exist in the available tables/views. Click OK and then select MINING_DATA_BUILD_V in the current account.

      Video

    I have created a video of this blog. It illustrates how you can Export a workflow and Import the workflow into a new schema.

    Exporting and Importing Oracle Data Miner (11gR2) Workflows

    Make sure to check out my other Oracle Data Miner (11gR2) videos.

    http://www.youtube.com/user/btierney70

    Brendan Tierney

    Vote for my Oracle Open World Presentation on Oracle Mix

    I’ve submitted the following 3 presentations on Oracle Mix, to be included in the voting process.

    Voting finishes on 19th June.

    Please vote for my presentations. As they stay about elections in Ireland, Vote early and Often.


    Oracle Data Mining has been available for many years now and has proven to be a powerful tool but seems to be over looked by longer established products that are a lot more expensive. Many companies have put significant work into development their BI environments. But what can they do now to improve their organisational knowledge. This presentation will look at how a database developer is more suited to doing data mining than someone with a PhD in statistics. Using the ODM tool and the CRIP-DM life-cycle it will be demonstrated how a data mining project can be conducted.

    With the release of the new Oracle Data Mining tool and it being part of SQL Developer, this presentation will have a look at how these too tools can be used in combination. In particular the presentation will focus on the Data Understanding stage of the CRISP-DM Life Cycle. Using the key elements of the Data Understand stage the presentation will look at how a database developer can use the new features of the new Oracle Data Mining tool in conjunction with SQL Developer to explore the data with the aim of gaining a key insight into the data.

    The new Oracle Data Miner 11g R2 tool is now easier to develop your Data Mining models and workflows. A Data Mining project has two main stages. This presentation will look at how you take your Data Mining workflow and Data Mining Model, that have been developed using the new Oracle Data Miner 11g R2 tool. It will show you have to extract the SQL code from the work flow to perform the Data Transformations, execute the Data Mining Model, how you can link these to your new data and finally how you can apply the model.

    See my YouTube channel for my videos on the Oracle Data Miner 11g R2 tool
    http://www.youtube.com/user/btierney70/

    New Oracle Data Miner tool is now Available

    Today the new Oracle Data Miner tool has been made available as part of the SQL Developer 3.0 (Early Adoptor Release 4).
    The new ODM tool has been significantly redeveloped, with a new workflow interface and new graphical outputs. These include graphical representations of the decision trees and clustering.
    To download the tool and to read the release documentation go to

    http://tinyurl.com/62u3m4y

    http://tinyurl.com/6heugsh

    If you download and use the new tool, let me know what you think of it.

    Data Mining Videos

    Following on from my previous posting on online Data Mining books, this post contains links to some Data Mining videos that are available online on on my website.
    Future of Baseball
    CBC News Data Mining Resource page
    The New Data Economy
    Data Mining in Games
    Microsoft: Get Started with Data Mining
    Oracle Data Mining – Who Caught the Flu
    Journalism in the Age of Data
    Computer Science and Medicine
    Data Visualisation

    And here are the links to the Hans Rosling excellent videos .
    World Data
    Data Visualisation
    BBC video – Data Visualisation

    Yes there are loads of videos online and on youtube on that mining, but what I want to give you is a list of presentations that I have found over the past few years to be the most interesting and valuable.

    If you know of any other videos that you would like to share, then send me the details and I will update the post.

    There are a large number of lectures and presentations on data mining available at Video Lectures.

    Updated 11-Jan-11

    The Open University has a site titled Joy of Stats that has a number of data analyse videos

    The beauty of data visualization - TED talk

    Online Data Mining Books

    Recently a new online Data Mining resource/Book was made available by Dr. Saed Sayad at University of Tronoto.

    He has developed a very good and clear to follow online book on the various parts of a data mining project in in particular the various techniques. It also follows the CRISP-DM process.

    http://chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm

    More online books that a bit more technical are

    Elements of Statistical Learning by Stanford University

    http://www-stat.stanford.edu/~tibs/ElemStatLearn

    2nd Edition of the Data Mining Book: Contcepts, Techniques and Applications

    http://www.dataminingbook.com/

    A popular academic book is Introduction to Data Mining by Tan et al. If you go to the publishers website the chapters on the main data mining techniques are available free.

    Mining Massive Datasets – By Anand Rajaraman and Jeffrey D. Ullman

    For Visual Data Mining techniques the following are excellent resources

    Visual Analytics Book – www.vismaster.eu/book/

    IBM Many Eye Project – http://www-958.ibm.com/software/data/cognos/manyeyes/

    There is also the recent book by Tom Davenport called Analytics At Work.

    Book web site http://www.analyticsatworkbook.com/

    Updated 11-Jan-11

    Reactive Business Intelligence - Online Data Mining Book

    Information is Beautiful

    Exploring the Oracle Data Miner Interface

    Once you have successfully installed Oracle Data Miner and established a connection you will be presented with the main ODM window with the navigator pane on the left hand side. All the tasks that you will need to do in ODM can be accessed from the Navigator pane or by the menu across the top of the window.2-1

      The Navigator

    Before commencing any data mining exercise you will want to explore your data. To do this you will need to expand the Data Sources branch of the tree in the Navigator pane.

    When expanded you will get a list of all users in the database who has some data visible to you or publicly to all users in the database. You only need to look for your Oracle Data Miner user. This will be the one you created as part of the installation steps given previously. By clicking on your ODM user you will see the branch will divided into Views and Tables. By expanding these branches you will be able to see what views and tables have been defined in your ODM user. The tables and views that you will see will be those that were defined as part of the installing steps given previously.

      2-2
      Loading/Importing Data

    There are a number of ways of getting data into your ODM schema. As your ODM schema is like any other oracle schema you can use all the techniques you know to copy and load data into the schema. The ODM tool on the other hand has a data Import option, which can be found under the Data menu option. ODM uses SQL*Loader as the tool to load data into your ODM schema.

    Before you select the Import option you will need to set the location of the SQL*Load executable. To do this, select Preferences from the Tool menu (Tool -> Preferences). In the Environment tab enter the location or browse for the executable.2-3

    When you have specified the location of SQL*Loader you can now load CSV type files into you ODM schema. Select the Import option from the Data menu. This will open the File Import Wizard. After the introduction screen you can enter the name of a text file or search for the file.

    2-4

    In the example given here we will load a file of transactions from a grocery retailer for organic purchases.
    On the next screen of wizard you can specify what the delimiter is, state if the first record contains the field headers. The Advanced Setting button allows you to enter some of the SQL*Loader specific setting, like skip count, max errors, etc.

    2-5

    The next screen of the wizard allows you to view the imported fields that were in the first row of the organics.csv file. You may need to do some tidying up of the column names, change the data types and the sizes for each attribute.

    The next screen of the wizard allows you to specify the name of the table that the data should be inserted into. If a new table is needed then you can give the table name. If the data is to be appended to the data of an existing table then that table can be select from the drop down list. In our example you will need to select that the data will go into a new table and give the name of the new table name, e.g. ORGANIC_PURCHASES. At this point you have entered all the details and you can proceed to the last screen of the wizard and select the Finish button. It will take a minutes or so for the data to be loaded. Once the data is loaded the table will appear in the Navigator pane under Tables and the table structure will appear in the main part of the ODM window. You can now select the Data tab to look at the data that has been loaded.

    2-6

      Exploring Data

    Before beginning any data mining task we need to performs some data investigation. This will allow us to explore the data and to gain a better understanding of the data values. We can discover a lot by doing this can it can help us to identify areas for improvement in the source applications, as well as identifying data that does not contribute to our business problem (this is called feature reduction), and it can allow us to identify data that needs reformatting into a number of additional features (feature creation). A simple example of this is a date of birth field provides no real value, but by creating a number of additional attributes (features) we can now use the date of birth field to determine what age group they fit into.

    To begin exploring the data in ODM, we need to select the table or view from the navigation tree. Hole the mouse over the table/view name and right click. From the menu that now displays, select the Show Summary Single-Record. We will come back to the other menu options when we cover the next topic, Data Transformations.

    2-8

    The Data Summarization screen will present some high level information for your data. Taking our Organics data we can explore the ranges of values for each attribute. There is a default setting of 10, which is the number of divisions that ODM will divide the data into. During the data exploration exercise you may want to vary this value from 5 to 20 to see if there are any groupings or trends in the data. This value can be changed by pressing the Select All button followed by the Preferences button.

    The following couple of example will illustrate how using the data exploration tool in ODM can help you discover information about your data in a simple graphical manner. The alternative is to log into SQL*Plus or Oracle Developer to write and execute queries to do something similar.

    For the fist example we will select he Gender attribute. Although we have our number of bins set to 10, we only get 5 bins displayed. The tool will try to divide the data in to 10 equally spaced bins, but if insufficient data exists then it will present the histogram with the existing set of distinct value.

    2-9

    From this example on the Gender attribute we can see that there are five distinct values. F for female, M for male, U for unknown, “” (space) for records that contain a space and Other for records that contain a null. From this data we can work out that maybe we should only have three possible vales (F, M, U) but we have 105 records where we do not have a value and this equates to just over 10% of the data samples. This is a sizeable number of records. So one of the steps in the Data Transforation phase (or data clean-up) is what do we work out what to do with these records. This can include removing the data from the data set, working our what the correct value should be or changing the value to U for unknown.

    For our second example we will look at the MINING_DATA_BUILD_V view (this can be found under the views branch in the navigator pane). Again right click on this object and select Show Summary Single-Record to bring up the Data Summarisation window. One of the first things that you will notice is that we get a lot more details relating to each attribute.

    2-10

    Some of these extra details include the average, max, min variance and the number of null values. Again we can go exploring the data, changing the number of bins to varying sizes to see if there is any hidden information in the data. An example of this is if we select AGE and set the number of bins to 10. We get a nice histogram showing that most of our customers are in the 31 to 46 age ranges. So maybe we should be concentrating on these.

    2-11

    Now if we change the number of bins to 30 can get a completely different picture of what is going on in the data. Now we can see that there are a number of important age groups what stand out more than others. If we look at the 21 to 34 age range, in the first histogram we can see that there is not much change between each of the age bins. But when we look at the second histogram for the 30 bins for the same 21 to 34 age range we get a very different view of the data. In this second histogram we see that that the ages of the customers vary a lot. What does mean. Well it can mean lots of different things and it all depends on the business scenario. In our example we are looking at an electronic goods store. What we can deduce from this second histogram is that there are a small number of customers up to about age 21. Then there is a big jump. Is this due to people having obtained their main job after school having some disposable income. This peak is followed by a drop off in customers followed by another peak, drop off, peak, drop off etc. Maybe we can build a profile of our customer based on their age just like what our financial organisations due to determine what products to sell to use based on our age and life stage.

    2-12

    From this histogram we can maybe categorise the customers into the follow

    • Early 20s – out of education, fist job, disposable income
    • Late 20s to early 30s – settling down, own home
    • Late 30s – maybe kids, so have less disposable income
    • 40s – maybe people are trading up and need new equipment. Or maybe the kids have now turned into teenagers and are encouraging their parents to buy up todate equipment.
    • Late 50s – These could be empty nesters where their children have left home, maybe setting up home by themselves and their parents are building things for their home. Or maybe the parents are treating themselves with new equipment as they have more disposable income
    • 60s + – parents and grand-parents buying equipment for their children and grand-children. Or maybe we have very techie people who have just retired
    • 70+ – we have a drop off here.

    As you can see we can discover a lot in the day by changing the number of bins and examining the data. The important part of this examination is trying to relate what you are seeing from the graphical representation of the data on the screen, back to the type of business we are examining. A lot can be discovered but you will have to spend some time looking for it.

    In my next posting, I will cover some of the Data Transformation function that are available in Oracle Data Miner.

    Brendan Tierney

    What is Data Mining ?

    In this weeks topic I will explore what is Data Mining, the different meanings, how the term is used, etc. I will give you my interpretation of what it is and how other descriptions of data mining can be categorised.

    Every article you read, every presentation you hear, etc. you get a slightly different description, or should it be that they hint to a description of how they use data mining in the products or their applications. By giving this hint at what data mining is they try to claim that they are using it, as it gives their products, applications and services a higher degree of sophistication compared to others. There is also the idea that it is a one of those trendy terms that is thrown out without them really knowing what it is really about.

    Data Mining Definition

    One of the most commonly cited definitions of what data mining is, “..it is the non-trivial extraction of previously unknown and potentially useful information from data” by Usama Fayyad et al (Chief Data Officer, Yahoo Inc) in their landmark paper back in 1996.

    Based on this definition data mining is does not involve some basic analytics, decision making based on some defined rules, being able to identify events based on current data, etc. But these type of scenarios are typically talked about as being data mining. If we go back to the definition by Fayadd above, by say the “non-trivial” it means that we cannot write some code/queries to pull data out of our data that answers some simple questions. Another important part of the definition is “potentially useful information”, tells use that some times and may in a lot of cases, data mining does not give use anything useful. But it can give us useful information only if we have a good understanding of the data, the business rules of the data, the meta-data, how the rules and the data relate to each other, etc. All of this requires extensive experience of working with the data. Who is best at doing this, but database designers and developers. People with a statistics background (typical what you see in data mining roles) have to go and learn all about the data, the business rules, the meta-data etc. This can be a huge waste of time and resources as the database people are generally ignored.

    Some examples

    I was at an IT conference last week (I was co-author of a paper on Opinion Data Mining). One of the key note talks was given by a technical lead in IBM (one of two thousand in the company). He gave some good examples of how Business Intelligence (BI) could be used to manage the energy needs of a new city being build out in the middle east. He also gave another example of how BI is being used in and around Galway city and coast line. There were several mentions of data mining during his talk, but I don’t think any of his examples reflected what data mining is. Yes he did give examples of how you can intelligently use your data. For example, if an object is spotted out in Galway bay then you can predict where this object will come to shore. But data mining is not the technique that is used in this case. Instead it is a rules based type system, that takes into account a number of factors, link the size of the object, the current position, currents, wind direction, etc. Using these rules (and not data mining) they can identify the landing position and let all the necessary bodies know this (like the coast guard, Galway county council, environmental control, etc).

    Generally data mining can be used when you have a mature BI environment in your organisation that includes not just transactional and business reporting, but also data warehousing, data analytics, prediction systems (based on rules), etc. Data mining allows you to explore for and identify patterns in your data (and you need lots of data really). Going back to the definition of data mining a lot of the results from a data mining project may not be of any value. What you are looking for are the nuggets of gold that exists in the data and you may take some time to fine these, if they exist at all.

    One of the aims of this weeks posting was to explore what data mining really is. At this point I haven’t really talked much about what it is, but what I hope you have gotten so far is that the term data mining is overly used in the IT world and can be seen as one of those trendy words that organisation like use (and use incorrectly). Data mining is used as an umbrella term that covers any processing of your data that involves a bit a processing, applying some rules and some analytics.

    Over the coming weeks we will explore what Data Mining really is and what are the different stages of a Data Mining project.

    The next posting will be about CRISP-DM, which is a industry neutral, product neutral data mining life cycle.