Endeca text enrichment. Entities extraction, sentiment analysis, and text tagging with Lexalytics customer defined lists. I love this one!

Endeca text enrichment. Entities extraction, sentiment analysis, and text tagging with Lexalytics customer defined lists. I love this one!

One of the interesting options of Endeca is its integration with text mining software Lexalytics (licensed separately). Lexalytics offers many text analysis functions such as sentiment analysis, document collection analysis, named entity extraction, theme and context extraction, summarization, document classification etc. Endeca exposes some of this functionality via its text enrichment component in Clover ETL. It is worthwhile noting that not all of the text analytics functionality is exposed via text enrichment and of those features that are exposed only a limited number of methods of the Lexalytics API are exposed (more on that later). A great way of learning more about Lexalytics is to visit their website, Wiki, and blog. I will post some more stuff on text analytics and Lexalytics in particular in one of my next posts.

Text tagging with Lexalytics

In my last post on Endeca we were using the text tagger component to tag a list of 16K Irish IT job records with 30K skills extracted from LinkedIn. If you remember we saw some dreadful performance and also realised that the text tagger component is not multi-threaded and maxed out at 25% on a quad-core CPU. The text enrichment component also offers text tagging and based on the documentation is also multi-threaded. So my expectation is that the tagging process is a lot quicker. Apart from the skills tagging exercise we will also perform some entity extraction on the jobs data and focus on People, Companies, and Products.

In a first step we download Lexalytics from edelivery.

We then add the license file that comes with the download to the Lexalytics root directory. In my case E:\Program Files (x86)\Lexalytics

We then add the salience.properties file to our project. This configuration file sets the various properties that we can make us of, e.g which types of entities we want to extract. As you can see from the screenshot below we will extract entities of type Person, Company, Product, and List. The interesting entity is the List entity. Through the List entity we can specify our own custom lists to the Lexalytics Salience engine.

Custom lists are made available to Lexalytics in the E:\Program Files (x86)\Lexalytics\data\user\salience\entities\lists folder as a tab separated Customer Defined List file, which contains various values and a label separated by Tab.

One big shortcoming of the Endeca text enrichment component is that it does not extract the label that can be defined in the customer defined list, e.g. The following entry has the tab separated label of Business Intelligence. The text enrichment component does not extract this even though this is exposed by the Lexalytics API. This also means that you can only define one customer defined list per text enrichment batch, as all of your CDLs will be dumped into the List field.

OBIEE, Cognos, Micro Strategy Busines Intelligence

Anyway, below is our tab separated custom list

Next we add the location of the Lexalytics license file and the data folder as variables to our project workspace parameter file.

As in the previous post we read the scraped data of Irish IT jobs from a MySQL database. The number now stands at 21K. The data flow is a lot simpler than what we had to do with the text tagger.

We are now ready to configure our text enrichment component. It’s all pretty straightforward. We supply:

Configuration file: This is the path to the salience properties file

Input file: This is the field in our recordset that we want to extract entities from and use for tagging.

Salience license file: Path to Lexalytics license file

Salience data path: Path to Lexalytics data folder

Number of threads: Hoooraaaay! This component is multi-threaded. During my test runs it used 1 core and maxed out at 25% CPU for 1 thread, 2 cores and maxed out at 50% CPU for 2 threads and so on.

In a last step we have to define the metadata and add it to the edge of the graph.

We are now ready to run the graph. If you read my last post on text tagging you will remember that tagging with the text tagger component took a whopping 12 hours to tag 16K records against 30K skills. With the text enrichment component it took 20 minutes to tag 21K records and on top of that we also extracted Person, Product, and Company entities and did a bit of sentiment analysis as well.

Here are some of the results

Conclusion

  • If you have Lexalytics licensed as part of your Endeca install use it for text tagging rather than the Text Tagger component, which is pretty…, well,  lame.
  • Unlike the Text Tagger component the Text Enrichment component (thanks to the underlying Lexalytics salience engine) is fine piece of software engineering. Unlike the text tagger it is multi-threaded and increasing the number of threads to 4 increased my CPU usage to 100%. It processes the incoming records in batches and it was a joy to watch how it cleared out the memory properly after each batch.
  • The text enrichment component only exposes a subset of the Lexalytics functionality. If you want to make use of the full potential of the Salience engine you need to write your own custom code.
  • The text enrichment component offers a lot of room for improvement (1) It does not expose the full feature set of Lexalytics (2) The implementation of CDLs should include Label extraction. If you want to make use of the full Lexalytics functionality you would need to write a bit of custom code, which looks pretty straightforward to me.
  • [/list_check]

    In the next posts we will load the data into the MDEX engine and start our discovery process. I wonder what we will find out???

     

    Endeca text tagging: Tagging unstructured IT jobs data against a whitelist of LinkedIn skills

    A couple of days ago I have started to look at Endeca. So far it looks like a great tool and I believe we will see and hear a lot more from it over the next couple of months and years.

    I am currently still learning a lot and in the next couple of weeks I will document my progress in a couple of blog posts. In this first post we will tag some unstructured data and then create a multi value attribute, which is one of the great features of Endeca.

    In this example we look at some Irish IT job offerings data that has been extracted from the web. One of the fields of this data set is a description field. We will tag this unstructured data with a whitelist of skills that were extracted from LinkedIn. There are 30K skills right now on LinkedIn, which resulted in some performance issues when using the text tagger in Clover ETL. More on that later

    Irish IT jobs data set (about 16K records collected over the last 3 months).

    LinkedIn skills data set

    The Irish jobs data sits on a MySQL server. The skills data set is in an Excel sheet. The end result of our ETL will look like this.

    Note: You need to name the input fields SearchTerm and TagValue respectively. Any other value will throw an error when loading via the Text Tagger component.

    We first add a connection to the MySQL database

    Next we create a graph and add the Irish Jobs and LinkedIn metadata to it.

    Note: A data flow/mapping in CloverETL is named graph.

    And the LinkedIn metadata

    Next we add a DB_INPUT_TABLE and an XLS_DataReader and configure them

    Next we add the Text Tagger – Whitelist component from the Discovery palette and connect the Irishjobs_IE input table and the XLSDataReader for the LinkedInSkills to it. The Text Tagger takes two parameters as input. The unstructured text field that needs to be tagged and the name of the output field. In our case the Source Field Name is Description and the Target Field Name is DescriptionSkills. We still need to set up the latter as Metadata in the next step.

    Note: I have set the multi-assign delimiter to “,”. This will separate the tagged skills as a comma separated list. In the Endeca Server we can treat this then as a multi-value attribute.

    Next we need to create the Metadata for the output from the Text Tagger and add the DescriptionSkills field to it. We do this duplicating the metadata from the irishjobs_ie input table and manually adding the field DescriptionSkills to it.

    Next we add a Bulk Add/Replace Records component to the graph, connect the Text Tagger to it, and add the newly created metadata to it. In the properties of the Bulk Add/Replace Records component we specifiy the Spec Attribute (unique key) and the multi-assign delimiter. In our case that is a “,” (as specified earlier in the text tagger. The multi-assign delimiter is used to split the chain of comma separated values. What I find strange here is that you need to define a delimiter for all of your attributes. I would really expect to specify this at the attribute level rather than the dataset level.

    ETL Performance problems

    Once this is done we can run the ETL. When I first ran this with 30K records in the skills whitelist the whole thing was crawling. For the 16K records it took 12 hours. I noticed that Clover was only using one core of my quad-core CPU at a time (maxed out at 25% CPU usage). An indication to me that multi-threading is not implemented. Either in the Text Tagger component itself or in Clover. I had a look on the Clover website and it stated there that it is multi-threaded so my conclusion for the moment is that the Text Tagger component is not multi-threaded or that the Desktop version of Clover is not multi-threaded. Can anyone please shed some light on this?

    I was then experimenting a bit. What I found was that splitting the whitelist file into multiple smaller ones and then sending the jobs data through the Text Tagger sausage machine improved performance significantly. What that means is that Text Tagger performance is not linear, i.e. doubling the number of whitelist records does more than double elapsed time for tagging.

    Below you see how the split is implemented in Clover. This should give you an idea how this can be done. I am sure this can be implemented in a more elegant way using splitters and loops and stuff. The take away though is that in order to get the best performance you need to play around with your whitelist sizes.

    Note: The Reformat component is used to concatenate the resulting skills output fields from the Text Tagger components.

    The result for 200 sample jobs for a subset of the skills (letters A-E) is as follows.

    Oracle User Group Dublin, 12 March 2013. It’s FREE.

    The annual Oracle User Group meeting in Dublin is this year on a special day. 12th of March. That’s my birthday.

    The agenda is live now. Unfortunately, there is only one data warehouse/business intelligence stream this year. A lot of interesting presentations had to be left out.

    I will present with Maciek Kocon on “Oracle Data Integrator 11g Best Practices. Busting your performance, deployment, and scheduling headaches”.

    Other presentations include Mark Rittman on deploying OBIEE in the enterprise. Peak Indicators will be there to present on the concept and case study of a BI Competency Centre (BICC), and there will be a presentation on best practices in migrating from OWB to ODI.

    Oracle themselves also have a stream on the 12c database.

    If you are not interested in any of the above you may be interested in the free lunch.

    Register now. Only a few places remain.

    See you there!

    Extreme re-usability in ODI 11g (I can’t believe I am giving away this trick).

    Another Christmas and a second baby under my belt it’s time to get back to blogging.

    There were recently some good posts by David Allan and Gurcan Orhan on the power of ODI functions. David mentions correctly that ODI functions are the most underrated feature in ODI. Whenever you think of re-usability in ODI think of user functions. Functions can be used anywhere in ODI where you need to write something once and apply it many times. This is not limited to using functions inside interfaces to load columns or the use of wrapping and parameterizing common functionality. You could just write a snippet of SQL, e.g. a commonly used filter in a WHERE clause as a function and reuse it many times. The nice thing about functions is that you can use substitution method API calls in a function, use Java or the ODI tools.

    Use case for re-use of user functions

    One of the things that have puzzled me about ODI and Knowledge Modules is that some of the steps in the KMs are repeated and re-used over and over again without some central placeholder. Why not write the step once and re-use it many times across the Knowledge Modules? This is exactly what I will show you. We will take table stats gathering as an example. One of the steps in Knowledge Modules is to gather table stats once the target table has been loaded. Why not create a user function that gathers table stats on a target table and takes the estimate percent size as a parameter?

    In a first step let’s create the function. As a recommendation I would suggest to prefix your functions with a convention. In a first implementation I had GATHER_TABLE_STATS for the syntax, which wreaked havoc with any KMs that are using the Oracle GATHER_TABLE_STATS procedure in package dbms_stats.

    And the implementation

    BEGIN
    
    dbms_stats.gather_table_stats ( ownname => '<%=odiRef.getInfo( "DEST_SCHEMA" )%>', tabname => '<%=odiRef.getTargetTable("RES_NAME")%>', degree => DBMS_STATS.AUTO_DEGREE, estimate_percent => $(sample_size), cascade => TRUE  ) ;
    
    END;
    

    Next we will use the function as a step in a (global) Knowledge Module and pass in 10 as a parameter for the estimate_percent part of the stats gathering piece. In this particular case I have modified the IKM SQL Control Append.

    Then we create and execute an interface that uses this IKM

    As you can see the function was substituted at runtime and stats were gathered on the target table.

    This is just one example where you can benefit from user function re-usability. Be creative and think out of the box and you will see user function written over everything.

    How you can launch an ODI scenario through a web service call?

    ODI and Jetty

    The ODI 11g standalone agent now ships with its own lightweight application server (Jetty). The main reason this was included is to make it easier to execute scenarios via web service calls. In the past this was quite painful as you needed a separate application server (OC4J). The other limitation was that it was difficult (read work around) to implement asynchronous web service calls. Luckily, this has all changed with ODI 11g and Jetty. However, I don’t believe that you can run the ODI console in Jetty, which is unfortunate.

    Asynchronous web service calls

    The WSDL can be found at http://<standaloneagentname>:<port>/oraclediagent/OdiInvoke?wsdl

     

    We can call the web service methods from any SOAP Client, e.g. the OdiInvokeWebService tool built into ODI.

    We need to provide our login, the work repository for execution, the name of the scenario, its version, and the context. You can also specify if you want the scenario to be executed synchronously or asynchronously. If you set Synchronous to false then control will be handed back to the client immediately. If you set it to true it will execute and then return control. In asynchronous scenarios you can use the session ID from the response file and the getSessionStatus method to return the status of your scenario execution.

    The other thing you have to be aware of is that when you launch a scenario through a web service call it will fork out a new agent based on your odiparams.bat. You need to plan for the extra memory used up.

    You can now easily orchestrate your process flows in BEPL or launch them from OBIEE.

    Need ODI Training? Learn ODI from the experts!

    I am proud to announce that Sonra are now offering ODI classes.

    Value-priced. Customised. Onsite-Offsite. Online.

    Five day standard training course.
    Online.
    Starts 24 February 2014.
    Book before 15 January 2014. Pay $949.

    More information on our ODI courses.

    Exadata, Exalytics, SAP HANA and the multi billion dollar question.

    Recently there has been a lot of noise around in memory databases and how Exadata, Exalytics and SAP HANA compare to each other. Here are my two cents on the debate.

    Why you can’t compare Exalytics to SAP Hana

    It is the vision of SAP Hana to be used for both OLTP and analytics. As the name already suggests Exalytics just caters for analytics. Exalytics needs to be loaded from the data warehouse or transactional systems. Not needed for Hana. Everything already sits in memory. While Exalytics is near near realtime. Hana is realtime. However, currently there are not too many OLTP applications running on Hana. The problem is that applications need to be adapted or rewritten to make full use of HANA’s new architecture.This seems about to change. SAP has recently released a service pack for HANA that will allow it to do just that. However, the claim of being able to run OLTP and Analytics in the same in memory database remains somewhat unproven.

    Why you can’t compare Exadata to SAP Hana

    Exadata V3 now ships with 4 TB of RAM. Contrary to the claims of Oracle this does not make it an in-memory database. It lacks two important features:

    - Most of the data still sits on disk. Mostly SSD. Still disk
    - It lacks the optimized algorithms and design features (in memory indexes etc.) that have specifically been designed for in memory access.

    How does HANA compare to Exadata/Exalytics then?

    Well, it doesn’t. Oracle (or anyone else for that matter) still has to come up with a product that can be compared to HANA.

    Use cases of SAP HANA

    The number 1 use case for SAP HANA is for realtime operational business intelligence. True realtime BI now seems a distinct possibility. In this use case HANA represents the Interactive Sector as defined by Inmon et al. in their DW 2.0 book.

    The other use case is similar to what Exalytics is used for. A performance booster for the data warehouse. In this scenario we load or replicate data into the in-memory database either directly from our OLTP systems or the data warehouse. This is what we have predominantly seen HANA being used for so far. However, this falls well short of the vision and long term strategy, namely to change the way we do databases.

    The third use case of course and the end game for HANA is to run OLTP, data warehouse, and analytics all on HANA. Not very realistic at the moment. DRAM is still too costly.

    What happens next?

    This is the billion dollar question. Curt Monash thinks:

    “Putting all that together, the analytic case for SAP HANA seems decently substantiated, there are years of experience with the technology and its antecedents, and column stores (including in-memory) are well-established for analytics via multiple vendors. The OLTP case for HANA, however, remains largely unproven. It will be interesting to see how it plays out.”

    It will indeed be interesting to see how this plays out. SAP are currently heavily promoting HANA. There is a developer license available. You can rent an instance on AWS. SAP are pushing it hard for start-ups. I am sure that it will replace Oracle in many SAP implementations as the underlying database. It remains to be seen, however, if it can eat into the wider database market (the ultimate SAP objective). The other interesting question is: when will Oracle come up with a product that can compete head on with HANA? I believe there is plenty of time. The game hasn’t really started yet.

    In the meantime Microsoft have also announced an in memory database named Hekaton for release in 2014/15. It seems to be for OLTP only and from what I read a bit of a joke. Interesting times indeed though.

    The wider picture and Google

    At the moment we are seeing some tectonic shifts in the technology space that we haven’t seen in a generation. New technologies like Hadoop, Impala, IMDBs, NoSQL, Cloud etc are emerging that can handle ever bigger data at an ever faster speed. These innovations will have knock on effects on the way we architect information systems and on enterprise architecture in general. I even believe that they will ultimately change the way we do business. Unknown to many, a lot of these innovations are pioneered by Google. Papers on MapReduce  and the Google File System kicked off Hadoop. Google BigTable  inspired a lot of the NoSQL databases we have seen recently. You may have heard of Impala from Cloudera. Again a brainchild of Google. Based on Google Spanner and probably more so their in-house RDBMS F1.

    I would expect more innovation to come from that corner. After all Google is at the epicentre of the Big Data problem. They already have their own offering named BigQuery, which recently left the Beta phase. Doesn’t look like much right now but I expect them to up their game.

    Of course you can ignore those trends but you do so at your own peril.

    If you want to find out more about IMDBs and SAP Hana I recommend to read In-Memory Data Management: Technology and Applications by Hasso Plattner, one of the co-founders of SAP.

    Tricks with SQL: Beware of the Predicate in the Outer Join

    Today we will have a look what happens when we place a filter condition into an outer join. We will also have a look at how this compares to placing the filter condition into the WHERE clause. Finally we’ll have a look where this could be useful.

    Let’s first create some sample data.

    create table customer (
    cust_id number,
    cust_desc varchar2(50)
    );
    
    ALTER TABLE customer
    add CONSTRAINT pk_customer PRIMARY KEY (cust_id);
    
    create table customer_loc (
    customer_loc_id NUMBER,
    customer_id NUMBER,
    customer_loc_desc VARCHAR2(50)
    );
    
    ALTER TABLE customer_loc
    add CONSTRAINT pk_customer_loc PRIMARY KEY (customer_loc_id);
    
    insert into customer values (1,'Gold');
    insert into customer values (2,'Gold');
    insert into customer values (3,'Silver');
    
    insert into customer_loc values (1,1,'Dublin');
    insert into customer_loc values (2,2,'Paris');
    insert into customer_loc values (3,4,'Berlin');
    
    commit;
    
    analyze table customer compute statistics;
    analyze table customer_loc compute statistics;

    Let’s run our first outer join query where we put the filter condition into the WHERE clause. All is as expected. The query just returns customer_id 1.

    Now let’s run the same query, but we put the filter condition into the join. This may be different from what you may have expected. What happens is that the query returns all of the rows in the customer table and those rows of table customer_loc where the join condition is met, i.e. those rows where customer_id = 1.

    Let’s verify what we have just seen with another query. This time we will put customer_id = 4 into the Join condition. There are no rows in table customer_loc that match this. As expected the query returns all rows for table customer but now rows for table customer_loc

    What could this be useful for? One use case would be some limited form of data densification whereby you need to return all of the tables in one table but only a subset of rows in another table. Typically this can only be done using some sort of subselect. An example:

    The query below is run in the SH schema and returns all of the customers, but only those sales transactions that are larger than 1000 in sales_amount.

    select a.cust_id,amount_sold
    from customers a 
    left outer join sales b on a.cust_id = b.cust_id and amount_sold > 1000;
    

    Making use of ODI Flexfields to meet requirements

    What are ODI Flexfields?

    In ODI you can create user-defined fields on certain objects. You can think of these fields as additional attributes for certain objects. At design time you populate these attributes with values that are then used at runtime, e.g. by a Knowledge Module. There are various Flexfields defined out of the box for very specific requirements, e.g. there are Flexfields defined on the Datastore object for SAP and HIVE data integration tasks.

    Where do you create them?

    You create Flexfields in the Security module under the Objects accordion. You can’t create Flexfields for all of the objects. While you can create a Flexfield for an Interface you can’t create a Flexfield for an Interface Target Table.
    Once you have created the Flexfield you can then populate it with values in Designer.
    An example
    You have a requirement to log errors to an error table using a CKM. Based on the severity of the error you want to allow records through to your target table. DQ checks that result in minor errors are logged in the error table and go through to the target. Records with more sever issues only go to the error table and do not end up in the target table.
    In a first step we need to define a Numeric type Flexfield on the Condition object. We name this field Passthrough

    When you create your DQ Condition in Designer you can then set the value for the Flexfield. The value 1 will allow the record to go through to the target table even if there is a violation of the DQ check.

    In a next step we need to write some code in our CKM to make use of the Flexfield and implement the above logic.

    We store the value of the Flexfield in a variable passthrough and flag all of the records with the passthrough value.

    <? passthrough=”<%=odiRef.getFlexFieldValue(“” + odiRef.getCK(“ID”) + “”,”2500″,”PASSTHROUGH_COND”)%>”.replaceAll(“‘”,””””); ?>

    As you can see from the figure below, 2500 is the internal ID of object Condition in the ODI repository.

    ODI Snippets: What is the Optimization Context in ODI?

    There are a lot of confusing messages out there on what the Optimization Context in ODI is used for.

    This is the setting I am talking about

    Is it a runtime setting??

    First of all, this is not a runtime or execution context setting. It is only used at design time in ODI Studio.
    If you execute an interface in your Test context with the Optimization Context set to Development it will still execute in Test.

    It is a design Time Setting!

    So what is the Optimization Context used for? ODI uses the Optimization Context to validate your Interface at design time. This means it will validate your filters,functions etc. against the environment set in your Optimization Context.

    You can easily try this out.

    • Set your Optimization Context to Development
    • Go to your source table, right click, and select Data…
    • This will display the Data in the source table in your Development environment
    • Now change the Optimization Context to Test.
    • Go back to your source table and select Data… again
    • This will now display the data in your test environment

    What is this useful for?

    Personally I can’t really think of anything, but let me know if you are using this feature.

    How to query a whole schema or even database?! This tip will save you hours and hours of boring work.

    I really like this tip especially its simplicity. It has saved me hours and hours of mind numbing work in the past.

    The use case

    Imagine the following situation. You are tasked to complete a source to target map. With the help of various source system SMEs, legacy data models, and the data dictionary of the source database you have been able to complete 99% of the mappings. However, there is that one field that is nowhere to be found in the source system (there’s always one). The business users have given you some sample values for this field. Wouldn’t it be nice to run a query such as SELECT column_name FROM schema/database WHERE value = ‘Sample Value’.

    I will show you how you can achieve something similar with ODI.

    The trick is to dump the content of your schema or database into a folder and then use grep your Windows search or whatever to search for your sample value.

    In ODI this is extremely easy. Using the Command on Source/Target functionality it can be completed with two lines of code.

    An example

    Our objective is to find the table where ODI stores Option values for interfaces.

    In a first step we create the procedure that dumps the content of the ODI schema to a folder

    Command on Source

    We query the data dictionary for those tables that are owned by the ODI work repository user. If you want to dump out the whole database just get rid of the WHERE clause.

    SELECT table_name,owner from all_tables WHERE owner = ‘<name of your work schema>’

    Command on Target

    We use the tool OdiSqlUnload to unload those tables from the Command on Source recordset. For each table we create a separate text file.

    OdiSqlUnload “-FILE=C:\unload\#table_name.txt” “-DRIVER=oracle.jdbc.driver.OracleDriver” “-URL=<your connection>” “-USER=#owner” “-PASS=<your password>” “-FILE_FORMAT=VARIABLE” “-FIELD_SEP=;” “-ROW_SEP=\r\n” “-DATE_FORMAT=yyyy/MM/dd HH:mm:ss” “-CHARSET_ENCODING=ISO8859_1″ “-XML_CHARSET_ENCODING=ISO-8859-1″

    SELECT * FROM #table_name

     

    In a second step we enter a dummy value for one of our interfaces. This will be the value that we will search for in the dump.

    Our dummy value is XXXXXX.

    Next we run the procedure. This will dump the table contents into files.

    Finally, we use Windows Search, grep or similar to find the file where our sample value is stored.

    Windows search comes back with one result. The last column of table SNP_UE_USED, which is I_TXT_VALUE.

    I am sure you can think of many more use cases for this. Of course, you can make all of this more sophisticated and wrap it up in a package with options to run against different databases such as MS SQL Server etc.

    For very big databases you will need to watch performance. You may want to only extract a subset of tables and search for your sample value there before moving on to the next set and so on.