Competitive Business Intelligence: web scraping with Oracle.

January 6, 2009

In my opinion, one of the trends for Business Intelligence in 2009 (and the years to come) will be the integration of externally available data (data not found within the organisation itself, e.g. data in magazines, the web, libraries etc.) into the data warehouse and into an organisation’s business processes. Using BI to monitor the external environment that an organisation operates in, will grow in importance for decision making.
“Decision makers […] need information about what is going on outside the organization as well as inside.[…] Macroenvironmental analysis […] examines the economic, political, social, and technological events that influence an industry”.
From: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales p.4.
However, this is not fully understood by the wider Business Intelligence community, as can be seen from the quote below. (This is a quote from an article on BI in one of the local business weeklies here in Dublin):
“BI tools are fundamentally about using data which an organisation already has – whether in databases, CRM systems, financial and accounting packages, ERP systems or elsewhere”.
This perspective is too narrow. While it is fundamental to use BI to mine and analyse data that an organisation owns, it is as important to integrate data from external sources such as the web to optimize the internal decision-making process. Organisations that understand this requirement will have the edge over their competitors. For executives to make informed decisions they need to be able to look at intra-organisational events as well as the competitive environment.
“Strategic management is the art and science of directing companies in light of events both inside and outside the organization. In addition to understanding their own operations, managers must understand the rest of the industry. For example, should a company try to be a low-cost producer or a best-cost producer? How can a company differentiate its product line? Should the focus be on the entire market or on a niche? Without understanding what others are doing, making decisions about these types of issues leads to unexpected results.”
From: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales.
Web mining, data mining and text mining techniques will be of fundamental importance to implement this new breed of BI.
In this series we will have a look at all three areas. In today’s article I will show you, how we can implement web mining techniques with Oracle. In part two of this series we will then look at how we can use data mining techniques in general and survival analysis in particular to analyse macro environmental data from the web. Finally, in the third part we will look at how we can use text mining to classify and cluster the extracted data.
So, what we will do today, is harvest macro environmental business intelligence of real estate data. I thought it might be interesting to look at property related data because of the recent bursting of the property bubble. The site we will extract data from is property.ie.
The information we harvest can be used to (amongst other things)
– Identify areas where houses sell the quickest (have a short survival rate).
– Identify features of houses that sell the quickest.
– Find properties that are near other properties
– Create a taxonomy/classification to browse properties by features
– Monitor price increases or decreases.
– Use a combination of all of the above.
In the case studies that follows I am using Oracle 11.1.0.6.

1. Create a user and assign the relevant permissions

Let’s log on as a DBA user, e.g. SYS and execute the following stuff:

SQL&gt; create user real_estate identified by real_estate;
User created.
SQL&gt; grant connect to real_estate;
Grant succeeded.
SQL&gt; grant resource to real_estate;
Grant succeeded.
SQL&gt; CREATE TABLESPACE reales
  2  DATAFILE 'D:ORACLEORADATAORCLREALES01.dbf' size 512M
  3  extent management local autoallocate;
Tablespace created.

SQL> create user real_estate identified by real_estate;

User created.

SQL> grant connect to real_estate;

Grant succeeded.

SQL> grant resource to real_estate;

Grant succeeded.

SQL> CREATE TABLESPACE reales

2 DATAFILE 'D:ORACLEORADATAORCLREALES01.dbf' size 512M

3 extent management local autoallocate;

Tablespace created.

This will give us user real_estate with connect and resource grants.
Next we need to create an Access Control List (ACL) for this user. The ACL will allow us to access to the property.ie website, but prevents access to any other websites. ACLs are new in Oracle 11. If you are using Oracle 10 you need to adapt permissions for this.

SQL&gt; begin
  2          dbms_network_acl_admin.create_acl (
  3                  acl             =&gt; 'utl_http.xml',
  4                  description   =&gt; 'Normal Access',
  5                  principal       =&gt; 'REAL_ESTATE',
  6                  is_grant       =&gt; TRUE,
  7                  privilege       =&gt; 'connect',
  8                  start_date    =&gt; null,
  9                  end_date      =&gt; null
 10          );
 11  end;
 12  /

SQL> begin

2 dbms_network_acl_admin.create_acl (

3 acl => 'utl_http.xml',

4 description => 'Normal Access',

5 principal => 'REAL_ESTATE',

6 is_grant => TRUE,

7 privilege => 'connect',

8 start_date => null,

9 end_date => null

10 );

11 end;

12 /

On line 5 the principal needs to be in capital letters. Otherwise Oracle will return an error.
Next we assign the property.ie site to the ACL:

SQL&gt; begin
  2      dbms_network_acl_admin.assign_acl (
  3      acl =&gt; 'utl_http.xml',
  4      host =&gt; 'www.property.ie',
  5      lower_port =&gt; 1,
  6      upper_port =&gt; 10000);
  7  end;
  8  /

SQL> begin

2 dbms_network_acl_admin.assign_acl (

3 acl => 'utl_http.xml',

4 host => 'www.property.ie',

5 lower_port => 1,

6 upper_port => 10000);

7 end;

8 /

Finally we give execute permission on utl_http and dbms_lock

SQL&gt; grant execute on utl_http to real_estate;
Grant succeeded.
SQL&gt;
SQL&gt; GRANT EXECUTE ON dbms_lock TO real_estate;
Grant succeeded.
SQL&gt; spool off

SQL> grant execute on utl_http to real_estate;

Grant succeeded.

SQL>

SQL> GRANT EXECUTE ON dbms_lock TO real_estate;

Grant succeeded.

SQL> spool off

2. Create Tables

Next we need to create the tables to store the extracted information.

SQL&gt; CREATE TABLE seed_html (
  2     html CLOB
  3  )
  4  TABLESPACE REALES
  5  PCTFREE 0
  6  /
Table created.
SQL&gt; CREATE TABLE seed (
  2     part_of_link VARCHAR2(30),
  3     num_pages NUMBER,
  4     num_property NUMBER,
  5     area VARCHAR2(30)
  6  )
  7  TABLESPACE REALES
  8  PCTFREE 0
  9  /
Table created.
SQL&gt; CREATE TABLE property_html (
  2     part_of_link VARCHAR2(30),
  3     HTML CLOB,
  4     link VARCHAR2(255)
  5  )
  6  TABLESPACE REALES
  7  PCTFREE 0
  8  /
Table created.
SQL&gt; CREATE TABLE property_description (
  2    property_id    NUMBER,
  3    prop_code      NUMBER,
  4    prop_desc      CLOB,
  5    activity_date  DATE,
  6    latitude       NUMBER,
  7    longitude      NUMBER
  8  )
  9  TABLESPACE REALES
 10  PCTFREE 0
 11  /
Table created.
SQL&gt; CREATE TABLE property
  2  (
  3    PROPERTY_ID      NUMBER,
  4    LINK             VARCHAR2(1000),
  5    PROP_CODE        NUMBER,
  6    PRICE            NUMBER,
  7    ADDRESS          VARCHAR2(500),
  8    ROOMS            VARCHAR2(500),
  9    AREA             VARCHAR2(50),
 10    VALID_FROM_DATE  DATE,
 11    VALID_TO_DATE    DATE,
 12    DATE_REMOVED     DATE,
 13    VALID_IND        NUMBER,
 14    DELETE_IND       NUMBER
 15  )
 16  TABLESPACE REALES
 17  PCTFREE    10
 18  /
Table created.
SQL&gt; CREATE TABLE PROPERTY_HELPER
  2  (
  3    LINK        VARCHAR2(1000),
  4    PROP_CODE   NUMBER,
  5    PRICE       NUMBER,
  6    ADDRESS     VARCHAR2(500),
  7    ROOMS       VARCHAR2(500),
  8    AREA        VARCHAR2(50),
  9    DELETE_IND  NUMBER
 10  )
 11  TABLESPACE REALES
 12  PCTFREE    0;
Table created.
SQL&gt; CREATE TABLE PROPERTY_ATTRIBUTES
  2  (
  3    LINK       VARCHAR2(4000 BYTE),
  4    PROP_CODE  NUMBER,
  5    PRICE      NUMBER,
  6    ADDRESS    VARCHAR2(4000 BYTE),
  7    ROOMS      VARCHAR2(4000 BYTE),
  8    AREA       VARCHAR2(30 BYTE)
  9  )
 10  TABLESPACE REALES
 11  PCTFREE    0;
Table created.
SQL&gt; CREATE SEQUENCE seq_property
  2    START WITH 1
  3    MAXVALUE 999999999999999999999999999
  4    MINVALUE 1
  5    NOCYCLE
  6    CACHE 20
  7    NOORDER
  8  /
Sequence created.

SQL> CREATE TABLE seed_html (

2 html CLOB

3 )

4 TABLESPACE REALES

5 PCTFREE 0

6 /

Table created.

SQL> CREATE TABLE seed (

2 part_of_link VARCHAR2(30),

3 num_pages NUMBER,

4 num_property NUMBER,

5 area VARCHAR2(30)

6 )

7 TABLESPACE REALES

8 PCTFREE 0

9 /

Table created.

SQL> CREATE TABLE property_html (

2 part_of_link VARCHAR2(30),

3 HTML CLOB,

4 link VARCHAR2(255)

5 )

6 TABLESPACE REALES

7 PCTFREE 0

8 /

Table created.

SQL> CREATE TABLE property_description (

2 property_id NUMBER,

3 prop_code NUMBER,

4 prop_desc CLOB,

5 activity_date DATE,

6 latitude NUMBER,

7 longitude NUMBER

8 )

9 TABLESPACE REALES

10 PCTFREE 0

11 /

Table created.

SQL> CREATE TABLE property

2 (

3 PROPERTY_ID NUMBER,

4 LINK VARCHAR2(1000),

5 PROP_CODE NUMBER,

6 PRICE NUMBER,

7 ADDRESS VARCHAR2(500),

8 ROOMS VARCHAR2(500),

9 AREA VARCHAR2(50),

10 VALID_FROM_DATE DATE,

11 VALID_TO_DATE DATE,

12 DATE_REMOVED DATE,

13 VALID_IND NUMBER,

14 DELETE_IND NUMBER

15 )

16 TABLESPACE REALES

17 PCTFREE 10

18 /

Table created.

SQL> CREATE TABLE PROPERTY_HELPER

2 (

3 LINK VARCHAR2(1000),

4 PROP_CODE NUMBER,

5 PRICE NUMBER,

6 ADDRESS VARCHAR2(500),

7 ROOMS VARCHAR2(500),

8 AREA VARCHAR2(50),

9 DELETE_IND NUMBER

10 )

11 TABLESPACE REALES

12 PCTFREE 0;

Table created.

SQL> CREATE TABLE PROPERTY_ATTRIBUTES

2 (

3 LINK VARCHAR2(4000 BYTE),

4 PROP_CODE NUMBER,

5 PRICE NUMBER,

6 ADDRESS VARCHAR2(4000 BYTE),

7 ROOMS VARCHAR2(4000 BYTE),

8 AREA VARCHAR2(30 BYTE)

9 )

10 TABLESPACE REALES

11 PCTFREE 0;

Table created.

SQL> CREATE SEQUENCE seq_property

2 START WITH 1

3 MAXVALUE 999999999999999999999999999

4 MINVALUE 1

5 NOCYCLE

6 CACHE 20

7 NOORDER

8 /

Sequence created.

Because we will be dealing with very little data initially I have not added any indexes to these tables. Once volume of data grows and we have a better understanding of query patterns we should add relevant indexes.

3. Extract the property seed

Before we get stuck into things I recommend you get familiar with the functionality, navigation etc. of the property.ie website. This will make it easier to understand what we will be dealing with in the next couple of sections. For the purpose of this exercise we will limit the extract process to properties in county Dublin, as we don’t want to put too much pressure on the property.ie web servers. At the same time, though, we want to gather enough information to perform some proper analysis: we will include all areas in Dublin in our extract process. If you have a look at the frontpage of the property.ie website you will see that each area also lists the number of properties available in this area. This information will become relevant for the later stages of our extract exercise.
The procedure below extracts the HTML part of the property.ie frontpage which contains the areas and the number of properties in each area.

SQL&gt; CREATE OR REPLACE PROCEDURE extract_seed_html
  2
  3  IS
  4
  5  -- exec  extract_seed_html
  6
  7  BEGIN
  8
  9     EXECUTE IMMEDIATE 'TRUNCATE TABLE seed_html';
 10
 11    -- utl_http.set_proxy([http://][user[:password]@]host[:port])
 12
 13     INSERT INTO seed_html
 14     SELECT TO_CLOB(to_clob(DBMS_LOB.SUBSTR (html,4000,5900)) || to_CLOB(DBMS_LOB.SUBSTR (html,4000,9900))) FROM (
 15        SELECT HTTPURITYPE('http://www.property.ie/').getclob() AS html FROM dual
 16     )
 17
 18     COMMIT;
 19
 20  END extract_seed_html;
 21  /
Procedure created.

SQL> CREATE OR REPLACE PROCEDURE extract_seed_html

3 IS

5 -- exec extract_seed_html

7 BEGIN

9 EXECUTE IMMEDIATE 'TRUNCATE TABLE seed_html';

11 -- utl_http.set_proxy([http://][user[:password]@]host[:port])

13 INSERT INTO seed_html

14 SELECT TO_CLOB(to_clob(DBMS_LOB.SUBSTR (html,4000,5900)) || to_CLOB(DBMS_LOB.SUBSTR (html,4000,9900))) FROM (

15 SELECT HTTPURITYPE('http://www.property.ie/').getclob() AS html FROM dual

16 )

18 COMMIT;

20 END extract_seed_html;

21 /

Procedure created.

On line 11 I have commented out the use of a proxy server. If you are using a proxy or want to anonymize your requests remove the comment and fill in your proxy info such as username, password, host, and port.
On line 15, we are using the HTTPURITYPE function to retrieve the HTML code of the property.ie frontpage and extract the HTML content of the property area dropdown. HTTPURITYPE uses the http_utl package.

HTML                                                                              OCCURENCE
-------------------------------------------------------------------------------- ----------<select name="s[a_id][]">                                                    1
<option value="">All areas</option>
</select><select name="s[a_id][]">                                                    2
<option value="">All areas</option>
</select><select name="s[a_id][]">                                                    3
<option value="">All areas</option>
</select><select name="s[a_id][]">                                                    4
<option value="">All areas</option>
</select><select id="area" name="s[a_id][]">                                                    5
<option value="">All areas</option>
</select>

HTML OCCURENCE

-------------------------------------------------------------------------------- ----------<select name="s[a_id][]"> 1