In my opinion, one of the trends for Business Intelligence in 2009 (and the years to come) will be the integration of externally available data (data not found within the organisation itself, e.g. data in magazines, the web, libraries etc.) into the data warehouse and into an organisation’s business processes. Using BI to monitor the external environment that an organisation operates in, will grow in importance for decision making.
“Decision makers […] need information about what is going on outside the organization as well as inside.[…] Macroenvironmental analysis […] examines the economic, political, social, and technological events that influence an industry”.
From: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales p.4.
However, this is not fully understood by the wider Business Intelligence community, as can be seen from the quote below. (This is a quote from an article on BI in one of the local business weeklies here in Dublin):
“BI tools are fundamentally about using data which an organisation already has – whether in databases, CRM systems, financial and accounting packages, ERP systems or elsewhere”.
This perspective is too narrow. While it is fundamental to use BI to mine and analyse data that an organisation owns, it is as important to integrate data from external sources such as the web to optimize the internal decision-making process. Organisations that understand this requirement will have the edge over their competitors. For executives to make informed decisions they need to be able to look at intra-organisational events as well as the competitive environment.
“Strategic management is the art and science of directing companies in light of events both inside and outside the organization. In addition to understanding their own operations, managers must understand the rest of the industry. For example, should a company try to be a low-cost producer or a best-cost producer? How can a company differentiate its product line? Should the focus be on the entire market or on a niche? Without understanding what others are doing, making decisions about these types of issues leads to unexpected results.”
From: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales.
Web mining, data mining and text mining techniques will be of fundamental importance to implement this new breed of BI.
In this series we will have a look at all three areas. In today’s article I will show you, how we can implement web mining techniques with Oracle. In part two of this series we will then look at how we can use data mining techniques in general and survival analysis in particular to analyse macro environmental data from the web. Finally, in the third part we will look at how we can use text mining to classify and cluster the extracted data.
So, what we will do today, is harvest macro environmental business intelligence of real estate data. I thought it might be interesting to look at property related data because of the recent bursting of the property bubble. The site we will extract data from is property.ie.
The information we harvest can be used to (amongst other things)
– Identify areas where houses sell the quickest (have a short survival rate).
– Identify features of houses that sell the quickest.
– Find properties that are near other properties
– Create a taxonomy/classification to browse properties by features
– Monitor price increases or decreases.
– Use a combination of all of the above.
In the case studies that follows I am using Oracle 11.1.0.6.
1. Create a user and assign the relevant permissions
Let’s log on as a DBA user, e.g. SYS and execute the following stuff:
1 2 3 4 5 6 7 8 9 10 |
SQL> create user real_estate identified by real_estate; User created. SQL> grant connect to real_estate; Grant succeeded. SQL> grant resource to real_estate; Grant succeeded. SQL> CREATE TABLESPACE reales 2 DATAFILE 'D:ORACLEORADATAORCLREALES01.dbf' size 512M 3 extent management local autoallocate; Tablespace created. |
This will give us user real_estate with connect and resource grants.
Next we need to create an Access Control List (ACL) for this user. The ACL will allow us to access to the property.ie website, but prevents access to any other websites. ACLs are new in Oracle 11. If you are using Oracle 10 you need to adapt permissions for this.
1 2 3 4 5 6 7 8 9 10 11 12 |
SQL> begin 2 dbms_network_acl_admin.create_acl ( 3 acl => 'utl_http.xml', 4 description => 'Normal Access', 5 principal => 'REAL_ESTATE', 6 is_grant => TRUE, 7 privilege => 'connect', 8 start_date => null, 9 end_date => null 10 ); 11 end; 12 / |
On line 5 the principal needs to be in capital letters. Otherwise Oracle will return an error.
Next we assign the property.ie site to the ACL:
1 2 3 4 5 6 7 8 |
SQL> begin 2 dbms_network_acl_admin.assign_acl ( 3 acl => 'utl_http.xml', 4 host => 'www.property.ie', 5 lower_port => 1, 6 upper_port => 10000); 7 end; 8 / |
Finally we give execute permission on utl_http and dbms_lock
1 2 3 4 5 6 |
SQL> grant execute on utl_http to real_estate; Grant succeeded. SQL> SQL> GRANT EXECUTE ON dbms_lock TO real_estate; Grant succeeded. SQL> spool off |
2. Create Tables
Next we need to create the tables to store the extracted information.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
SQL> CREATE TABLE seed_html ( 2 html CLOB 3 ) 4 TABLESPACE REALES 5 PCTFREE 0 6 / Table created. SQL> CREATE TABLE seed ( 2 part_of_link VARCHAR2(30), 3 num_pages NUMBER, 4 num_property NUMBER, 5 area VARCHAR2(30) 6 ) 7 TABLESPACE REALES 8 PCTFREE 0 9 / Table created. SQL> CREATE TABLE property_html ( 2 part_of_link VARCHAR2(30), 3 HTML CLOB, 4 link VARCHAR2(255) 5 ) 6 TABLESPACE REALES 7 PCTFREE 0 8 / Table created. SQL> CREATE TABLE property_description ( 2 property_id NUMBER, 3 prop_code NUMBER, 4 prop_desc CLOB, 5 activity_date DATE, 6 latitude NUMBER, 7 longitude NUMBER 8 ) 9 TABLESPACE REALES 10 PCTFREE 0 11 / Table created. SQL> CREATE TABLE property 2 ( 3 PROPERTY_ID NUMBER, 4 LINK VARCHAR2(1000), 5 PROP_CODE NUMBER, 6 PRICE NUMBER, 7 ADDRESS VARCHAR2(500), 8 ROOMS VARCHAR2(500), 9 AREA VARCHAR2(50), 10 VALID_FROM_DATE DATE, 11 VALID_TO_DATE DATE, 12 DATE_REMOVED DATE, 13 VALID_IND NUMBER, 14 DELETE_IND NUMBER 15 ) 16 TABLESPACE REALES 17 PCTFREE 10 18 / Table created. SQL> CREATE TABLE PROPERTY_HELPER 2 ( 3 LINK VARCHAR2(1000), 4 PROP_CODE NUMBER, 5 PRICE NUMBER, 6 ADDRESS VARCHAR2(500), 7 ROOMS VARCHAR2(500), 8 AREA VARCHAR2(50), 9 DELETE_IND NUMBER 10 ) 11 TABLESPACE REALES 12 PCTFREE 0; Table created. SQL> CREATE TABLE PROPERTY_ATTRIBUTES 2 ( 3 LINK VARCHAR2(4000 BYTE), 4 PROP_CODE NUMBER, 5 PRICE NUMBER, 6 ADDRESS VARCHAR2(4000 BYTE), 7 ROOMS VARCHAR2(4000 BYTE), 8 AREA VARCHAR2(30 BYTE) 9 ) 10 TABLESPACE REALES 11 PCTFREE 0; Table created. SQL> CREATE SEQUENCE seq_property 2 START WITH 1 3 MAXVALUE 999999999999999999999999999 4 MINVALUE 1 5 NOCYCLE 6 CACHE 20 7 NOORDER 8 / Sequence created. |
Note: Because we will be dealing with very little data initially I have not added any indexes to these tables. Once volume of data grows and we have a better understanding of query patterns we should add relevant indexes.
3. Extract the property seed
Before we get stuck into things I recommend you get familiar with the functionality, navigation etc. of the property.ie website. This will make it easier to understand what we will be dealing with in the next couple of sections. For the purpose of this exercise we will limit the extract process to properties in county Dublin, as we don’t want to put too much pressure on the property.ie web servers. At the same time, though, we want to gather enough information to perform some proper analysis: we will include all areas in Dublin in our extract process. If you have a look at the frontpage of the property.ie website you will see that each area also lists the number of properties available in this area. This information will become relevant for the later stages of our extract exercise.
The procedure below extracts the HTML part of the property.ie frontpage which contains the areas and the number of properties in each area.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
SQL> CREATE OR REPLACE PROCEDURE extract_seed_html 2 3 IS 4 5 -- exec extract_seed_html 6 7 BEGIN 8 9 EXECUTE IMMEDIATE 'TRUNCATE TABLE seed_html'; 10 11 -- utl_http.set_proxy([http://][user[:password]@]host[:port]) 12 13 INSERT INTO seed_html 14 SELECT TO_CLOB(to_clob(DBMS_LOB.SUBSTR (html,4000,5900)) || to_CLOB(DBMS_LOB.SUBSTR (html,4000,9900))) FROM ( 15 SELECT HTTPURITYPE('http://www.property.ie/').getclob() AS html FROM dual 16 ) 17 18 COMMIT; 19 20 END extract_seed_html; 21 / Procedure created. |
On line 11 I have commented out the use of a proxy server. If you are using a proxy or want to anonymize your requests remove the comment and fill in your proxy info such as username, password, host, and port.
On line 15, we are using the HTTPURITYPE function to retrieve the HTML code of the property.ie frontpage and extract the HTML content of the property area dropdown. HTTPURITYPE uses the http_utl package.
1 2 3 4 5 6 7 8 9 10 11 12 |
HTML OCCURENCE -------------------------------------------------------------------------------- ---------- <select id="area" name="s[a_id][]"> 1 <option value="">All areas</option> <select id="area" name="s[a_id][]"> 2 <option value="">All areas</option> <select id="area" name="s[a_id][]"> 3 <option value="">All areas</option> <select id="area" name="s[a_id][]"> 4 <option value="">All areas</option> <select id="area" name="s[a_id][]"> 5 <option value="">All areas</option> |
We will now strip this piece of information of any HTML noise.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
SQL> CREATE OR REPLACE PROCEDURE load_seed 2 3 IS 4 5 -- exec load_seed 6 7 BEGIN 8 9 EXECUTE IMMEDIATE 'TRUNCATE TABLE seed'; 10 11 INSERT /*+ APPEND */ INTO seed 12 SELECT 13 REPLACE(TRIM(REGEXP_REPLACE(REGEXP_SUBSTR(html, '[A-Z][a-z].*([0-9]{1,3})',1,occurence),'([0-9]{1,3})')),' ','-') AS prep_for_link, 14 CEIL(TO_NUMBER(REGEXP_REPLACE(REGEXP_SUBSTR(REGEXP_SUBSTR(html, '[A-Z][a-z].*([0-9]{1,3})',1,occurence),'([0-9]{1,3})'),'(|)'))/10) AS num_pages, 15 TO_NUMBER(REGEXP_REPLACE(REGEXP_SUBSTR(REGEXP_SUBSTR(html, '[A-Z][a-z].*([0-9]{1,3})',1,occurence),'([0-9]{1,3})'),'(|)')) as num_property, 16 REGEXP_SUBSTR(html, '[A-Z][a-z].*([0-9]{1,3})',1,occurence) as area 17 FROM 18 ( SELECT html,occurence FROM seed_html 19 CROSS JOIN ( 20 SELECT level occurence FROM dual CONNECT BY level <= 190) ); 21 22 COMMIT; 23 24 END load_seed; 25 / Procedure created. |
On lines 18-20, we do a cross join between our seed_html table with an inline view that returns the numbers 1 to 190. This is done using the CONNECT BY clause. We have chosen 190 here as the upper limit, because there will never be more than 190 areas in county Dublin.
The inline view returns the following.
1 2 3 |
SQL> SELECT html,occurence FROM seed_html 2 CROSS JOIN ( 3 SELECT level occurence FROM dual CONNECT BY level <= 190); |
We then use regular expressions to parse each occurrence of an area and the number of properties in this area on a step by step basis. At the end of this article there are a couple of links to regular expressions tutorials. This is the first time that I have used them myself, so I am sure the above could have been done in a more elegant and more performant way.
In our seed table, we should now have the following information
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
PART_OF_LINK NUM_PAGES NUM_PROPERTY AREA ------------------------------ ---------- ------------ ------------------------------ Adamstown 1 10 Adamstown (10) Ard-Na-Greine 1 5 Ard Na Greine (5) Artane 5 43 Artane (43) Ashtown 3 29 Ashtown (29) Aylesbury 1 8 Aylesbury (8) Ayrfield 2 14 Ayrfield (14) Balbriggan 18 176 Balbriggan (176) Baldonnell 1 3 Baldonnell (3) Baldoyle 3 23 Baldoyle (23) Balgriffin 2 16 Balgriffin (16) Ballinteer 4 33 Ballinteer (33) Ballsbridge 6 51 Ballsbridge (51) |
4. Extract HTML for property master pages
Each property area has one or more property master pages. On each master property page there are no more than 10 properties listed. Users of the property.ie site can page through these master pages. By clicking on a property on the master page they get to the details page for this property.
The URL template for the master page is
http://www.property.ie/property-for-sale/dublin//p_
/, e.g. http://www.property.ie/property-for-sale/dublin/balbriggan/p_2/
With the information from the seed table, we will iterate over the master property page in our next procedure and parse information that we are interested in from this page. What we will do first though is introduce an error handling procedures. This is necessary to handle errors in case we lose connectivity.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
SQL> CREATE OR REPLACE PROCEDURE raise_err ( 2 p_errcode IN NUMBER := NULL, 3 p_errmsg IN VARCHAR2 := NULL 4 ) 5 IS 6 l_errcode NUMBER := NVL (p_errcode, SQLCODE); 7 l_errmsg VARCHAR2(1000) := NVL (p_errmsg, SQLERRM); 8 BEGIN 9 10 11 IF l_errcode BETWEEN -20999 AND -20000 12 THEN 13 raise_application_error (l_errcode, l_errmsg); 14 /* Use positive error numbers */ 15 ELSIF l_errcode > 0 16 AND l_errcode NOT IN (1, 100) 17 THEN 18 raise_application_error (-20000, l_errcode || '-' || l_errmsg); 19 /* Can't EXCEPTION_INIT -1403 */ 20 ELSIF l_errcode IN (100, -1403) 21 THEN 22 RAISE NO_DATA_FOUND; 23 /* Re-raise any other exception. */ 24 ELSIF l_errcode != 0 25 THEN 26 EXECUTE IMMEDIATE 27 'DECLARE myexc EXCEPTION; ' || 28 ' PRAGMA EXCEPTION_INIT (myexc, ' || 29 TO_CHAR (l_errcode) || ');' || 30 'BEGIN RAISE myexc; END;'; 31 32 END IF; 33 END; 34 / Procedure created. |
Procedure raise_err raises any errors during extract. But let’s move on to actually extracting the HTML for the master property pages via our seed table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
SQL> CREATE OR REPLACE PROCEDURE extract_prop_html (p_area IN VARCHAR2) 2 3 IS 4 5 -- exec extract_prop_html (NULL) 6 7 CURSOR c_seed 8 IS 9 SELECT 10 part_of_link 11 FROM 12 seed a 13 WHERE 14 UPPER(part_of_link) = COALESCE(UPPER(p_area),UPPER(part_of_link)) AND 15 NOT EXISTS ( 16 SELECT NULL FROM property_html b WHERE a.part_of_link = b.part_of_link ); 17 18 l_part_of_link VARCHAR2(30); 19 20 BEGIN 21 22 23 BEGIN 24 25 FOR r_seed IN c_seed 26 LOOP 27 28 l_part_of_link := r_seed.part_of_link; 29 30 INSERT INTO property_html 31 SELECT 32 part_of_link, 33 REGEXP_SUBSTR(REPLACE(REPLACE(REPLACE(html,CHR(10),''),CHR(13),''),CHR(9),''),'searchresults_container.*summary_info'), 34 link 35 FROM ( 36 SELECT 37 part_of_link, 38 HTTPURITYPE('http://www.property.ie/property-for-sale/dublin/' || part_of_link || '/p_' || occurence ||'/').getclob() AS html, 39 'http://www.property.ie/property-for-sale/dublin/' || part_of_link || '/p_' || occurence ||'/' AS link, 40 occurence 41 FROM 42 seed a CROSS JOIN ( 43 SELECT level occurence FROM dual CONNECT BY level <= ( SELECT MAX(num_pages) FROM seed ) 44 ) 45 WHERE num_pages >= occurence AND part_of_link = r_seed.part_of_link 46 ); 47 48 COMMIT; 49 50 51 52 dbms_lock.sleep(1); 53 54 END LOOP; 55 56 57 58 EXCEPTION WHEN OTHERS THEN 59 60 EXECUTE IMMEDIATE ' DELETE FROM property_html WHERE part_of_link = ' ' ' || l_part_of_link || ' ' ''; 61 raise_err(SQLCODE,SUBSTR(SQLERRM,1,1000)); 62 63 END; 64 65 END extract_prop_html; 66 / |
On lines 7-16 we define a cursor that will use the information from the seed table to browse the master property pages. This cursor allows us to either iterate over everything in the seed table (if we pass in NULL as a parameter to the procedure) or just a particular area. This is achieved via the COALESCE function.
On lines 42-57 we do the main work. We extract the html of all of the master property pages on an area by area basis. Again we use our cross join and CONNECT BY technique from earlier on to retrieve all master property pages for an area in one go. The results of this cross join just for one area would look similar to below:
1 2 3 4 5 6 7 8 9 |
PART_OF_LINK HTML LINK ------------------------------ -------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------- Ballsbridge searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_1/ Ballsbridge searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_2/ Ballsbridge searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_3/ Ballsbridge searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_4/ Ballsbridge searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_5/ Ballsbridge searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_6/ ... |
We store the piece of html that contains the property attributes in our property_html table. Later on we will use this piece of HTML to parse the property attributes we are interested in.
On line 52 we pause for exactly one second to reduce the load on the property.ie web server before moving on to the next area.
On lines 58-63 we do some error handling in case we lose connectivity. If we lose connectivity we delete any entries for the area we were extracting at the moment the error occurred. This will allow us to pick up from where the extract stopped when we re-execute the procedure.
5. Extract and merge property attributes
We now have the relevant HTML from the master property pages to extract and merge the property attributes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
SQL> CREATE OR REPLACE PROCEDURE merge_prop 2 3 IS 4 5 -- exec merge_prop 6 7 BEGIN 8 9 10 INSERT INTO property_attributes 11 SELECT 12 TO_CHAR(REGEXP_SUBSTR(SUBSTR(prop_info,1,INSTR(prop_info,'</h2>')),'http.*[0-9]{4}.')) AS link, 13 to_number(replace((REGEXP_SUBSTR(SUBSTR(prop_info,1,INSTR(prop_info,'</h2>')),'/[0-9]{4,5}./')),'/','')) AS prop_code, 14 TO_NUMBER(REGEXP_REPLACE(REGEXP_SUBSTR(prop_info,'[0-9],[0-9]{3},[0-9]{3}|[0-9]{3},[0-9]{3}'),'[,|.]')) AS price, 15 TO_CHAR(REGEXP_REPLACE(SUBSTR(prop_info,1,INSTR(prop_info,'</h2>')-1),'<[^>]+>|[0-9].')) AS address, 16 to_char(substr(prop_info,instr(prop_info,'<h4>',1)+4,instr(prop_info,'</h4>',1)-instr(prop_info,'<h4>',1)-4)) as rooms, 17 part_of_link AS area 18 FROM ( SELECT 19 SUBSTR(html,instr(html,'<div class="sresult_address">',1,occurence+3),instr(html,'<div class="sresult_moredetail">',1,occurence+3)-instr(html,'<div class="sresult_address">',1,occurence+3)) AS prop_info, 20 occurence,part_of_link FROM property_html 21 CROSS JOIN ( 22 SELECT level occurence FROM dual CONNECT BY level <= 10) ); 23 24 25 COMMIT; 26 27 -- Get the properties that were updated or newly inserted 28 29 INSERT INTO property_helper 30 SELECT 31 link, 32 prop_code, 33 price, 34 address, 35 rooms, 36 area, 37 0 38 FROM 39 property_attributes 40 WHERE prop_code IS NOT NULL 41 MINUS 42 SELECT 43 link, 44 prop_code, 45 price, 46 address, 47 rooms, 48 area, 49 delete_ind 50 FROM 51 property 52 53 COMMIT; 54 55 -- Get the property codes that were deleted 56 57 INSERT INTO property_helper 58 SELECT 59 '-', 60 prop_code, 61 -1, 62 '-', 63 '-', 64 '-', 65 1 66 FROM 67 property 68 WHERE delete_ind <> 1 69 MINUS 70 SELECT 71 '-', 72 prop_code, 73 -1, 74 '-', 75 '-', 76 '-', 77 1 78 FROM 79 property_attributes; 80 81 82 COMMIT; 83 84 -- Update the updated and deleted records 85 86 MERGE INTO property a USING ( 87 SELECT 88 link, 89 prop_code, 90 price, 91 address, 92 rooms, 93 area, 94 delete_ind 95 FROM 96 property_helper 97 ) b ON (a.prop_code = b.prop_code ) 98 WHEN MATCHED THEN UPDATE SET 99 a.valid_to_date = CASE WHEN a.valid_ind = 1 THEN SYSDATE ELSE a.valid_to_date END, 100 a.date_removed = CASE 101 WHEN a.delete_ind = 1 THEN a.date_removed -- It has been removed previously 102 ELSE 103 CASE 104 WHEN b.delete_ind = 1 THEN SYSDATE 105 ELSE a.date_removed 106 END 107 END, 108 a.valid_ind = 0, 109 a.delete_ind = CASE WHEN b.delete_ind = 1 THEN 1 ELSE a.delete_ind END; 110 111 112 COMMIT; 113 114 -- Create the updated and newly inserted records. Updated records get a new record to audit changes 115 116 INSERT INTO property 117 SELECT 118 seq_property.nextval, 119 link, 120 prop_code, 121 price, 122 address, 123 rooms, 124 area, 125 SYSDATE, 126 TO_DATE('31/12/9999','DD/MM/YYYY'), 127 TO_DATE('31/12/9999','DD/MM/YYYY'), 128 1, 129 0 130 FROM ( 131 SELECT 132 link, 133 prop_code, 134 price, 135 address, 136 rooms, 137 area, 138 delete_ind 139 FROM 140 property_helper 141 MINUS 142 SELECT 143 link, 144 prop_code, 145 price, 146 address, 147 rooms, 148 area, 149 delete_ind 150 FROM 151 property 152 ) 153 WHERE 154 delete_ind <> 1; 155 156 COMMIT; 157 158 END merge_prop; 159 / Procedure created. |
The above procedure consists of five parts.
On lines 11-22 we parse the relevant attributes from the HTML piece we extracted in the previous step. This includes the link to the property’s details page, the property_code (unique identifier for the property), the price, the address, and the room details. Again we are using Regular Expressions to achieve this.
On lines 29-51 we store properties that were either updated or added since our last extract batch in a helper table (property_helper). We have to do a full comparison between all our previously extracted properties in the property table and the properties we have just extracted. We do this via the MINUS operator.
Note: For a large volume of records and depending on our hardware, we might run into performance issues doing a full diff between the two result sets. Anything below 1M records should not be a problem though.
On lines 51-75 we store properties that were deleted since our last extract job in the property_helper table. Again the only option we have here is to do a full comparison between the records we have extracted previously and those we have extracted in our current batch cycle.
On lines 86-109 we merge records that were updated or deleted with previously extracted property records. For each record that was updated we update its valid period and set the valid_ind to 0, i.e. the valid indicator is set to false and as a result we have marked this record as invalid. For each record that was deleted we also update its valid period and valid_ind field. In addition, we update the record’s delete_ind field to 1, i.e. its delete indicator is set to true and as a result we have marked this record as deleted at source.
On lines 104-142 we insert the new records we came across in our current extract batch. We also create a new record for updated records (similar to a Slowly Changing Dimension Type 2). This will give us an audit trail for any updates that were made to records, e.g. when the price is increased or decreased.
6. Extract property details
As part of the previous step we extracted the link to the property’s details page. In this step we will use this link as part of an HTTP get request and scrape the information we are interested in from this page.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
SQL> CREATE OR REPLACE PROCEDURE insert_prop_desc 2 3 IS 4 5 -- exec insert_prop_desc 6 7 CURSOR c_prop_desc 8 IS 9 SELECT 10 link, 11 property_id, 12 prop_code 13 FROM 14 property a 15 WHERE 16 NOT EXISTS ( SELECT NULL FROM property_description b WHERE a.prop_code = b.prop_code); 17 18 BEGIN 19 20 FOR r_prop_desc IN c_prop_desc 21 LOOP 22 23 INSERT INTO property_description 24 SELECT 25 r_prop_desc.property_id, 26 r_prop_desc.prop_code, 27 REPLACE(REGEXP_REPLACE(REGEXP_SUBSTR(REPLACE(REPLACE(REPLACE(html,CHR(10),''),CHR(13),''),CHR(9),''),'--></script></div>.*<div class="separator">'),'<[^>]+>'),'-->',''), 28 TRUNC(SYSDATE), 29 REGEXP_SUBSTR(TO_CHAR(REGEXP_SUBSTR(html,'show_map.*')),'(-|[0-9])[0-9].[0-9]{2,8}',1,1), 30 REGEXP_SUBSTR(TO_CHAR(REGEXP_SUBSTR(html,'show_map.*')),'-[0-9].[0-9]{2,8}',1,1) 31 FROM ( 32 SELECT 33 HTTPURITYPE(r_prop_desc.link).getclob() AS html 34 from dual ); 35 36 COMMIT; 37 38 dbms_lock.sleep(1); 39 40 41 END LOOP; 42 43 END insert_prop_desc ; 44 / |
On lines 7-16 we define a cursor that will return us those properties for which no description has been added.
On lines 23-34 we iterate over the cursor and parse the description, the longitude, and the latitude from the HTML. We will use longitude and latitude in part 2 of this series to calculate distance between properties.
7. Bringing it all together
In a last step we bring all the individual procedures together in a master procedure.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
SQL> CREATE OR REPLACE PROCEDURE prop_batch 2 3 IS 4 5 BEGIN 6 7 EXECUTE IMMEDIATE 'TRUNCATE TABLE property_html'; 8 9 EXECUTE IMMEDIATE 'TRUNCATE TABLE property_helper'; 10 11 EXECUTE IMMEDIATE 'TRUNCATE TABLE seed'; 12 13 EXECUTE IMMEDIATE 'TRUNCATE TABLE seed_html'; 14 15 EXECUTE IMMEDIATE 'TRUNCATE TABLE property_attributes'; 16 17 extract_seed_html; 18 19 load_seed; 20 21 extract_prop_html (NULL); 22 23 merge_prop; 24 25 insert_prop_desc; 26 27 END prop_batch; 28 / Procedure created. |
On lines 7 -15 we remove data from our previous extract batch and then, step by step, execute each extract procedure.
As a last step we need to add error handling and code instrumentation to our solution. However, this is out of scope for this article.