One pass SCD2 load: How to load a Slowly Changing Dimension Type 2 with one SQL Merge statement in Oracle

Uli Bethke DW Design, ETL, Oracle

The other day I came across a useful new feature in the Merge statement for SQL Server 2008. This new feature outputs merged rows for further processing, something which up until now (Oracle 11.1.0.7) is not possible in Oracle. This extra functionality can be used to load a Slowly Changing Dimension Type 2 in one SQL statement. As a result you have only one pass over the data, less logical I/O, and as a result improved performance. If you are interested how exactly the Merge statement can be used in SQL Server 2008 to load an SCD2 have a look at this article. Recently, this has also featured in Kimball's design tip 107. The whole thing was wrecking my head a bit: how can we achieve the same thing in Oracle? Finally, I came up with a solution. I have to admit it is not perfect, as this will only work if you can identify a changed record at source as either an Insert or an Update. So you need to know if the record is a new one or was just modified. In the vast majority of cases you should be able to have this information available at extract time. So if you extract your information from source via change data capture, timestamp created/timestamp modified, audit records , or Oracle Total Recall you should be good to go. Still, SQL Server 2008 has more powerful functionality here. I can see an enhancement request coming ...

Teach me Big Data to Advance my Career

For this post we will rely on the good old SH schema (Oracle 11.1.0.7).

Let's first create a sequence that we will be using later on for the surrogate key of our SCD2.

Next we will create an SCD2 based on the existing sh.products dimension. We will also add a valid_ind indicator and populate the effective to date with 'end of time'. To refresh your memory, an SCD2 as defined by Monsieur Kimball is used to track history in a DW. Each time an attribute in the underlying source of the dimension is modified, a new record with the updated attribute is created in the dimension. An SCD2 typically has three helper columns. The effective from and effective to date of the record, and a flag to indicate the most recent record.

Now we will prepare and populate our staging table for the SCD2 load. I have called this products_source. We will be creating 6 records in this staging table. 5 records represent updates to existing records, and 1 record is a new insert.

The 'U' attribute in the dml_type column marks these records as updates.

Next we update the prod_min_price to simulate a reduction in price and we also create a brand new record.

We should now have the following records in our products_source staging table.

Finally, we merge our staging table into our prodcuts_scd dimension table.

For the records updated at source we need to both create a new record for the updated attributes and update the existing record to reflect the changes in effective dates and the valid indicator. For the new records at source (in our case there is just one) we need to create new records in the SCD2.

On lines 24-29 we effectively duplicate the records that were update at source

On line 3 we prefix the prod_name for the first set of updates with the number 1, while the second set of updates does not undergo this treatment. The resultset we will now merge should look as follows:

In the Merge statement we merge the source and target resultset on the prod_name column. This matches now for the first set of the updated records. For these we update the valid_ind and the product effective to date on lines 32-33.

The other records are created as new records in our target dimension. As part of the insert, we strip the prefixed records of their attachment.

The resultset in our SCD2 should look as follows:

Teach me Big Data to Advance my Career

About the author

Uli Bethke LinkedIn Profile

Uli has 18 years’ hands on experience as a consultant, architect, and manager in the data industry. He frequently speaks at conferences. Uli has architected and delivered data warehouses in Europe, North America, and South East Asia. He is a traveler between the worlds of traditional data warehousing and big data technologies.

Uli is a regular contributor to blogs and books, holds an Oracle ACE award, and chairs the the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a non for profit global data management organization. He has co-founded the Irish Oracle Big Data User Group.