During one of my session at SQL Day (as a speaker who remember the 2009 edition, I can say this event is getting better and better) I notice that lot of folks didn’t know how efficient a MERGE statement is in SCD scenarios. Or at least they didn’t know how easy it could be due to SCD Merge Wizard, a free tool available at http://scdmergewizard.codeplex.com/. So, inspired by Paweł Potasiński posts I also decided to share with you some thoughts.
Set-based versus Row-by-row processing
I was genuinely surprised that somebody actually using SCD Transformation in production. This SSIS transformation is great to explain how to load SCD type 0, 1 and 2 attributes, but it so slow that in most cases completely useless.
The problem is that in a data flow tasks generated by this transformation are a lot of components that work in infamous RBAR (Row By Agonizing Row) mode. Can you see those transformations?
If your answer was “OLE DB commands” you are almost right. Yes, those commands looks like that:
UPDATE [dbo].[DimCustomers] SET [BirthDate] = ?,[EnglishEducation] = ?,[FirstName] = ?,[LastName] = ?,[Phone] = ?,[Title] = ? WHERE [CustomerBusinessKey] = ? (for inferred mmebers)
UPDATE [dbo].[DimCustomers] SET [EndDate] = ? WHERE [CustomerBusinessKey] = ? AND [EndDate] IS NULL (for type 2 attributes)
UPDATE [dbo].[DimCustomers] SET [BirthDate] = ?,[LastName] = ? WHERE [CustomerBusinessKey] = ? AND [EndDate] IS NULL (for type 1 attributes)
And are, of course, they are called on row by row basis.
So, the first factor that affects performance is the number of rows that have changed since the last synchronization — as this umber grows, the overall performance drops.
But this transformation is also slow in situations when no rows need to be updated. Take this example:
18 484 rows in source, literally none of them was updated, but the execution time was 17 seconds:
It seems like this Slowly Changing Dimension tasks split rows in not so smart way — instead of using some kind of set-based solution rows in source are compered one by one with rows that exist in dimension table. This is why the performance of SCD transformation also depends on the row number of rows in both (stage and dimension) tables.
Lesson learned — use SCD transformation only for small dimension tables that change infrequently.
In all other cases you should turn to other solution, from which in my opinion the best is putting MERGE statements into Execute SQL task.
SCD MERGE Wizard
As you probably know MERGE allows to perform INSERT/UPDATE/DELETE operations as a single, atomic task. In addition it scans both (source and destination) tables only once. However, write a correct MERGE statement is not a trivial task, especially for SCD type 2 attributes (those attributes require two changes — the old row has to be updated, and a new row has to be inserted). Luckily for us, the SCD MERGE Wizard will generate those statement for us.
Let me advertise this really useful tool:
First, go to CodePlex page and download it.
Then install this standalone tool and run it:
Click Next and connect to your stage database (a database where the source table lives):
Then connect to your data warehouse and select the dimension table:
Third question will be about variables used to mark rows as active or not active — little advice here, if you are not sure which variables will be needed, leave them all. You can always delete them later, but this wizard will not allow you adding variables at later stages:
Then you will be asked to define and configure all attributes — this step is quite important and you really should think what values are stored in DateTo and DateFrom columns. Yes, you can always edit the generated MERGE statements, but some errors are subtle and can pop up later, when user will notice that some SCD2 data are incorrect:
Because MERGE is (in opposite to UPSERT) deterministic, which means that the same row will never be modified more than once, the next question will check if there are duplicate keys in source table.
The last question also needs some attention because you probably want to change all three default global settings. Otherwise, you will need to manually add database prefixes, some rows could be deleted from dimension table and you will lose important comments:
That’s it, after clicking Next both MERGE statement (for SCD type 1 and SCD type 2 attributes) will be show:
I strongly recommend to spend some time analyzing this script. Finally, use it inside Execute SQL Task and appreciate huge performance improvement. In this case, the time dropped form 17 seconds to 0.3 second, but in worst case scenario (large tables with lots of changed rows) it will be even bigger:
See you again