People that write data integration solutions often have a tough job at hand. I can tell because I get to see all the questions and bugs that get reported.
That is the main reason I committed a first collection of 30 examples to the codebase to be included in the next GA release (2.3.1 will be released next Friday, more on that later).
Today I would like to talk about one of the more interesting examples on the de-normaliser step. What the de-normaliser step does is help you out with the lookup of key-value pairs so that you can attribute the value after lookup to a certain field. The step is a fairly recent addition to Pentaho Data Integration, but gets a lot of attention. I guess that’s because the use of a key-value pair system is often used in situations where programmers need a very flexible way of storing data in a relational database.
The typical example is that you have a Customer table with a Customer ID and all of a sudden 50 extra attributes need to be stored in the database somewhere. However, the main Customer table is not chosen to store these fields (usually because they would be empty in +90% of the cases) but they are stored in a table containing key-value pairs. These key-value pair tables typically look like this:
|101||…45 more keys…||…|
OK, that’s fine. The problem starts when the reporting and data warehouse developers want to get all these 50 fields in the Customer Dimension. That is because there it would make sense to de-normalise the data and present the data in an easy-to-retrieve fashion, next to the other customer data. That’s the point of the data warehouse, remember.
Typically, what you would do is launch 50 lookups like this:
SELECT Value AS Cousins_First_Name
Now, you can immagine the maintainance nightmare that results from such a system:
- If key-value pairs get added we need to add extra lookups
- If a key changes, we need to look in 50 lookups to find the right one
- What if for some reason, 2 or more of the same keys would appear for the same CustomerID (a key collission), what would you do: take the first, the last, concatenate separated by commas, etc? Perhaps an error would be giving, causing your transformation to fail?
- Usually, not only Strings but also dates, numbers and booleans are stuffed into the Value field and all kinds of data conversions would have to take place.
Also, the performance would go right down the drain because of the many individual calls to the database. That is why we create the de-normaliser step to allow you to solve this problem once and for all. The way to do it is easy. First, you join with the rest of the customer dimension:
SELECT C.*, KV.Key, KV.Value
FROM Customer C, Customer_Key_Value KV
This way we get all the key-value pairs in multiple rows and then apply the step as shown in this image:
As you can see, we have put all the key-value pair lookups, the conversions AND the aggregation logic into a single dialog making it very easy to do maintenance.
Now, yesterday, someone asked me if the step could also de-normalise multiple value fields for the same key. Since the requirement never came up until yesterday, I had to say that it could not. However, this morning I added support for this as part of a fix for a bug report. So now you can indeed de-normalise different value fields, using the same key value. (if you update using the latest kettle.jar library)
The relevant example can be found in the new release distribution in the samples directory. (or if you access the source code using subversion)
The file is called:
Denormaliser – 2 series of key-value pairs.ktr
Until next time,