Key-value madness

People that write data integration solutions often have a tough job at hand. I can tell because I get to see all the questions and bugs that get reported.

That is the main reason I committed a first collection of 30 examples to the codebase to be included in the next GA release (2.3.1 will be released next Friday, more on that later).

Today I would like to talk about one of the more interesting examples on the de-normaliser step. What the de-normaliser step does is help you out with the lookup of key-value pairs so that you can attribute the value after lookup to a certain field. The step is a fairly recent addition to Pentaho Data Integration, but gets a lot of attention. I guess that’s because the use of a key-value pair system is often used in situations where programmers need a very flexible way of storing data in a relational database.

The typical example is that you have a Customer table with a Customer ID and all of a sudden 50 extra attributes need to be stored in the database somewhere. However, the main Customer table is not chosen to store these fields (usually because they would be empty in +90% of the cases) but they are stored in a table containing key-value pairs. These key-value pair tables typically look like this:

CustomerID Key Value
101 COUSINS_FIRST_NAME Mary
101 COUSINS_SECOND_NAME J.
101 COUSINS_LAST_NAME Blige
101 COUSINS_BIRTH_DATE 1969/02/14
101 COUSINS_INCOME 1723.86
101 …45 more keys…

OK, that’s fine. The problem starts when the reporting and data warehouse developers want to get all these 50 fields in the Customer Dimension. That is because there it would make sense to de-normalise the data and present the data in an easy-to-retrieve fashion, next to the other customer data. That’s the point of the data warehouse, remember.

Typically, what you would do is launch 50 lookups like this:

SELECT Value AS Cousins_First_Name
FROM Customer_Key_Value
WHERE CustomerID=101
AND Key=”COUSINS_FIRST_NAME”

Now, you can immagine the maintainance nightmare that results from such a system:

  • If key-value pairs get added we need to add extra lookups
  • If a key changes, we need to look in 50 lookups to find the right one
  • What if for some reason, 2 or more of the same keys would appear for the same CustomerID (a key collission), what would you do: take the first, the last, concatenate separated by commas, etc? Perhaps an error would be giving, causing your transformation to fail?
  • Usually, not only Strings but also dates, numbers and booleans are stuffed into the Value field and all kinds of data conversions would have to take place.

Also, the performance would go right down the drain because of the many individual calls to the database. That is why we create the de-normaliser step to allow you to solve this problem once and for all. The way to do it is easy. First, you join with the rest of the customer dimension:

SELECT C.*, KV.Key, KV.Value
FROM Customer C, Customer_Key_Value KV
WHERE C.CustomerID=KV.CustomerID

This way we get all the key-value pairs in multiple rows and then apply the step as shown in this image:

Sample Denormaliser Step

As you can see, we have put all the key-value pair lookups, the conversions AND the aggregation logic into a single dialog making it very easy to do maintenance.

Now, yesterday, someone asked me if the step could also de-normalise multiple value fields for the same key. Since the requirement never came up until yesterday, I had to say that it could not. However, this morning I added support for this as part of a fix for a bug report. So now you can indeed de-normalise different value fields, using the same key value. (if you update using the latest kettle.jar library)

The relevant example can be found in the new release distribution in the samples directory. (or if you access the source code using subversion)

The file is called:

Denormaliser – 2 series of key-value pairs.ktr

Until next time,

Matt

11 comments

  • One point must be made very strongly — this is not denormalization, because the key-value data model is not a legitimate normal form.

    In normal form, all the attributes are defined as columns of the table for which they are attributes. The attribute name is the column name, and the attribute value is stored as data in the respective column. If most of the columns in a given row are NULL, so be it.

    In the key-value data model, the attribute name is itself data, which muddies the definition of metadata.

    The key-value data model suffers from many problems, including the scalability problem you pointed out. Other problems include:

    - No declarative way to make any of the attributes mandatory (NOT NULL) for the entity.

    - Any attribute name may be used, even if the attribute is not valid for the entity, unless the attribute name column is a foreign key to an attribute lookup table.

    - No way to use data types. Everything must be VARCHAR. How can you enforce that a given attribute is stored as a DATE value, or an integer, or a string shorter than the maximum of the attribute value column?

    - No way to use referential integrity for a given attribute. Some attributes should be limited to a set of valid values, by making it a reference to a lookup table. This is not possible in key-value models, because all attribute data share the same column.

    The key-value data model is often used when there has been inadequate attention paid to analyzing the attributes and creating a proper schema.

    So don’t call fetching such data “denormalization” — that implies it’s stored in normal form to begin with, which it is not.

  • Bill, thanks for the comment.

    Let’s get this straight: I absolutely agree with you that these kinds of tables are indeed one of the ugliest possible products of “clever” programmers and IT departments. I’m in no possible way advocating it’s use :-)

    Unfortunately, these revolting tables keep being used despite the obvious drawbacks. And that is why we provided for ways to help people out there.

    So, although I can understand your anger at the stupidity of seeing these tables pop up everywhere, there is no need to direct your anger this way. I’m on the receiving end just the same.

    Hey, the key-value pair lookup came a as a bit of a side-effect when we chose the name of the step and this is the reason why we choose the name “Denormalize”. The term “Unpivot” as chosen by some other ETL tool didn’t quite seem fitting here either so we went with the name it has now.

    That’s all there is to it, really, no need to get worked up over it ;-)

  • I’m not sure I agree with Bill’s definition of normalization.

    Certainly, a table in the form of:

    Customer,field,value
    primary key(Customer,field)

    is in normal form. You could have a foreign key reference a table of valid field names.

    Then you’d have:

    TABLE ValidFields with one column, “validField”
    with data like: “first name”,”income”, etc.

    TABLE CustomerValues with 3 columns:
    “CustomerId” (perhaps a foreign key to a customer db)
    “field” foreign key to ValidFields(validField)
    “value” string

    Certainly, that’s a normalized table. It might not have the best performance with indexes, but with 50 different key-value pairs, I might choose that method instead of having 51 columns in 1 table.

  • Whoah there… I’m not angry… and I didn’t say never to use the design you described. It may be necessary in certain cases (probably less frequently than it is used, as you point out).

    But this model has limitations, as I described. As long as those limitations don’t break our applications, and there isn’t a better way to do it with conventional data modeling, then EAV is acceptable.

    This data model pattern is usually called EAV, by the way.
    http://en.wikipedia.org/wiki/EAV

    But my point was that it’s specifically not in normal form. Normal form has a specific definition, and that ain’t it. In other words, there is no rule of normalization that would lead you to store attributes in that manner.

    “Denormalization” also has a specific meaning: to restructure data out of normal form, and into a more human-readable form, usually with redundant values.

    So because EAV doesn’t qualify as a normal form, then it is not correct to say you are “denormalizing” it. All you can say is that you’re querying to get all the attributes.

    I’m not saying never use EAV, and I’m not trying to detract from your recommended methods to work with schema using that structure. Apologies if I gave that impression.

  • Well Bill, I understand where you’re coming from, but I still think that the term “De-normalization” very well applies to what we do when we construct a data warehouse.

    We could have called the step “Entity-Attribute-Value Model Lookup”, but it doesn’t really sound as good does it ;-)

    Sheeri: I would probably still go for the 70 column table, even if these fields would be mostly empty all the time. I bet most database engines can deal with those kinds of sparse-matrixes perfectly well. What those database engines DO choke on is the constant hammering on these smaller tables. I’ve seen it happening a couple of times and it’s not a pretty sight! There is a reason why I called the blog entry “Key-Value madness

    Whatever the case, if you want to migrate the data and put the fields in the Customer column after all, you can use the “De-normalize” step and Pentaho Data Integration ;-)

  • Thank you for this explanation on de-normalizing. It helped me a lot. Maybe something to include in the documentation, as this only shows a very limited description of this function.

    Grtz,

    J.

  • Graag gedaan Jeroen.
    The steps themselves are being ported to the wiki over time. That way, it will be easier to link to this post, add comments, etc.

    Matt

  • jim

    Matt – I’m considering some form of an EAV/CR schema because I need to track attributes about attributes and I can not identify another method without adding an excessive number of columns.

    For instance, I would like to know for every attribute, who entered the attribute and the source of the attribute – this information is critical to our process.

    Like I said, the only solution not involving a key-value table would require adding an instance of these fields for every attribute in the table – you can imagine how quickly the number of columns would increase.

    Could you recommend any alternative approach?

  • Hi Jim,

    You are probably right that using key-values are the way to go here. However, please remember that such highly dynamic and configurable systems are a pain to do BI on. (the subject of my work and this blog I guess)
    In a system where everyone can create new attributes, types of attributes, the database is not really the RDBMS, but the EAV/CR schema you set up. As such, it becomes a lot harder to get meaningful KPI and information from such a system.

    Then again, you can’t hide complexity. If this is what needs to be done, it needs to be done. :-)

    All the best,

    Matt

  • Charles Fisher

    I’m not sure why another ETL tool calls this “Unpivot”. The procedure performed here is what I have always thought of as “Pivot”.

    I agree that this is not denormalization. To me, denormalization almost always involves creating redundant copies of some information, for expediency and performance. In this transformation, no redundant information has been created, just a different structure.

    Before I even learned the word “pivot”, I called the procedured described here “reification”, which is in general the process of taking something abstract and making it concrete. So if we consider the key term to be “abstract” when it is data in the “Key” column, but to be “concrete” when it is a column in a table, this fits pretty well.

  • Pingback: Matt Casters on Data Integration » Dynamic de-normalization of attributes stored in key-value pair tables